From mboxrd@z Thu Jan 1 00:00:00 1970 Return-path: Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]) by merlin.infradead.org with esmtps (Exim 4.80.1 #2 (Red Hat Linux)) id 1VguET-000413-VZ for kexec@lists.infradead.org; Thu, 14 Nov 2013 10:33:08 +0000 Received: from m1.gw.fujitsu.co.jp (unknown [10.0.50.71]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id EE0AD3EE0BD for ; Thu, 14 Nov 2013 19:32:35 +0900 (JST) Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id DEC2F45DE55 for ; Thu, 14 Nov 2013 19:32:35 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.nic.fujitsu.com [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id C79A545DE54 for ; Thu, 14 Nov 2013 19:32:35 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id BB28F1DB8044 for ; Thu, 14 Nov 2013 19:32:35 +0900 (JST) Received: from m1000.s.css.fujitsu.com (m1000.s.css.fujitsu.com [10.240.81.136]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 6CAA41DB803F for ; Thu, 14 Nov 2013 19:32:35 +0900 (JST) Message-ID: <5284A689.70903@jp.fujitsu.com> Date: Thu, 14 Nov 2013 19:31:37 +0900 From: HATAYAMA Daisuke MIME-Version: 1.0 Subject: Re: /proc/vmcore mmap() failure issue References: <20131113204130.GD7613@redhat.com> In-Reply-To: <20131113204130.GD7613@redhat.com> List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Sender: "kexec" Errors-To: kexec-bounces+dwmw2=twosheds.infradead.org@lists.infradead.org To: Vivek Goyal Cc: Baoquan He , Kexec Mailing List , linux kernel mailing list , Atsushi Kumagai , "Eric W. Biederman" , Dave Young , WANG Chao (2013/11/14 5:41), Vivek Goyal wrote: > Hi Hatayama, > > We are facing some /proc/vmcore mmap() failure issues and then makdumpfile > exits without saving dump and system reboots. > > I tried latest makedumpfile (devel branch) with 3.12 kernel. > > I think this issue happens only on some machines. And it looks like it > happens when end of system RAM chunk in first kernel is not page aligned. For > example, I have one machine where I noticed it and this is how system > RAM looks like. > > 00100000-dafa57ff : System RAM > 01000000-015892fa : Kernel code > 015892fb-0195c9ff : Kernel data > 01ae6000-01d31fff : Kernel bss > 24000000-33ffffff : Crash kernel > dafa5800-dbffffff : reserved > > Notice that dafa57ff does not end at page boundary and next reserved > range does not start at page boundary. I think that next reserved > range is referenced through some ACPI data. More on this later. > > So we put some printk() messages to get more info. In a nut shell, > remap_pfn_range() fails when we try to map the last section of system > RAM not ending on page boundary. > > remap_pfn_range() > track_pfn_remap() { > /* > * For anything smaller than the vma size we set prot based on the > * lookup. > */ > flags = lookup_memtype(paddr); > > /* Check memtype for the remaining pages */ > while (size > PAGE_SIZE) { > size -= PAGE_SIZE; > paddr += PAGE_SIZE; > if (flags != lookup_memtype(paddr)) > return -EINVAL; <---------------- Failure. > } > > } > > > So we pass in a range to track_pfn_remap. Say pfn=0xdad62 size=0x244000. > Now we call lookup_memtype() on every page in the range and make sure > they all are same, otherwise we fail. Guess what, all all same except > last page (which does not end at page boundary). > > I dived deeper in to lookup_memtype() and noticed that all regular > ranges are not registered anywhere and their flags are _PAGE_CACHE_UC_MINUS. > But last unaligned page/range, is registered in memtype rb tree and > has attribute, _PAGE_CACHE_WB. > > Then I hooked into reserve_memtype() to figure out who is registering > page 0xdafa5000 and it is acpi_init() which does it. > > [ 0.721655] Hardware name: > [ 0.730590] ffff8800340f3830 ffff8800340f37c0 ffffffff81575509 > 00000000dafa5000 > [ 0.738010] ffff8800340f3800 ffffffff810566cc 00000000000dafa5 > 00000000dafa5000 > [ 0.745428] 00000000dafa6000 00000000dafa5000 0000000000000000 > 0000000000001000 > [ 0.752845] Call Trace: > [ 0.755288] [] dump_stack+0x45/0x56 > [ 0.760414] [] reserve_memtype+0x31c/0x3f0 > [ 0.766144] [] __ioremap_caller+0x12f/0x360 > [ 0.771963] [] ? acpi_os_release_object+0xe/0x12 > [ 0.778217] [] ? acpi_os_map_memory+0xf6/0x14e > [ 0.784295] [] ioremap_cache+0x14/0x20 > [ 0.789679] [] acpi_os_map_memory+0xf6/0x14e > [ 0.795582] [] > acpi_ex_system_memory_space_handler+0xdd/0x1ca > [ 0.802961] [] > acpi_ev_address_space_dispatch+0x1b0/0x208 > [ 0.809993] [] acpi_ex_access_region+0x20e/0x2a2 > [ 0.816244] [] ? __alloc_pages_nodemask+0x134/0x300 > [ 0.822754] [] acpi_ex_field_datum_io+0xf6/0x171 > [ 0.829004] [] acpi_ex_extract_from_field+0xd7/0x20a > [ 0.835602] [] ? > acpi_ut_create_internal_object_dbg+0x23/0x8a > [ 0.842981] [] > acpi_ex_read_data_from_field+0x10f/0x14b > [ 0.849838] [] > acpi_ex_resolve_node_to_value+0x18e/0x21c > [ 0.856780] [] acpi_ex_resolve_to_value+0x202/0x209 > [ 0.863291] [] acpi_ds_evaluate_name_path+0x7b/0xf5 > [ 0.869803] [] acpi_ds_exec_end_op+0x98/0x3e8 > [ 0.875793] [] acpi_ps_parse_loop+0x514/0x560 > [ 0.881784] [] acpi_ps_parse_aml+0x98/0x28c > [ 0.887601] [] acpi_ps_execute_method+0x1c1/0x26c > [ 0.893939] [] acpi_ns_evaluate+0x1c1/0x258 > [ 0.899755] [] acpi_ev_execute_reg_method+0xca/0x112 > [ 0.906353] [] acpi_ev_reg_run+0x48/0x52 > [ 0.911910] [] acpi_ns_walk_namespace+0xc8/0x17f > [ 0.918160] [] ? acpi_ev_detach_region+0x146/0x146 > [ 0.924585] [] acpi_ev_execute_reg_methods+0x44/0xf7 > [ 0.931184] [] ? acpi_sleep_proc_init+0x2a/0x2a > [ 0.937349] [] ? acpi_os_wait_semaphore+0x43/0x57 > [ 0.943686] [] ? acpi_ut_acquire_mutex+0x48/0x88 > [ 0.949938] [] > acpi_ev_initialize_op_regions+0x49/0x71 > [ 0.956709] [] ? acpi_sleep_proc_init+0x2a/0x2a > [ 0.962873] [] acpi_initialize_objects+0x23/0x4f > [ 0.969125] [] acpi_init+0x90/0x268 > > So basically, this split page seems to be a problem. Some other code > thinks that it has access to full page and goes ahead and registers > that with PAT rb tree and this causes problems in mmap() code. > > I suspect we might have to go back to idea of copying first and last > non page aligned ranges in new kernel's memory and read it from there > to solve this issue. Do you have other ideas? > Sorry for delayed response, although it looks like you have already found a way to fix this issue. BTW, I previously found a part of makedumpfile that truncates the first and last pages if they are not aligned in page size. Discussing with Kumagai-san, the truncation is performed on some ia64 system and he found a valid data in the truncated area, and the latest makedumpfile no longer does such truncation. The commit is: commit f854b37adba223d5b4801accbedd17b447266d51 Author: Atsushi Kumagai Date: Fri Jun 21 15:25:31 2013 +0900 [PATCH 2/2] Fix the handling of the pages correspond to border of PT_LOAD. The pages correspond to border of PT_LOAD were removed as holes. For example, pfn:N showed below was removed but we know even odd region like [0x40ffda7000 - 0x40ffda8000] can include valid dates, so we shouldn't remove it as holes. phys_start = 0x40ffda7000 |<-- frac_head -->|------------- PT_LOAD ------------- ----+-----------------------+---------------------+---- | pfn:N | pfn:N+1 | ... ----+-----------------------+---------------------+---- | pfn_to_paddr(pfn:N) # page size = 16k = 0x40ffda4000 This patch handles such odd regions correctly. Then read pfn:N and write it to disk, the ranges not covered by any PT_LOAD entries will be filled with 0. Signed-off-by: Atsushi Kumagai The log on the web is: http://lists.infradead.org/pipermail/kexec/2013-May/008875.html So, without this change, you would not have seen this issue. The original reason why the code was implemented so might be the issues similar to here. Next, I think it necessary to consider whether or not to revert the above commit or not since makedumpfile fails on some kind of system as you reported. -- Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753297Ab3KNKcl (ORCPT ); Thu, 14 Nov 2013 05:32:41 -0500 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:54290 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752978Ab3KNKch (ORCPT ); Thu, 14 Nov 2013 05:32:37 -0500 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.9 X-SHieldMailCheckerPolicyVersion: FJ-ISEC-20120718-2 Message-ID: <5284A689.70903@jp.fujitsu.com> Date: Thu, 14 Nov 2013 19:31:37 +0900 From: HATAYAMA Daisuke User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Vivek Goyal CC: linux kernel mailing list , Kexec Mailing List , Baoquan He , WANG Chao , Dave Young , "Eric W. Biederman" , Atsushi Kumagai Subject: Re: /proc/vmcore mmap() failure issue References: <20131113204130.GD7613@redhat.com> In-Reply-To: <20131113204130.GD7613@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2013/11/14 5:41), Vivek Goyal wrote: > Hi Hatayama, > > We are facing some /proc/vmcore mmap() failure issues and then makdumpfile > exits without saving dump and system reboots. > > I tried latest makedumpfile (devel branch) with 3.12 kernel. > > I think this issue happens only on some machines. And it looks like it > happens when end of system RAM chunk in first kernel is not page aligned. For > example, I have one machine where I noticed it and this is how system > RAM looks like. > > 00100000-dafa57ff : System RAM > 01000000-015892fa : Kernel code > 015892fb-0195c9ff : Kernel data > 01ae6000-01d31fff : Kernel bss > 24000000-33ffffff : Crash kernel > dafa5800-dbffffff : reserved > > Notice that dafa57ff does not end at page boundary and next reserved > range does not start at page boundary. I think that next reserved > range is referenced through some ACPI data. More on this later. > > So we put some printk() messages to get more info. In a nut shell, > remap_pfn_range() fails when we try to map the last section of system > RAM not ending on page boundary. > > remap_pfn_range() > track_pfn_remap() { > /* > * For anything smaller than the vma size we set prot based on the > * lookup. > */ > flags = lookup_memtype(paddr); > > /* Check memtype for the remaining pages */ > while (size > PAGE_SIZE) { > size -= PAGE_SIZE; > paddr += PAGE_SIZE; > if (flags != lookup_memtype(paddr)) > return -EINVAL; <---------------- Failure. > } > > } > > > So we pass in a range to track_pfn_remap. Say pfn=0xdad62 size=0x244000. > Now we call lookup_memtype() on every page in the range and make sure > they all are same, otherwise we fail. Guess what, all all same except > last page (which does not end at page boundary). > > I dived deeper in to lookup_memtype() and noticed that all regular > ranges are not registered anywhere and their flags are _PAGE_CACHE_UC_MINUS. > But last unaligned page/range, is registered in memtype rb tree and > has attribute, _PAGE_CACHE_WB. > > Then I hooked into reserve_memtype() to figure out who is registering > page 0xdafa5000 and it is acpi_init() which does it. > > [ 0.721655] Hardware name: > [ 0.730590] ffff8800340f3830 ffff8800340f37c0 ffffffff81575509 > 00000000dafa5000 > [ 0.738010] ffff8800340f3800 ffffffff810566cc 00000000000dafa5 > 00000000dafa5000 > [ 0.745428] 00000000dafa6000 00000000dafa5000 0000000000000000 > 0000000000001000 > [ 0.752845] Call Trace: > [ 0.755288] [] dump_stack+0x45/0x56 > [ 0.760414] [] reserve_memtype+0x31c/0x3f0 > [ 0.766144] [] __ioremap_caller+0x12f/0x360 > [ 0.771963] [] ? acpi_os_release_object+0xe/0x12 > [ 0.778217] [] ? acpi_os_map_memory+0xf6/0x14e > [ 0.784295] [] ioremap_cache+0x14/0x20 > [ 0.789679] [] acpi_os_map_memory+0xf6/0x14e > [ 0.795582] [] > acpi_ex_system_memory_space_handler+0xdd/0x1ca > [ 0.802961] [] > acpi_ev_address_space_dispatch+0x1b0/0x208 > [ 0.809993] [] acpi_ex_access_region+0x20e/0x2a2 > [ 0.816244] [] ? __alloc_pages_nodemask+0x134/0x300 > [ 0.822754] [] acpi_ex_field_datum_io+0xf6/0x171 > [ 0.829004] [] acpi_ex_extract_from_field+0xd7/0x20a > [ 0.835602] [] ? > acpi_ut_create_internal_object_dbg+0x23/0x8a > [ 0.842981] [] > acpi_ex_read_data_from_field+0x10f/0x14b > [ 0.849838] [] > acpi_ex_resolve_node_to_value+0x18e/0x21c > [ 0.856780] [] acpi_ex_resolve_to_value+0x202/0x209 > [ 0.863291] [] acpi_ds_evaluate_name_path+0x7b/0xf5 > [ 0.869803] [] acpi_ds_exec_end_op+0x98/0x3e8 > [ 0.875793] [] acpi_ps_parse_loop+0x514/0x560 > [ 0.881784] [] acpi_ps_parse_aml+0x98/0x28c > [ 0.887601] [] acpi_ps_execute_method+0x1c1/0x26c > [ 0.893939] [] acpi_ns_evaluate+0x1c1/0x258 > [ 0.899755] [] acpi_ev_execute_reg_method+0xca/0x112 > [ 0.906353] [] acpi_ev_reg_run+0x48/0x52 > [ 0.911910] [] acpi_ns_walk_namespace+0xc8/0x17f > [ 0.918160] [] ? acpi_ev_detach_region+0x146/0x146 > [ 0.924585] [] acpi_ev_execute_reg_methods+0x44/0xf7 > [ 0.931184] [] ? acpi_sleep_proc_init+0x2a/0x2a > [ 0.937349] [] ? acpi_os_wait_semaphore+0x43/0x57 > [ 0.943686] [] ? acpi_ut_acquire_mutex+0x48/0x88 > [ 0.949938] [] > acpi_ev_initialize_op_regions+0x49/0x71 > [ 0.956709] [] ? acpi_sleep_proc_init+0x2a/0x2a > [ 0.962873] [] acpi_initialize_objects+0x23/0x4f > [ 0.969125] [] acpi_init+0x90/0x268 > > So basically, this split page seems to be a problem. Some other code > thinks that it has access to full page and goes ahead and registers > that with PAT rb tree and this causes problems in mmap() code. > > I suspect we might have to go back to idea of copying first and last > non page aligned ranges in new kernel's memory and read it from there > to solve this issue. Do you have other ideas? > Sorry for delayed response, although it looks like you have already found a way to fix this issue. BTW, I previously found a part of makedumpfile that truncates the first and last pages if they are not aligned in page size. Discussing with Kumagai-san, the truncation is performed on some ia64 system and he found a valid data in the truncated area, and the latest makedumpfile no longer does such truncation. The commit is: commit f854b37adba223d5b4801accbedd17b447266d51 Author: Atsushi Kumagai Date: Fri Jun 21 15:25:31 2013 +0900 [PATCH 2/2] Fix the handling of the pages correspond to border of PT_LOAD. The pages correspond to border of PT_LOAD were removed as holes. For example, pfn:N showed below was removed but we know even odd region like [0x40ffda7000 - 0x40ffda8000] can include valid dates, so we shouldn't remove it as holes. phys_start = 0x40ffda7000 |<-- frac_head -->|------------- PT_LOAD ------------- ----+-----------------------+---------------------+---- | pfn:N | pfn:N+1 | ... ----+-----------------------+---------------------+---- | pfn_to_paddr(pfn:N) # page size = 16k = 0x40ffda4000 This patch handles such odd regions correctly. Then read pfn:N and write it to disk, the ranges not covered by any PT_LOAD entries will be filled with 0. Signed-off-by: Atsushi Kumagai The log on the web is: http://lists.infradead.org/pipermail/kexec/2013-May/008875.html So, without this change, you would not have seen this issue. The original reason why the code was implemented so might be the issues similar to here. Next, I think it necessary to consider whether or not to revert the above commit or not since makedumpfile fails on some kind of system as you reported. -- Thanks. HATAYAMA, Daisuke