public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Yinghai Lu <yinghai@kernel.org>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	mingo@elte.hu, rdreier@cisco.com,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: kexec boot regression
Date: Tue, 15 Dec 2009 13:43:00 -0800	[thread overview]
Message-ID: <4B2802E4.40404@kernel.org> (raw)
In-Reply-To: <20091215214009.GD28252@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 2992 bytes --]

Jens Axboe wrote:
> On Tue, Dec 15 2009, Jens Axboe wrote:
>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>> Jens Axboe wrote:
>>>> On Tue, Dec 15 2009, Jens Axboe wrote:
>>>>> On Tue, Dec 15 2009, Jens Axboe wrote:
>>>>>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>>>>>> Jens Axboe wrote:
>>>>>>>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>>>>>>>> Jens Axboe wrote:
>>>>>>>>>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>>>>>>>>>> Jens Axboe wrote:
>>>>>>>>>>>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>>>>>>>>>>>> [PATCH] x86/pci: intel ioh bus num reg accessing fix
>>>>>>>>>>>>>
>>>>>>>>>>>>> it is above 0x100, so if mmconf is not enable, need to skip it
>>>>>>>>>>>> This works, it kexecs kernels fine. But since 2.6.32 doesn't have the
>>>>>>>>>>>> mmconf problem to begin with, are we now just working around the issue?
>>>>>>>>>>>> SRAT still reports issues, numa doesn't work.
>>>>>>>>>>> that patch will be bullet proof... we need it.
>>>>>>>>>>>
>>>>>>>>>>> also still need to figure out why memmap range is not passed properly.
>>>>>>>>>>>
>>>>>>>>>>> do you mean 2.6.32 kexec 2.6.32 it have worked mmconf and numa in
>>>>>>>>>>> second kernel?
>>>>>>>>>> Yes, 2.6.32 booted and 2.6.32 kexec'ed works just fine, no SRAT
>>>>>>>>>> complaints and NUMA works fine.
>>>>>>>>> do you need 
>>>>>>>>> memmap=62G@4G
>>>>>>>>> in this case?
>>>>>>>> Yes, I've needed that always.
>>>>>>> good,
>>>>>>>
>>>>>>> can you enable debug option in kexec to see why kexec can not pass
>>>>>>> whole 38? range to second kernel?
>>>>>> Not getting any output so far, -d doesn't do much. Poking around in the
>>>>>> source...
>>>>> OK, cold boot and kexec 2.0.1 gets all 39 ranges passed properly to
>>>>> kexec'ed kernels. Since the older kexec stopped at range 30 (31 ranges
>>>>> total), that smells like just a kexec bug. Retesting -git...
>>>> Current -git works fine when all the ranges are passed correctly. So, I
>>>> think, the only existing regression is the SRAT issue.
>>> did you change node_shift?
>> Yes:
>>
>> CONFIG_NODES_SHIFT=6
>>
>> What I don't get is that 2.6.32 and -git print the same PXM map, and in
>> both cases it's totalling exactly 64G. Yet it says:
>>
>> SRAT: PXMs only cover 49035MB of your 65419MB e820 RAM. Not used.
> 
> Clue:
> 
> [    0.000000] SRAT: Node 0 PXM 0 0-80000000
> [    0.000000] SRAT: Node 0 PXM 0 100000000-480000000
> [    0.000000] SRAT: Node 2 PXM 1 480000000-880000000
> [    0.000000] SRAT: Node 1 PXM 2 880000000-c80000000
> [    0.000000] SRAT: Node 3 PXM 3 c80000000-1080000000
> [    0.000000] NUMA: Using 31 for the hash shift.
> [    0.000000] pxm0: 0-480000 (4718592), absent 553990
> [    0.000000] pxm1: 880000-c80000 (4194304), absent 0
> [    0.000000] pxm2: 480000-880000 (4194304), absent 4194304
> [    0.000000] pxm3: c80000-1080000 (4194304), absent 0
> [    0.000000] SRAT: PXMs only cover 49035MB of your 65419MB e820 RAM.  Not used.
> [    0.000000] SRAT: SRAT not used.
> 

oh, i post one patch last week, 

can you check it?

YH

[-- Attachment #2: Attached Message --]
[-- Type: message/rfc822, Size: 5721 bytes --]

From: Yinghai Lu <yinghai@kernel.org>
To: Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>,  "H. Peter Anvin" <hpa@zytor.com>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mel@csn.ul.ie>,  Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: [PATCH] x86: fix checking of SRAT when node0 ram is not from 0 -v2
Date: Sun, 13 Dec 2009 15:33:38 -0800
Message-ID: <4B2579D2.3010201@kernel.org>



Found one system that boot from socket1 instead of socket0, SRAT get rejected...

[    0.000000] SRAT: Node 1 PXM 0 0-a0000
[    0.000000] SRAT: Node 1 PXM 0 100000-80000000
[    0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
[    0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
[    0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
[    0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
[    0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
[    0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
[    0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
[    0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
...
[    0.000000] NUMA: Allocated memnodemap from 500000 - 701040
[    0.000000] NUMA: Using 20 for the hash shift.
[    0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
[    0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
[    0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
[    0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
[    0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
[    0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
[    0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
[    0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
[    0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
[    0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
[    0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
[    0.000000] SRAT: SRAT not used.

the early_node_map is not sorted because node0 with non zero start come first.

so try to sort it right away after all regions are registered.

-v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/mm/srat_32.c |    2 ++
 arch/x86/mm/srat_64.c |    4 +++-
 include/linux/mm.h    |    3 +++
 mm/page_alloc.c       |    4 ++--
 4 files changed, 10 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/mm/srat_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/srat_32.c
+++ linux-2.6/arch/x86/mm/srat_32.c
@@ -267,6 +267,8 @@ int __init get_memcfg_from_srat(void)
 		e820_register_active_regions(chunk->nid, chunk->start_pfn,
 					     min(chunk->end_pfn, max_pfn));
 	}
+	/* for out of order entries in SRAT */
+	sort_node_map();
 
 	for_each_online_node(nid) {
 		unsigned long start = node_start_pfn[nid];
Index: linux-2.6/arch/x86/mm/srat_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/srat_64.c
+++ linux-2.6/arch/x86/mm/srat_64.c
@@ -317,7 +317,7 @@ static int __init nodes_cover_memory(con
 		unsigned long s = nodes[i].start >> PAGE_SHIFT;
 		unsigned long e = nodes[i].end >> PAGE_SHIFT;
 		pxmram += e - s;
-		pxmram -= absent_pages_in_range(s, e);
+		pxmram -= __absent_pages_in_range(i, s, e);
 		if ((long)pxmram < 0)
 			pxmram = 0;
 	}
@@ -373,6 +373,8 @@ int __init acpi_scan_nodes(unsigned long
 	for_each_node_mask(i, nodes_parsed)
 		e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
 						nodes[i].end >> PAGE_SHIFT);
+	/* for out of order entries in SRAT */
+	sort_node_map();
 	if (!nodes_cover_memory(nodes)) {
 		bad_srat();
 		return -1;
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1022,6 +1022,9 @@ extern void add_active_range(unsigned in
 extern void remove_active_range(unsigned int nid, unsigned long start_pfn,
 					unsigned long end_pfn);
 extern void remove_all_active_ranges(void);
+void sort_node_map(void);
+unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
+						unsigned long end_pfn);
 extern unsigned long absent_pages_in_range(unsigned long start_pfn,
 						unsigned long end_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -3573,7 +3573,7 @@ static unsigned long __meminit zone_span
  * Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
  * then all holes in the requested range will be accounted for.
  */
-static unsigned long __meminit __absent_pages_in_range(int nid,
+unsigned long __meminit __absent_pages_in_range(int nid,
 				unsigned long range_start_pfn,
 				unsigned long range_end_pfn)
 {
@@ -4102,7 +4102,7 @@ static int __init cmp_node_active_region
 }
 
 /* sort the node_map by start_pfn */
-static void __init sort_node_map(void)
+void __init sort_node_map(void)
 {
 	sort(early_node_map, (size_t)nr_nodemap_entries,
 			sizeof(struct node_active_region),


  reply	other threads:[~2009-12-15 21:44 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-15 11:50 kexec boot regression Jens Axboe
2009-12-15 12:01 ` Yinghai Lu
2009-12-15 12:14   ` Jens Axboe
2009-12-15 12:31     ` Yinghai Lu
2009-12-15 12:39       ` Jens Axboe
2009-12-15 12:55         ` Yinghai Lu
2009-12-15 14:11           ` Jens Axboe
2009-12-15 18:39             ` Yinghai Lu
2009-12-15 18:47               ` Matthew Wilcox
2009-12-15 18:54               ` Jens Axboe
2009-12-15 18:59               ` Jens Axboe
2009-12-15 19:04                 ` Yinghai Lu
2009-12-15 19:11                   ` Jens Axboe
2009-12-15 19:17                     ` Yinghai Lu
2009-12-15 19:22                       ` Jens Axboe
2009-12-15 19:28                         ` Jens Axboe
2009-12-15 19:44                     ` Yinghai Lu
2009-12-15 19:48                       ` Jens Axboe
2009-12-15 19:49                         ` Yinghai Lu
2009-12-15 19:57                           ` Jens Axboe
2009-12-15 21:30                   ` Markus Trippelsdorf
2009-12-15 23:02                     ` kexec boot regression radeon/kms (bisected) Markus Trippelsdorf
2009-12-15 19:43               ` kexec boot regression Jens Axboe
2009-12-15 19:48                 ` Yinghai Lu
2009-12-15 19:51                   ` Jens Axboe
2009-12-15 19:56                     ` Yinghai Lu
2009-12-15 20:09                       ` Jens Axboe
2009-12-15 20:14                     ` Yinghai Lu
2009-12-15 20:19                       ` Jens Axboe
2009-12-15 20:21                         ` Yinghai Lu
2009-12-15 20:42                           ` Jens Axboe
2009-12-15 20:55                             ` Jens Axboe
2009-12-15 21:01                               ` Jens Axboe
2009-12-15 21:26                                 ` Yinghai Lu
2009-12-15 21:30                                   ` Jens Axboe
2009-12-15 21:40                                     ` Jens Axboe
2009-12-15 21:43                                       ` Yinghai Lu [this message]
2009-12-15 21:47                                         ` Jens Axboe
2009-12-15 21:50                                           ` Yinghai Lu
2009-12-15 21:52                                           ` Jens Axboe
2009-12-15 22:24                                             ` Yinghai Lu
2009-12-16 10:01                                               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B2802E4.40404@kernel.org \
    --to=yinghai@kernel.org \
    --cc=hpa@zytor.com \
    --cc=jbarnes@virtuousgeek.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rdreier@cisco.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox