All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yinghai Lu <yinghai@kernel.org>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	mingo@elte.hu, rdreier@cisco.com,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: kexec boot regression
Date: Tue, 15 Dec 2009 13:43:00 -0800	[thread overview]
Message-ID: <4B2802E4.40404@kernel.org> (raw)
In-Reply-To: <20091215214009.GD28252@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 2992 bytes --]

Jens Axboe wrote:
> On Tue, Dec 15 2009, Jens Axboe wrote:
>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>> Jens Axboe wrote:
>>>> On Tue, Dec 15 2009, Jens Axboe wrote:
>>>>> On Tue, Dec 15 2009, Jens Axboe wrote:
>>>>>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>>>>>> Jens Axboe wrote:
>>>>>>>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>>>>>>>> Jens Axboe wrote:
>>>>>>>>>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>>>>>>>>>> Jens Axboe wrote:
>>>>>>>>>>>> On Tue, Dec 15 2009, Yinghai Lu wrote:
>>>>>>>>>>>>> [PATCH] x86/pci: intel ioh bus num reg accessing fix
>>>>>>>>>>>>>
>>>>>>>>>>>>> it is above 0x100, so if mmconf is not enable, need to skip it
>>>>>>>>>>>> This works, it kexecs kernels fine. But since 2.6.32 doesn't have the
>>>>>>>>>>>> mmconf problem to begin with, are we now just working around the issue?
>>>>>>>>>>>> SRAT still reports issues, numa doesn't work.
>>>>>>>>>>> that patch will be bullet proof... we need it.
>>>>>>>>>>>
>>>>>>>>>>> also still need to figure out why memmap range is not passed properly.
>>>>>>>>>>>
>>>>>>>>>>> do you mean 2.6.32 kexec 2.6.32 it have worked mmconf and numa in
>>>>>>>>>>> second kernel?
>>>>>>>>>> Yes, 2.6.32 booted and 2.6.32 kexec'ed works just fine, no SRAT
>>>>>>>>>> complaints and NUMA works fine.
>>>>>>>>> do you need 
>>>>>>>>> memmap=62G@4G
>>>>>>>>> in this case?
>>>>>>>> Yes, I've needed that always.
>>>>>>> good,
>>>>>>>
>>>>>>> can you enable debug option in kexec to see why kexec can not pass
>>>>>>> whole 38? range to second kernel?
>>>>>> Not getting any output so far, -d doesn't do much. Poking around in the
>>>>>> source...
>>>>> OK, cold boot and kexec 2.0.1 gets all 39 ranges passed properly to
>>>>> kexec'ed kernels. Since the older kexec stopped at range 30 (31 ranges
>>>>> total), that smells like just a kexec bug. Retesting -git...
>>>> Current -git works fine when all the ranges are passed correctly. So, I
>>>> think, the only existing regression is the SRAT issue.
>>> did you change node_shift?
>> Yes:
>>
>> CONFIG_NODES_SHIFT=6
>>
>> What I don't get is that 2.6.32 and -git print the same PXM map, and in
>> both cases it's totalling exactly 64G. Yet it says:
>>
>> SRAT: PXMs only cover 49035MB of your 65419MB e820 RAM. Not used.
> 
> Clue:
> 
> [    0.000000] SRAT: Node 0 PXM 0 0-80000000
> [    0.000000] SRAT: Node 0 PXM 0 100000000-480000000
> [    0.000000] SRAT: Node 2 PXM 1 480000000-880000000
> [    0.000000] SRAT: Node 1 PXM 2 880000000-c80000000
> [    0.000000] SRAT: Node 3 PXM 3 c80000000-1080000000
> [    0.000000] NUMA: Using 31 for the hash shift.
> [    0.000000] pxm0: 0-480000 (4718592), absent 553990
> [    0.000000] pxm1: 880000-c80000 (4194304), absent 0
> [    0.000000] pxm2: 480000-880000 (4194304), absent 4194304
> [    0.000000] pxm3: c80000-1080000 (4194304), absent 0
> [    0.000000] SRAT: PXMs only cover 49035MB of your 65419MB e820 RAM.  Not used.
> [    0.000000] SRAT: SRAT not used.
> 

oh, i post one patch last week, 

can you check it?

YH

[-- Attachment #2: Attached Message --]
[-- Type: message/rfc822, Size: 5721 bytes --]

From: Yinghai Lu <yinghai@kernel.org>
To: Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>,  "H. Peter Anvin" <hpa@zytor.com>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mel@csn.ul.ie>,  Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: [PATCH] x86: fix checking of SRAT when node0 ram is not from 0 -v2
Date: Sun, 13 Dec 2009 15:33:38 -0800
Message-ID: <4B2579D2.3010201@kernel.org>



Found one system that boot from socket1 instead of socket0, SRAT get rejected...

[    0.000000] SRAT: Node 1 PXM 0 0-a0000
[    0.000000] SRAT: Node 1 PXM 0 100000-80000000
[    0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
[    0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
[    0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
[    0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
[    0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
[    0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
[    0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
[    0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
...
[    0.000000] NUMA: Allocated memnodemap from 500000 - 701040
[    0.000000] NUMA: Using 20 for the hash shift.
[    0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
[    0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
[    0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
[    0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
[    0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
[    0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
[    0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
[    0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
[    0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
[    0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
[    0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
[    0.000000] SRAT: SRAT not used.

the early_node_map is not sorted because node0 with non zero start come first.

so try to sort it right away after all regions are registered.

-v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/mm/srat_32.c |    2 ++
 arch/x86/mm/srat_64.c |    4 +++-
 include/linux/mm.h    |    3 +++
 mm/page_alloc.c       |    4 ++--
 4 files changed, 10 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/mm/srat_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/srat_32.c
+++ linux-2.6/arch/x86/mm/srat_32.c
@@ -267,6 +267,8 @@ int __init get_memcfg_from_srat(void)
 		e820_register_active_regions(chunk->nid, chunk->start_pfn,
 					     min(chunk->end_pfn, max_pfn));
 	}
+	/* for out of order entries in SRAT */
+	sort_node_map();
 
 	for_each_online_node(nid) {
 		unsigned long start = node_start_pfn[nid];
Index: linux-2.6/arch/x86/mm/srat_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/srat_64.c
+++ linux-2.6/arch/x86/mm/srat_64.c
@@ -317,7 +317,7 @@ static int __init nodes_cover_memory(con
 		unsigned long s = nodes[i].start >> PAGE_SHIFT;
 		unsigned long e = nodes[i].end >> PAGE_SHIFT;
 		pxmram += e - s;
-		pxmram -= absent_pages_in_range(s, e);
+		pxmram -= __absent_pages_in_range(i, s, e);
 		if ((long)pxmram < 0)
 			pxmram = 0;
 	}
@@ -373,6 +373,8 @@ int __init acpi_scan_nodes(unsigned long
 	for_each_node_mask(i, nodes_parsed)
 		e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
 						nodes[i].end >> PAGE_SHIFT);
+	/* for out of order entries in SRAT */
+	sort_node_map();
 	if (!nodes_cover_memory(nodes)) {
 		bad_srat();
 		return -1;
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1022,6 +1022,9 @@ extern void add_active_range(unsigned in
 extern void remove_active_range(unsigned int nid, unsigned long start_pfn,
 					unsigned long end_pfn);
 extern void remove_all_active_ranges(void);
+void sort_node_map(void);
+unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
+						unsigned long end_pfn);
 extern unsigned long absent_pages_in_range(unsigned long start_pfn,
 						unsigned long end_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -3573,7 +3573,7 @@ static unsigned long __meminit zone_span
  * Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
  * then all holes in the requested range will be accounted for.
  */
-static unsigned long __meminit __absent_pages_in_range(int nid,
+unsigned long __meminit __absent_pages_in_range(int nid,
 				unsigned long range_start_pfn,
 				unsigned long range_end_pfn)
 {
@@ -4102,7 +4102,7 @@ static int __init cmp_node_active_region
 }
 
 /* sort the node_map by start_pfn */
-static void __init sort_node_map(void)
+void __init sort_node_map(void)
 {
 	sort(early_node_map, (size_t)nr_nodemap_entries,
 			sizeof(struct node_active_region),


  reply	other threads:[~2009-12-15 21:44 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-15 11:50 kexec boot regression Jens Axboe
2009-12-15 12:01 ` Yinghai Lu
2009-12-15 12:14   ` Jens Axboe
2009-12-15 12:31     ` Yinghai Lu
2009-12-15 12:39       ` Jens Axboe
2009-12-15 12:55         ` Yinghai Lu
2009-12-15 14:11           ` Jens Axboe
2009-12-15 18:39             ` Yinghai Lu
2009-12-15 18:47               ` Matthew Wilcox
2009-12-15 18:54               ` Jens Axboe
2009-12-15 18:59               ` Jens Axboe
2009-12-15 19:04                 ` Yinghai Lu
2009-12-15 19:11                   ` Jens Axboe
2009-12-15 19:17                     ` Yinghai Lu
2009-12-15 19:22                       ` Jens Axboe
2009-12-15 19:28                         ` Jens Axboe
2009-12-15 19:44                     ` Yinghai Lu
2009-12-15 19:48                       ` Jens Axboe
2009-12-15 19:49                         ` Yinghai Lu
2009-12-15 19:57                           ` Jens Axboe
2009-12-15 21:30                   ` Markus Trippelsdorf
2009-12-15 23:02                     ` kexec boot regression radeon/kms (bisected) Markus Trippelsdorf
2009-12-15 19:43               ` kexec boot regression Jens Axboe
2009-12-15 19:48                 ` Yinghai Lu
2009-12-15 19:51                   ` Jens Axboe
2009-12-15 19:56                     ` Yinghai Lu
2009-12-15 20:09                       ` Jens Axboe
2009-12-15 20:14                     ` Yinghai Lu
2009-12-15 20:19                       ` Jens Axboe
2009-12-15 20:21                         ` Yinghai Lu
2009-12-15 20:42                           ` Jens Axboe
2009-12-15 20:55                             ` Jens Axboe
2009-12-15 21:01                               ` Jens Axboe
2009-12-15 21:26                                 ` Yinghai Lu
2009-12-15 21:30                                   ` Jens Axboe
2009-12-15 21:40                                     ` Jens Axboe
2009-12-15 21:43                                       ` Yinghai Lu [this message]
2009-12-15 21:47                                         ` Jens Axboe
2009-12-15 21:50                                           ` Yinghai Lu
2009-12-15 21:52                                           ` Jens Axboe
2009-12-15 22:24                                             ` Yinghai Lu
2009-12-16 10:01                                               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B2802E4.40404@kernel.org \
    --to=yinghai@kernel.org \
    --cc=hpa@zytor.com \
    --cc=jbarnes@virtuousgeek.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rdreier@cisco.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.