LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Srivatsa S. Bhat @ 2013-01-22 19:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw, tj,
	akpm, linuxppc-dev
In-Reply-To: <1358883152.21576.55.camel@gandalf.local.home>

On 01/23/2013 01:02 AM, Steven Rostedt wrote:
> On Tue, 2013-01-22 at 13:03 +0530, Srivatsa S. Bhat wrote:
>> A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
>> locks can also lead to too many deadlock possibilities which can make it very
>> hard/impossible to use. This is explained in the example below, which helps
>> justify the need for a different algorithm to implement flexible Per-CPU
>> Reader-Writer locks.
>>
>> We can use global rwlocks as shown below safely, without fear of deadlocks:
>>
>> Readers:
>>
>>          CPU 0                                CPU 1
>>          ------                               ------
>>
>> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
>>
>>
>> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);
>>
>>
>> Writer:
>>
>>          CPU 2:
>>          ------
>>
>>        write_lock(&my_rwlock);
>>
> 
> I thought global locks are now fair. That is, a reader will block if a
> writer is waiting. Hence, the above should deadlock on the current
> rwlock_t types.
> 

Oh is it? Last I checked, lockdep didn't complain about this ABBA scenario!

> We need to fix those locations (or better yet, remove all rwlocks ;-)
> 

:-)

The challenge with stop_machine() removal is that the replacement on the
reader side must have the (locking) flexibility comparable to preempt_disable().
Otherwise, that solution most likely won't be viable because we'll hit way
too many locking problems and go crazy by the time we convert them over..(if
we can, that is!)

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Srivatsa S. Bhat @ 2013-01-22 19:41 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rostedt, rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw,
	tj, akpm, linuxppc-dev
In-Reply-To: <20130122104506.32b4e581@nehalam.linuxnetplumber.net>

On 01/23/2013 12:15 AM, Stephen Hemminger wrote:
> On Tue, 22 Jan 2013 13:03:22 +0530
> "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> 
>> A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
>> locks can also lead to too many deadlock possibilities which can make it very
>> hard/impossible to use. This is explained in the example below, which helps
>> justify the need for a different algorithm to implement flexible Per-CPU
>> Reader-Writer locks.
>>
>> We can use global rwlocks as shown below safely, without fear of deadlocks:
>>
>> Readers:
>>
>>          CPU 0                                CPU 1
>>          ------                               ------
>>
>> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
>>
>>
>> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);
>>
>>
>> Writer:
>>
>>          CPU 2:
>>          ------
>>
>>        write_lock(&my_rwlock);
>>
>>
>> We can observe that there is no possibility of deadlocks or circular locking
>> dependencies here. Its perfectly safe.
>>
>> Now consider a blind/straight-forward conversion of global rwlocks to per-CPU
>> rwlocks like this:
>>
>> The reader locks its own per-CPU rwlock for read, and proceeds.
>>
>> Something like: read_lock(per-cpu rwlock of this cpu);
>>
>> The writer acquires all per-CPU rwlocks for write and only then proceeds.
>>
>> Something like:
>>
>>   for_each_online_cpu(cpu)
>> 	write_lock(per-cpu rwlock of 'cpu');
>>
>>
>> Now let's say that for performance reasons, the above scenario (which was
>> perfectly safe when using global rwlocks) was converted to use per-CPU rwlocks.
>>
>>
>>          CPU 0                                CPU 1
>>          ------                               ------
>>
>> 1.    spin_lock(&random_lock);             read_lock(my_rwlock of CPU 1);
>>
>>
>> 2.    read_lock(my_rwlock of CPU 0);       spin_lock(&random_lock);
>>
>>
>> Writer:
>>
>>          CPU 2:
>>          ------
>>
>>       for_each_online_cpu(cpu)
>>         write_lock(my_rwlock of 'cpu');
>>
>>
>> Consider what happens if the writer begins his operation in between steps 1
>> and 2 at the reader side. It becomes evident that we end up in a (previously
>> non-existent) deadlock due to a circular locking dependency between the 3
>> entities, like this:
>>
>>
>> (holds              Waiting for
>>  random_lock) CPU 0 -------------> CPU 2  (holds my_rwlock of CPU 0
>>                                                for write)
>>                ^                   |
>>                |                   |
>>         Waiting|                   | Waiting
>>           for  |                   |  for
>>                |                   V
>>                 ------ CPU 1 <------
>>
>>                 (holds my_rwlock of
>>                  CPU 1 for read)
>>
>>
>>
>> So obviously this "straight-forward" way of implementing percpu rwlocks is
>> deadlock-prone. One simple measure for (or characteristic of) safe percpu
>> rwlock should be that if a user replaces global rwlocks with per-CPU rwlocks
>> (for performance reasons), he shouldn't suddenly end up in numerous deadlock
>> possibilities which never existed before. The replacement should continue to
>> remain safe, and perhaps improve the performance.
>>
>> Observing the robustness of global rwlocks in providing a fair amount of
>> deadlock safety, we implement per-CPU rwlocks as nothing but global rwlocks,
>> as a first step.
>>
>>
>> Cc: David Howells <dhowells@redhat.com>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> 
> We got rid of brlock years ago, do we have to reintroduce it like this?
> The problem was that brlock caused starvation.
> 

Um? I still see it in include/linux/lglock.h and its users in fs/ directory.

BTW, I'm not advocating that everybody start converting their global reader-writer
locks to per-cpu rwlocks, because such a conversion probably won't make sense
in all scenarios.

The thing is, for CPU hotplug in particular, the "preempt_disable() at the reader;
stop_machine() at the writer" scheme had some very desirable properties at the
reader side (even though people might hate stop_machine() with all their
heart ;-)), namely : 

At the reader side:

o No need to hold locks to prevent CPU offline
o Extremely fast/optimized updates (the preempt count)
o No need for heavy memory barriers
o Extremely flexible nesting rules

So this made perfect sense at the reader for CPU hotplug, because it is expected
that CPU hotplug operations are very infrequent, and it is well-known that quite
a few atomic hotplug readers are in very hot paths. The problem was that the
stop_machine() at the writer was not only a little too heavy, but also inflicted
real-time latencies on the system because it needed cooperation from _all_ CPUs
synchronously, to take one CPU down.

So the idea is to get rid of stop_machine() without hurting the reader side.
And this scheme of per-cpu rwlocks comes close to ensuring that. (You can look
at the previous versions of this patchset [links given in cover letter] to see
what other schemes we hashed out before coming to this one).

The only reason I exposed this as a generic locking scheme was because Tejun
pointed out that, complex locking schemes implemented in individual subsystems
is not such a good idea. And also this comes at a time when per-cpu rwsemaphores
have just been introduced in the kernel and Oleg had ideas about converting the
cpu hotplug (sleepable) locking to use them.

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Steven Rostedt @ 2013-01-22 19:32 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw, tj,
	akpm, linuxppc-dev
In-Reply-To: <20130122073315.13822.27093.stgit@srivatsabhat.in.ibm.com>

On Tue, 2013-01-22 at 13:03 +0530, Srivatsa S. Bhat wrote:
> A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
> locks can also lead to too many deadlock possibilities which can make it very
> hard/impossible to use. This is explained in the example below, which helps
> justify the need for a different algorithm to implement flexible Per-CPU
> Reader-Writer locks.
> 
> We can use global rwlocks as shown below safely, without fear of deadlocks:
> 
> Readers:
> 
>          CPU 0                                CPU 1
>          ------                               ------
> 
> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
> 
> 
> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);
> 
> 
> Writer:
> 
>          CPU 2:
>          ------
> 
>        write_lock(&my_rwlock);
> 

I thought global locks are now fair. That is, a reader will block if a
writer is waiting. Hence, the above should deadlock on the current
rwlock_t types.

We need to fix those locations (or better yet, remove all rwlocks ;-)

-- Steve

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Stephen Hemminger @ 2013-01-22 18:45 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rostedt, rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw,
	tj, akpm, linuxppc-dev
In-Reply-To: <20130122073315.13822.27093.stgit@srivatsabhat.in.ibm.com>

On Tue, 22 Jan 2013 13:03:22 +0530
"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:

> A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
> locks can also lead to too many deadlock possibilities which can make it very
> hard/impossible to use. This is explained in the example below, which helps
> justify the need for a different algorithm to implement flexible Per-CPU
> Reader-Writer locks.
> 
> We can use global rwlocks as shown below safely, without fear of deadlocks:
> 
> Readers:
> 
>          CPU 0                                CPU 1
>          ------                               ------
> 
> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
> 
> 
> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);
> 
> 
> Writer:
> 
>          CPU 2:
>          ------
> 
>        write_lock(&my_rwlock);
> 
> 
> We can observe that there is no possibility of deadlocks or circular locking
> dependencies here. Its perfectly safe.
> 
> Now consider a blind/straight-forward conversion of global rwlocks to per-CPU
> rwlocks like this:
> 
> The reader locks its own per-CPU rwlock for read, and proceeds.
> 
> Something like: read_lock(per-cpu rwlock of this cpu);
> 
> The writer acquires all per-CPU rwlocks for write and only then proceeds.
> 
> Something like:
> 
>   for_each_online_cpu(cpu)
> 	write_lock(per-cpu rwlock of 'cpu');
> 
> 
> Now let's say that for performance reasons, the above scenario (which was
> perfectly safe when using global rwlocks) was converted to use per-CPU rwlocks.
> 
> 
>          CPU 0                                CPU 1
>          ------                               ------
> 
> 1.    spin_lock(&random_lock);             read_lock(my_rwlock of CPU 1);
> 
> 
> 2.    read_lock(my_rwlock of CPU 0);       spin_lock(&random_lock);
> 
> 
> Writer:
> 
>          CPU 2:
>          ------
> 
>       for_each_online_cpu(cpu)
>         write_lock(my_rwlock of 'cpu');
> 
> 
> Consider what happens if the writer begins his operation in between steps 1
> and 2 at the reader side. It becomes evident that we end up in a (previously
> non-existent) deadlock due to a circular locking dependency between the 3
> entities, like this:
> 
> 
> (holds              Waiting for
>  random_lock) CPU 0 -------------> CPU 2  (holds my_rwlock of CPU 0
>                                                for write)
>                ^                   |
>                |                   |
>         Waiting|                   | Waiting
>           for  |                   |  for
>                |                   V
>                 ------ CPU 1 <------
> 
>                 (holds my_rwlock of
>                  CPU 1 for read)
> 
> 
> 
> So obviously this "straight-forward" way of implementing percpu rwlocks is
> deadlock-prone. One simple measure for (or characteristic of) safe percpu
> rwlock should be that if a user replaces global rwlocks with per-CPU rwlocks
> (for performance reasons), he shouldn't suddenly end up in numerous deadlock
> possibilities which never existed before. The replacement should continue to
> remain safe, and perhaps improve the performance.
> 
> Observing the robustness of global rwlocks in providing a fair amount of
> deadlock safety, we implement per-CPU rwlocks as nothing but global rwlocks,
> as a first step.
> 
> 
> Cc: David Howells <dhowells@redhat.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

We got rid of brlock years ago, do we have to reintroduce it like this?
The problem was that brlock caused starvation.

^ permalink raw reply

* Re: [PATCH 6/6][v3] perf: Document the ABI of perf sysfs entries
From: Jiri Olsa @ 2013-01-22 17:10 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Andi Kleen, Peter Zijlstra, robert.richter, Greg KH,
	Anton Blanchard, linux-kernel, Stephane Eranian, linuxppc-dev,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo
In-Reply-To: <20130118174654.GA12575@us.ibm.com>

On Fri, Jan 18, 2013 at 09:46:54AM -0800, Sukadev Bhattiprolu wrote:
> Jiri Olsa [jolsa@redhat.com] wrote:

SNIP

> +
> +Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
> +		Linux Powerpc mailing list <linuxppc-dev@ozlabs.org>
> +
> +Description:	POWER-systems specific performance monitoring events
> +
> +		A collection of performance monitoring events that may be
> +		supported by the POWER CPU. These events can be monitored
> +		using the 'perf(1)' tool.
> +
> +		These events may not be supported by other CPUs.
> +
> +		The contents of each file would look like:
> +
> +			event=0xNNNN
> +
> +		where 'N' is a hex digit and the number '0xNNNN' shows the
> +		"raw code" for the perf event identified by the file's
> +		"basename".
> +
> +		Further, multiple terms like 'event=0xNNNN' can be specified
> +		and separated with comma. All available terms are defined in
> +		the /sys/bus/event_source/devices/<dev>/format file.

Acked-by: Jiri Olsa <jolsa@redhat.com>

thanks,
jirka

^ permalink raw reply

* Re: [PATCH] perf: Fix compile warnings in tests/attr.c
From: Jiri Olsa @ 2013-01-22 13:57 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linuxppc-dev, Anton Blanchard, paulus, linux-kernel, acme
In-Reply-To: <20130121213823.GA4774@us.ibm.com>

On Mon, Jan 21, 2013 at 01:38:23PM -0800, Sukadev Bhattiprolu wrote:
> Jiri Olsa [jolsa@redhat.com] wrote:
> | On Fri, Jan 18, 2013 at 05:30:52PM -0800, Sukadev Bhattiprolu wrote:

SNIP

> __u64 is 'unsigned long long' on x86 and PRIu64 is 'llu' which is fine.
> 
> __u64 is 'unsigned long' on Power and PRIu64 is 'lu' which is again fine.
> 
> But __u64 is 'unsigned long long' on x86_64, but PRIu64 is '%lu' bc __WORDSIZE
> is 64.
> 
> On x86_64, shouldn't __u64, be defined as 'unsigned long' rather than
> 'unsigned long long' - ie include 'int-l64.h' rather than 'int-ll64.h' ?

hum, not sure ;-) will try to find some time to look on that

> 
> BTW, does 'perf' with my patch compile, (with warnings) for you on x86_64
> with 'WERROR=0 make' ?

this one passes with warnings

jirka

^ permalink raw reply

* [PATCH Bug fix 4/4] Rename movablecore_map to movablemem_map.
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358855181-6160-1-git-send-email-tangchen@cn.fujitsu.com>

Since "core" could be confused with cpu cores, but here it is memory,
so rename the boot option movablecore_map to movablemem_map.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    8 ++--
 include/linux/memblock.h            |    2 +-
 include/linux/mm.h                  |    8 ++--
 mm/memblock.c                       |    8 ++--
 mm/page_alloc.c                     |   96 +++++++++++++++++-----------------
 5 files changed, 61 insertions(+), 61 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f02aa4c..7770611 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1637,7 +1637,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
-	movablecore_map=nn[KMG]@ss[KMG]
+	movablemem_map=nn[KMG]@ss[KMG]
 			[KNL,X86,IA-64,PPC] This parameter is similar to
 			memmap except it specifies the memory map of
 			ZONE_MOVABLE.
@@ -1647,11 +1647,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			ss to the end of the 1st node will be ZONE_MOVABLE,
 			and all the rest nodes will only have ZONE_MOVABLE.
 			If memmap is specified at the same time, the
-			movablecore_map will be limited within the memmap
+			movablemem_map will be limited within the memmap
 			areas. If kernelcore or movablecore is also specified,
-			movablecore_map will have higher priority to be
+			movablemem_map will have higher priority to be
 			satisfied. So the administrator should be careful that
-			the amount of movablecore_map areas are not too large.
+			the amount of movablemem_map areas are not too large.
 			Otherwise kernel won't have enough memory to start.
 
 	MTD_Partition=	[MTD]
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index ac52bbc..1094952 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -60,7 +60,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-extern struct movablecore_map movablecore_map;
+extern struct movablemem_map movablemem_map;
 
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1559e35..7cef651 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1359,15 +1359,15 @@ extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
 
-#define MOVABLECORE_MAP_MAX MAX_NUMNODES
-struct movablecore_entry {
+#define MOVABLEMEM_MAP_MAX MAX_NUMNODES
+struct movablemem_entry {
 	unsigned long start_pfn;    /* start pfn of memory segment */
 	unsigned long end_pfn;      /* end pfn of memory segment (exclusive) */
 };
 
-struct movablecore_map {
+struct movablemem_map {
 	int nr_map;
-	struct movablecore_entry map[MOVABLECORE_MAP_MAX];
+	struct movablemem_entry map[MOVABLEMEM_MAP_MAX];
 };
 
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
diff --git a/mm/memblock.c b/mm/memblock.c
index 0218231..c47ddd5 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -105,7 +105,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 {
 	phys_addr_t this_start, this_end, cand;
 	u64 i;
-	int curr = movablecore_map.nr_map - 1;
+	int curr = movablemem_map.nr_map - 1;
 
 	/* pump up @end */
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
@@ -124,15 +124,15 @@ restart:
 			continue;
 
 		for (; curr >= 0; curr--) {
-			if ((movablecore_map.map[curr].start_pfn << PAGE_SHIFT)
+			if ((movablemem_map.map[curr].start_pfn << PAGE_SHIFT)
 			    < this_end)
 				break;
 		}
 
 		cand = round_down(this_end - size, align);
 		if (curr >= 0 &&
-		    cand < movablecore_map.map[curr].end_pfn << PAGE_SHIFT) {
-			this_end = movablecore_map.map[curr].start_pfn
+		    cand < movablemem_map.map[curr].end_pfn << PAGE_SHIFT) {
+			this_end = movablemem_map.map[curr].start_pfn
 				   << PAGE_SHIFT;
 			goto restart;
 		}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2bd529e..3978797 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -202,7 +202,7 @@ static unsigned long __meminitdata dma_reserve;
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 /* Movable memory ranges, will also be used by memblock subsystem. */
-struct movablecore_map movablecore_map;
+struct movablemem_map movablemem_map;
 
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
@@ -4375,7 +4375,7 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
  * sanitize_zone_movable_limit() - Sanitize the zone_movable_limit array.
  *
  * zone_movable_limit is initialized as 0. This function will try to get
- * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
+ * the first ZONE_MOVABLE pfn of each node from movablemem_map, and
  * assigne them to zone_movable_limit.
  * zone_movable_limit[nid] == 0 means no limit for the node.
  *
@@ -4386,7 +4386,7 @@ static void __meminit sanitize_zone_movable_limit(void)
 	int map_pos = 0, i, nid;
 	unsigned long start_pfn, end_pfn;
 
-	if (!movablecore_map.nr_map)
+	if (!movablemem_map.nr_map)
 		return;
 
 	/* Iterate all ranges from minimum to maximum */
@@ -4420,22 +4420,22 @@ static void __meminit sanitize_zone_movable_limit(void)
 		if (start_pfn >= end_pfn)
 			continue;
 
-		while (map_pos < movablecore_map.nr_map) {
-			if (end_pfn <= movablecore_map.map[map_pos].start_pfn)
+		while (map_pos < movablemem_map.nr_map) {
+			if (end_pfn <= movablemem_map.map[map_pos].start_pfn)
 				break;
 
-			if (start_pfn >= movablecore_map.map[map_pos].end_pfn) {
+			if (start_pfn >= movablemem_map.map[map_pos].end_pfn) {
 				map_pos++;
 				continue;
 			}
 
 			/*
 			 * The start_pfn of ZONE_MOVABLE is either the minimum
-			 * pfn specified by movablecore_map, or 0, which means
+			 * pfn specified by movablemem_map, or 0, which means
 			 * the node has no ZONE_MOVABLE.
 			 */
 			zone_movable_limit[nid] = max(start_pfn,
-					movablecore_map.map[map_pos].start_pfn);
+					movablemem_map.map[map_pos].start_pfn);
 
 			break;
 		}
@@ -4898,12 +4898,12 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	}
 
 	/*
-	 * If neither kernelcore/movablecore nor movablecore_map is specified,
-	 * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
+	 * If neither kernelcore/movablecore nor movablemem_map is specified,
+	 * there is no ZONE_MOVABLE. But if movablemem_map is specified, the
 	 * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
 	 */
 	if (!required_kernelcore) {
-		if (movablecore_map.nr_map)
+		if (movablemem_map.nr_map)
 			memcpy(zone_movable_pfn, zone_movable_limit,
 				sizeof(zone_movable_pfn));
 		goto out;
@@ -5168,14 +5168,14 @@ early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
 /**
- * insert_movablecore_map() - Insert a memory range in to movablecore_map.map.
+ * insert_movablemem_map() - Insert a memory range in to movablemem_map.map.
  * @start_pfn:	start pfn of the range
  * @end_pfn:	end pfn of the range
  *
  * This function will also merge the overlapped ranges, and sort the array
  * by start_pfn in monotonic increasing order.
  */
-static void __init insert_movablecore_map(unsigned long start_pfn,
+static void __init insert_movablemem_map(unsigned long start_pfn,
 					  unsigned long end_pfn)
 {
 	int pos, overlap;
@@ -5184,31 +5184,31 @@ static void __init insert_movablecore_map(unsigned long start_pfn,
 	 * pos will be at the 1st overlapped range, or the position
 	 * where the element should be inserted.
 	 */
-	for (pos = 0; pos < movablecore_map.nr_map; pos++)
-		if (start_pfn <= movablecore_map.map[pos].end_pfn)
+	for (pos = 0; pos < movablemem_map.nr_map; pos++)
+		if (start_pfn <= movablemem_map.map[pos].end_pfn)
 			break;
 
 	/* If there is no overlapped range, just insert the element. */
-	if (pos == movablecore_map.nr_map ||
-	    end_pfn < movablecore_map.map[pos].start_pfn) {
+	if (pos == movablemem_map.nr_map ||
+	    end_pfn < movablemem_map.map[pos].start_pfn) {
 		/*
 		 * If pos is not the end of array, we need to move all
 		 * the rest elements backward.
 		 */
-		if (pos < movablecore_map.nr_map)
-			memmove(&movablecore_map.map[pos+1],
-				&movablecore_map.map[pos],
-				sizeof(struct movablecore_entry) *
-				(movablecore_map.nr_map - pos));
-		movablecore_map.map[pos].start_pfn = start_pfn;
-		movablecore_map.map[pos].end_pfn = end_pfn;
-		movablecore_map.nr_map++;
+		if (pos < movablemem_map.nr_map)
+			memmove(&movablemem_map.map[pos+1],
+				&movablemem_map.map[pos],
+				sizeof(struct movablemem_entry) *
+				(movablemem_map.nr_map - pos));
+		movablemem_map.map[pos].start_pfn = start_pfn;
+		movablemem_map.map[pos].end_pfn = end_pfn;
+		movablemem_map.nr_map++;
 		return;
 	}
 
 	/* overlap will be at the last overlapped range */
-	for (overlap = pos + 1; overlap < movablecore_map.nr_map; overlap++)
-		if (end_pfn < movablecore_map.map[overlap].start_pfn)
+	for (overlap = pos + 1; overlap < movablemem_map.nr_map; overlap++)
+		if (end_pfn < movablemem_map.map[overlap].start_pfn)
 			break;
 
 	/*
@@ -5216,29 +5216,29 @@ static void __init insert_movablecore_map(unsigned long start_pfn,
 	 * and move the rest elements forward.
 	 */
 	overlap--;
-	movablecore_map.map[pos].start_pfn = min(start_pfn,
-					movablecore_map.map[pos].start_pfn);
-	movablecore_map.map[pos].end_pfn = max(end_pfn,
-					movablecore_map.map[overlap].end_pfn);
+	movablemem_map.map[pos].start_pfn = min(start_pfn,
+					movablemem_map.map[pos].start_pfn);
+	movablemem_map.map[pos].end_pfn = max(end_pfn,
+					movablemem_map.map[overlap].end_pfn);
 
-	if (pos != overlap && overlap + 1 != movablecore_map.nr_map)
-		memmove(&movablecore_map.map[pos+1],
-			&movablecore_map.map[overlap+1],
-			sizeof(struct movablecore_entry) *
-			(movablecore_map.nr_map - overlap - 1));
+	if (pos != overlap && overlap + 1 != movablemem_map.nr_map)
+		memmove(&movablemem_map.map[pos+1],
+			&movablemem_map.map[overlap+1],
+			sizeof(struct movablemem_entry) *
+			(movablemem_map.nr_map - overlap - 1));
 
-	movablecore_map.nr_map -= overlap - pos;
+	movablemem_map.nr_map -= overlap - pos;
 }
 
 /**
- * movablecore_map_add_region() - Add a memory range into movablecore_map.
+ * movablemem_map_add_region() - Add a memory range into movablemem_map.
  * @start:	physical start address of range
  * @end:	physical end address of range
  *
  * This function transform the physical address into pfn, and then add the
- * range into movablecore_map by calling insert_movablecore_map().
+ * range into movablemem_map by calling insert_movablemem_map().
  */
-static void __init movablecore_map_add_region(u64 start, u64 size)
+static void __init movablemem_map_add_region(u64 start, u64 size)
 {
 	unsigned long start_pfn, end_pfn;
 
@@ -5246,8 +5246,8 @@ static void __init movablecore_map_add_region(u64 start, u64 size)
 	if (start + size <= start)
 		return;
 
-	if (movablecore_map.nr_map >= ARRAY_SIZE(movablecore_map.map)) {
-		pr_err("movable_memory_map: too many entries;"
+	if (movablemem_map.nr_map >= ARRAY_SIZE(movablemem_map.map)) {
+		pr_err("movablemem_map: too many entries;"
 			" ignoring [mem %#010llx-%#010llx]\n",
 			(unsigned long long) start,
 			(unsigned long long) (start + size - 1));
@@ -5256,19 +5256,19 @@ static void __init movablecore_map_add_region(u64 start, u64 size)
 
 	start_pfn = PFN_DOWN(start);
 	end_pfn = PFN_UP(start + size);
-	insert_movablecore_map(start_pfn, end_pfn);
+	insert_movablemem_map(start_pfn, end_pfn);
 }
 
 /*
- * cmdline_parse_movablecore_map() - Parse boot option movablecore_map.
+ * cmdline_parse_movablemem_map() - Parse boot option movablemem_map.
  * @p:	The boot option of the following format:
- * 	movablecore_map=nn[KMG]@ss[KMG]
+ * 	movablemem_map=nn[KMG]@ss[KMG]
  *
  * This option sets the memory range [ss, ss+nn) to be used as movable memory.
  *
  * Return: 0 on success or -EINVAL on failure.
  */
-static int __init cmdline_parse_movablecore_map(char *p)
+static int __init cmdline_parse_movablemem_map(char *p)
 {
 	char *oldp;
 	u64 start_at, mem_size;
@@ -5287,13 +5287,13 @@ static int __init cmdline_parse_movablecore_map(char *p)
 		if (p == oldp || *p != '\0')
 			goto err;
 
-		movablecore_map_add_region(start_at, mem_size);
+		movablemem_map_add_region(start_at, mem_size);
 		return 0;
 	}
 err:
 	return -EINVAL;
 }
-early_param("movablecore_map", cmdline_parse_movablecore_map);
+early_param("movablemem_map", cmdline_parse_movablemem_map);
 
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 2/4] Bug fix: Fix the doc format.
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358855181-6160-1-git-send-email-tangchen@cn.fujitsu.com>

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/page_alloc.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 00037a3..cd6f8a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4372,7 +4372,7 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
 }
 
 /**
- * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ * sanitize_zone_movable_limit() - Sanitize the zone_movable_limit array.
  *
  * zone_movable_limit is initialized as 0. This function will try to get
  * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
@@ -5173,9 +5173,9 @@ early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
 /**
- * insert_movablecore_map - Insert a memory range in to movablecore_map.map.
- * @start_pfn: start pfn of the range
- * @end_pfn: end pfn of the range
+ * insert_movablecore_map() - Insert a memory range in to movablecore_map.map.
+ * @start_pfn:	start pfn of the range
+ * @end_pfn:	end pfn of the range
  *
  * This function will also merge the overlapped ranges, and sort the array
  * by start_pfn in monotonic increasing order.
@@ -5236,9 +5236,9 @@ static void __init insert_movablecore_map(unsigned long start_pfn,
 }
 
 /**
- * movablecore_map_add_region - Add a memory range into movablecore_map.
- * @start: physical start address of range
- * @end: physical end address of range
+ * movablecore_map_add_region() - Add a memory range into movablecore_map.
+ * @start:	physical start address of range
+ * @end:	physical end address of range
  *
  * This function transform the physical address into pfn, and then add the
  * range into movablecore_map by calling insert_movablecore_map().
@@ -5265,8 +5265,13 @@ static void __init movablecore_map_add_region(u64 start, u64 size)
 }
 
 /*
- * movablecore_map=nn[KMG]@ss[KMG] sets the region of memory to be used as
- * movable memory.
+ * cmdline_parse_movablecore_map() - Parse boot option movablecore_map.
+ * @p:	The boot option of the following format:
+ * 	movablecore_map=nn[KMG]@ss[KMG]
+ *
+ * This option sets the memory range [ss, ss+nn) to be used as movable memory.
+ *
+ * Return: 0 on success or -EINVAL on failure.
  */
 static int __init cmdline_parse_movablecore_map(char *p)
 {
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 3/4] Bug fix: Remove the unused sanitize_zone_movable_limit() definition.
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358855181-6160-1-git-send-email-tangchen@cn.fujitsu.com>

When CONFIG_HAVE_MEMBLOCK_NODE_MAP is not defined, sanitize_zone_movable_limit()
is also not used. So remove it.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/page_alloc.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd6f8a6..2bd529e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4459,11 +4459,6 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
 
 	return zholes_size[zone_type];
 }
-
-static void __meminit sanitize_zone_movable_limit(void)
-{
-}
-
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 1/4] Bug fix: Use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358855181-6160-1-git-send-email-tangchen@cn.fujitsu.com>

The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 include/linux/memblock.h |    3 ++-
 mm/memblock.c            |   34 ++++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6e25597..ac52bbc 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,7 +42,6 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
-extern struct movablecore_map movablecore_map;
 
 #define memblock_dbg(fmt, ...) \
 	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
@@ -61,6 +60,8 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+extern struct movablecore_map movablecore_map;
+
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 1e48774..0218231 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -92,9 +92,13 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
  *
  * Find @size free area aligned to @align in the specified range and node.
  *
+ * If we have CONFIG_HAVE_MEMBLOCK_NODE_MAP defined, we need to check if the
+ * memory we found if not in hotpluggable ranges.
+ *
  * RETURNS:
  * Found address on success, %0 on failure.
  */
+#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 					phys_addr_t end, phys_addr_t size,
 					phys_addr_t align, int nid)
@@ -139,6 +143,36 @@ restart:
 
 	return 0;
 }
+#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+					phys_addr_t end, phys_addr_t size,
+					phys_addr_t align, int nid)
+{
+	phys_addr_t this_start, this_end, cand;
+	u64 i;
+
+	/* pump up @end */
+	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+		end = memblock.current_limit;
+
+	/* avoid allocating the first page */
+	start = max_t(phys_addr_t, start, PAGE_SIZE);
+	end = max(start, end);
+
+	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
+		this_start = clamp(this_start, start, end);
+		this_end = clamp(this_end, start, end);
+
+		if (this_end < size)
+			continue;
+
+		cand = round_down(this_end - size, align);
+		if (cand >= this_start)
+			return cand;
+	}
+	return 0;
+}
+#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /**
  * memblock_find_in_range - find free area in given range
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 0/4] Bug fix for movablecore_map boot option.
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi

Hi Andrew,

patch1 ~ patch3 fix some problems of movablecore_map boot option.
And since the name "core" could be confused, patch4 rename this option
to movablemem_map.

All these patches are based on the latest -mm tree.
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm

Tang Chen (4):
  Bug fix: Use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map
    in memblock_overlaps_region().
  Bug fix: Fix the doc format.
  Bug fix: Remove the unused sanitize_zone_movable_limit() definition.
  Rename movablecore_map to movablemem_map.

 Documentation/kernel-parameters.txt |    8 +-
 include/linux/memblock.h            |    3 +-
 include/linux/mm.h                  |    8 +-
 mm/memblock.c                       |   42 +++++++++++-
 mm/page_alloc.c                     |  116 +++++++++++++++++-----------------
 5 files changed, 106 insertions(+), 71 deletions(-)

^ permalink raw reply

* [PATCH Bug fix 4/5] cpu-hotplug, memory-hotplug: clear cpu_to_node() when offlining the node
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

From: Wen Congyang <wency@cn.fujitsu.com>

When the node is offlined, there is no memory/cpu on the node. If a
sleep task runs on a cpu of this node, it will be migrated to the
cpu on the other node. So we can clear cpu-to-node mapping.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   30 +++++++++++++++++++++++++++++-
 1 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index edd1773..022583b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1702,6 +1702,34 @@ static int check_cpu_on_node(void *data)
 	return 0;
 }
 
+static void unmap_cpu_on_node(void *data)
+{
+#ifdef CONFIG_ACPI_NUMA
+	struct pglist_data *pgdat = data;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (cpu_to_node(cpu) == pgdat->node_id)
+			numa_clear_node(cpu);
+#endif
+}
+
+static int check_and_unmap_cpu_on_node(void *data)
+{
+	int ret = check_cpu_on_node(data);
+
+	if (ret)
+		return ret;
+
+	/*
+	 * the node will be offlined when we come here, so we can clear
+	 * the cpu_to_node() now.
+	 */
+
+	unmap_cpu_on_node(data);
+	return 0;
+}
+
 /* offline the node if all memory sections of this node are removed */
 void try_offline_node(int nid)
 {
@@ -1728,7 +1756,7 @@ void try_offline_node(int nid)
 		return;
 	}
 
-	if (stop_machine(check_cpu_on_node, pgdat, NULL))
+	if (stop_machine(check_and_unmap_cpu_on_node, pgdat, NULL))
 		return;
 
 	/*
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 5/5] Do not use cpu_to_node() to find an offlined cpu's node.
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
return -1. As a result, cpumask_of_node(nid) will return NULL. In this case,
find_next_bit() in for_each_cpu will get a NULL pointer and cause panic.

Here is a call trace:
[  609.824017] Call Trace:
[  609.824017]  <IRQ>
[  609.824017]  [<ffffffff810b0721>] select_fallback_rq+0x71/0x190
[  609.824017]  [<ffffffff810b086e>] ? try_to_wake_up+0x2e/0x2f0
[  609.824017]  [<ffffffff810b0b0b>] try_to_wake_up+0x2cb/0x2f0
[  609.824017]  [<ffffffff8109da08>] ? __run_hrtimer+0x78/0x320
[  609.824017]  [<ffffffff810b0b85>] wake_up_process+0x15/0x20
[  609.824017]  [<ffffffff8109ce62>] hrtimer_wakeup+0x22/0x30
[  609.824017]  [<ffffffff8109da13>] __run_hrtimer+0x83/0x320
[  609.824017]  [<ffffffff8109ce40>] ? update_rmtp+0x80/0x80
[  609.824017]  [<ffffffff8109df56>] hrtimer_interrupt+0x106/0x280
[  609.824017]  [<ffffffff810a72c8>] ? sd_free_ctl_entry+0x68/0x70
[  609.824017]  [<ffffffff8167cf39>] smp_apic_timer_interrupt+0x69/0x99
[  609.824017]  [<ffffffff8167be2f>] apic_timer_interrupt+0x6f/0x80

There is a hrtimer process sleeping, whose cpu has already been offlined.
When it is waken up, it tries to find another cpu to run, and get a -1 nid.
As a result, cpumask_of_node(-1) returns NULL, and causes ernel panic.

This patch fixes this problem by judging if the nid is -1.
If nid is not -1, a cpu on the same node will be picked.
Else, a online cpu on another node will be picked.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 kernel/sched/core.c |   28 +++++++++++++++++++---------
 1 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..035ee9f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1132,18 +1132,28 @@ EXPORT_SYMBOL_GPL(kick_process);
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
-	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
+	int nid = cpu_to_node(cpu);
+	const struct cpumask *nodemask = NULL;
 	enum { cpuset, possible, fail } state = cpuset;
 	int dest_cpu;
 
-	/* Look for allowed, online CPU in same node. */
-	for_each_cpu(dest_cpu, nodemask) {
-		if (!cpu_online(dest_cpu))
-			continue;
-		if (!cpu_active(dest_cpu))
-			continue;
-		if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
-			return dest_cpu;
+	/*
+	 * If the node that the cpu is on has been offlined, cpu_to_node()
+	 * will return -1. There is no cpu on the node, and we should
+	 * select the cpu on the other node.
+	 */
+	if (nid != -1) {
+		nodemask = cpumask_of_node(nid);
+
+		/* Look for allowed, online CPU in same node. */
+		for_each_cpu(dest_cpu, nodemask) {
+			if (!cpu_online(dest_cpu))
+				continue;
+			if (!cpu_active(dest_cpu))
+				continue;
+			if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
+				return dest_cpu;
+		}
 	}
 
 	for (;;) {
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 3/5] cpu-hotplug, memory-hotplug: try offline the node when hotremoving a cpu
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

From: Wen Congyang <wency@cn.fujitsu.com>

The node will be offlined when all memory/cpu on the node is hotremoved.
So we should try offline the node when hotremoving a cpu on the node.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/acpi/processor_driver.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/acpi/processor_driver.c b/drivers/acpi/processor_driver.c
index a24ee43..81745f1 100644
--- a/drivers/acpi/processor_driver.c
+++ b/drivers/acpi/processor_driver.c
@@ -45,6 +45,7 @@
 #include <linux/cpuidle.h>
 #include <linux/slab.h>
 #include <linux/acpi.h>
+#include <linux/memory_hotplug.h>
 
 #include <asm/io.h>
 #include <asm/cpu.h>
@@ -641,6 +642,7 @@ static int acpi_processor_remove(struct acpi_device *device, int type)
 
 	per_cpu(processors, pr->id) = NULL;
 	per_cpu(processor_device_array, pr->id) = NULL;
+	try_offline_node(cpu_to_node(pr->id));
 
 free:
 	free_cpumask_var(pr->throttling.shared_cpu_map);
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 1/5] cpu_hotplug: clear apicid to node when the cpu is hotremoved
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

From: Wen Congyang <wency@cn.fujitsu.com>

When a cpu is hotpluged, we call acpi_map_cpu2node() in _acpi_map_lsapic()
to store the cpu's node and apicid's node. But we don't clear the cpu's node
in acpi_unmap_lsapic() when this cpu is hotremove. If the node is also
hotremoved, we will get the following messages:
[ 1646.771485] kernel BUG at include/linux/gfp.h:329!
[ 1646.828729] invalid opcode: 0000 [#1] SMP
[ 1646.877872] Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
[ 1647.588773] Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
[ 1647.711545] RIP: 0010:[<ffffffff811bc3fd>]  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
[ 1647.810492] RSP: 0018:ffff88078a049cf8  EFLAGS: 00010246
[ 1647.874028] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 1647.959339] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
[ 1648.044659] RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
[ 1648.129953] R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
[ 1648.215259] R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
[ 1648.300572] FS:  00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
[ 1648.397272] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1648.465985] CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
[ 1648.551265] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1648.636565] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1648.721838] Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
[ 1648.818534] Stack:
[ 1648.842548]  ffff8807c39d7fa0 ffffffff000040d0 00000000000000bb 00000000000080d0
[ 1648.931469]  ffff8807c1417300 ffff8807c39d7fa0 ffff8807c1417300 0000000000000001
[ 1649.020410]  ffff88078a049d88 ffffffff811bc4a0 ffff8807c1410c80 0000000000000000
[ 1649.109464] Call Trace:
[ 1649.138713]  [<ffffffff811bc4a0>] new_slab+0x30/0x1b0
[ 1649.199075]  [<ffffffff811bc978>] __slab_alloc+0x358/0x4c0
[ 1649.264683]  [<ffffffff810b71c0>] ? alloc_fair_sched_group+0xd0/0x1b0
[ 1649.341695]  [<ffffffff811be7d4>] kmem_cache_alloc_node_trace+0xb4/0x1e0
[ 1649.421824]  [<ffffffff8109d188>] ? hrtimer_init+0x48/0x100
[ 1649.488414]  [<ffffffff810b71c0>] ? alloc_fair_sched_group+0xd0/0x1b0
[ 1649.565402]  [<ffffffff810b71c0>] alloc_fair_sched_group+0xd0/0x1b0
[ 1649.640297]  [<ffffffff810a8bce>] sched_create_group+0x3e/0x110
[ 1649.711040]  [<ffffffff810bdbcd>] sched_autogroup_create_attach+0x4d/0x180
[ 1649.793260]  [<ffffffff81089614>] sys_setsid+0xd4/0xf0
[ 1649.854694]  [<ffffffff8167a029>] system_call_fastpath+0x16/0x1b
[ 1649.926483] Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
[ 1650.161454] RIP  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
[ 1650.232348]  RSP <ffff88078a049cf8>
[ 1650.274029] ---[ end trace adf84c90f3fea3e5 ]---

The reason is that: the cpu's node is not NUMA_NO_NODE, we will call
alloc_pages_exact_node() to alloc memory on the node, but the node
is offlined.

If the node is onlined, we still need cpu's node. For example:
a task on the cpu is sleeped when the cpu is hotremoved. We will
choose another cpu to run this task when it is waked up. If we
know the cpu's node, we will choose the cpu on the same node first.
So we should clear cpu-to-node mapping when the node is offlined.

This patch only clears apicid-to-node mapping when the cpu is hotremoved.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/kernel/acpi/boot.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index bacf4b0..7d53833 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -697,6 +697,10 @@ EXPORT_SYMBOL(acpi_map_lsapic);
 
 int acpi_unmap_lsapic(int cpu)
 {
+#ifdef CONFIG_ACPI_NUMA
+	set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
+#endif
+
 	per_cpu(x86_cpu_to_apicid, cpu) = -1;
 	set_cpu_present(cpu, false);
 	num_processors--;
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 2/5] memory-hotplug: export the function try_offline_node()
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

From: Wen Congyang <wency@cn.fujitsu.com>

The node will be offlined when all memory/cpu on the node
have been hotremoved. So we need the function try_offline_node()
in cpu-hotplug path.

If the memory-hotplug is disabled, and cpu-hotplug is enabled
1. no memory no the node
   we don't online the node, and cpu's node is the nearest node.
2. the node contains some memory
   the node has been onlined, and cpu's node is still needed
   to migrate the sleep task on the cpu to the same node.
So we do nothing in try_offline_node() in this case.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 include/linux/memory_hotplug.h |    2 ++
 mm/memory_hotplug.c            |    3 ++-
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 69903cc..0b2878e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -193,6 +193,7 @@ extern void get_page_bootmem(unsigned long ingo, struct page *page,
 
 void lock_memory_hotplug(void);
 void unlock_memory_hotplug(void);
+extern void try_offline_node(int nid);
 
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
@@ -227,6 +228,7 @@ static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
 
 static inline void lock_memory_hotplug(void) {}
 static inline void unlock_memory_hotplug(void) {}
+static inline void try_offline_node(int nid) {}
 
 #endif /* ! CONFIG_MEMORY_HOTPLUG */
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f0dc9ad..edd1773 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1703,7 +1703,7 @@ static int check_cpu_on_node(void *data)
 }
 
 /* offline the node if all memory sections of this node are removed */
-static void try_offline_node(int nid)
+void try_offline_node(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long start_pfn = pgdat->node_start_pfn;
@@ -1759,6 +1759,7 @@ static void try_offline_node(int nid)
 	 */
 	memset(pgdat, 0, sizeof(*pgdat));
 }
+EXPORT_SYMBOL(try_offline_node);
 
 int __ref remove_memory(int nid, u64 start, u64 size)
 {
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 0/5] Bug fix for node offline
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi

Based on physical memory hot-remove functionality, we can implement
node hot-remove. But there are some problems in cpu driver when offlining
a node. This patch-set will fix them.

All these patches are based on the latest -mm tree.
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm

Tang Chen (1):
  Do not use cpu_to_node() to find an offlined cpu's node.

Wen Congyang (4):
  cpu_hotplug: clear apicid to node when the cpu is hotremoved
  memory-hotplug: export the function try_offline_node()
  cpu-hotplug, memory-hotplug: try offline the node when hotremoving a
    cpu
  cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the
    node

 arch/x86/kernel/acpi/boot.c     |    4 ++++
 drivers/acpi/processor_driver.c |    2 ++
 include/linux/memory_hotplug.h  |    2 ++
 kernel/sched/core.c             |   28 +++++++++++++++++++---------
 mm/memory_hotplug.c             |   33 +++++++++++++++++++++++++++++++--
 5 files changed, 58 insertions(+), 11 deletions(-)

^ permalink raw reply

* [PATCH Bug fix 5/5] Bug fix: Fix the doc format in drivers/firmware/memmap.c
From: Tang Chen @ 2013-01-22 11:43 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358854984-6073-1-git-send-email-tangchen@cn.fujitsu.com>

Make the comments in drivers/firmware/memmap.c kernel-doc compliant.

Reported-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/firmware/memmap.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c
index 658fdd4..0b5b5f6 100644
--- a/drivers/firmware/memmap.c
+++ b/drivers/firmware/memmap.c
@@ -209,7 +209,7 @@ static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
 }
 
 /*
- * firmware_map_find_entry_in_list: Search memmap entry in a given list.
+ * firmware_map_find_entry_in_list() - Search memmap entry in a given list.
  * @start: Start of the memory range.
  * @end:   End of the memory range (exclusive).
  * @type:  Type of the memory range.
@@ -219,7 +219,7 @@ static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
  * given list. The caller must hold map_entries_lock, and must not release
  * the lock until the processing of the returned entry has completed.
  *
- * Return pointer to the entry to be found on success, or NULL on failure.
+ * Return: Pointer to the entry to be found on success, or NULL on failure.
  */
 static struct firmware_map_entry * __meminit
 firmware_map_find_entry_in_list(u64 start, u64 end, const char *type,
@@ -237,7 +237,7 @@ firmware_map_find_entry_in_list(u64 start, u64 end, const char *type,
 }
 
 /*
- * firmware_map_find_entry: Search memmap entry in map_entries.
+ * firmware_map_find_entry() - Search memmap entry in map_entries.
  * @start: Start of the memory range.
  * @end:   End of the memory range (exclusive).
  * @type:  Type of the memory range.
@@ -246,7 +246,7 @@ firmware_map_find_entry_in_list(u64 start, u64 end, const char *type,
  * The caller must hold map_entries_lock, and must not release the lock
  * until the processing of the returned entry has completed.
  *
- * Return pointer to the entry to be found on success, or NULL on failure.
+ * Return: Pointer to the entry to be found on success, or NULL on failure.
  */
 static struct firmware_map_entry * __meminit
 firmware_map_find_entry(u64 start, u64 end, const char *type)
@@ -255,7 +255,7 @@ firmware_map_find_entry(u64 start, u64 end, const char *type)
 }
 
 /*
- * firmware_map_find_entry_bootmem: Search memmap entry in map_entries_bootmem.
+ * firmware_map_find_entry_bootmem() - Search memmap entry in map_entries_bootmem.
  * @start: Start of the memory range.
  * @end:   End of the memory range (exclusive).
  * @type:  Type of the memory range.
@@ -263,7 +263,7 @@ firmware_map_find_entry(u64 start, u64 end, const char *type)
  * This function is similar to firmware_map_find_entry except that it find the
  * given entry in map_entries_bootmem.
  *
- * Return pointer to the entry to be found on success, or NULL on failure.
+ * Return: Pointer to the entry to be found on success, or NULL on failure.
  */
 static struct firmware_map_entry * __meminit
 firmware_map_find_entry_bootmem(u64 start, u64 end, const char *type)
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 4/5] Bug fix: Fix section mismatch problem of release_firmware_map_entry().
From: Tang Chen @ 2013-01-22 11:43 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358854984-6073-1-git-send-email-tangchen@cn.fujitsu.com>

The function release_firmware_map_entry() references the function
__meminit firmware_map_find_entry_in_list(). So it should also have
__meminit.

And since the firmware_map_entry->kobj is initialized with memmap_ktype,
the memmap_ktype should also be prefixed by __refdata.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/firmware/memmap.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c
index 0710179..658fdd4 100644
--- a/drivers/firmware/memmap.c
+++ b/drivers/firmware/memmap.c
@@ -103,7 +103,7 @@ to_memmap_entry(struct kobject *kobj)
 	return container_of(kobj, struct firmware_map_entry, kobj);
 }
 
-static void release_firmware_map_entry(struct kobject *kobj)
+static void __meminit release_firmware_map_entry(struct kobject *kobj)
 {
 	struct firmware_map_entry *entry = to_memmap_entry(kobj);
 
@@ -127,7 +127,7 @@ static void release_firmware_map_entry(struct kobject *kobj)
 	kfree(entry);
 }
 
-static struct kobj_type memmap_ktype = {
+static struct kobj_type __refdata memmap_ktype = {
 	.release	= release_firmware_map_entry,
 	.sysfs_ops	= &memmap_attr_ops,
 	.default_attrs	= def_attrs,
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 3/5] Bug fix: Do not split pages when freeing pagetable pages.
From: Tang Chen @ 2013-01-22 11:43 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358854984-6073-1-git-send-email-tangchen@cn.fujitsu.com>

The old way we free pagetable pages was wrong.

The old way is:
When we got a hugepage, we split it into smaller pages. And sometimes,
we only need to free some of the smaller pages because the others are
still in use. And the whole larger page will be freed if all the smaller
pages are not in use. All these were done in remove_pmd/pud_table().

But there is no way to know if the larger page has been split. As a result,
there is no way to decide when to split.

Actually, there is no need to split larger pages into smaller ones.

We do it in the following new way:
1) For direct mapped pages, all the pages were freed when they were offlined.
   And since menmory offline is done section by section, all the memory ranges
   being removed are aligned to PAGE_SIZE. So only need to deal with unaligned
   pages when freeing vmemmap pages.

2) For vmemmap pages being used to store page_struct, if part of the larger
   page is still in use, just fill the unused part with 0xFD. And when the
   whole page is fulfilled with 0xFD, then free the larger page.

This problem is caused by the following related patch:
memory-hotplug-common-apis-to-support-page-tables-hot-remove.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix.patch

This patch will fix the problem based on the above patches.

Reported-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/init_64.c |  148 +++++++++++++++---------------------------------
 1 files changed, 46 insertions(+), 102 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 6d1a768..9fbed24 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -777,7 +777,7 @@ static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
 
 static void __meminit
 remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
-		 bool direct, bool split)
+		 bool direct)
 {
 	unsigned long next, pages = 0;
 	pte_t *pte;
@@ -807,23 +807,9 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 			/*
 			 * Do not free direct mapping pages since they were
 			 * freed when offlining, or simplely not in use.
-			 *
-			 * Do not free pages split from larger page since only
-			 * the _count of the 1st page struct is available.
-			 * Free the larger page when it is fulfilled with 0xFD.
 			 */
-			if (!direct) {
-				if (split) {
-					/*
-					 * Fill the split 4KB page with 0xFD.
-					 * When the whole 2MB page is fulfilled
-					 * with 0xFD, it could be freed.
-					 */
-					memset((void *)addr, PAGE_INUSE,
-						PAGE_SIZE);
-				} else
-					free_pagetable(pte_page(*pte), 0);
-			}
+			if (!direct)
+				free_pagetable(pte_page(*pte), 0);
 
 			spin_lock(&init_mm.page_table_lock);
 			pte_clear(&init_mm, addr, pte);
@@ -833,26 +819,24 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 			pages++;
 		} else {
 			/*
+			 * If we are here, we are freeing vmemmap pages since
+			 * direct mapped memory ranges to be freed are aligned.
+			 *
 			 * If we are not removing the whole page, it means
-			 * other ptes in this page are being used and we cannot
-			 * remove them. So fill the unused ptes with 0xFD, and
-			 * remove the page when it is wholly filled with 0xFD.
+			 * other page structs in this page are being used and
+			 * we canot remove them. So fill the unused page_structs
+			 * with 0xFD, and remove the page when it is wholly
+			 * filled with 0xFD.
 			 */
 			memset((void *)addr, PAGE_INUSE, next - addr);
 
-			/*
-			 * If the range is not aligned to PAGE_SIZE, then the
-			 * page is definitely not split from larger page.
-			 */
 			page_addr = page_address(pte_page(*pte));
 			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
-				if (!direct)
-					free_pagetable(pte_page(*pte), 0);
+				free_pagetable(pte_page(*pte), 0);
 
 				spin_lock(&init_mm.page_table_lock);
 				pte_clear(&init_mm, addr, pte);
 				spin_unlock(&init_mm.page_table_lock);
-				pages++;
 			}
 		}
 	}
@@ -865,13 +849,12 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 
 static void __meminit
 remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
-		 bool direct, bool split)
+		 bool direct)
 {
-	unsigned long pte_phys, next, pages = 0;
+	unsigned long next, pages = 0;
 	pte_t *pte_base;
 	pmd_t *pmd;
 	void *page_addr;
-	bool split_pmd = split, split_pte = false;
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
@@ -883,63 +866,35 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 		if (pmd_large(*pmd)) {
 			if (IS_ALIGNED(addr, PMD_SIZE) &&
 			    IS_ALIGNED(next, PMD_SIZE)) {
-				if (!direct) {
-					if (split_pmd) {
-						/*
-						 * Fill the split 2MB page with
-						 * 0xFD. When the whole 1GB page
-						 * is fulfilled with 0xFD, it
-						 * could be freed.
-						 */
-						memset((void *)addr, PAGE_INUSE,
-							PMD_SIZE);
-					} else {
-						free_pagetable(pmd_page(*pmd),
+				if (!direct)
+					free_pagetable(pmd_page(*pmd),
 						       get_order(PMD_SIZE));
-					}
-				}
 
 				spin_lock(&init_mm.page_table_lock);
 				pmd_clear(pmd);
 				spin_unlock(&init_mm.page_table_lock);
-
-				/*
-				 * For non-direct mapping, pages means
-				 * nothing.
-				 */
 				pages++;
+			} else {
+				/* If here, we are freeing vmemmap pages. */
+				memset((void *)addr, PAGE_INUSE, next - addr);
+
+				page_addr = page_address(pmd_page(*pmd));
+				if (!memchr_inv(page_addr, PAGE_INUSE,
+						PMD_SIZE)) {
+					free_pagetable(pmd_page(*pmd),
+						       get_order(PMD_SIZE));
 
-				continue;
+					spin_lock(&init_mm.page_table_lock);
+					pmd_clear(pmd);
+					spin_unlock(&init_mm.page_table_lock);
+				}
 			}
 
-			/*
-			 * We use 2M page, but we need to remove part of them,
-			 * so split 2M page to 4K page.
-			 */
-			pte_base = (pte_t *)alloc_low_page(&pte_phys);
-			BUG_ON(!pte_base);
-			__split_large_page((pte_t *)pmd, addr,
-					   (pte_t *)pte_base);
-			split_pte = true;
-
-			spin_lock(&init_mm.page_table_lock);
-			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
-			spin_unlock(&init_mm.page_table_lock);
-
-			flush_tlb_all();
+			continue;
 		}
 
 		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
-		remove_pte_table(pte_base, addr, next, direct, split_pte);
-
-		if (!direct && split_pte) {
-			page_addr = page_address(pmd_page(*pmd));
-			if (!memchr_inv(page_addr, PAGE_INUSE, PMD_SIZE)) {
-				free_pagetable(pmd_page(*pmd),
-					       get_order(PMD_SIZE));
-			}
-		}
-
+		remove_pte_table(pte_base, addr, next, direct);
 		free_pte_table(pte_base, pmd);
 		unmap_low_page(pte_base);
 	}
@@ -953,11 +908,10 @@ static void __meminit
 remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 		 bool direct)
 {
-	unsigned long pmd_phys, next, pages = 0;
+	unsigned long next, pages = 0;
 	pmd_t *pmd_base;
 	pud_t *pud;
 	void *page_addr;
-	bool split_pmd = false;
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
@@ -977,37 +931,27 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 				pud_clear(pud);
 				spin_unlock(&init_mm.page_table_lock);
 				pages++;
-				continue;
-			}
+			} else {
+				/* If here, we are freeing vmemmap pages. */
+				memset((void *)addr, PAGE_INUSE, next - addr);
 
-			/*
-			 * We use 1G page, but we need to remove part of them,
-			 * so split 1G page to 2M page.
-			 */
-			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
-			BUG_ON(!pmd_base);
-			__split_large_page((pte_t *)pud, addr,
-					   (pte_t *)pmd_base);
-			split_pmd = true;
+				page_addr = page_address(pud_page(*pud));
+				if (!memchr_inv(page_addr, PAGE_INUSE,
+						PUD_SIZE)) {
+					free_pagetable(pud_page(*pud),
+						       get_order(PUD_SIZE));
 
-			spin_lock(&init_mm.page_table_lock);
-			pud_populate(&init_mm, pud, __va(pmd_phys));
-			spin_unlock(&init_mm.page_table_lock);
+					spin_lock(&init_mm.page_table_lock);
+					pud_clear(pud);
+					spin_unlock(&init_mm.page_table_lock);
+				}
+			}
 
-			flush_tlb_all();
+			continue;
 		}
 
 		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
-		remove_pmd_table(pmd_base, addr, next, direct, split_pmd);
-
-		if (!direct && split_pmd) {
-			page_addr = page_address(pud_page(*pud));
-			if (!memchr_inv(page_addr, PAGE_INUSE, PUD_SIZE)) {
-				free_pagetable(pud_page(*pud),
-					       get_order(PUD_SIZE));
-			}
-		}
-
+		remove_pmd_table(pmd_base, addr, next, direct);
 		free_pmd_table(pmd_base, pud);
 		unmap_low_page(pmd_base);
 	}
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 2/5] Bug-fix: mempolicy: fix is_valid_nodemask()
From: Tang Chen @ 2013-01-22 11:43 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358854984-6073-1-git-send-email-tangchen@cn.fujitsu.com>

From: Lai Jiangshan <laijs@cn.fujitsu.com>

is_valid_nodemask() is introduced by 19770b32. but it does not match
its comments, because it does not check the zone which > policy_zone.

Also in b377fd, this commits told us, if highest zone is ZONE_MOVABLE,
we should also apply memory policies to it. so ZONE_MOVABLE should be valid zone
for policies. is_valid_nodemask() need to be changed to match it.

Fix: check all zones, even its zoneid > policy_zone.
Use nodes_intersects() instead open code to check it.

Reported-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/mempolicy.c |   36 ++++++++++++++++++++++--------------
 1 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index af8a121..6f7979c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -161,19 +161,7 @@ static const struct mempolicy_operations {
 /* Check that the nodemask contains at least one populated zone */
 static int is_valid_nodemask(const nodemask_t *nodemask)
 {
-	int nd, k;
-
-	for_each_node_mask(nd, *nodemask) {
-		struct zone *z;
-
-		for (k = 0; k <= policy_zone; k++) {
-			z = &NODE_DATA(nd)->node_zones[k];
-			if (z->managed_pages > 0)
-				return 1;
-		}
-	}
-
-	return 0;
+	return nodes_intersects(*nodemask, node_states[N_MEMORY]);
 }
 
 static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
@@ -1644,6 +1632,26 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
 	return pol;
 }
 
+static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
+{
+	enum zone_type dynamic_policy_zone = policy_zone;
+
+	BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
+
+	/*
+	 * if policy->v.nodes has movable memory only,
+	 * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
+	 *
+	 * policy->v.nodes is intersect with node_states[N_MEMORY].
+	 * so if the following test faile, it implies
+	 * policy->v.nodes has movable memory only.
+	 */
+	if (!nodes_intersects(policy->v.nodes, node_states[N_HIGH_MEMORY]))
+		dynamic_policy_zone = ZONE_MOVABLE;
+
+	return zone >= dynamic_policy_zone;
+}
+
 /*
  * Return a nodemask representing a mempolicy for filtering nodes for
  * page allocation
@@ -1652,7 +1660,7 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 {
 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
 	if (unlikely(policy->mode == MPOL_BIND) &&
-			gfp_zone(gfp) >= policy_zone &&
+			apply_policy_zone(policy, gfp_zone(gfp)) &&
 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
 		return &policy->v.nodes;
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 0/5] Bug fix for physical memory hot-remove.
From: Tang Chen @ 2013-01-22 11:42 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi

Here are some bug fix patches for physical memory hot-remove. All these
patches are based on the latest -mm tree.
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm

And patch1 and patch3 are very important.
patch1: free compound pages when freeing memmap, otherwise the kernel
        will panic the next time memory is hot-added.
patch3: the old way of freeing pagetable pages was wrong. We should never
        split larger pages into small ones.


Lai Jiangshan (1):
  Bug-fix: mempolicy: fix is_valid_nodemask()

Tang Chen (3):
  Bug fix: Do not split pages when freeing pagetable pages.
  Bug fix: Fix section mismatch problem of
    release_firmware_map_entry().
  Bug fix: Fix the doc format in drivers/firmware/memmap.c

Wen Congyang (1):
  Bug fix: consider compound pages when free memmap

 arch/x86/mm/init_64.c     |  148 ++++++++++++++-------------------------------
 drivers/firmware/memmap.c |   16 +++---
 mm/mempolicy.c            |   36 +++++++----
 mm/sparse.c               |    2 +-
 4 files changed, 77 insertions(+), 125 deletions(-)

^ permalink raw reply

* [PATCH Bug fix 1/5] Bug fix: consider compound pages when free memmap
From: Tang Chen @ 2013-01-22 11:43 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358854984-6073-1-git-send-email-tangchen@cn.fujitsu.com>

From: Wen Congyang <wency@cn.fujitsu.com>

usemap could also be allocated as compound pages. Should also
consider compound pages when freeing memmap.

If we don't fix it, there could be problems when we free vmemmap
pagetables which are stored in compound pages. The old pagetables
will not be freed properly, and when we add the memory again,
no new pagetable will be created. And the old pagetable entry is
used, than the kernel will panic.

The call trace is like the following:

[  691.175487] BUG: unable to handle kernel paging request at ffffea0040000000
[  691.258872] IP: [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
[  691.336971] PGD 7ff7d4067 PUD 78e035067 PMD 78e11d067 PTE 0
[  691.403952] Oops: 0002 [#1] SMP
[  691.442695] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg lpc_ich mfd_core i2c_i801 i2c_core i7core_edac edac_core ioatdma e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
[  692.042726] CPU 0
[  692.064641] Pid: 4, comm: kworker/0:0 Tainted: G        W 3.8.0-rc3-phy-hot-remove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
[  692.196723] RIP: 0010:[<ffffffff816a483f>]  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
[  692.303885] RSP: 0018:ffff8807bdcb35d8  EFLAGS: 00010006
[  692.367331] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000200000
[  692.452578] RDX: ffff88078df01148 RSI: 0000000000000282 RDI: ffffea0040000000
[  692.537822] RBP: ffff8807bdcb3618 R08: 4cf05005b019467a R09: 0cd98fa09631467a
[  692.623071] R10: 0000000000000000 R11: 0000000000030e20 R12: 0000000000008000
[  692.708319] R13: ffffea0040000000 R14: ffff88078df66248 R15: ffff88078ea13b10
[  692.793562] FS:  0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
[  692.890233] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  692.958870] CR2: ffffea0040000000 CR3: 0000000001c0c000 CR4: 00000000000007f0
[  693.044119] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  693.129367] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  693.214617] Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
[  693.315437] Stack:
[  693.339421]  0000000000000000 0000000000000282 0000000000000000 ffff88078df40f00
[  693.428208]  0000000000000001 0000000000000200 00000000000002ff 0000000000000200
[  693.516981]  ffff8807bdcb3668 ffffffff816940e5 0000000000000000 0000000001000000
[  693.605761] Call Trace:
[  693.634949]  [<ffffffff816940e5>] __add_pages+0x85/0x120
[  693.698398]  [<ffffffff8104f1d1>] arch_add_memory+0x71/0xf0
[  693.764960]  [<ffffffff81079bff>] ? request_resource_conflict+0x8f/0xa0
[  693.843982]  [<ffffffff81694796>] add_memory+0xd6/0x1f0
[  693.906393]  [<ffffffff814044df>] acpi_memory_device_add+0x170/0x20c
[  693.982302]  [<ffffffff813c1de2>] acpi_device_probe+0x50/0x18a
[  694.051977]  [<ffffffff8125a9d3>] ? sysfs_create_link+0x13/0x20
[  694.122691]  [<ffffffff8146c31c>] really_probe+0x6c/0x320
[  694.187170]  [<ffffffff8146c617>] driver_probe_device+0x47/0xa0
[  694.257885]  [<ffffffff8146c720>] ? __driver_attach+0xb0/0xb0
[  694.326521]  [<ffffffff8146c720>] ? __driver_attach+0xb0/0xb0
[  694.395157]  [<ffffffff8146c773>] __device_attach+0x53/0x60
[  694.461719]  [<ffffffff8146a34c>] bus_for_each_drv+0x6c/0xa0
[  694.529316]  [<ffffffff8146c298>] device_attach+0xa8/0xc0
[  694.593799]  [<ffffffff8146af70>] bus_probe_device+0xb0/0xe0
[  694.661398]  [<ffffffff814699c1>] device_add+0x301/0x570
[  694.724842]  [<ffffffff81469c4e>] device_register+0x1e/0x30
[  694.791403]  [<ffffffff813c354a>] acpi_device_register+0x1d8/0x27c
[  694.865230]  [<ffffffff813c37cd>] acpi_add_single_object+0x1df/0x2b9
[  694.941140]  [<ffffffff813fa078>] ? acpi_ut_release_mutex+0xac/0xb5
[  695.016009]  [<ffffffff813c39b9>] acpi_bus_check_add+0x112/0x18f
[  695.087764]  [<ffffffff810df61d>] ? trace_hardirqs_on+0xd/0x10
[  695.157445]  [<ffffffff810a1b0f>] ? up+0x2f/0x50
[  695.212585]  [<ffffffff813bdddb>] ? acpi_os_signal_semaphore+0x6b/0x74
[  695.290573]  [<ffffffff813ec519>] acpi_ns_walk_namespace+0x105/0x255
[  695.366478]  [<ffffffff813c38a7>] ? acpi_add_single_object+0x2b9/0x2b9
[  695.444459]  [<ffffffff813c38a7>] ? acpi_add_single_object+0x2b9/0x2b9
[  695.522439]  [<ffffffff813ecb6c>] acpi_walk_namespace+0xcf/0x118
[  695.594190]  [<ffffffff813c3a91>] acpi_bus_scan+0x5b/0x7c
[  695.658676]  [<ffffffff813c3b1e>] acpi_bus_add+0x2a/0x2c
[  695.722121]  [<ffffffff81402905>] container_notify_cb+0x112/0x1a9
[  695.794914]  [<ffffffff813d5859>] acpi_ev_notify_dispatch+0x46/0x61
[  695.869781]  [<ffffffff813be072>] acpi_os_execute_deferred+0x27/0x34
[  695.945687]  [<ffffffff81091c6e>] process_one_work+0x20e/0x5c0
[  696.015361]  [<ffffffff81091bff>] ? process_one_work+0x19f/0x5c0
[  696.087113]  [<ffffffff813be04b>] ? acpi_os_wait_events_complete+0x23/0x23
[  696.169248]  [<ffffffff81093d0e>] worker_thread+0x12e/0x370
[  696.235807]  [<ffffffff81093be0>] ? manage_workers+0x180/0x180
[  696.305485]  [<ffffffff81099e4e>] kthread+0xee/0x100
[  696.364773]  [<ffffffff810e1179>] ? __lock_release+0x129/0x190
[  696.434450]  [<ffffffff81099d60>] ? __init_kthread_worker+0x70/0x70
[  696.509317]  [<ffffffff816b34ac>] ret_from_fork+0x7c/0xb0
[  696.573799]  [<ffffffff81099d60>] ? __init_kthread_worker+0x70/0x70
[  696.648662] Code: 00 00 48 89 df 48 89 45 c8 e8 3e 71 b1 ff 48 89 c2 48 8b 75 c8 b8 ef ff ff ff f6 02 01 75 4b 49 63 cc 31 c0 4c 89 ef 48 c1 e1 06 <f3> aa 48 8b 02 48 83 c8 01 48 85 d2 48 89 02 74 29 a8 01 74 25
[  696.880997] RIP  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
[  696.960128]  RSP <ffff8807bdcb35d8>
[  697.001768] CR2: ffffea0040000000
[  697.041336] ---[ end trace e7f94e3a34c442d4 ]---
[  697.096474] Kernel panic - not syncing: Fatal exception

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/sparse.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index ef29496..7ca6dc8 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -698,7 +698,7 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 	/*
 	 * Check to see if allocation came from hot-plug-add
 	 */
-	if (PageSlab(usemap_page)) {
+	if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
 		kfree(usemap);
 		if (memmap)
 			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
-- 
1.7.1

^ permalink raw reply related

* [PATCH v5 45/45] Documentation/cpu-hotplug: Remove references to stop_machine()
From: Srivatsa S. Bhat @ 2013-01-22  7:45 UTC (permalink / raw)
  To: tglx, peterz, tj, oleg, paulmck, rusty, mingo, akpm, namhyung
  Cc: linux-arch, linux, nikunj, linux-pm, fweisbec, linux-doc,
	linux-kernel, rostedt, xiaoguangrong, rjw, sbw, wangyun,
	srivatsa.bhat, netdev, linuxppc-dev, linux-arm-kernel
In-Reply-To: <20130122073210.13822.50434.stgit@srivatsabhat.in.ibm.com>

Since stop_machine() is no longer used in the CPU offline path, we cannot
disable CPU hotplug using preempt_disable()/local_irq_disable() etc. We
need to use the newly introduced get/put_online_cpus_atomic() APIs.
Reflect this in the documentation.

Cc: Rob Landley <rob@landley.net>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 Documentation/cpu-hotplug.txt |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt
index 9f40135..7f907ec 100644
--- a/Documentation/cpu-hotplug.txt
+++ b/Documentation/cpu-hotplug.txt
@@ -113,13 +113,15 @@ Never use anything other than cpumask_t to represent bitmap of CPUs.
 	#include <linux/cpu.h>
 	get_online_cpus() and put_online_cpus():
 
-The above calls are used to inhibit cpu hotplug operations. While the
+The above calls are used to inhibit cpu hotplug operations, when invoked from
+non-atomic context (because the above functions can sleep). While the
 cpu_hotplug.refcount is non zero, the cpu_online_mask will not change.
-If you merely need to avoid cpus going away, you could also use
-preempt_disable() and preempt_enable() for those sections.
-Just remember the critical section cannot call any
-function that can sleep or schedule this process away. The preempt_disable()
-will work as long as stop_machine_run() is used to take a cpu down.
+
+However, if you are executing in atomic context (ie., you can't afford to
+sleep), and you merely need to avoid cpus going offline, you can use
+get_online_cpus_atomic() and put_online_cpus_atomic() for those sections.
+Just remember the critical section cannot call any function that can sleep or
+schedule this process away.
 
 CPU Hotplug - Frequently Asked Questions.
 
@@ -360,6 +362,9 @@ A: There are two ways.  If your code can be run in interrupt context, use
 		return err;
 	}
 
+   If my_func_on_cpu() itself cannot block, use get/put_online_cpus_atomic()
+   instead of get/put_online_cpus() to prevent CPUs from going offline.
+
 Q: How do we determine how many CPUs are available for hotplug.
 A: There is no clear spec defined way from ACPI that can give us that
    information today. Based on some input from Natalie of Unisys,

^ permalink raw reply related

* [PATCH v5 44/45] CPU hotplug, stop_machine: Decouple CPU hotplug from stop_machine() in Kconfig
From: Srivatsa S. Bhat @ 2013-01-22  7:45 UTC (permalink / raw)
  To: tglx, peterz, tj, oleg, paulmck, rusty, mingo, akpm, namhyung
  Cc: linux-arch, linux, nikunj, linux-pm, fweisbec, linux-doc,
	linux-kernel, rostedt, xiaoguangrong, rjw, sbw, wangyun,
	srivatsa.bhat, netdev, linuxppc-dev, linux-arm-kernel
In-Reply-To: <20130122073210.13822.50434.stgit@srivatsabhat.in.ibm.com>

... and also cleanup a comment that refers to CPU hotplug being dependent on
stop_machine().

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/stop_machine.h |    2 +-
 init/Kconfig                 |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..ce2d3c4 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -120,7 +120,7 @@ int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus);
  * @cpus: the cpus to run the @fn() on (NULL = any online cpu)
  *
  * Description: This is a special version of the above, which assumes cpus
- * won't come or go while it's being called.  Used by hotplug cpu.
+ * won't come or go while it's being called.
  */
 int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus);
 
diff --git a/init/Kconfig b/init/Kconfig
index be8b7f5..048a0c5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1711,7 +1711,7 @@ config INIT_ALL_POSSIBLE
 config STOP_MACHINE
 	bool
 	default y
-	depends on (SMP && MODULE_UNLOAD) || HOTPLUG_CPU
+	depends on (SMP && MODULE_UNLOAD)
 	help
 	  Need stop_machine() primitive.
 

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox