LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6][v4]: perf: Make POWER7 events available in sysfs
From: Sukadev Bhattiprolu @ 2013-01-23  6:22 UTC (permalink / raw)
  To: Peter Zijlstra, Paul Mackerras, Ingo Molnar
  Cc: Andi Kleen, robert.richter, Anton Blanchard, linux-kernel,
	Stephane Eranian, linuxppc-dev, Arnaldo Carvalho de Melo,
	Jiri Olsa

[PATCH 0/6][v4]: perf: Make POWER7 events available in sysfs

Make the generic and some POWER7-specific perf events available in sysfs.
These changes mainly extend similar functionality implemented in x86 to
work on POWER architecture as well.

Thanks to input from Stephane Eranian, Robert Richter, Peter Ziljstra
and Jiri Olsa.

Changelog[v4]:
	[Jiri Olsa]: Document that multiple event= like terms can be specified
	in the 'events' file.
	[Jiri Olsa]: Remove the documentation for the 'config format' file
	as it is already documented in 'Documentation/ABI/testing/'
	[Jiri Olsa]: Move the ABI documentaion from 'stable/' to 'testing/'.

Changelog[v3]:
	[Jiri Olsa]: No need to define EVENT_ID, PMU_EVENT_PTR() if used only
	once
	[Greg KH]: Document the new sysfs interfaces in Documenation/ABI

Changelog[v2]:
	[Jiri Olsa] Use PMU_FORMAT_ATTR() rather than duplicating code.

Sukadev Bhattiprolu (6):
  perf/Power7: Use macros to identify perf events
  perf: Make EVENT_ATTR global
  perf/POWER7: Make generic event translations available in sysfs
  perf/POWER7: Make some POWER7 events available in sysfs
  perf: Create a sysfs entry for Power event format
  perf: Document the ABI of perf sysfs entries

 .../testing/sysfs-bus-event_source-devices-events  |   62 +++++++++++++++
 arch/powerpc/include/asm/perf_event_server.h       |   32 ++++++++
 arch/powerpc/perf/core-book3s.c                    |   24 ++++++
 arch/powerpc/perf/power7-pmu.c                     |   81 ++++++++++++++++++--
 arch/x86/kernel/cpu/perf_event.c                   |   13 +---
 include/linux/perf_event.h                         |   11 +++
 6 files changed, 205 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-events

^ permalink raw reply

* Re: [PATCH 1/6][v3] perf/Power7: Use macros to identify perf events
From: Michael Ellerman @ 2013-01-23  3:50 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Andi Kleen, Peter Zijlstra, robert.richter, Anton Blanchard,
	linux-kernel, Stephane Eranian, linuxppc-dev, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, Jiri Olsa
In-Reply-To: <20130110010347.GA32590@us.ibm.com>

On Wed, 2013-01-09 at 17:03 -0800, Sukadev Bhattiprolu wrote:
> [PATCH 1/6][v3] perf/Power7: Use macros to identify perf events
> 
> Define and use macros to identify perf events codes. This would make it
> easier and more readable when these event codes need to be used in more
> than one place.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> ---
>  arch/powerpc/perf/power7-pmu.c |   28 ++++++++++++++++++++--------
>  1 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/perf/power7-pmu.c b/arch/powerpc/perf/power7-pmu.c
> index 441af08..44e70d2 100644
> --- a/arch/powerpc/perf/power7-pmu.c
> +++ b/arch/powerpc/perf/power7-pmu.c
> @@ -51,6 +51,18 @@
>  #define MMCR1_PMCSEL_MSK	0xff
>  
>  /*
> + * Power7 event codes.
> + */
> +#define	PME_PM_CYC			0x1e
> +#define	PME_PM_GCT_NOSLOT_CYC		0x100f8
> +#define	PME_PM_CMPLU_STALL		0x4000a
> +#define	PME_PM_INST_CMPL		0x2
> +#define	PME_PM_LD_REF_L1		0xc880
> +#define	PME_PM_LD_MISS_L1		0x400f0
> +#define	PME_PM_BRU_FIN			0x10068
> +#define	PME_PM_BRU_MPRED		0x400f6
> +
> +/*
>   * Layout of constraint bits:
>   * 6666555555555544444444443333333333222222222211111111110000000000
>   * 3210987654321098765432109876543210987654321098765432109876543210
> @@ -296,14 +308,14 @@ static void power7_disable_pmc(unsigned int pmc, unsigned long mmcr[])
>  }
>  
>  static int power7_generic_events[] = {
> -	[PERF_COUNT_HW_CPU_CYCLES] = 0x1e,
> -	[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */
> -	[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a,  /* CMPLU_STALL */
> -	[PERF_COUNT_HW_INSTRUCTIONS] = 2,
> -	[PERF_COUNT_HW_CACHE_REFERENCES] = 0xc880,	/* LD_REF_L1_LSU*/
> -	[PERF_COUNT_HW_CACHE_MISSES] = 0x400f0,		/* LD_MISS_L1	*/
> -	[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x10068,	/* BRU_FIN	*/
> -	[PERF_COUNT_HW_BRANCH_MISSES] = 0x400f6,	/* BR_MPRED	*/
> +	[PERF_COUNT_HW_CPU_CYCLES] =			PME_PM_CYC,
> +	[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =	PME_PM_GCT_NOSLOT_CYC,
> +	[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =	PME_PM_CMPLU_STALL,
> +	[PERF_COUNT_HW_INSTRUCTIONS] =			PME_PM_INST_CMPL,
> +	[PERF_COUNT_HW_CACHE_REFERENCES] =		PME_PM_LD_REF_L1,
> +	[PERF_COUNT_HW_CACHE_MISSES] =			PME_PM_LD_MISS_L1,


Your patch is good, but raises the question why we're using L1 events
for HW_CACHE.

AFAICS on Intel they use 0x42fe/0x412e, which are last-level-cache (LLC)
events.
        
        PMU name : ix86arch (Intel X86 architectural PMU)
        Name     : LLC_REFERENCES
        Desc     : count each request originating from the core to
        reference a cache line in the last level cache. The count may
        include speculation, but excludes cache line fills due to
        hardware prefetch
        Code     : 0x4f2e

        PMU name : ix86arch (Intel X86 architectural PMU)
        Name     : LLC_MISSES
        Desc     : count each cache miss condition for references to the
        last level cache. The event count may include speculation, but
        excludes cache line fills due to hardware prefetch
        Code     : 0x412e


That would seem to more closely match our PM_L3_LD_HIT/MISS?

cheers

^ permalink raw reply

* [PATCH][v2] KVM: PPC: add paravirt idle loop for 64-bit book E
From: Stuart Yoder @ 2013-01-22 23:54 UTC (permalink / raw)
  To: agraf, benh; +Cc: linuxppc-dev, kvm-ppc, kvm, Stuart Yoder

From: Stuart Yoder <stuart.yoder@freescale.com>

Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
---

-v2
   -macro'ized loop in idle_book3e.S to avoid code 
    duplication, paravirt loop is now in idle_book3e.S

 arch/powerpc/kernel/epapr_hcalls.S |    2 ++
 arch/powerpc/kernel/idle_book3e.S  |   30 ++++++++++++++++++++++++++++--
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/epapr_hcalls.S b/arch/powerpc/kernel/epapr_hcalls.S
index 62c0dc2..9f1ebf7 100644
--- a/arch/powerpc/kernel/epapr_hcalls.S
+++ b/arch/powerpc/kernel/epapr_hcalls.S
@@ -17,6 +17,7 @@
 #include <asm/asm-compat.h>
 #include <asm/asm-offsets.h>
 
+#ifndef CONFIG_PPC64
 /* epapr_ev_idle() was derived from e500_idle() */
 _GLOBAL(epapr_ev_idle)
 	CURRENT_THREAD_INFO(r3, r1)
@@ -42,6 +43,7 @@ epapr_ev_idle_start:
 	 * _TLF_NAPPING.
 	 */
 	b	idle_loop
+#endif
 
 /* Hypercall entry point. Will be patched with device tree instructions. */
 .global epapr_hypercall_start
diff --git a/arch/powerpc/kernel/idle_book3e.S b/arch/powerpc/kernel/idle_book3e.S
index 4c7cb400..e1c9acd 100644
--- a/arch/powerpc/kernel/idle_book3e.S
+++ b/arch/powerpc/kernel/idle_book3e.S
@@ -16,11 +16,13 @@
 #include <asm/ppc-opcode.h>
 #include <asm/processor.h>
 #include <asm/thread_info.h>
+#include <asm/epapr_hcalls.h>
 
 /* 64-bit version only for now */
 #ifdef CONFIG_PPC64
 
-_GLOBAL(book3e_idle)
+.macro BOOK3E_IDLE name loop
+_GLOBAL(\name)
 	/* Save LR for later */
 	mflr	r0
 	std	r0,16(r1)
@@ -67,7 +69,31 @@ _GLOBAL(book3e_idle)
 
 	/* We can now re-enable hard interrupts and go to sleep */
 	wrteei	1
-1:	PPC_WAIT(0)
+	\loop
+
+.endm
+
+.macro BOOK3E_IDLE_LOOP
+1:
+	PPC_WAIT(0)
 	b	1b
+.endm
+
+.macro EPAPR_EV_IDLE_LOOP
+idle_loop:
+       LOAD_REG_IMMEDIATE(r11, EV_HCALL_TOKEN(EV_IDLE))
+
+.global epapr_ev_idle_start
+epapr_ev_idle_start:
+       li      r3, -1
+       nop
+       nop
+       nop
+       b       idle_loop
+.endm
+
+BOOK3E_IDLE epapr_ev_idle, EPAPR_EV_IDLE_LOOP
+
+BOOK3E_IDLE book3e_idle BOOK3E_IDLE_LOOP
 
 #endif /* CONFIG_PPC64 */
-- 
1.7.9.7

^ permalink raw reply related

* Re: [PATCH] perf: Fix compile warnings in tests/attr.c
From: Michael Ellerman @ 2013-01-22 23:45 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Anton Blanchard, linux-kernel, linuxppc-dev, paulus, acme, mingo,
	Jiri Olsa
In-Reply-To: <20130121213823.GA4774@us.ibm.com>

On Mon, 2013-01-21 at 13:38 -0800, Sukadev Bhattiprolu wrote:
> Jiri Olsa [jolsa@redhat.com] wrote:
> | On Fri, Jan 18, 2013 at 05:30:52PM -0800, Sukadev Bhattiprolu wrote:
> | > From 4d266e5040c33103f5d226b0d16b89f8ef79e3ad Mon Sep 17 00:00:00 2001
> | > From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> | > Date: Fri, 18 Jan 2013 11:14:28 -0800
> | > Subject: [PATCH] perf: Fix compile warnings in tests/attr.c
> | > 
> | > Replace '%llu' in printf()s with 'PRIu64' in 'tools/perf/tests/attr.c'
> | > to fix compile warnings (which become errors due to -Werror).
> | 
> | i386 and x86_64 compiles fine for me with gcc versions 4.6.3-2 and 4.7.2-2
> 
> But is broken on Power for 64bit :-( I am trying to fix that and thought
> that use of format specifiers like 'PRIu64' was the way to go.
> 
> | 
> | with your patch for x86_64 I'm getting following warnings/errors:
> 
> | 
> |     CC tests/attr.o
> | tests/attr.c: In function ‘store_event’:
> | tests/attr.c:69:4: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘__u64’ [-Werror=format]
> 
> Here is what I see on an x86_64 box, RHEL6.2 box:
> 
> 	$ rpm -qf /usr/include/linux/types.h
> 	kernel-headers-2.6.32-220.4.2.el6.x86_64
> 
> 	$ cat foo.c
> 	#include <linux/types.h>
> 
> 	$ cc -Werror -Wall foo.c
> 	In file included from /usr/include/asm-generic/types.h:7,
> 			 from /usr/include/asm/types.h:6,
> 			 from /usr/include/linux/types.h:4,
> 			 from foo5.c:1:
> 	/usr/include/asm-generic/int-ll64.h:31:2: error: #error __u64 defined as unsigned long long
> 
> where the #error is my debug message.
> 
> <snip>
> 
> | make: *** [tests/attr.o] Error 1
> | 
> | i386 compiles fine
> 
> __u64 is 'unsigned long long' on x86 and PRIu64 is 'llu' which is fine.
> 
> __u64 is 'unsigned long' on Power and PRIu64 is 'lu' which is again fine.
> 
> But __u64 is 'unsigned long long' on x86_64, but PRIu64 is '%lu' bc __WORDSIZE
> is 64.

This is a bit of a mess, but let me see if I can help explain it.

The root of the problem is that you're mixing up the kernel type __u64,
with the userspace format specifier PRIu64.

PRIu64 is the format specifier for printing a uint64_t, it _may_ also be
the right specifier for a __u64, but there's no guarantee of that - as
you have discovered.

Inside the kernel both x86 and powerpc use unsigned long long always, in
32-bit and 64-bit code. That means in the kernel we can always use %llu.

On x86 that definition is also exported to userspace, so on x86 __u64 is
always unsigned long long. As you noticed this potentially differs from
uint64_t, which can be confusing. However it means in x86 userspace code
you can always print a __u64 with %llu.

On powerpc we default to using definitions that match userspace, so
__u64 changes depending on your wordsize, and so you must use PRIu64
etc. to print them.

There is however support in recent powerpc kernels to switch to using
unsigned long long even on 64-bit. See commit 2c9c6ce.

You need to define __SANE_USERSPACE_TYPES__ before including types.h.
Then you can always use %llu to print __u64.

cheers

^ permalink raw reply

* Re: [PATCH 1/2] powerpc/5200: Fix size to request_mem_region() call
From: Anatolij Gustschin @ 2013-01-22 21:10 UTC (permalink / raw)
  To: Grant Likely; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1358473200-17886-1-git-send-email-grant.likely@secretlab.ca>

Hi Grant,

On Fri, 18 Jan 2013 01:39:59 +0000
Grant Likely <grant.likely@secretlab.ca> wrote:

> The Bestcomm driver requests a memory region larger than the one
> described in the device tree. This is due to an extra undocumented field
> in the bestcomm register structure. This hasn't been a problem up to
> now, but there is a patch pending to make the DT platform_bus support
> code use platform_device_add() which tightens the rules and provides
> extra checks for drivers to stay within the specified register regions.
> 
> Alternately, I could have removed the extra field from the structure,
> but I'm not sure if it is still needed for resume to work. Better be
> safe and leave it in.
> 
> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Anatolij Gustschin <agust@denx.de>
> ---
>  arch/powerpc/sysdev/bestcomm/bestcomm.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

There is a patch moving this driver to drivers/dma,

http://patchwork.ozlabs.org/patch/191153/

I've applied it to my 5xxx next branch.

> diff --git a/arch/powerpc/sysdev/bestcomm/bestcomm.c b/arch/powerpc/sysdev/bestcomm/bestcomm.c
> index d913063..81c3314 100644
> --- a/arch/powerpc/sysdev/bestcomm/bestcomm.c
> +++ b/arch/powerpc/sysdev/bestcomm/bestcomm.c
> @@ -414,7 +414,7 @@ static int mpc52xx_bcom_probe(struct platform_device *op)
>  		goto error_sramclean;
>  	}
>  
> -	if (!request_mem_region(res_bcom.start, sizeof(struct mpc52xx_sdma),
> +	if (!request_mem_region(res_bcom.start, resource_size(&res_bcom),
>  				DRIVER_NAME)) {
>  		printk(KERN_ERR DRIVER_NAME ": "
>  			"Can't request registers region\n");

similar change is needed for release_mem_region() in error path
and in driver's remove() function.

Thanks,

Anatolij

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Steven Rostedt @ 2013-01-22 20:54 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw, tj,
	akpm, linuxppc-dev
In-Reply-To: <50FEEF5D.6080302@linux.vnet.ibm.com>

On Wed, 2013-01-23 at 01:28 +0530, Srivatsa S. Bhat wrote:

> > I thought global locks are now fair. That is, a reader will block if a
> > writer is waiting. Hence, the above should deadlock on the current
> > rwlock_t types.
> > 
> 
> Oh is it? Last I checked, lockdep didn't complain about this ABBA scenario!

It doesn't and Peter Zijlstra said we need to fix that ;-)  It only
recently became an issue with the new "fair" locking of rwlocks.

-- Steve

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Srivatsa S. Bhat @ 2013-01-22 19:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw, tj,
	akpm, linuxppc-dev
In-Reply-To: <1358883152.21576.55.camel@gandalf.local.home>

On 01/23/2013 01:02 AM, Steven Rostedt wrote:
> On Tue, 2013-01-22 at 13:03 +0530, Srivatsa S. Bhat wrote:
>> A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
>> locks can also lead to too many deadlock possibilities which can make it very
>> hard/impossible to use. This is explained in the example below, which helps
>> justify the need for a different algorithm to implement flexible Per-CPU
>> Reader-Writer locks.
>>
>> We can use global rwlocks as shown below safely, without fear of deadlocks:
>>
>> Readers:
>>
>>          CPU 0                                CPU 1
>>          ------                               ------
>>
>> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
>>
>>
>> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);
>>
>>
>> Writer:
>>
>>          CPU 2:
>>          ------
>>
>>        write_lock(&my_rwlock);
>>
> 
> I thought global locks are now fair. That is, a reader will block if a
> writer is waiting. Hence, the above should deadlock on the current
> rwlock_t types.
> 

Oh is it? Last I checked, lockdep didn't complain about this ABBA scenario!

> We need to fix those locations (or better yet, remove all rwlocks ;-)
> 

:-)

The challenge with stop_machine() removal is that the replacement on the
reader side must have the (locking) flexibility comparable to preempt_disable().
Otherwise, that solution most likely won't be viable because we'll hit way
too many locking problems and go crazy by the time we convert them over..(if
we can, that is!)

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Srivatsa S. Bhat @ 2013-01-22 19:41 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rostedt, rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw,
	tj, akpm, linuxppc-dev
In-Reply-To: <20130122104506.32b4e581@nehalam.linuxnetplumber.net>

On 01/23/2013 12:15 AM, Stephen Hemminger wrote:
> On Tue, 22 Jan 2013 13:03:22 +0530
> "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> 
>> A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
>> locks can also lead to too many deadlock possibilities which can make it very
>> hard/impossible to use. This is explained in the example below, which helps
>> justify the need for a different algorithm to implement flexible Per-CPU
>> Reader-Writer locks.
>>
>> We can use global rwlocks as shown below safely, without fear of deadlocks:
>>
>> Readers:
>>
>>          CPU 0                                CPU 1
>>          ------                               ------
>>
>> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
>>
>>
>> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);
>>
>>
>> Writer:
>>
>>          CPU 2:
>>          ------
>>
>>        write_lock(&my_rwlock);
>>
>>
>> We can observe that there is no possibility of deadlocks or circular locking
>> dependencies here. Its perfectly safe.
>>
>> Now consider a blind/straight-forward conversion of global rwlocks to per-CPU
>> rwlocks like this:
>>
>> The reader locks its own per-CPU rwlock for read, and proceeds.
>>
>> Something like: read_lock(per-cpu rwlock of this cpu);
>>
>> The writer acquires all per-CPU rwlocks for write and only then proceeds.
>>
>> Something like:
>>
>>   for_each_online_cpu(cpu)
>> 	write_lock(per-cpu rwlock of 'cpu');
>>
>>
>> Now let's say that for performance reasons, the above scenario (which was
>> perfectly safe when using global rwlocks) was converted to use per-CPU rwlocks.
>>
>>
>>          CPU 0                                CPU 1
>>          ------                               ------
>>
>> 1.    spin_lock(&random_lock);             read_lock(my_rwlock of CPU 1);
>>
>>
>> 2.    read_lock(my_rwlock of CPU 0);       spin_lock(&random_lock);
>>
>>
>> Writer:
>>
>>          CPU 2:
>>          ------
>>
>>       for_each_online_cpu(cpu)
>>         write_lock(my_rwlock of 'cpu');
>>
>>
>> Consider what happens if the writer begins his operation in between steps 1
>> and 2 at the reader side. It becomes evident that we end up in a (previously
>> non-existent) deadlock due to a circular locking dependency between the 3
>> entities, like this:
>>
>>
>> (holds              Waiting for
>>  random_lock) CPU 0 -------------> CPU 2  (holds my_rwlock of CPU 0
>>                                                for write)
>>                ^                   |
>>                |                   |
>>         Waiting|                   | Waiting
>>           for  |                   |  for
>>                |                   V
>>                 ------ CPU 1 <------
>>
>>                 (holds my_rwlock of
>>                  CPU 1 for read)
>>
>>
>>
>> So obviously this "straight-forward" way of implementing percpu rwlocks is
>> deadlock-prone. One simple measure for (or characteristic of) safe percpu
>> rwlock should be that if a user replaces global rwlocks with per-CPU rwlocks
>> (for performance reasons), he shouldn't suddenly end up in numerous deadlock
>> possibilities which never existed before. The replacement should continue to
>> remain safe, and perhaps improve the performance.
>>
>> Observing the robustness of global rwlocks in providing a fair amount of
>> deadlock safety, we implement per-CPU rwlocks as nothing but global rwlocks,
>> as a first step.
>>
>>
>> Cc: David Howells <dhowells@redhat.com>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> 
> We got rid of brlock years ago, do we have to reintroduce it like this?
> The problem was that brlock caused starvation.
> 

Um? I still see it in include/linux/lglock.h and its users in fs/ directory.

BTW, I'm not advocating that everybody start converting their global reader-writer
locks to per-cpu rwlocks, because such a conversion probably won't make sense
in all scenarios.

The thing is, for CPU hotplug in particular, the "preempt_disable() at the reader;
stop_machine() at the writer" scheme had some very desirable properties at the
reader side (even though people might hate stop_machine() with all their
heart ;-)), namely : 

At the reader side:

o No need to hold locks to prevent CPU offline
o Extremely fast/optimized updates (the preempt count)
o No need for heavy memory barriers
o Extremely flexible nesting rules

So this made perfect sense at the reader for CPU hotplug, because it is expected
that CPU hotplug operations are very infrequent, and it is well-known that quite
a few atomic hotplug readers are in very hot paths. The problem was that the
stop_machine() at the writer was not only a little too heavy, but also inflicted
real-time latencies on the system because it needed cooperation from _all_ CPUs
synchronously, to take one CPU down.

So the idea is to get rid of stop_machine() without hurting the reader side.
And this scheme of per-cpu rwlocks comes close to ensuring that. (You can look
at the previous versions of this patchset [links given in cover letter] to see
what other schemes we hashed out before coming to this one).

The only reason I exposed this as a generic locking scheme was because Tejun
pointed out that, complex locking schemes implemented in individual subsystems
is not such a good idea. And also this comes at a time when per-cpu rwsemaphores
have just been introduced in the kernel and Oleg had ideas about converting the
cpu hotplug (sleepable) locking to use them.

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Steven Rostedt @ 2013-01-22 19:32 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw, tj,
	akpm, linuxppc-dev
In-Reply-To: <20130122073315.13822.27093.stgit@srivatsabhat.in.ibm.com>

On Tue, 2013-01-22 at 13:03 +0530, Srivatsa S. Bhat wrote:
> A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
> locks can also lead to too many deadlock possibilities which can make it very
> hard/impossible to use. This is explained in the example below, which helps
> justify the need for a different algorithm to implement flexible Per-CPU
> Reader-Writer locks.
> 
> We can use global rwlocks as shown below safely, without fear of deadlocks:
> 
> Readers:
> 
>          CPU 0                                CPU 1
>          ------                               ------
> 
> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
> 
> 
> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);
> 
> 
> Writer:
> 
>          CPU 2:
>          ------
> 
>        write_lock(&my_rwlock);
> 

I thought global locks are now fair. That is, a reader will block if a
writer is waiting. Hence, the above should deadlock on the current
rwlock_t types.

We need to fix those locations (or better yet, remove all rwlocks ;-)

-- Steve

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Stephen Hemminger @ 2013-01-22 18:45 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rostedt, rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw,
	tj, akpm, linuxppc-dev
In-Reply-To: <20130122073315.13822.27093.stgit@srivatsabhat.in.ibm.com>

On Tue, 22 Jan 2013 13:03:22 +0530
"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:

> A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
> locks can also lead to too many deadlock possibilities which can make it very
> hard/impossible to use. This is explained in the example below, which helps
> justify the need for a different algorithm to implement flexible Per-CPU
> Reader-Writer locks.
> 
> We can use global rwlocks as shown below safely, without fear of deadlocks:
> 
> Readers:
> 
>          CPU 0                                CPU 1
>          ------                               ------
> 
> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
> 
> 
> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);
> 
> 
> Writer:
> 
>          CPU 2:
>          ------
> 
>        write_lock(&my_rwlock);
> 
> 
> We can observe that there is no possibility of deadlocks or circular locking
> dependencies here. Its perfectly safe.
> 
> Now consider a blind/straight-forward conversion of global rwlocks to per-CPU
> rwlocks like this:
> 
> The reader locks its own per-CPU rwlock for read, and proceeds.
> 
> Something like: read_lock(per-cpu rwlock of this cpu);
> 
> The writer acquires all per-CPU rwlocks for write and only then proceeds.
> 
> Something like:
> 
>   for_each_online_cpu(cpu)
> 	write_lock(per-cpu rwlock of 'cpu');
> 
> 
> Now let's say that for performance reasons, the above scenario (which was
> perfectly safe when using global rwlocks) was converted to use per-CPU rwlocks.
> 
> 
>          CPU 0                                CPU 1
>          ------                               ------
> 
> 1.    spin_lock(&random_lock);             read_lock(my_rwlock of CPU 1);
> 
> 
> 2.    read_lock(my_rwlock of CPU 0);       spin_lock(&random_lock);
> 
> 
> Writer:
> 
>          CPU 2:
>          ------
> 
>       for_each_online_cpu(cpu)
>         write_lock(my_rwlock of 'cpu');
> 
> 
> Consider what happens if the writer begins his operation in between steps 1
> and 2 at the reader side. It becomes evident that we end up in a (previously
> non-existent) deadlock due to a circular locking dependency between the 3
> entities, like this:
> 
> 
> (holds              Waiting for
>  random_lock) CPU 0 -------------> CPU 2  (holds my_rwlock of CPU 0
>                                                for write)
>                ^                   |
>                |                   |
>         Waiting|                   | Waiting
>           for  |                   |  for
>                |                   V
>                 ------ CPU 1 <------
> 
>                 (holds my_rwlock of
>                  CPU 1 for read)
> 
> 
> 
> So obviously this "straight-forward" way of implementing percpu rwlocks is
> deadlock-prone. One simple measure for (or characteristic of) safe percpu
> rwlock should be that if a user replaces global rwlocks with per-CPU rwlocks
> (for performance reasons), he shouldn't suddenly end up in numerous deadlock
> possibilities which never existed before. The replacement should continue to
> remain safe, and perhaps improve the performance.
> 
> Observing the robustness of global rwlocks in providing a fair amount of
> deadlock safety, we implement per-CPU rwlocks as nothing but global rwlocks,
> as a first step.
> 
> 
> Cc: David Howells <dhowells@redhat.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

We got rid of brlock years ago, do we have to reintroduce it like this?
The problem was that brlock caused starvation.

^ permalink raw reply

* Re: [PATCH 6/6][v3] perf: Document the ABI of perf sysfs entries
From: Jiri Olsa @ 2013-01-22 17:10 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Andi Kleen, Peter Zijlstra, robert.richter, Greg KH,
	Anton Blanchard, linux-kernel, Stephane Eranian, linuxppc-dev,
	Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo
In-Reply-To: <20130118174654.GA12575@us.ibm.com>

On Fri, Jan 18, 2013 at 09:46:54AM -0800, Sukadev Bhattiprolu wrote:
> Jiri Olsa [jolsa@redhat.com] wrote:

SNIP

> +
> +Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
> +		Linux Powerpc mailing list <linuxppc-dev@ozlabs.org>
> +
> +Description:	POWER-systems specific performance monitoring events
> +
> +		A collection of performance monitoring events that may be
> +		supported by the POWER CPU. These events can be monitored
> +		using the 'perf(1)' tool.
> +
> +		These events may not be supported by other CPUs.
> +
> +		The contents of each file would look like:
> +
> +			event=0xNNNN
> +
> +		where 'N' is a hex digit and the number '0xNNNN' shows the
> +		"raw code" for the perf event identified by the file's
> +		"basename".
> +
> +		Further, multiple terms like 'event=0xNNNN' can be specified
> +		and separated with comma. All available terms are defined in
> +		the /sys/bus/event_source/devices/<dev>/format file.

Acked-by: Jiri Olsa <jolsa@redhat.com>

thanks,
jirka

^ permalink raw reply

* Re: [PATCH] perf: Fix compile warnings in tests/attr.c
From: Jiri Olsa @ 2013-01-22 13:57 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linuxppc-dev, Anton Blanchard, paulus, linux-kernel, acme
In-Reply-To: <20130121213823.GA4774@us.ibm.com>

On Mon, Jan 21, 2013 at 01:38:23PM -0800, Sukadev Bhattiprolu wrote:
> Jiri Olsa [jolsa@redhat.com] wrote:
> | On Fri, Jan 18, 2013 at 05:30:52PM -0800, Sukadev Bhattiprolu wrote:

SNIP

> __u64 is 'unsigned long long' on x86 and PRIu64 is 'llu' which is fine.
> 
> __u64 is 'unsigned long' on Power and PRIu64 is 'lu' which is again fine.
> 
> But __u64 is 'unsigned long long' on x86_64, but PRIu64 is '%lu' bc __WORDSIZE
> is 64.
> 
> On x86_64, shouldn't __u64, be defined as 'unsigned long' rather than
> 'unsigned long long' - ie include 'int-l64.h' rather than 'int-ll64.h' ?

hum, not sure ;-) will try to find some time to look on that

> 
> BTW, does 'perf' with my patch compile, (with warnings) for you on x86_64
> with 'WERROR=0 make' ?

this one passes with warnings

jirka

^ permalink raw reply

* [PATCH Bug fix 4/4] Rename movablecore_map to movablemem_map.
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358855181-6160-1-git-send-email-tangchen@cn.fujitsu.com>

Since "core" could be confused with cpu cores, but here it is memory,
so rename the boot option movablecore_map to movablemem_map.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 Documentation/kernel-parameters.txt |    8 ++--
 include/linux/memblock.h            |    2 +-
 include/linux/mm.h                  |    8 ++--
 mm/memblock.c                       |    8 ++--
 mm/page_alloc.c                     |   96 +++++++++++++++++-----------------
 5 files changed, 61 insertions(+), 61 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f02aa4c..7770611 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1637,7 +1637,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			that the amount of memory usable for all allocations
 			is not too small.
 
-	movablecore_map=nn[KMG]@ss[KMG]
+	movablemem_map=nn[KMG]@ss[KMG]
 			[KNL,X86,IA-64,PPC] This parameter is similar to
 			memmap except it specifies the memory map of
 			ZONE_MOVABLE.
@@ -1647,11 +1647,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			ss to the end of the 1st node will be ZONE_MOVABLE,
 			and all the rest nodes will only have ZONE_MOVABLE.
 			If memmap is specified at the same time, the
-			movablecore_map will be limited within the memmap
+			movablemem_map will be limited within the memmap
 			areas. If kernelcore or movablecore is also specified,
-			movablecore_map will have higher priority to be
+			movablemem_map will have higher priority to be
 			satisfied. So the administrator should be careful that
-			the amount of movablecore_map areas are not too large.
+			the amount of movablemem_map areas are not too large.
 			Otherwise kernel won't have enough memory to start.
 
 	MTD_Partition=	[MTD]
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index ac52bbc..1094952 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -60,7 +60,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-extern struct movablecore_map movablecore_map;
+extern struct movablemem_map movablemem_map;
 
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1559e35..7cef651 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1359,15 +1359,15 @@ extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
 
-#define MOVABLECORE_MAP_MAX MAX_NUMNODES
-struct movablecore_entry {
+#define MOVABLEMEM_MAP_MAX MAX_NUMNODES
+struct movablemem_entry {
 	unsigned long start_pfn;    /* start pfn of memory segment */
 	unsigned long end_pfn;      /* end pfn of memory segment (exclusive) */
 };
 
-struct movablecore_map {
+struct movablemem_map {
 	int nr_map;
-	struct movablecore_entry map[MOVABLECORE_MAP_MAX];
+	struct movablemem_entry map[MOVABLEMEM_MAP_MAX];
 };
 
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
diff --git a/mm/memblock.c b/mm/memblock.c
index 0218231..c47ddd5 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -105,7 +105,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 {
 	phys_addr_t this_start, this_end, cand;
 	u64 i;
-	int curr = movablecore_map.nr_map - 1;
+	int curr = movablemem_map.nr_map - 1;
 
 	/* pump up @end */
 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
@@ -124,15 +124,15 @@ restart:
 			continue;
 
 		for (; curr >= 0; curr--) {
-			if ((movablecore_map.map[curr].start_pfn << PAGE_SHIFT)
+			if ((movablemem_map.map[curr].start_pfn << PAGE_SHIFT)
 			    < this_end)
 				break;
 		}
 
 		cand = round_down(this_end - size, align);
 		if (curr >= 0 &&
-		    cand < movablecore_map.map[curr].end_pfn << PAGE_SHIFT) {
-			this_end = movablecore_map.map[curr].start_pfn
+		    cand < movablemem_map.map[curr].end_pfn << PAGE_SHIFT) {
+			this_end = movablemem_map.map[curr].start_pfn
 				   << PAGE_SHIFT;
 			goto restart;
 		}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2bd529e..3978797 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -202,7 +202,7 @@ static unsigned long __meminitdata dma_reserve;
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 /* Movable memory ranges, will also be used by memblock subsystem. */
-struct movablecore_map movablecore_map;
+struct movablemem_map movablemem_map;
 
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
@@ -4375,7 +4375,7 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
  * sanitize_zone_movable_limit() - Sanitize the zone_movable_limit array.
  *
  * zone_movable_limit is initialized as 0. This function will try to get
- * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
+ * the first ZONE_MOVABLE pfn of each node from movablemem_map, and
  * assigne them to zone_movable_limit.
  * zone_movable_limit[nid] == 0 means no limit for the node.
  *
@@ -4386,7 +4386,7 @@ static void __meminit sanitize_zone_movable_limit(void)
 	int map_pos = 0, i, nid;
 	unsigned long start_pfn, end_pfn;
 
-	if (!movablecore_map.nr_map)
+	if (!movablemem_map.nr_map)
 		return;
 
 	/* Iterate all ranges from minimum to maximum */
@@ -4420,22 +4420,22 @@ static void __meminit sanitize_zone_movable_limit(void)
 		if (start_pfn >= end_pfn)
 			continue;
 
-		while (map_pos < movablecore_map.nr_map) {
-			if (end_pfn <= movablecore_map.map[map_pos].start_pfn)
+		while (map_pos < movablemem_map.nr_map) {
+			if (end_pfn <= movablemem_map.map[map_pos].start_pfn)
 				break;
 
-			if (start_pfn >= movablecore_map.map[map_pos].end_pfn) {
+			if (start_pfn >= movablemem_map.map[map_pos].end_pfn) {
 				map_pos++;
 				continue;
 			}
 
 			/*
 			 * The start_pfn of ZONE_MOVABLE is either the minimum
-			 * pfn specified by movablecore_map, or 0, which means
+			 * pfn specified by movablemem_map, or 0, which means
 			 * the node has no ZONE_MOVABLE.
 			 */
 			zone_movable_limit[nid] = max(start_pfn,
-					movablecore_map.map[map_pos].start_pfn);
+					movablemem_map.map[map_pos].start_pfn);
 
 			break;
 		}
@@ -4898,12 +4898,12 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	}
 
 	/*
-	 * If neither kernelcore/movablecore nor movablecore_map is specified,
-	 * there is no ZONE_MOVABLE. But if movablecore_map is specified, the
+	 * If neither kernelcore/movablecore nor movablemem_map is specified,
+	 * there is no ZONE_MOVABLE. But if movablemem_map is specified, the
 	 * start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
 	 */
 	if (!required_kernelcore) {
-		if (movablecore_map.nr_map)
+		if (movablemem_map.nr_map)
 			memcpy(zone_movable_pfn, zone_movable_limit,
 				sizeof(zone_movable_pfn));
 		goto out;
@@ -5168,14 +5168,14 @@ early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
 /**
- * insert_movablecore_map() - Insert a memory range in to movablecore_map.map.
+ * insert_movablemem_map() - Insert a memory range in to movablemem_map.map.
  * @start_pfn:	start pfn of the range
  * @end_pfn:	end pfn of the range
  *
  * This function will also merge the overlapped ranges, and sort the array
  * by start_pfn in monotonic increasing order.
  */
-static void __init insert_movablecore_map(unsigned long start_pfn,
+static void __init insert_movablemem_map(unsigned long start_pfn,
 					  unsigned long end_pfn)
 {
 	int pos, overlap;
@@ -5184,31 +5184,31 @@ static void __init insert_movablecore_map(unsigned long start_pfn,
 	 * pos will be at the 1st overlapped range, or the position
 	 * where the element should be inserted.
 	 */
-	for (pos = 0; pos < movablecore_map.nr_map; pos++)
-		if (start_pfn <= movablecore_map.map[pos].end_pfn)
+	for (pos = 0; pos < movablemem_map.nr_map; pos++)
+		if (start_pfn <= movablemem_map.map[pos].end_pfn)
 			break;
 
 	/* If there is no overlapped range, just insert the element. */
-	if (pos == movablecore_map.nr_map ||
-	    end_pfn < movablecore_map.map[pos].start_pfn) {
+	if (pos == movablemem_map.nr_map ||
+	    end_pfn < movablemem_map.map[pos].start_pfn) {
 		/*
 		 * If pos is not the end of array, we need to move all
 		 * the rest elements backward.
 		 */
-		if (pos < movablecore_map.nr_map)
-			memmove(&movablecore_map.map[pos+1],
-				&movablecore_map.map[pos],
-				sizeof(struct movablecore_entry) *
-				(movablecore_map.nr_map - pos));
-		movablecore_map.map[pos].start_pfn = start_pfn;
-		movablecore_map.map[pos].end_pfn = end_pfn;
-		movablecore_map.nr_map++;
+		if (pos < movablemem_map.nr_map)
+			memmove(&movablemem_map.map[pos+1],
+				&movablemem_map.map[pos],
+				sizeof(struct movablemem_entry) *
+				(movablemem_map.nr_map - pos));
+		movablemem_map.map[pos].start_pfn = start_pfn;
+		movablemem_map.map[pos].end_pfn = end_pfn;
+		movablemem_map.nr_map++;
 		return;
 	}
 
 	/* overlap will be at the last overlapped range */
-	for (overlap = pos + 1; overlap < movablecore_map.nr_map; overlap++)
-		if (end_pfn < movablecore_map.map[overlap].start_pfn)
+	for (overlap = pos + 1; overlap < movablemem_map.nr_map; overlap++)
+		if (end_pfn < movablemem_map.map[overlap].start_pfn)
 			break;
 
 	/*
@@ -5216,29 +5216,29 @@ static void __init insert_movablecore_map(unsigned long start_pfn,
 	 * and move the rest elements forward.
 	 */
 	overlap--;
-	movablecore_map.map[pos].start_pfn = min(start_pfn,
-					movablecore_map.map[pos].start_pfn);
-	movablecore_map.map[pos].end_pfn = max(end_pfn,
-					movablecore_map.map[overlap].end_pfn);
+	movablemem_map.map[pos].start_pfn = min(start_pfn,
+					movablemem_map.map[pos].start_pfn);
+	movablemem_map.map[pos].end_pfn = max(end_pfn,
+					movablemem_map.map[overlap].end_pfn);
 
-	if (pos != overlap && overlap + 1 != movablecore_map.nr_map)
-		memmove(&movablecore_map.map[pos+1],
-			&movablecore_map.map[overlap+1],
-			sizeof(struct movablecore_entry) *
-			(movablecore_map.nr_map - overlap - 1));
+	if (pos != overlap && overlap + 1 != movablemem_map.nr_map)
+		memmove(&movablemem_map.map[pos+1],
+			&movablemem_map.map[overlap+1],
+			sizeof(struct movablemem_entry) *
+			(movablemem_map.nr_map - overlap - 1));
 
-	movablecore_map.nr_map -= overlap - pos;
+	movablemem_map.nr_map -= overlap - pos;
 }
 
 /**
- * movablecore_map_add_region() - Add a memory range into movablecore_map.
+ * movablemem_map_add_region() - Add a memory range into movablemem_map.
  * @start:	physical start address of range
  * @end:	physical end address of range
  *
  * This function transform the physical address into pfn, and then add the
- * range into movablecore_map by calling insert_movablecore_map().
+ * range into movablemem_map by calling insert_movablemem_map().
  */
-static void __init movablecore_map_add_region(u64 start, u64 size)
+static void __init movablemem_map_add_region(u64 start, u64 size)
 {
 	unsigned long start_pfn, end_pfn;
 
@@ -5246,8 +5246,8 @@ static void __init movablecore_map_add_region(u64 start, u64 size)
 	if (start + size <= start)
 		return;
 
-	if (movablecore_map.nr_map >= ARRAY_SIZE(movablecore_map.map)) {
-		pr_err("movable_memory_map: too many entries;"
+	if (movablemem_map.nr_map >= ARRAY_SIZE(movablemem_map.map)) {
+		pr_err("movablemem_map: too many entries;"
 			" ignoring [mem %#010llx-%#010llx]\n",
 			(unsigned long long) start,
 			(unsigned long long) (start + size - 1));
@@ -5256,19 +5256,19 @@ static void __init movablecore_map_add_region(u64 start, u64 size)
 
 	start_pfn = PFN_DOWN(start);
 	end_pfn = PFN_UP(start + size);
-	insert_movablecore_map(start_pfn, end_pfn);
+	insert_movablemem_map(start_pfn, end_pfn);
 }
 
 /*
- * cmdline_parse_movablecore_map() - Parse boot option movablecore_map.
+ * cmdline_parse_movablemem_map() - Parse boot option movablemem_map.
  * @p:	The boot option of the following format:
- * 	movablecore_map=nn[KMG]@ss[KMG]
+ * 	movablemem_map=nn[KMG]@ss[KMG]
  *
  * This option sets the memory range [ss, ss+nn) to be used as movable memory.
  *
  * Return: 0 on success or -EINVAL on failure.
  */
-static int __init cmdline_parse_movablecore_map(char *p)
+static int __init cmdline_parse_movablemem_map(char *p)
 {
 	char *oldp;
 	u64 start_at, mem_size;
@@ -5287,13 +5287,13 @@ static int __init cmdline_parse_movablecore_map(char *p)
 		if (p == oldp || *p != '\0')
 			goto err;
 
-		movablecore_map_add_region(start_at, mem_size);
+		movablemem_map_add_region(start_at, mem_size);
 		return 0;
 	}
 err:
 	return -EINVAL;
 }
-early_param("movablecore_map", cmdline_parse_movablecore_map);
+early_param("movablemem_map", cmdline_parse_movablemem_map);
 
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 2/4] Bug fix: Fix the doc format.
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358855181-6160-1-git-send-email-tangchen@cn.fujitsu.com>

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/page_alloc.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 00037a3..cd6f8a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4372,7 +4372,7 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
 }
 
 /**
- * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ * sanitize_zone_movable_limit() - Sanitize the zone_movable_limit array.
  *
  * zone_movable_limit is initialized as 0. This function will try to get
  * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
@@ -5173,9 +5173,9 @@ early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
 /**
- * insert_movablecore_map - Insert a memory range in to movablecore_map.map.
- * @start_pfn: start pfn of the range
- * @end_pfn: end pfn of the range
+ * insert_movablecore_map() - Insert a memory range in to movablecore_map.map.
+ * @start_pfn:	start pfn of the range
+ * @end_pfn:	end pfn of the range
  *
  * This function will also merge the overlapped ranges, and sort the array
  * by start_pfn in monotonic increasing order.
@@ -5236,9 +5236,9 @@ static void __init insert_movablecore_map(unsigned long start_pfn,
 }
 
 /**
- * movablecore_map_add_region - Add a memory range into movablecore_map.
- * @start: physical start address of range
- * @end: physical end address of range
+ * movablecore_map_add_region() - Add a memory range into movablecore_map.
+ * @start:	physical start address of range
+ * @end:	physical end address of range
  *
  * This function transform the physical address into pfn, and then add the
  * range into movablecore_map by calling insert_movablecore_map().
@@ -5265,8 +5265,13 @@ static void __init movablecore_map_add_region(u64 start, u64 size)
 }
 
 /*
- * movablecore_map=nn[KMG]@ss[KMG] sets the region of memory to be used as
- * movable memory.
+ * cmdline_parse_movablecore_map() - Parse boot option movablecore_map.
+ * @p:	The boot option of the following format:
+ * 	movablecore_map=nn[KMG]@ss[KMG]
+ *
+ * This option sets the memory range [ss, ss+nn) to be used as movable memory.
+ *
+ * Return: 0 on success or -EINVAL on failure.
  */
 static int __init cmdline_parse_movablecore_map(char *p)
 {
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 3/4] Bug fix: Remove the unused sanitize_zone_movable_limit() definition.
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358855181-6160-1-git-send-email-tangchen@cn.fujitsu.com>

When CONFIG_HAVE_MEMBLOCK_NODE_MAP is not defined, sanitize_zone_movable_limit()
is also not used. So remove it.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/page_alloc.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd6f8a6..2bd529e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4459,11 +4459,6 @@ static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
 
 	return zholes_size[zone_type];
 }
-
-static void __meminit sanitize_zone_movable_limit(void)
-{
-}
-
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 1/4] Bug fix: Use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358855181-6160-1-git-send-email-tangchen@cn.fujitsu.com>

The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 include/linux/memblock.h |    3 ++-
 mm/memblock.c            |   34 ++++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6e25597..ac52bbc 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,7 +42,6 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
-extern struct movablecore_map movablecore_map;
 
 #define memblock_dbg(fmt, ...) \
 	if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
@@ -61,6 +60,8 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+extern struct movablecore_map movablecore_map;
+
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 1e48774..0218231 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -92,9 +92,13 @@ static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
  *
  * Find @size free area aligned to @align in the specified range and node.
  *
+ * If we have CONFIG_HAVE_MEMBLOCK_NODE_MAP defined, we need to check if the
+ * memory we found if not in hotpluggable ranges.
+ *
  * RETURNS:
  * Found address on success, %0 on failure.
  */
+#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
 					phys_addr_t end, phys_addr_t size,
 					phys_addr_t align, int nid)
@@ -139,6 +143,36 @@ restart:
 
 	return 0;
 }
+#else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+					phys_addr_t end, phys_addr_t size,
+					phys_addr_t align, int nid)
+{
+	phys_addr_t this_start, this_end, cand;
+	u64 i;
+
+	/* pump up @end */
+	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+		end = memblock.current_limit;
+
+	/* avoid allocating the first page */
+	start = max_t(phys_addr_t, start, PAGE_SIZE);
+	end = max(start, end);
+
+	for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
+		this_start = clamp(this_start, start, end);
+		this_end = clamp(this_end, start, end);
+
+		if (this_end < size)
+			continue;
+
+		cand = round_down(this_end - size, align);
+		if (cand >= this_start)
+			return cand;
+	}
+	return 0;
+}
+#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /**
  * memblock_find_in_range - find free area in given range
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 0/4] Bug fix for movablecore_map boot option.
From: Tang Chen @ 2013-01-22 11:46 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi

Hi Andrew,

patch1 ~ patch3 fix some problems of movablecore_map boot option.
And since the name "core" could be confused, patch4 rename this option
to movablemem_map.

All these patches are based on the latest -mm tree.
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm

Tang Chen (4):
  Bug fix: Use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map
    in memblock_overlaps_region().
  Bug fix: Fix the doc format.
  Bug fix: Remove the unused sanitize_zone_movable_limit() definition.
  Rename movablecore_map to movablemem_map.

 Documentation/kernel-parameters.txt |    8 +-
 include/linux/memblock.h            |    3 +-
 include/linux/mm.h                  |    8 +-
 mm/memblock.c                       |   42 +++++++++++-
 mm/page_alloc.c                     |  116 +++++++++++++++++-----------------
 5 files changed, 106 insertions(+), 71 deletions(-)

^ permalink raw reply

* [PATCH Bug fix 4/5] cpu-hotplug, memory-hotplug: clear cpu_to_node() when offlining the node
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

From: Wen Congyang <wency@cn.fujitsu.com>

When the node is offlined, there is no memory/cpu on the node. If a
sleep task runs on a cpu of this node, it will be migrated to the
cpu on the other node. So we can clear cpu-to-node mapping.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   30 +++++++++++++++++++++++++++++-
 1 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index edd1773..022583b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1702,6 +1702,34 @@ static int check_cpu_on_node(void *data)
 	return 0;
 }
 
+static void unmap_cpu_on_node(void *data)
+{
+#ifdef CONFIG_ACPI_NUMA
+	struct pglist_data *pgdat = data;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		if (cpu_to_node(cpu) == pgdat->node_id)
+			numa_clear_node(cpu);
+#endif
+}
+
+static int check_and_unmap_cpu_on_node(void *data)
+{
+	int ret = check_cpu_on_node(data);
+
+	if (ret)
+		return ret;
+
+	/*
+	 * the node will be offlined when we come here, so we can clear
+	 * the cpu_to_node() now.
+	 */
+
+	unmap_cpu_on_node(data);
+	return 0;
+}
+
 /* offline the node if all memory sections of this node are removed */
 void try_offline_node(int nid)
 {
@@ -1728,7 +1756,7 @@ void try_offline_node(int nid)
 		return;
 	}
 
-	if (stop_machine(check_cpu_on_node, pgdat, NULL))
+	if (stop_machine(check_and_unmap_cpu_on_node, pgdat, NULL))
 		return;
 
 	/*
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 5/5] Do not use cpu_to_node() to find an offlined cpu's node.
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
return -1. As a result, cpumask_of_node(nid) will return NULL. In this case,
find_next_bit() in for_each_cpu will get a NULL pointer and cause panic.

Here is a call trace:
[  609.824017] Call Trace:
[  609.824017]  <IRQ>
[  609.824017]  [<ffffffff810b0721>] select_fallback_rq+0x71/0x190
[  609.824017]  [<ffffffff810b086e>] ? try_to_wake_up+0x2e/0x2f0
[  609.824017]  [<ffffffff810b0b0b>] try_to_wake_up+0x2cb/0x2f0
[  609.824017]  [<ffffffff8109da08>] ? __run_hrtimer+0x78/0x320
[  609.824017]  [<ffffffff810b0b85>] wake_up_process+0x15/0x20
[  609.824017]  [<ffffffff8109ce62>] hrtimer_wakeup+0x22/0x30
[  609.824017]  [<ffffffff8109da13>] __run_hrtimer+0x83/0x320
[  609.824017]  [<ffffffff8109ce40>] ? update_rmtp+0x80/0x80
[  609.824017]  [<ffffffff8109df56>] hrtimer_interrupt+0x106/0x280
[  609.824017]  [<ffffffff810a72c8>] ? sd_free_ctl_entry+0x68/0x70
[  609.824017]  [<ffffffff8167cf39>] smp_apic_timer_interrupt+0x69/0x99
[  609.824017]  [<ffffffff8167be2f>] apic_timer_interrupt+0x6f/0x80

There is a hrtimer process sleeping, whose cpu has already been offlined.
When it is waken up, it tries to find another cpu to run, and get a -1 nid.
As a result, cpumask_of_node(-1) returns NULL, and causes ernel panic.

This patch fixes this problem by judging if the nid is -1.
If nid is not -1, a cpu on the same node will be picked.
Else, a online cpu on another node will be picked.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 kernel/sched/core.c |   28 +++++++++++++++++++---------
 1 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..035ee9f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1132,18 +1132,28 @@ EXPORT_SYMBOL_GPL(kick_process);
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
-	const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
+	int nid = cpu_to_node(cpu);
+	const struct cpumask *nodemask = NULL;
 	enum { cpuset, possible, fail } state = cpuset;
 	int dest_cpu;
 
-	/* Look for allowed, online CPU in same node. */
-	for_each_cpu(dest_cpu, nodemask) {
-		if (!cpu_online(dest_cpu))
-			continue;
-		if (!cpu_active(dest_cpu))
-			continue;
-		if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
-			return dest_cpu;
+	/*
+	 * If the node that the cpu is on has been offlined, cpu_to_node()
+	 * will return -1. There is no cpu on the node, and we should
+	 * select the cpu on the other node.
+	 */
+	if (nid != -1) {
+		nodemask = cpumask_of_node(nid);
+
+		/* Look for allowed, online CPU in same node. */
+		for_each_cpu(dest_cpu, nodemask) {
+			if (!cpu_online(dest_cpu))
+				continue;
+			if (!cpu_active(dest_cpu))
+				continue;
+			if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
+				return dest_cpu;
+		}
 	}
 
 	for (;;) {
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 3/5] cpu-hotplug, memory-hotplug: try offline the node when hotremoving a cpu
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

From: Wen Congyang <wency@cn.fujitsu.com>

The node will be offlined when all memory/cpu on the node is hotremoved.
So we should try offline the node when hotremoving a cpu on the node.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/acpi/processor_driver.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/acpi/processor_driver.c b/drivers/acpi/processor_driver.c
index a24ee43..81745f1 100644
--- a/drivers/acpi/processor_driver.c
+++ b/drivers/acpi/processor_driver.c
@@ -45,6 +45,7 @@
 #include <linux/cpuidle.h>
 #include <linux/slab.h>
 #include <linux/acpi.h>
+#include <linux/memory_hotplug.h>
 
 #include <asm/io.h>
 #include <asm/cpu.h>
@@ -641,6 +642,7 @@ static int acpi_processor_remove(struct acpi_device *device, int type)
 
 	per_cpu(processors, pr->id) = NULL;
 	per_cpu(processor_device_array, pr->id) = NULL;
+	try_offline_node(cpu_to_node(pr->id));
 
 free:
 	free_cpumask_var(pr->throttling.shared_cpu_map);
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 1/5] cpu_hotplug: clear apicid to node when the cpu is hotremoved
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

From: Wen Congyang <wency@cn.fujitsu.com>

When a cpu is hotpluged, we call acpi_map_cpu2node() in _acpi_map_lsapic()
to store the cpu's node and apicid's node. But we don't clear the cpu's node
in acpi_unmap_lsapic() when this cpu is hotremove. If the node is also
hotremoved, we will get the following messages:
[ 1646.771485] kernel BUG at include/linux/gfp.h:329!
[ 1646.828729] invalid opcode: 0000 [#1] SMP
[ 1646.877872] Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
[ 1647.588773] Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
[ 1647.711545] RIP: 0010:[<ffffffff811bc3fd>]  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
[ 1647.810492] RSP: 0018:ffff88078a049cf8  EFLAGS: 00010246
[ 1647.874028] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 1647.959339] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
[ 1648.044659] RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
[ 1648.129953] R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
[ 1648.215259] R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
[ 1648.300572] FS:  00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
[ 1648.397272] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1648.465985] CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
[ 1648.551265] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1648.636565] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1648.721838] Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
[ 1648.818534] Stack:
[ 1648.842548]  ffff8807c39d7fa0 ffffffff000040d0 00000000000000bb 00000000000080d0
[ 1648.931469]  ffff8807c1417300 ffff8807c39d7fa0 ffff8807c1417300 0000000000000001
[ 1649.020410]  ffff88078a049d88 ffffffff811bc4a0 ffff8807c1410c80 0000000000000000
[ 1649.109464] Call Trace:
[ 1649.138713]  [<ffffffff811bc4a0>] new_slab+0x30/0x1b0
[ 1649.199075]  [<ffffffff811bc978>] __slab_alloc+0x358/0x4c0
[ 1649.264683]  [<ffffffff810b71c0>] ? alloc_fair_sched_group+0xd0/0x1b0
[ 1649.341695]  [<ffffffff811be7d4>] kmem_cache_alloc_node_trace+0xb4/0x1e0
[ 1649.421824]  [<ffffffff8109d188>] ? hrtimer_init+0x48/0x100
[ 1649.488414]  [<ffffffff810b71c0>] ? alloc_fair_sched_group+0xd0/0x1b0
[ 1649.565402]  [<ffffffff810b71c0>] alloc_fair_sched_group+0xd0/0x1b0
[ 1649.640297]  [<ffffffff810a8bce>] sched_create_group+0x3e/0x110
[ 1649.711040]  [<ffffffff810bdbcd>] sched_autogroup_create_attach+0x4d/0x180
[ 1649.793260]  [<ffffffff81089614>] sys_setsid+0xd4/0xf0
[ 1649.854694]  [<ffffffff8167a029>] system_call_fastpath+0x16/0x1b
[ 1649.926483] Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
[ 1650.161454] RIP  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
[ 1650.232348]  RSP <ffff88078a049cf8>
[ 1650.274029] ---[ end trace adf84c90f3fea3e5 ]---

The reason is that: the cpu's node is not NUMA_NO_NODE, we will call
alloc_pages_exact_node() to alloc memory on the node, but the node
is offlined.

If the node is onlined, we still need cpu's node. For example:
a task on the cpu is sleeped when the cpu is hotremoved. We will
choose another cpu to run this task when it is waked up. If we
know the cpu's node, we will choose the cpu on the same node first.
So we should clear cpu-to-node mapping when the node is offlined.

This patch only clears apicid-to-node mapping when the cpu is hotremoved.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/kernel/acpi/boot.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index bacf4b0..7d53833 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -697,6 +697,10 @@ EXPORT_SYMBOL(acpi_map_lsapic);
 
 int acpi_unmap_lsapic(int cpu)
 {
+#ifdef CONFIG_ACPI_NUMA
+	set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
+#endif
+
 	per_cpu(x86_cpu_to_apicid, cpu) = -1;
 	set_cpu_present(cpu, false);
 	num_processors--;
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 2/5] memory-hotplug: export the function try_offline_node()
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-acpi, Mel Gorman, x86, linux-kernel, linux-mm,
	Peter Zijlstra, linuxppc-dev, Jiang Liu
In-Reply-To: <1358855156-6126-1-git-send-email-tangchen@cn.fujitsu.com>

From: Wen Congyang <wency@cn.fujitsu.com>

The node will be offlined when all memory/cpu on the node
have been hotremoved. So we need the function try_offline_node()
in cpu-hotplug path.

If the memory-hotplug is disabled, and cpu-hotplug is enabled
1. no memory no the node
   we don't online the node, and cpu's node is the nearest node.
2. the node contains some memory
   the node has been onlined, and cpu's node is still needed
   to migrate the sleep task on the cpu to the same node.
So we do nothing in try_offline_node() in this case.

Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 include/linux/memory_hotplug.h |    2 ++
 mm/memory_hotplug.c            |    3 ++-
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 69903cc..0b2878e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -193,6 +193,7 @@ extern void get_page_bootmem(unsigned long ingo, struct page *page,
 
 void lock_memory_hotplug(void);
 void unlock_memory_hotplug(void);
+extern void try_offline_node(int nid);
 
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
@@ -227,6 +228,7 @@ static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
 
 static inline void lock_memory_hotplug(void) {}
 static inline void unlock_memory_hotplug(void) {}
+static inline void try_offline_node(int nid) {}
 
 #endif /* ! CONFIG_MEMORY_HOTPLUG */
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f0dc9ad..edd1773 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1703,7 +1703,7 @@ static int check_cpu_on_node(void *data)
 }
 
 /* offline the node if all memory sections of this node are removed */
-static void try_offline_node(int nid)
+void try_offline_node(int nid)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long start_pfn = pgdat->node_start_pfn;
@@ -1759,6 +1759,7 @@ static void try_offline_node(int nid)
 	 */
 	memset(pgdat, 0, sizeof(*pgdat));
 }
+EXPORT_SYMBOL(try_offline_node);
 
 int __ref remove_memory(int nid, u64 start, u64 size)
 {
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 0/5] Bug fix for node offline
From: Tang Chen @ 2013-01-22 11:45 UTC (permalink / raw)
  To: akpm, rjw, len.brown, mingo, tglx, minchan.kim, rientjes, benh,
	paulus, cl, kosaki.motohiro, isimatu.yasuaki, wujianguo, wency,
	hpa, linfeng, laijs, mgorman, yinghai, glommer, jiang.liu,
	julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi

Based on physical memory hot-remove functionality, we can implement
node hot-remove. But there are some problems in cpu driver when offlining
a node. This patch-set will fix them.

All these patches are based on the latest -mm tree.
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm

Tang Chen (1):
  Do not use cpu_to_node() to find an offlined cpu's node.

Wen Congyang (4):
  cpu_hotplug: clear apicid to node when the cpu is hotremoved
  memory-hotplug: export the function try_offline_node()
  cpu-hotplug, memory-hotplug: try offline the node when hotremoving a
    cpu
  cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the
    node

 arch/x86/kernel/acpi/boot.c     |    4 ++++
 drivers/acpi/processor_driver.c |    2 ++
 include/linux/memory_hotplug.h  |    2 ++
 kernel/sched/core.c             |   28 +++++++++++++++++++---------
 mm/memory_hotplug.c             |   33 +++++++++++++++++++++++++++++++--
 5 files changed, 58 insertions(+), 11 deletions(-)

^ permalink raw reply

* [PATCH Bug fix 5/5] Bug fix: Fix the doc format in drivers/firmware/memmap.c
From: Tang Chen @ 2013-01-22 11:43 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358854984-6073-1-git-send-email-tangchen@cn.fujitsu.com>

Make the comments in drivers/firmware/memmap.c kernel-doc compliant.

Reported-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/firmware/memmap.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c
index 658fdd4..0b5b5f6 100644
--- a/drivers/firmware/memmap.c
+++ b/drivers/firmware/memmap.c
@@ -209,7 +209,7 @@ static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
 }
 
 /*
- * firmware_map_find_entry_in_list: Search memmap entry in a given list.
+ * firmware_map_find_entry_in_list() - Search memmap entry in a given list.
  * @start: Start of the memory range.
  * @end:   End of the memory range (exclusive).
  * @type:  Type of the memory range.
@@ -219,7 +219,7 @@ static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
  * given list. The caller must hold map_entries_lock, and must not release
  * the lock until the processing of the returned entry has completed.
  *
- * Return pointer to the entry to be found on success, or NULL on failure.
+ * Return: Pointer to the entry to be found on success, or NULL on failure.
  */
 static struct firmware_map_entry * __meminit
 firmware_map_find_entry_in_list(u64 start, u64 end, const char *type,
@@ -237,7 +237,7 @@ firmware_map_find_entry_in_list(u64 start, u64 end, const char *type,
 }
 
 /*
- * firmware_map_find_entry: Search memmap entry in map_entries.
+ * firmware_map_find_entry() - Search memmap entry in map_entries.
  * @start: Start of the memory range.
  * @end:   End of the memory range (exclusive).
  * @type:  Type of the memory range.
@@ -246,7 +246,7 @@ firmware_map_find_entry_in_list(u64 start, u64 end, const char *type,
  * The caller must hold map_entries_lock, and must not release the lock
  * until the processing of the returned entry has completed.
  *
- * Return pointer to the entry to be found on success, or NULL on failure.
+ * Return: Pointer to the entry to be found on success, or NULL on failure.
  */
 static struct firmware_map_entry * __meminit
 firmware_map_find_entry(u64 start, u64 end, const char *type)
@@ -255,7 +255,7 @@ firmware_map_find_entry(u64 start, u64 end, const char *type)
 }
 
 /*
- * firmware_map_find_entry_bootmem: Search memmap entry in map_entries_bootmem.
+ * firmware_map_find_entry_bootmem() - Search memmap entry in map_entries_bootmem.
  * @start: Start of the memory range.
  * @end:   End of the memory range (exclusive).
  * @type:  Type of the memory range.
@@ -263,7 +263,7 @@ firmware_map_find_entry(u64 start, u64 end, const char *type)
  * This function is similar to firmware_map_find_entry except that it find the
  * given entry in map_entries_bootmem.
  *
- * Return pointer to the entry to be found on success, or NULL on failure.
+ * Return: Pointer to the entry to be found on success, or NULL on failure.
  */
 static struct firmware_map_entry * __meminit
 firmware_map_find_entry_bootmem(u64 start, u64 end, const char *type)
-- 
1.7.1

^ permalink raw reply related

* [PATCH Bug fix 4/5] Bug fix: Fix section mismatch problem of release_firmware_map_entry().
From: Tang Chen @ 2013-01-22 11:43 UTC (permalink / raw)
  To: akpm, rientjes, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, glommer, jiang.liu, julian.calaby, sfr
  Cc: linux-mm, x86, linuxppc-dev, linux-kernel, linux-acpi
In-Reply-To: <1358854984-6073-1-git-send-email-tangchen@cn.fujitsu.com>

The function release_firmware_map_entry() references the function
__meminit firmware_map_find_entry_in_list(). So it should also have
__meminit.

And since the firmware_map_entry->kobj is initialized with memmap_ktype,
the memmap_ktype should also be prefixed by __refdata.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/firmware/memmap.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c
index 0710179..658fdd4 100644
--- a/drivers/firmware/memmap.c
+++ b/drivers/firmware/memmap.c
@@ -103,7 +103,7 @@ to_memmap_entry(struct kobject *kobj)
 	return container_of(kobj, struct firmware_map_entry, kobj);
 }
 
-static void release_firmware_map_entry(struct kobject *kobj)
+static void __meminit release_firmware_map_entry(struct kobject *kobj)
 {
 	struct firmware_map_entry *entry = to_memmap_entry(kobj);
 
@@ -127,7 +127,7 @@ static void release_firmware_map_entry(struct kobject *kobj)
 	kfree(entry);
 }
 
-static struct kobj_type memmap_ktype = {
+static struct kobj_type __refdata memmap_ktype = {
 	.release	= release_firmware_map_entry,
 	.sysfs_ops	= &memmap_attr_ops,
 	.default_attrs	= def_attrs,
-- 
1.7.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox