LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [GIT PULL] tree-wide: clean up no longer required #include <linux/init.h>
From: Paul Gortmaker @ 2014-02-05 14:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-arch, Stephen Rothwell, linux-m68k, rusty, linux-ia64, kvm,
	linux-mips, linuxppc-dev, x86, linux-alpha, netdev, gregkh,
	sparclinux, akpm, Linus Torvalds, linux-arm-kernel, linux-s390
In-Reply-To: <20140205064150.GA31568@gmail.com>

[Re: [GIT PULL] tree-wide: clean up no longer required #include <linux/init.h>] On 05/02/2014 (Wed 07:41) Ingo Molnar wrote:

> 
> * Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> > Hi Ingo,
> > 
> > On Wed, 5 Feb 2014 07:06:33 +0100 Ingo Molnar <mingo@kernel.org> wrote:
> > > 
> > > So, if you meant Linus to pull it, you probably want to cite a real 
> > > Git URI along the lines of:
> > > 
> > >    git://git.kernel.org/pub/scm/linux/kernel/git/paulg/init.git
> > 
> > Paul provided the proper git url further down in the mail along with the
> > usual pull request message (I guess he should have put that bit at the
> > top).
> 
> Yeah, indeed, and it even comes with a signed tag, which is an extra 
> nice touch:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git tags/init-cleanup
> 
> (I guess the https was mentioned first to lower expectations.)

Just to clarify, the init.git was the repo of raw commits+series file
that was used for testing on linux next; now useless, except for showing
the last several weeks of history (hence the visual http link).  The
signed tag [separate repo] is the application of those commits against
the 3.14-rc1 tag, which was the end goal from the beginning.

Does history matter?  In the case of a cleanup like this, it does only
in the immediate context of this pull request; to help distinguish this
work from some short lived half baked idea that also had its testing
invalidated by arbitrarily rebasing onto the latest shiny tag.

I wouldn't have even mentioned the patch repo, except for the fact that
I know how Linus loves arbitrary rebases [and malformed pull requests]  :)

Thanks,
Paul.
--

> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply

* [PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
From: Thadeu Lima de Souza Cascardo @ 2014-02-05 18:20 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: paulus, shangw, Thadeu Lima de Souza Cascardo

Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 ("powerpc/eeh: Use
partial hotplug for EEH unaware drivers") introduces eeh_rmv_device,
which may grab a reference to a driver, but not release it.

That prevents a driver from being removed after it has gone through EEH
recovery.

This patch drops the reference if it was taken.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/eeh_driver.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 7bb30dc..fdc679d 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata)
 	 */
 	if (!dev || (dev->hdr_type & PCI_HEADER_TYPE_BRIDGE))
 		return NULL;
+
 	driver = eeh_pcid_get(dev);
-	if (driver && driver->err_handler)
-		return NULL;
+	if (driver) {
+		eeh_pcid_put(dev);
+		if (driver->err_handler)
+			return NULL;
+	}
 
 	/* Remove it from PCI subsystem */
 	pr_debug("EEH: Removing %s without EEH sensitive driver\n",
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
From: Nishanth Aravamudan @ 2014-02-05 18:43 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo; +Cc: paulus, linuxppc-dev, shangw
In-Reply-To: <1391624445-9095-1-git-send-email-cascardo@linux.vnet.ibm.com>

On 05.02.2014 [16:20:45 -0200], Thadeu Lima de Souza Cascardo wrote:
> Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 ("powerpc/eeh: Use
> partial hotplug for EEH unaware drivers") introduces eeh_rmv_device,
> which may grab a reference to a driver, but not release it.
> 
> That prevents a driver from being removed after it has gone through EEH
> recovery.
> 
> This patch drops the reference if it was taken.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
> Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/eeh_driver.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 7bb30dc..fdc679d 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata)
>  	 */
>  	if (!dev || (dev->hdr_type & PCI_HEADER_TYPE_BRIDGE))
>  		return NULL;
> +

This appears to be unnecessary whitespace change?

-Nish

>  	driver = eeh_pcid_get(dev);
> -	if (driver && driver->err_handler)
> -		return NULL;
> +	if (driver) {
> +		eeh_pcid_put(dev);
> +		if (driver->err_handler)
> +			return NULL;
> +	}
> 
>  	/* Remove it from PCI subsystem */
>  	pr_debug("EEH: Removing %s without EEH sensitive driver\n",
> -- 
> 1.7.1
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* Re: [PATCH v2] powerpc/eeh: drop taken reference to driver on eeh_rmv_device
From: Thadeu Lima de Souza Cascardo @ 2014-02-05 19:08 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linuxppc-dev, paulus, shangw
In-Reply-To: <20140205184338.GA5433@linux.vnet.ibm.com>

On Wed, Feb 05, 2014 at 10:43:38AM -0800, Nishanth Aravamudan wrote:
> On 05.02.2014 [16:20:45 -0200], Thadeu Lima de Souza Cascardo wrote:
> > Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 ("powerpc/eeh: Use
> > partial hotplug for EEH unaware drivers") introduces eeh_rmv_device,
> > which may grab a reference to a driver, but not release it.
> > 
> > That prevents a driver from being removed after it has gone through EEH
> > recovery.
> > 
> > This patch drops the reference if it was taken.
> > 
> > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
> > Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/eeh_driver.c |    8 ++++++--
> >  1 files changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> > index 7bb30dc..fdc679d 100644
> > --- a/arch/powerpc/kernel/eeh_driver.c
> > +++ b/arch/powerpc/kernel/eeh_driver.c
> > @@ -362,9 +362,13 @@ static void *eeh_rmv_device(void *data, void *userdata)
> >  	 */
> >  	if (!dev || (dev->hdr_type & PCI_HEADER_TYPE_BRIDGE))
> >  		return NULL;
> > +
> 
> This appears to be unnecessary whitespace change?
> 
> -Nish
> 

Hi, Nish.

I originally add it there for readability, giving both more evidence to
the code below, where a driver reference is get and put, and to the code
above and its respective comment. Leaving those together could give the
impression the comment also applies to the code below.

But I have no strong feelings about that.

Ben, do you want me to send a new version?

Regards.
Cascardo.

> >  	driver = eeh_pcid_get(dev);
> > -	if (driver && driver->err_handler)
> > -		return NULL;
> > +	if (driver) {
> > +		eeh_pcid_put(dev);
> > +		if (driver->err_handler)
> > +			return NULL;
> > +	}
> > 
> >  	/* Remove it from PCI subsystem */
> >  	pr_debug("EEH: Removing %s without EEH sensitive driver\n",
> > -- 
> > 1.7.1
> > 
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> > 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Christoph Lameter @ 2014-02-05 19:28 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, David Rientjes, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140205001352.GC10101@linux.vnet.ibm.com>

On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:

> > If the target node allocation fails (for whatever reason) then I would
> > recommend for simplicities sake to change the target node to
> > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > complex solution would be to look through partial lists in increasing
> > distance to find a partially used slab that is reasonable close to the
> > current node. Slab has logic like that in fallback_alloc(). Slubs
> > get_any_partial() function does something close to what you want.
>
> I apologize for my own ignorance, but I'm having trouble following.
> Anton's original patch did fallback to the current cpu slab, but I'm not
> sure any NUMA_NO_NODE change is necessary there. At the point we're
> deactivating the slab (in the current code, in __slab_alloc()), we have
> successfully allocated from somewhere, it's just not on the node we
> expected to be on.

Right so if we are ignoring the node then the simplest thing to do is to
not deactivate the current cpu slab but to take an object from it.

> So perhaps you are saying to make a change lower in the code? I'm not
> sure where it makes sense to change the target node in that case. I'd
> appreciate any guidance you can give.

This not an easy thing to do. If the current slab is not the right node
but would be the node from which the page allocator would be returning
memory then the current slab can still be allocated from. If the fallback
is to another node then the current cpu slab needs to be deactivated and
the allocation from that node needs to proceeed. Have a look at
fallback_alloc() in the slab allocator.

A allocation attempt from the page allocator can be restricted to a
specific node through GFP_THIS_NODE.

^ permalink raw reply

* Re: [PATCH 2/2] clocksource: Make clocksource register functions void
From: Thomas Gleixner @ 2014-02-05 20:39 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-mips, Kevin Hilman, linux, Sekhar Nori, Paul Mackerras,
	H. Peter Anvin, Yijing Wang, Daniel Walker,
	Hans-Christian Egtvedt, Jonas Bonn, Kukjin Kim, Russell King,
	Richard Weinberger, x86, Tony Lindgren, Ingo Molnar,
	linux-arm-msm, David Brown, Haavard Skinnemoen, Mike Frysinger,
	user-mode-linux-devel, Nicolas Ferre, Jeff Dike, Barry Song,
	linux-samsung-soc, John Stultz, user-mode-linux-user, linux-omap,
	linux-arm-kernel, davinci-linux-open-source, Michal Simek,
	Jim Cromie, microblaze-uclinux, Hanjun Guo, linux-kernel,
	Ralf Baechle, Tony Prisk, Bryan Huntsman, uclinux-dist-devel,
	linuxppc-dev
In-Reply-To: <52E0D575.5050702@linaro.org>

On Thu, 23 Jan 2014, Daniel Lezcano wrote:
> On 01/23/2014 08:12 AM, Yijing Wang wrote:
> > Currently, clocksource_register() and __clocksource_register_scale()
> > functions always return 0, it's pointless, make functions void.
> > And remove the dead code that check the clocksource_register_hz()
> > return value.
> > 
> > Signed-off-by: Yijing Wang <wangyijing@huawei.com>
> 
> Well, do we really want to change all these files to not take care of a return
> value ? What about is we have to check it again later ?
> 
> I would recommend to investigate __clocksource_register_scale and the
> underneath functions if there is not an error to be returned in the call stack
> somewhere which is ignored today.
> 
> The same applies for clocksource_register.

There is really no point in making it fail. It's so low level that
anything more than a proper printk/BUG/WARN is overkill.

Thanks,

	tglx

^ permalink raw reply

* RE: [PATCH 2/2] clocksource: Make clocksource register functions void
From: Thomas Gleixner @ 2014-02-05 20:40 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mips@linux-mips.org, x86@kernel.org, Kevin Hilman,
	linux@lists.openrisc.net, Sekhar Nori, Michal Simek,
	Paul Mackerras, Ralf Baechle, H. Peter Anvin, Yijing Wang,
	Daniel Walker, Hans-Christian Egtvedt, Jonas Bonn, Kukjin Kim,
	Russell King, Richard Weinberger, Daniel Lezcano, Tony Lindgren,
	Ingo Molnar, microblaze-uclinux@itee.uq.edu.au, David Brown,
	Haavard Skinnemoen, Mike Frysinger,
	user-mode-linux-devel@lists.sourceforge.net,
	linux-arm-msm@vger.kernel.org, Jeff Dike,
	davinci-linux-open-source@linux.davincidsp.com,
	linux-samsung-soc@vger.kernel.org, John Stultz,
	user-mode-linux-user@lists.sourceforge.net,
	linux-omap@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Barry Song, Jim Cromie, Hanjun Guo, linux-kernel@vger.kernel.org,
	Nicolas Ferre, 'Tony Prisk', Bryan Huntsman,
	uclinux-dist-devel@blackfin.uclinux.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D46489C@AcuExch.aculab.com>

Yijing,

On Thu, 23 Jan 2014, David Laight wrote:

> From: Linuxppc-dev Tony Prisk
> > On 23/01/14 20:12, Yijing Wang wrote:
> > > Currently, clocksource_register() and __clocksource_register_scale()
> > > functions always return 0, it's pointless, make functions void.
> > > And remove the dead code that check the clocksource_register_hz()
> > > return value.
> > ......
> > > -static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
> > > +static inline void clocksource_register_hz(struct clocksource *cs, u32 hz)
> > >   {
> > >   	return __clocksource_register_scale(cs, 1, hz);
> > >   }
> > 
> > This doesn't make sense - you are still returning a value on a function
> > declared void, and the return is now from a function that doesn't return
> > anything either ?!?!
> > Doesn't this throw a compile-time warning??
> 
> It depends on the compiler.
> Recent gcc allow it.
> I don't know if it is actually valid C though.
> 
> There is no excuse for it on lines like the above though.

Can you please resend with that fixed against 3.14-rc1 ?

Thanks,

	tglx

^ permalink raw reply

* [PATCH 13/51] powerpc, sysfs: Fix CPU hotplug callback registration
From: Srivatsa S. Bhat @ 2014-02-05 22:06 UTC (permalink / raw)
  To: paulus, oleg, rusty, peterz, tglx, akpm
  Cc: ego, walken, linux, linux-kernel, Wang Dongsheng, Olof Johansson,
	Madhavan Srinivasan, Paul Mackerras, Srivatsa S. Bhat, tj,
	paulmck, linuxppc-dev, mingo
In-Reply-To: <20140205220251.19080.92336.stgit@srivatsabhat.in.ibm.com>

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_maps_update_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_maps_update_done();


Fix the sysfs code in powerpc by using this latter form of callback
registration.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Wang Dongsheng <dongsheng.wang@freescale.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/powerpc/kernel/sysfs.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 97e1dc9..c29ad44 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -975,7 +975,8 @@ static int __init topology_init(void)
 	int cpu;
 
 	register_nodes();
-	register_cpu_notifier(&sysfs_cpu_nb);
+
+	cpu_maps_update_begin();
 
 	for_each_possible_cpu(cpu) {
 		struct cpu *c = &per_cpu(cpu_devices, cpu);
@@ -999,6 +1000,11 @@ static int __init topology_init(void)
 		if (cpu_online(cpu))
 			register_cpu_online(cpu);
 	}
+
+	__register_cpu_notifier(&sysfs_cpu_nb);
+
+	cpu_maps_update_done();
+
 #ifdef CONFIG_PPC64
 	sysfs_create_dscr_default();
 #endif /* CONFIG_PPC64 */

^ permalink raw reply related

* Re: [PATCH 4/8] powerpc: add hv_gpci interface header
From: Cody P Schafer @ 2014-02-05 23:14 UTC (permalink / raw)
  To: Michael Ellerman, Linux PPC
  Cc: Ingo Molnar, Paul Mackerras, Peter Zijlstra,
	Arnaldo Carvalho de Melo, LKML
In-Reply-To: <20140201055807.267F12C00BA@ozlabs.org>


>> diff --git a/arch/powerpc/include/asm/hv_gpci.h b/arch/powerpc/include/asm/hv_gpci.h
>
> Any reason this can't just live in arch/powerpc/perf ?
>

Nope, it should be able to keep the header there for now. As this 
interface allows determination of the HW topology, we may have some code 
that exposes that (in sysfs) at some point, which doesn't really belong 
in arch/powerpc/perf (though we could just put it there anyhow).

^ permalink raw reply

* [PATCH-next] powerpc: delete old PrPMC 280/2800 support
From: Paul Gortmaker @ 2014-02-06  1:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul Gortmaker, Paul Mackerras, linuxppc-dev

This processor/memory module was mostly used on ATCA blades and
before that, on cPCI blades.  It wasn't really user friendly, with
custom non u-boot bootloaders (powerboot/motload) and no real way
to recover corrupted boot flash (which was a common problem).

As such, it had its day back before the big ppc --> powerpc move
to device trees, and that was largely through commercial BSPs that
started to dry up around 2007.

Systems using one were largely in a "deploy and sustain" mode,
so interest in upgrading to new kernels in the field was nil.
Also, requiring 50A, 48V power supplies and a 2'x2'x2' ATCA
chassis largely rules out any hobbyist/enthusiast interest.

The point of all this, is that we might as well delete the in
kernel files relating to this platform.  No point in continuing
to build it via walking the defconfigs or via linux-next testing.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---

[The above description is not idle speculation based on 2nd hand
info -- I personally suffered with these platforms, and when faced
with a chance to take some for free vs. letting them go to e-waste,
I happily chose the latter option.   So lets get this out early in
the 3.15 cycle so any objections have a full window to appear.  ]

 arch/powerpc/boot/Makefile                     |   5 +-
 arch/powerpc/configs/prpmc2800_defconfig       | 108 -----------------
 arch/powerpc/platforms/embedded6xx/Kconfig     |   8 --
 arch/powerpc/platforms/embedded6xx/Makefile    |   1 -
 arch/powerpc/platforms/embedded6xx/prpmc2800.c | 156 -------------------------
 5 files changed, 2 insertions(+), 276 deletions(-)
 delete mode 100644 arch/powerpc/configs/prpmc2800_defconfig
 delete mode 100644 arch/powerpc/platforms/embedded6xx/prpmc2800.c

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 90e9d9548660..a1f8c7f1ec60 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -54,7 +54,7 @@ zlib       := inffast.c inflate.c inftrees.c
 zlibheader := inffast.h inffixed.h inflate.h inftrees.h infutil.h
 zliblinuxheader := zlib.h zconf.h zutil.h
 
-$(addprefix $(obj)/,$(zlib) cuboot-c2k.o gunzip_util.o main.o prpmc2800.o): \
+$(addprefix $(obj)/,$(zlib) cuboot-c2k.o gunzip_util.o main.o): \
 	$(addprefix $(obj)/,$(zliblinuxheader)) $(addprefix $(obj)/,$(zlibheader))
 
 libfdt       := fdt.c fdt_ro.c fdt_wip.c fdt_sw.c fdt_rw.c fdt_strerror.c
@@ -95,7 +95,7 @@ src-plat-$(CONFIG_FSL_SOC_BOOKE) += cuboot-85xx.c cuboot-85xx-cpm2.c
 src-plat-$(CONFIG_EMBEDDED6xx) += cuboot-pq2.c cuboot-mpc7448hpc2.c \
 					cuboot-c2k.c gamecube-head.S \
 					gamecube.c wii-head.S wii.c holly.c \
-					prpmc2800.c fixed-head.S mvme5100.c
+					fixed-head.S mvme5100.c
 src-plat-$(CONFIG_AMIGAONE) += cuboot-amigaone.c
 src-plat-$(CONFIG_PPC_PS3) += ps3-head.S ps3-hvcall.S ps3.c
 src-plat-$(CONFIG_EPAPR_BOOT) += epapr.c epapr-wrapper.c
@@ -204,7 +204,6 @@ image-$(CONFIG_PPC_CHRP)		+= zImage.chrp
 image-$(CONFIG_PPC_EFIKA)		+= zImage.chrp
 image-$(CONFIG_PPC_PMAC)		+= zImage.pmac
 image-$(CONFIG_PPC_HOLLY)		+= dtbImage.holly
-image-$(CONFIG_PPC_PRPMC2800)		+= dtbImage.prpmc2800
 image-$(CONFIG_DEFAULT_UIMAGE)		+= uImage
 image-$(CONFIG_EPAPR_BOOT)		+= zImage.epapr
 
diff --git a/arch/powerpc/configs/prpmc2800_defconfig b/arch/powerpc/configs/prpmc2800_defconfig
deleted file mode 100644
index cd80fb615d34..000000000000
--- a/arch/powerpc/configs/prpmc2800_defconfig
+++ /dev/null
@@ -1,108 +0,0 @@
-CONFIG_ALTIVEC=y
-CONFIG_EXPERIMENTAL=y
-CONFIG_SYSVIPC=y
-CONFIG_POSIX_MQUEUE=y
-CONFIG_LOG_BUF_SHIFT=14
-CONFIG_BLK_DEV_INITRD=y
-# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_BLK_DEV_BSG is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
-# CONFIG_PPC_CHRP is not set
-# CONFIG_PPC_PMAC is not set
-CONFIG_EMBEDDED6xx=y
-CONFIG_PPC_PRPMC2800=y
-CONFIG_HIGHMEM=y
-CONFIG_NO_HZ=y
-CONFIG_HIGH_RES_TIMERS=y
-CONFIG_BINFMT_MISC=y
-CONFIG_SPARSE_IRQ=y
-# CONFIG_SECCOMP is not set
-CONFIG_NET=y
-CONFIG_PACKET=y
-CONFIG_UNIX=y
-CONFIG_XFRM_USER=y
-CONFIG_INET=y
-CONFIG_IP_MULTICAST=y
-CONFIG_IP_PNP=y
-CONFIG_IP_PNP_DHCP=y
-CONFIG_IP_PNP_BOOTP=y
-CONFIG_SYN_COOKIES=y
-# CONFIG_IPV6 is not set
-CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
-CONFIG_MTD=y
-CONFIG_MTD_CONCAT=y
-CONFIG_MTD_PARTITIONS=y
-CONFIG_MTD_CHAR=y
-CONFIG_MTD_BLOCK=y
-CONFIG_MTD_CFI=y
-CONFIG_MTD_JEDECPROBE=y
-CONFIG_MTD_CFI_INTELEXT=y
-CONFIG_MTD_PHYSMAP_OF=y
-CONFIG_PROC_DEVICETREE=y
-CONFIG_BLK_DEV_LOOP=y
-CONFIG_BLK_DEV_RAM=y
-CONFIG_BLK_DEV_RAM_SIZE=131072
-CONFIG_IDE=y
-CONFIG_BLK_DEV_GENERIC=y
-CONFIG_BLK_DEV_PDC202XX_NEW=y
-CONFIG_BLK_DEV_SD=y
-CONFIG_ATA=y
-CONFIG_SATA_MV=y
-CONFIG_MACINTOSH_DRIVERS=y
-CONFIG_NETDEVICES=y
-CONFIG_NET_ETHERNET=y
-CONFIG_NET_PCI=y
-CONFIG_E100=y
-CONFIG_8139TOO=y
-# CONFIG_8139TOO_PIO is not set
-CONFIG_E1000=y
-CONFIG_MV643XX_ETH=y
-# CONFIG_INPUT_KEYBOARD is not set
-# CONFIG_INPUT_MOUSE is not set
-# CONFIG_SERIO is not set
-CONFIG_SERIAL_MPSC=y
-CONFIG_SERIAL_MPSC_CONSOLE=y
-# CONFIG_HW_RANDOM is not set
-CONFIG_I2C=y
-CONFIG_I2C_CHARDEV=y
-CONFIG_I2C_MV64XXX=y
-CONFIG_VIDEO_OUTPUT_CONTROL=y
-CONFIG_HID_DRAGONRISE=y
-CONFIG_HID_GYRATION=y
-CONFIG_HID_TWINHAN=y
-CONFIG_HID_NTRIG=y
-CONFIG_HID_ORTEK=y
-CONFIG_HID_PANTHERLORD=y
-CONFIG_HID_PETALYNX=y
-CONFIG_HID_SAMSUNG=y
-CONFIG_HID_SONY=y
-CONFIG_HID_SUNPLUS=y
-CONFIG_HID_GREENASIA=y
-CONFIG_HID_SMARTJOYPLUS=y
-CONFIG_HID_TOPSEED=y
-CONFIG_HID_THRUSTMASTER=y
-CONFIG_THRUSTMASTER_FF=y
-CONFIG_HID_ZEROPLUS=y
-CONFIG_ZEROPLUS_FF=y
-CONFIG_USB=y
-CONFIG_USB_DEVICEFS=y
-# CONFIG_USB_DEVICE_CLASS is not set
-CONFIG_USB_MON=y
-CONFIG_USB_EHCI_HCD=y
-CONFIG_USB_OHCI_HCD=y
-CONFIG_RTC_CLASS=y
-CONFIG_RTC_DRV_MAX6900=y
-CONFIG_EXT2_FS=y
-CONFIG_EXT3_FS=y
-# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
-CONFIG_INOTIFY=y
-CONFIG_PROC_KCORE=y
-CONFIG_TMPFS=y
-CONFIG_NFS_FS=y
-CONFIG_ROOT_NFS=y
-CONFIG_PARTITION_ADVANCED=y
-CONFIG_CRC_T10DIF=y
-# CONFIG_RCU_CPU_STALL_DETECTOR is not set
-CONFIG_SYSCTL_SYSCALL_CHECK=y
-# CONFIG_CRYPTO_ANSI_CPRNG is not set
diff --git a/arch/powerpc/platforms/embedded6xx/Kconfig b/arch/powerpc/platforms/embedded6xx/Kconfig
index 6d3c7a9fd047..3fdc8bc6258f 100644
--- a/arch/powerpc/platforms/embedded6xx/Kconfig
+++ b/arch/powerpc/platforms/embedded6xx/Kconfig
@@ -49,14 +49,6 @@ config PPC_HOLLY
 	  Select PPC_HOLLY if configuring for an IBM 750GX/CL Eval
 	  Board with TSI108/9 bridge (Hickory/Holly)
 
-config PPC_PRPMC2800
-	bool "Motorola-PrPMC2800"
-	depends on EMBEDDED6xx
-	select MV64X60
-	select NOT_COHERENT_CACHE
-	help
-	  This option enables support for the Motorola PrPMC2800 board
-
 config PPC_C2K
 	bool "SBS/GEFanuc C2K board"
 	depends on EMBEDDED6xx
diff --git a/arch/powerpc/platforms/embedded6xx/Makefile b/arch/powerpc/platforms/embedded6xx/Makefile
index cdd48d402b93..f126a2a09981 100644
--- a/arch/powerpc/platforms/embedded6xx/Makefile
+++ b/arch/powerpc/platforms/embedded6xx/Makefile
@@ -5,7 +5,6 @@ obj-$(CONFIG_MPC7448HPC2)	+= mpc7448_hpc2.o
 obj-$(CONFIG_LINKSTATION)	+= linkstation.o ls_uart.o
 obj-$(CONFIG_STORCENTER)	+= storcenter.o
 obj-$(CONFIG_PPC_HOLLY)		+= holly.o
-obj-$(CONFIG_PPC_PRPMC2800)	+= prpmc2800.o
 obj-$(CONFIG_PPC_C2K)		+= c2k.o
 obj-$(CONFIG_USBGECKO_UDBG)	+= usbgecko_udbg.o
 obj-$(CONFIG_GAMECUBE_COMMON)	+= flipper-pic.o
diff --git a/arch/powerpc/platforms/embedded6xx/prpmc2800.c b/arch/powerpc/platforms/embedded6xx/prpmc2800.c
deleted file mode 100644
index d455f08bea53..000000000000
--- a/arch/powerpc/platforms/embedded6xx/prpmc2800.c
+++ /dev/null
@@ -1,156 +0,0 @@
-/*
- * Board setup routines for the Motorola PrPMC2800
- *
- * Author: Dale Farnsworth <dale@farnsworth.org>
- *
- * 2007 (c) MontaVista, Software, Inc.  This file is licensed under
- * the terms of the GNU General Public License version 2.  This program
- * is licensed "as is" without any warranty of any kind, whether express
- * or implied.
- */
-
-#include <linux/stddef.h>
-#include <linux/kernel.h>
-#include <linux/delay.h>
-#include <linux/interrupt.h>
-#include <linux/seq_file.h>
-
-#include <asm/machdep.h>
-#include <asm/prom.h>
-#include <asm/time.h>
-
-#include <mm/mmu_decl.h>
-
-#include <sysdev/mv64x60.h>
-
-#define MV64x60_MPP_CNTL_0	0x0000
-#define MV64x60_MPP_CNTL_2	0x0008
-
-#define MV64x60_GPP_IO_CNTL	0x0000
-#define MV64x60_GPP_LEVEL_CNTL	0x0010
-#define MV64x60_GPP_VALUE_SET	0x0018
-
-#define PLATFORM_NAME_MAX	32
-
-static char prpmc2800_platform_name[PLATFORM_NAME_MAX];
-
-static void __iomem *mv64x60_mpp_reg_base;
-static void __iomem *mv64x60_gpp_reg_base;
-
-static void __init prpmc2800_setup_arch(void)
-{
-	struct device_node *np;
-	phys_addr_t paddr;
-	const unsigned int *reg;
-
-	/*
-	 * ioremap mpp and gpp registers in case they are later
-	 * needed by prpmc2800_reset_board().
-	 */
-	np = of_find_compatible_node(NULL, NULL, "marvell,mv64360-mpp");
-	reg = of_get_property(np, "reg", NULL);
-	paddr = of_translate_address(np, reg);
-	of_node_put(np);
-	mv64x60_mpp_reg_base = ioremap(paddr, reg[1]);
-
-	np = of_find_compatible_node(NULL, NULL, "marvell,mv64360-gpp");
-	reg = of_get_property(np, "reg", NULL);
-	paddr = of_translate_address(np, reg);
-	of_node_put(np);
-	mv64x60_gpp_reg_base = ioremap(paddr, reg[1]);
-
-#ifdef CONFIG_PCI
-	mv64x60_pci_init();
-#endif
-
-	printk("Motorola %s\n", prpmc2800_platform_name);
-}
-
-static void prpmc2800_reset_board(void)
-{
-	u32 temp;
-
-	local_irq_disable();
-
-	temp = in_le32(mv64x60_mpp_reg_base + MV64x60_MPP_CNTL_0);
-	temp &= 0xFFFF0FFF;
-	out_le32(mv64x60_mpp_reg_base + MV64x60_MPP_CNTL_0, temp);
-
-	temp = in_le32(mv64x60_gpp_reg_base + MV64x60_GPP_LEVEL_CNTL);
-	temp |= 0x00000004;
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_LEVEL_CNTL, temp);
-
-	temp = in_le32(mv64x60_gpp_reg_base + MV64x60_GPP_IO_CNTL);
-	temp |= 0x00000004;
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_IO_CNTL, temp);
-
-	temp = in_le32(mv64x60_mpp_reg_base + MV64x60_MPP_CNTL_2);
-	temp &= 0xFFFF0FFF;
-	out_le32(mv64x60_mpp_reg_base + MV64x60_MPP_CNTL_2, temp);
-
-	temp = in_le32(mv64x60_gpp_reg_base + MV64x60_GPP_LEVEL_CNTL);
-	temp |= 0x00080000;
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_LEVEL_CNTL, temp);
-
-	temp = in_le32(mv64x60_gpp_reg_base + MV64x60_GPP_IO_CNTL);
-	temp |= 0x00080000;
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_IO_CNTL, temp);
-
-	out_le32(mv64x60_gpp_reg_base + MV64x60_GPP_VALUE_SET, 0x00080004);
-}
-
-static void prpmc2800_restart(char *cmd)
-{
-	volatile ulong i = 10000000;
-
-	prpmc2800_reset_board();
-
-	while (i-- > 0);
-	panic("restart failed\n");
-}
-
-#ifdef CONFIG_NOT_COHERENT_CACHE
-#define PPRPM2800_COHERENCY_SETTING "off"
-#else
-#define PPRPM2800_COHERENCY_SETTING "on"
-#endif
-
-void prpmc2800_show_cpuinfo(struct seq_file *m)
-{
-	seq_printf(m, "Vendor\t\t: Motorola\n");
-	seq_printf(m, "coherency\t: %s\n", PPRPM2800_COHERENCY_SETTING);
-}
-
-/*
- * Called very early, device-tree isn't unflattened
- */
-static int __init prpmc2800_probe(void)
-{
-	unsigned long root = of_get_flat_dt_root();
-	unsigned long len = PLATFORM_NAME_MAX;
-	void *m;
-
-	if (!of_flat_dt_is_compatible(root, "motorola,PrPMC2800"))
-		return 0;
-
-	/* Update ppc_md.name with name from dt */
-	m = of_get_flat_dt_prop(root, "model", &len);
-	if (m)
-		strncpy(prpmc2800_platform_name, m,
-			min((int)len, PLATFORM_NAME_MAX - 1));
-
-	_set_L2CR(_get_L2CR() | L2CR_L2E);
-	return 1;
-}
-
-define_machine(prpmc2800){
-	.name			= prpmc2800_platform_name,
-	.probe			= prpmc2800_probe,
-	.setup_arch		= prpmc2800_setup_arch,
-	.init_early		= mv64x60_init_early,
-	.show_cpuinfo		= prpmc2800_show_cpuinfo,
-	.init_IRQ		= mv64x60_init_irq,
-	.get_irq		= mv64x60_get_irq,
-	.restart		= prpmc2800_restart,
-	.calibrate_decr		= generic_calibrate_decr,
-};
-- 
1.8.5.2

^ permalink raw reply related

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Nishanth Aravamudan @ 2014-02-06  2:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li
In-Reply-To: <alpine.DEB.2.02.1401241618500.20466@chino.kir.corp.google.com>

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> 
> > Thank you for clarifying and providing  a test patch. I ran with this on
> > the system showing the original problem, configured to have 15GB of
> > memory.
> > 
> > With your patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:         8768192 kB
> > Slab:            3882560 kB
> > SReclaimable:     105408 kB
> > SUnreclaim:      3777152 kB
> > 
> > With Anton's patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:        11195008 kB
> > Slab:            1427968 kB
> > SReclaimable:     109184 kB
> > SUnreclaim:      1318784 kB
> > 
> > 
> > I know that's fairly unscientific, but the numbers are reproducible. 
> > 
> 
> I don't think the goal of the discussion is to reduce the amount of slab 
> allocated, but rather get the most local slab memory possible by use of 
> kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> which is probably cpu_to_node() for a cpu bound to a node without memory, 
> my patch is allocating it on the most local node; Anton's patch is 
> allocating it on whatever happened to be the cpu slab.
> 
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2278,10 +2278,14 @@ redo:
> > > 
> > >  	if (unlikely(!node_match(page, node))) {
> > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > -		deactivate_slab(s, page, c->freelist);
> > > -		c->page = NULL;
> > > -		c->freelist = NULL;
> > > -		goto new_slab;
> > > +		if (unlikely(!node_present_pages(node)))
> > > +			node = numa_mem_id();
> > > +		if (!node_match(page, node)) {
> > > +			deactivate_slab(s, page, c->freelist);
> > > +			c->page = NULL;
> > > +			c->freelist = NULL;
> > > +			goto new_slab;
> > > +		}
> > 
> > Semantically, and please correct me if I'm wrong, this patch is saying
> > if we have a memoryless node, we expect the page's locality to be that
> > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > Just wanting to make sure I understand the intent.
> > 
> 
> Yeah, the default policy should be to fallback to local memory if the node 
> passed is memoryless.
> 
> > What I find odd is that there are only 2 nodes on this system, node 0
> > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > should be coming from node 1 (thus node_match() should always be true?)
> > 
> 
> The nice thing about slub is its debugging ability, what is 
> /sys/kernel/slab/cache/objects showing in comparison between the two 
> patches?

Ok, I finally got around to writing a script that compares the objects
output from both kernels.

log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Joonsoo's patch.

log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Anton's patch.

slab                           objects    objects   percent
                               log1       log2      change
-----------------------------------------------------------
:t-0000104                     71190      85680      20.353982 %
UDP                            4352       3392       22.058824 %
inode_cache                    54302      41923      22.796582 %
fscache_cookie_jar             3276       2457       25.000000 %
:t-0000896                     438        292        33.333333 %
:t-0000080                     310401     195323     37.073978 %
ext4_inode_cache               335        201        40.000000 %
:t-0000192                     89408      128898     44.168307 %
:t-0000184                     151300     81880      45.882353 %
:t-0000512                     49698      73648      48.191074 %
:at-0000192                    242867     120948     50.199904 %
xfs_inode                      34350      15221      55.688501 %
:t-0016384                     11005      17257      56.810541 %
proc_inode_cache               103868     34717      66.575846 %
tw_sock_TCP                    768        256        66.666667 %
:t-0004096                     15240      25672      68.451444 %
nfs_inode_cache                1008       315        68.750000 %
:t-0001024                     14528      24720      70.154185 %
:t-0032768                     655        1312       100.305344%
:t-0002048                     14242      30720      115.700042%
:t-0000640                     1020       2550       150.000000%
:t-0008192                     10005      27905      178.910545%

FWIW, the configuration of this LPAR has slightly changed. It is now configured
for maximally 400 CPUs, of which 200 are present. The result is that even with
Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
script reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-512                        1182 MB    2.03%  100.00%
kmalloc-192                        1182 MB    1.38%  100.00%
kmalloc-16384                       966 MB   17.66%  100.00%
kmalloc-4096                        353 MB   15.92%  100.00%
kmalloc-8192                        259 MB   27.28%  100.00%
kmalloc-32768                       207 MB    9.86%  100.00%

In comparison (log2 above):

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       273 MB   98.76%  100.00%
kmalloc-8192                        225 MB   98.67%  100.00%
pgtable-2^11                        114 MB  100.00%  100.00%
pgtable-2^12                        109 MB  100.00%  100.00%
kmalloc-4096                        104 MB   98.59%  100.00%

I appreciate all the help so far, if anyone has any ideas how best to
proceed further, or what they'd like debugged more, I'm happy to get
this fixed. We're hitting this on a couple of different systems and I'd
like to find a good resolution to the problem.

Thanks,
Nish

^ permalink raw reply

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Nishanth Aravamudan @ 2014-02-06  2:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, David Rientjes, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402051312430.21661@nuc>

On 05.02.2014 [13:28:03 -0600], Christoph Lameter wrote:
> On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:
> 
> > > If the target node allocation fails (for whatever reason) then I would
> > > recommend for simplicities sake to change the target node to
> > > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > > complex solution would be to look through partial lists in increasing
> > > distance to find a partially used slab that is reasonable close to the
> > > current node. Slab has logic like that in fallback_alloc(). Slubs
> > > get_any_partial() function does something close to what you want.
> >
> > I apologize for my own ignorance, but I'm having trouble following.
> > Anton's original patch did fallback to the current cpu slab, but I'm not
> > sure any NUMA_NO_NODE change is necessary there. At the point we're
> > deactivating the slab (in the current code, in __slab_alloc()), we have
> > successfully allocated from somewhere, it's just not on the node we
> > expected to be on.
> 
> Right so if we are ignoring the node then the simplest thing to do is to
> not deactivate the current cpu slab but to take an object from it.

Ok, that's what Anton's patch does, I believe. Are you ok with that
patch as it is?

> > So perhaps you are saying to make a change lower in the code? I'm not
> > sure where it makes sense to change the target node in that case. I'd
> > appreciate any guidance you can give.
> 
> This not an easy thing to do. If the current slab is not the right node
> but would be the node from which the page allocator would be returning
> memory then the current slab can still be allocated from. If the fallback
> is to another node then the current cpu slab needs to be deactivated and
> the allocation from that node needs to proceeed. Have a look at
> fallback_alloc() in the slab allocator.
> 
> A allocation attempt from the page allocator can be restricted to a
> specific node through GFP_THIS_NODE.

Thanks for the pointers, I will try and take a look.

Thanks,
Nish

^ permalink raw reply

* arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Stephen N Chivers @ 2014-02-06  2:09 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Chris Proctor

I have a MPC8548e based board and an application that makes
extensive use of floating point including numerous calls to cos.
In the same program there is the use of an sqlite database.

The kernel is derived from 2.6.31 and is compiled with math emulation.

At some point after the reading of the SQLITE database, the
return result from cos goes from an in range value to an out
of range value.

This is as a result of the FP rounding mode mutating from "round to 
nearest"
to "round toward zero".

The cos in the glibc version being used is known to be sensitive to 
rounding
direction and Joseph Myers has previously fixed glibc.

The failure does not occur on a machine that has a hardware floating
point unit (a MPC7410 processor).

I have traced the mutation to the following series of instructions:

        mffs            f0
        mtfsb1          4*cr7+so
        mtfsb0          4*cr7+eq
        fadd            f13,f1,f2
        mtfsf           1, f0

The instructions are part of the stream emitted by gcc for the conversion
of a 128 bit floating point value into an integer in the sqlite database 
read.

Immediately before the execution of the mffs instruction the "rounding
mode" is "round to nearest".

On the MPC8548 board, the execution of the mtfsf instruction does not
restore the rounding mode to "round to nearest".

I believe that the mask computation in mtfsf.c is incorrect and is 
reversed.

In the latest version of the file (linux-3.14-rc1), the mask is computed 
by:

                 mask = 0;
                 if (FM & (1 << 0))
                        mask |= 0x90000000;
                 if (FM & (1 << 1))
                        mask |= 0x0f000000;
                 if (FM & (1 << 2))
                        mask |= 0x00f00000;
                 if (FM & (1 << 3))
                        mask |= 0x000f0000;
                 if (FM & (1 << 4))
                        mask |= 0x0000f000;
                 if (FM & (1 << 5))
                        mask |= 0x00000f00;
                 if (FM & (1 << 6))
                        mask |= 0x000000f0;
                 if (FM & (1 << 7))
                        mask |= 0x0000000f;

I think it should be:

                mask = 0;
                if (FM & (1 << 0))
                        mask |= 0x0000000f;
                if (FM & (1 << 1))
                        mask |= 0x000000f0;
                if (FM & (1 << 2))
                        mask |= 0x00000f00;
                if (FM & (1 << 3))
                        mask |= 0x0000f000;
                if (FM & (1 << 4))
                        mask |= 0x000f0000;
                if (FM & (1 << 5))
                        mask |= 0x00f00000;
                if (FM & (1 << 6))
                        mask |= 0x0f000000;
                if (FM & (1 << 7))
                        mask |= 0x90000000;

With the above mask computation I get consistent results for both the 
MPC8548
and MPC7410 boards.

Am I missing something subtle?

Stephen Chivers,
CSC Australia Pty. Ltd.

^ permalink raw reply

* [PATCH V3 1/3] time: Change the return type of clockevents_notify() to integer
From: Preeti U Murthy @ 2014-02-06  5:49 UTC (permalink / raw)
  To: rafael.j.wysocki, linux-pm, peterz, fweisbec, daniel.lezcano,
	linux-kernel, paulus, benh, tglx, linuxppc-dev, mingo
  Cc: deepthi, paulmck, srivatsa.bhat
In-Reply-To: <20140206054158.4595.17827.stgit@preeti>

The broadcast framework can potentially be made use of by archs which do not have an
external clock device as well. Then, it is required that one of the CPUs need
to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a
result its local timers should remain functional all the time. For such
a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock
device cannot be shutdown. To make way for this support, change the return
type of tick_broadcast_oneshot_control() and hence clockevents_notify() to
indicate such scenarios.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 include/linux/clockchips.h   |    6 +++---
 kernel/time/clockevents.c    |    8 +++++---
 kernel/time/tick-broadcast.c |    6 ++++--
 kernel/time/tick-internal.h  |    6 +++---
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..e0c5a6c 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; }
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
@@ -196,7 +196,7 @@ static inline void clockevents_notify(unsigned long reason, void *arg) {}
 static inline void clockevents_suspend(void) {}
 static inline void clockevents_resume(void) {}
 
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
 
 #endif
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..79b8685 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -524,12 +524,13 @@ void clockevents_resume(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 /**
  * clockevents_notify - notification about relevant events
+ * Returns 0 on success, any other value on error
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
 	struct clock_event_device *dev, *tmp;
 	unsigned long flags;
-	int cpu;
+	int cpu, ret = 0;
 
 	raw_spin_lock_irqsave(&clockevents_lock, flags);
 
@@ -542,7 +543,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 
 	case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
 	case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-		tick_broadcast_oneshot_control(reason);
+		ret = tick_broadcast_oneshot_control(reason);
 		break;
 
 	case CLOCK_EVT_NOTIFY_CPU_DYING:
@@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 		break;
 	}
 	raw_spin_unlock_irqrestore(&clockevents_lock, flags);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 43780ab..ddf2ac2 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -633,14 +633,15 @@ again:
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
+ * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups.
  */
-void tick_broadcast_oneshot_control(unsigned long reason)
+int tick_broadcast_oneshot_control(unsigned long reason)
 {
 	struct clock_event_device *bc, *dev;
 	struct tick_device *td;
 	unsigned long flags;
 	ktime_t now;
-	int cpu;
+	int cpu, ret = 0;
 
 	/*
 	 * Periodic mode does not care about the enter/exit of power
@@ -746,6 +747,7 @@ void tick_broadcast_oneshot_control(unsigned long reason)
 	}
 out:
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
+	return ret;
 }
 
 /*
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 8329669..f0dc03c 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -46,7 +46,7 @@ extern int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *));
 extern void tick_resume_oneshot(void);
 # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
-extern void tick_broadcast_oneshot_control(unsigned long reason);
+extern int tick_broadcast_oneshot_control(unsigned long reason);
 extern void tick_broadcast_switch_to_oneshot(void);
 extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
 extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
@@ -58,7 +58,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
 	BUG();
 }
-static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
+static inline int tick_broadcast_oneshot_control(unsigned long reason) { return 0; }
 static inline void tick_broadcast_switch_to_oneshot(void) { }
 static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
 static inline int tick_broadcast_oneshot_active(void) { return 0; }
@@ -87,7 +87,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
 	BUG();
 }
-static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
+static inline int tick_broadcast_oneshot_control(unsigned long reason) { return 0; }
 static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
 static inline int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
 {

^ permalink raw reply related

* [PATCH V3 0/3] time/cpuidle: Support in tick broadcast framework in absence of external clock device
From: Preeti U Murthy @ 2014-02-06  5:49 UTC (permalink / raw)
  To: rafael.j.wysocki, linux-pm, peterz, fweisbec, daniel.lezcano,
	linux-kernel, paulus, benh, tglx, linuxppc-dev, mingo
  Cc: deepthi, paulmck, srivatsa.bhat

On some architectures, the local timers of CPUs stop in deep idle states.
They will need to depend on an external clock device to wake them up. However
certain implementations of archs do not have an external clock device.

This patchset provides support in the tick broadcast framework for such
architectures so as to enable the CPUs to get into deep idle.

Presently we are in need of this support on certain implementations of
PowerPC. This patchset has thus been tested on the same.

V1: https://lkml.org/lkml/2013/12/12/687.
V2: https://lkml.org/lkml/2014/1/24/28

Changes in V3:
1. Modified comments and code around programming of the broadcast hrtimer.

---

Preeti U Murthy (2):
      time: Change the return type of clockevents_notify() to integer
      time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set

Thomas Gleixner (1):
      tick/cpuidle: Initialize hrtimer mode of broadcast


 drivers/cpuidle/cpuidle.c            |   38 +++++++-----
 include/linux/clockchips.h           |   15 ++++-
 kernel/time/Makefile                 |    2 -
 kernel/time/clockevents.c            |    8 ++-
 kernel/time/tick-broadcast-hrtimer.c |  105 ++++++++++++++++++++++++++++++++++
 kernel/time/tick-broadcast.c         |   51 ++++++++++++++++-
 kernel/time/tick-internal.h          |    6 +-
 7 files changed, 197 insertions(+), 28 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

-- 

^ permalink raw reply

* [PATCH V3 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set
From: Preeti U Murthy @ 2014-02-06  5:50 UTC (permalink / raw)
  To: rafael.j.wysocki, linux-pm, peterz, fweisbec, daniel.lezcano,
	linux-kernel, paulus, benh, tglx, linuxppc-dev, mingo
  Cc: deepthi, paulmck, srivatsa.bhat
In-Reply-To: <20140206054158.4595.17827.stgit@preeti>

Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
local timers stop. The cpuidle_idle_call() currently handles such idle states
by calling into the broadcast framework so as to wakeup CPUs at their next
wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
into the broadcast frameowork can fail for archs that do not have an external
clock device to handle wakeups and the CPU in question has to thus be made
the stand by CPU. This patch handles such cases by failing the call into
cpuidle so that the arch can take some default action. The arch will certainly
not enter a similar idle state because a failed cpuidle call will also implicitly
indicate that the broadcast framework has not registered this CPU to be woken up.
Hence we are safe if we fail the cpuidle call.

In the process move the functions that trace idle statistics just before and
after the entry and exit into idle states respectively. In other
scenarios where the call to cpuidle fails, we end up not tracing idle
entry and exit since a decision on an idle state could not be taken. Similarly
when the call to broadcast framework fails, we skip tracing idle statistics
because we are in no further position to take a decision on an alternative
idle state to enter into.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 drivers/cpuidle/cpuidle.c |   38 +++++++++++++++++++++++---------------
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..8f42033 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -117,15 +117,19 @@ int cpuidle_idle_call(void)
 {
 	struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
 	struct cpuidle_driver *drv;
-	int next_state, entered_state;
-	bool broadcast;
+	int next_state, entered_state, ret = 0;
+	bool broadcast = false;
 
-	if (off || !initialized)
-		return -ENODEV;
+	if (off || !initialized) {
+		ret = -ENODEV;
+		goto out;
+	}
 
 	/* check if the device is ready */
-	if (!dev || !dev->enabled)
-		return -EBUSY;
+	if (!dev || !dev->enabled) {
+		ret = -EBUSY;
+		goto out;
+	}
 
 	drv = cpuidle_get_cpu_driver(dev);
 
@@ -137,15 +141,18 @@ int cpuidle_idle_call(void)
 		if (cpuidle_curr_governor->reflect)
 			cpuidle_curr_governor->reflect(dev, next_state);
 		local_irq_enable();
-		return 0;
+		goto out;
 	}
 
-	trace_cpu_idle_rcuidle(next_state, dev->cpu);
-
 	broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
 
-	if (broadcast)
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
+	if (broadcast) {
+		ret = clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
+		if (ret)
+			goto out;
+	}
+
+	trace_cpu_idle_rcuidle(next_state, dev->cpu);
 
 	if (cpuidle_state_is_coupled(dev, drv, next_state))
 		entered_state = cpuidle_enter_state_coupled(dev, drv,
@@ -153,16 +160,17 @@ int cpuidle_idle_call(void)
 	else
 		entered_state = cpuidle_enter_state(dev, drv, next_state);
 
-	if (broadcast)
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
-
 	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
 
 	/* give the governor an opportunity to reflect on the outcome */
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev, entered_state);
 
-	return 0;
+out:	if (broadcast)
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
+
+
+	return ret;
 }
 
 /**

^ permalink raw reply related

* [PATCH V3 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast
From: Preeti U Murthy @ 2014-02-06  5:50 UTC (permalink / raw)
  To: rafael.j.wysocki, linux-pm, peterz, fweisbec, daniel.lezcano,
	linux-kernel, paulus, benh, tglx, linuxppc-dev, mingo
  Cc: deepthi, paulmck, srivatsa.bhat
In-Reply-To: <20140206054158.4595.17827.stgit@preeti>

From: Thomas Gleixner <tglx@linutronix.de>

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device. This patch includes support in the broadcast framework to handle
the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer
on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states.

This patchset introduces a pseudo clock device which can be registered by the
archs as tick_broadcast_device in the absence of a real external clock
device. Once registered, the broadcast framework will work as is for these
architectures as long as the archs take care of the BROADCAST_ENTER
notification failing for one of the CPUs. This CPU is made the stand by CPU to
handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*.

The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the
stand by CPU dynamically moves around and so does the hrtimer which is queued
to trigger at the next earliest wakeup time. This is consistent with the case where
an external clock device is present. The smp affinity of this clock device is
set to the CPU with the earliest wakeup. This patchset handles the hotplug of
the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD
notification.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
[Added Changelog and code to handle reprogramming of hrtimer]
---

 include/linux/clockchips.h           |    9 +++
 kernel/time/Makefile                 |    2 -
 kernel/time/tick-broadcast-hrtimer.c |  105 ++++++++++++++++++++++++++++++++++
 kernel/time/tick-broadcast.c         |   45 ++++++++++++++-
 4 files changed, 159 insertions(+), 2 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index e0c5a6c..dbe9e14 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -62,6 +62,11 @@ enum clock_event_mode {
 #define CLOCK_EVT_FEAT_DYNIRQ		0x000020
 #define CLOCK_EVT_FEAT_PERCPU		0x000040
 
+/*
+ * Clockevent device is based on a hrtimer for broadcast
+ */
+#define CLOCK_EVT_FEAT_HRTIMER		0x000080
+
 /**
  * struct clock_event_device - clock event device descriptor
  * @event_handler:	Assigned by the framework to be called by the low
@@ -83,6 +88,7 @@ enum clock_event_mode {
  * @name:		ptr to clock event name
  * @rating:		variable to rate clock event devices
  * @irq:		IRQ number (only for non CPU local devices)
+ * @bound_on:		Bound on CPU
  * @cpumask:		cpumask to indicate for which CPUs this device works
  * @list:		list head for the management code
  * @owner:		module reference
@@ -113,6 +119,7 @@ struct clock_event_device {
 	const char		*name;
 	int			rating;
 	int			irq;
+	int			bound_on;
 	const struct cpumask	*cpumask;
 	struct list_head	list;
 	struct module		*owner;
@@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT)
+extern void tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
+static void tick_setup_hrtimer_broadcast(void) {};
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 9250130..06151ef 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
 
 obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)		+= clockevents.o
 obj-$(CONFIG_GENERIC_CLOCKEVENTS)		+= tick-common.o
-obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o
+obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o tick-broadcast-hrtimer.o
 obj-$(CONFIG_GENERIC_SCHED_CLOCK)		+= sched_clock.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-sched.o
diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
new file mode 100644
index 0000000..af1e119
--- /dev/null
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -0,0 +1,105 @@
+/*
+ * linux/kernel/time/tick-broadcast-hrtimer.c
+ * This file emulates a local clock event device
+ * via a pseudo clock device.
+ */
+#include <linux/cpu.h>
+#include <linux/err.h>
+#include <linux/hrtimer.h>
+#include <linux/interrupt.h>
+#include <linux/percpu.h>
+#include <linux/profile.h>
+#include <linux/clockchips.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
+#include <linux/module.h>
+
+#include "tick-internal.h"
+
+static struct hrtimer bctimer;
+
+static void bc_set_mode(enum clock_event_mode mode,
+			struct clock_event_device *bc)
+{
+	switch (mode) {
+	case CLOCK_EVT_MODE_SHUTDOWN:
+		/*
+		 * Note, we cannot cancel the timer here as we might
+		 * run into the following live lock scenario:
+		 *
+		 * cpu 0		cpu1
+		 * lock(broadcast_lock);
+		 *			hrtimer_interrupt()
+		 *			bc_handler()
+		 *			   tick_handle_oneshot_broadcast();
+		 *			    lock(broadcast_lock);
+		 * hrtimer_cancel()
+		 *  wait_for_callback()
+		 */
+		hrtimer_try_to_cancel(&bctimer);
+		break;
+	default:
+		break;
+	}
+}
+
+/*
+ * This is called from the guts of the broadcast code when the cpu
+ * which is about to enter idle has the earliest broadcast timer event.
+ */
+static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
+{
+	/*
+	 * We try to cancel the timer first. If the callback is on
+	 * flight on some other cpu then we let it handle it. If we
+	 * were able to cancel the timer nothing can rearm it as we
+	 * own broadcast_lock.
+	 *
+	 * However we can also be called from the event handler of
+	 * ce_broadcast_hrtimer itself when it expires. We cannot therefore
+	 * restart the timer since it is on flight on the same CPU. But
+	 * due to the same reason we can reset it.
+	 */
+	if (hrtimer_try_to_cancel(&bctimer) >= 0) {
+		hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED);
+		/* Bind the "device" to the cpu */
+		bc->bound_on = smp_processor_id();
+	} else if (bc->bound_on == smp_processor_id()) {
+		hrtimer_set_expires(&bctimer, expires);
+	}
+	return 0;
+}
+
+static struct clock_event_device ce_broadcast_hrtimer = {
+	.set_mode		= bc_set_mode,
+	.set_next_ktime		= bc_set_next,
+	.features		= CLOCK_EVT_FEAT_ONESHOT |
+				  CLOCK_EVT_FEAT_KTIME |
+				  CLOCK_EVT_FEAT_HRTIMER,
+	.rating			= 0,
+	.bound_on		= -1,
+	.min_delta_ns		= 1,
+	.max_delta_ns		= KTIME_MAX,
+	.min_delta_ticks	= 1,
+	.max_delta_ticks	= KTIME_MAX,
+	.mult			= 1,
+	.shift			= 0,
+	.cpumask		= cpu_all_mask,
+};
+
+static enum hrtimer_restart bc_handler(struct hrtimer *t)
+{
+	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
+
+	if (ce_broadcast_hrtimer.next_event.tv64 == KTIME_MAX)
+		return HRTIMER_NORESTART;
+
+	return HRTIMER_RESTART;
+}
+
+void tick_setup_hrtimer_broadcast(void)
+{
+	hrtimer_init(&bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	bctimer.function = bc_handler;
+	clockevents_register_device(&ce_broadcast_hrtimer);
+}
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index ddf2ac2..1e46e62f 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -630,6 +630,42 @@ again:
 	raw_spin_unlock(&tick_broadcast_lock);
 }
 
+static int broadcast_needs_cpu(struct clock_event_device *bc, int cpu)
+{
+	if (!(bc->features & CLOCK_EVT_FEAT_HRTIMER))
+		return 0;
+	if (bc->next_event.tv64 == KTIME_MAX)
+		return 0;
+	return bc->bound_on == cpu ? -EBUSY : 0;
+}
+
+static void broadcast_shutdown_local(struct clock_event_device *bc,
+				     struct clock_event_device *dev)
+{
+	/*
+	 * For hrtimer based broadcasting we cannot shutdown the cpu
+	 * local device if our own event is the first one to expire or
+	 * if we own the broadcast timer.
+	 */
+	if (bc->features & CLOCK_EVT_FEAT_HRTIMER) {
+		if (broadcast_needs_cpu(bc, smp_processor_id()))
+			return;
+		if (dev->next_event.tv64 < bc->next_event.tv64)
+			return;
+	}
+	clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
+}
+
+static void broadcast_move_bc(int deadcpu)
+{
+	struct clock_event_device *bc = tick_broadcast_device.evtdev;
+
+	if (!bc || !broadcast_needs_cpu(bc, deadcpu))
+		return;
+	/* This moves the broadcast assignment to this cpu */
+	clockevents_program_event(bc, bc->next_event, 1);
+}
+
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
@@ -667,7 +703,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
 	if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
 		if (!cpumask_test_and_set_cpu(cpu, tick_broadcast_oneshot_mask)) {
 			WARN_ON_ONCE(cpumask_test_cpu(cpu, tick_broadcast_pending_mask));
-			clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
+			broadcast_shutdown_local(bc, dev);
 			/*
 			 * We only reprogram the broadcast timer if we
 			 * did not mark ourself in the force mask and
@@ -680,6 +716,11 @@ int tick_broadcast_oneshot_control(unsigned long reason)
 			    dev->next_event.tv64 < bc->next_event.tv64)
 				tick_broadcast_set_event(bc, cpu, dev->next_event, 1);
 		}
+		/*
+		 * If the current CPU owns the hrtimer broadcast
+		 * mechanism, it cannot go deep idle.
+		 */
+		ret = broadcast_needs_cpu(bc, cpu);
 	} else {
 		if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_oneshot_mask)) {
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
@@ -853,6 +894,8 @@ void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
 	cpumask_clear_cpu(cpu, tick_broadcast_pending_mask);
 	cpumask_clear_cpu(cpu, tick_broadcast_force_mask);
 
+	broadcast_move_bc(cpu);
+
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
 

^ permalink raw reply related

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Joonsoo Kim @ 2014-02-06  8:04 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140206020757.GC5433@linux.vnet.ibm.com>

On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > 
> > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > the system showing the original problem, configured to have 15GB of
> > > memory.
> > > 
> > > With your patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:         8768192 kB
> > > Slab:            3882560 kB
> > > SReclaimable:     105408 kB
> > > SUnreclaim:      3777152 kB
> > > 
> > > With Anton's patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:        11195008 kB
> > > Slab:            1427968 kB
> > > SReclaimable:     109184 kB
> > > SUnreclaim:      1318784 kB
> > > 
> > > 
> > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > 
> > 
> > I don't think the goal of the discussion is to reduce the amount of slab 
> > allocated, but rather get the most local slab memory possible by use of 
> > kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> > which is probably cpu_to_node() for a cpu bound to a node without memory, 
> > my patch is allocating it on the most local node; Anton's patch is 
> > allocating it on whatever happened to be the cpu slab.
> > 
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -2278,10 +2278,14 @@ redo:
> > > > 
> > > >  	if (unlikely(!node_match(page, node))) {
> > > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > > -		deactivate_slab(s, page, c->freelist);
> > > > -		c->page = NULL;
> > > > -		c->freelist = NULL;
> > > > -		goto new_slab;
> > > > +		if (unlikely(!node_present_pages(node)))
> > > > +			node = numa_mem_id();
> > > > +		if (!node_match(page, node)) {
> > > > +			deactivate_slab(s, page, c->freelist);
> > > > +			c->page = NULL;
> > > > +			c->freelist = NULL;
> > > > +			goto new_slab;
> > > > +		}
> > > 
> > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > if we have a memoryless node, we expect the page's locality to be that
> > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > Just wanting to make sure I understand the intent.
> > > 
> > 
> > Yeah, the default policy should be to fallback to local memory if the node 
> > passed is memoryless.
> > 
> > > What I find odd is that there are only 2 nodes on this system, node 0
> > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > should be coming from node 1 (thus node_match() should always be true?)
> > > 
> > 
> > The nice thing about slub is its debugging ability, what is 
> > /sys/kernel/slab/cache/objects showing in comparison between the two 
> > patches?
> 
> Ok, I finally got around to writing a script that compares the objects
> output from both kernels.
> 
> log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> and Joonsoo's patch.
> 
> log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> and Anton's patch.
> 
> slab                           objects    objects   percent
>                                log1       log2      change
> -----------------------------------------------------------
> :t-0000104                     71190      85680      20.353982 %
> UDP                            4352       3392       22.058824 %
> inode_cache                    54302      41923      22.796582 %
> fscache_cookie_jar             3276       2457       25.000000 %
> :t-0000896                     438        292        33.333333 %
> :t-0000080                     310401     195323     37.073978 %
> ext4_inode_cache               335        201        40.000000 %
> :t-0000192                     89408      128898     44.168307 %
> :t-0000184                     151300     81880      45.882353 %
> :t-0000512                     49698      73648      48.191074 %
> :at-0000192                    242867     120948     50.199904 %
> xfs_inode                      34350      15221      55.688501 %
> :t-0016384                     11005      17257      56.810541 %
> proc_inode_cache               103868     34717      66.575846 %
> tw_sock_TCP                    768        256        66.666667 %
> :t-0004096                     15240      25672      68.451444 %
> nfs_inode_cache                1008       315        68.750000 %
> :t-0001024                     14528      24720      70.154185 %
> :t-0032768                     655        1312       100.305344%
> :t-0002048                     14242      30720      115.700042%
> :t-0000640                     1020       2550       150.000000%
> :t-0008192                     10005      27905      178.910545%
> 
> FWIW, the configuration of this LPAR has slightly changed. It is now configured
> for maximally 400 CPUs, of which 200 are present. The result is that even with
> Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> script reports:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-512                        1182 MB    2.03%  100.00%
> kmalloc-192                        1182 MB    1.38%  100.00%
> kmalloc-16384                       966 MB   17.66%  100.00%
> kmalloc-4096                        353 MB   15.92%  100.00%
> kmalloc-8192                        259 MB   27.28%  100.00%
> kmalloc-32768                       207 MB    9.86%  100.00%
> 
> In comparison (log2 above):
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       273 MB   98.76%  100.00%
> kmalloc-8192                        225 MB   98.67%  100.00%
> pgtable-2^11                        114 MB  100.00%  100.00%
> pgtable-2^12                        109 MB  100.00%  100.00%
> kmalloc-4096                        104 MB   98.59%  100.00%
> 
> I appreciate all the help so far, if anyone has any ideas how best to
> proceed further, or what they'd like debugged more, I'm happy to get
> this fixed. We're hitting this on a couple of different systems and I'd
> like to find a good resolution to the problem.

Hello,

I have no memoryless system, so, to debug it, I need your help. :)
First, please let me know node information on your system.

I'm preparing 3 another patches which are nearly same with previous patch,
but slightly different approach. Could you test them on your system?
I will send them soon.

And I think that same problem exists if CONFIG_SLAB is enabled. Could you
confirm that?

And, could you confirm that your system's numa_mem_id() is properly set?
And, could you confirm that node_present_pages() test works properly?
And, with my patches, could you give me more information on slub stat?
For this, you need to enable CONFIG_SLUB_STATS. Then please send me all the
slub stat on /proc/sys/kernel/debug/slab.

Sorry for too many request.
If it bothers you too much, please ignore it :)

Thanks.

^ permalink raw reply

* [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
From: Joonsoo Kim @ 2014-02-06  8:07 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Joonsoo Kim,
	Wanpeng Li
In-Reply-To: <1391674026-20092-1-git-send-email-iamjoonsoo.kim@lge.com>

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/slub.c b/mm/slub.c
index cc1f995..c851f82 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	void *object;
 	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
+	if (node == NUMA_NO_NODE)
+		searchnode = numa_mem_id();
+	else {
+		searchnode = node;
+		if (!node_present_pages(node))
+			searchnode = get_numa_mem(node);
+	}
+
 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
 		return object;
@@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 redo:
 
 	if (unlikely(!node_match(page, node))) {
-		stat(s, ALLOC_NODE_MISMATCH);
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
+		int searchnode = node;
+
+		if (node != NUMA_NO_NODE && !node_present_pages(node))
+			searchnode = get_numa_mem(node);
+
+		if (!node_match(page, searchnode)) {
+			stat(s, ALLOC_NODE_MISMATCH);
+			deactivate_slab(s, page, c->freelist);
+			c->page = NULL;
+			c->freelist = NULL;
+			goto new_slab;
+		}
 	}
 
 	/*
-- 
1.7.9.5

^ permalink raw reply related

* [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-06  8:07 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Joonsoo Kim,
	Wanpeng Li
In-Reply-To: <1391674026-20092-1-git-send-email-iamjoonsoo.kim@lge.com>

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..a6d5438 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -233,11 +233,20 @@ static inline int numa_node_id(void)
  * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
  */
 DECLARE_PER_CPU(int, _numa_mem_);
+int _node_numa_mem_[MAX_NUMNODES];
 
 #ifndef set_numa_mem
 static inline void set_numa_mem(int node)
 {
 	this_cpu_write(_numa_mem_, node);
+	_node_numa_mem_[numa_node_id()] = node;
+}
+#endif
+
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+	return _node_numa_mem_[node];
 }
 #endif
 
@@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
 static inline void set_cpu_numa_mem(int cpu, int node)
 {
 	per_cpu(_numa_mem_, cpu) = node;
+	_node_numa_mem_[numa_node_id()] = node;
 }
 #endif
 
@@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
 }
 #endif
 
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+	return node;
+}
+#endif
+
 #ifndef cpu_to_mem
 static inline int cpu_to_mem(int cpu)
 {
-- 
1.7.9.5

^ permalink raw reply related

* [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
From: Joonsoo Kim @ 2014-02-06  8:07 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Joonsoo Kim,
	Wanpeng Li
In-Reply-To: <20140206020757.GC5433@linux.vnet.ibm.com>

Currently, if allocation constraint to node is NUMA_NO_NODE, we search
a partial slab on numa_node_id() node. This doesn't work properly on the
system having memoryless node, since it can have no memory on that node and
there must be no partial slab on that node.

On that node, page allocation always fallback to numa_mem_id() first. So
searching a partial slab on numa_node_id() in that case is proper solution
for memoryless node case.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..cc1f995 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1698,7 +1698,7 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 		struct kmem_cache_cpu *c)
 {
 	void *object;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
-- 
1.7.9.5

^ permalink raw reply related

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
From: David Rientjes @ 2014-02-06  8:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <1391674026-20092-1-git-send-email-iamjoonsoo.kim@lge.com>

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
> 
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

Acked-by: David Rientjes <rientjes@google.com>

I think you'll need to send these to Andrew since he appears to be picking 
up slub patches these days.

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: David Rientjes @ 2014-02-06  8:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <1391674026-20092-2-git-send-email-iamjoonsoo.kim@lge.com>

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

I may be misunderstanding this patch and there's no help because there's 
no changelog.

> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce..a6d5438 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>   */
>  DECLARE_PER_CPU(int, _numa_mem_);
> +int _node_numa_mem_[MAX_NUMNODES];
>  
>  #ifndef set_numa_mem
>  static inline void set_numa_mem(int node)
>  {
>  	this_cpu_write(_numa_mem_, node);
> +	_node_numa_mem_[numa_node_id()] = node;
> +}
> +#endif
> +
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> +	return _node_numa_mem_[node];
>  }
>  #endif
>  
> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>  static inline void set_cpu_numa_mem(int cpu, int node)
>  {
>  	per_cpu(_numa_mem_, cpu) = node;
> +	_node_numa_mem_[numa_node_id()] = node;

The intention seems to be that _node_numa_mem_[X] for a node X will return 
a node Y with memory that has the nearest distance?  In other words, 
caching the value returned by local_memory_node(X)?

That doesn't seem to be what it's doing since numa_node_id() is the node 
of the cpu that current is running on so this ends up getting initialized 
to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in 
cpu_possible_mask.

>  }
>  #endif
>  
> @@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
>  }
>  #endif
>  
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> +	return node;
> +}
> +#endif
> +
>  #ifndef cpu_to_mem
>  static inline int cpu_to_mem(int cpu)
>  {

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-06  8:26 UTC (permalink / raw)
  To: Stephen N Chivers; +Cc: Chris Proctor, linuxppc-dev
In-Reply-To: <OF784BA598.ACFE9DA4-ONCA257C76.00827472-CA257C77.000BCFAD@csc.com>

On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> I have a MPC8548e based board and an application that makes
> extensive use of floating point including numerous calls to cos.
> In the same program there is the use of an sqlite database.
> 
> The kernel is derived from 2.6.31 and is compiled with math emulation.
> 
> At some point after the reading of the SQLITE database, the
> return result from cos goes from an in range value to an out
> of range value.
> 
> This is as a result of the FP rounding mode mutating from "round to 
> nearest"
> to "round toward zero".
> 
> The cos in the glibc version being used is known to be sensitive to 
> rounding
> direction and Joseph Myers has previously fixed glibc.
> 
> The failure does not occur on a machine that has a hardware floating
> point unit (a MPC7410 processor).
> 
> I have traced the mutation to the following series of instructions:
> 
>         mffs            f0
>         mtfsb1          4*cr7+so
>         mtfsb0          4*cr7+eq
>         fadd            f13,f1,f2
>         mtfsf           1, f0
> 
> The instructions are part of the stream emitted by gcc for the conversion
> of a 128 bit floating point value into an integer in the sqlite database 
> read.
> 
> Immediately before the execution of the mffs instruction the "rounding
> mode" is "round to nearest".
> 
> On the MPC8548 board, the execution of the mtfsf instruction does not
> restore the rounding mode to "round to nearest".
> 
> I believe that the mask computation in mtfsf.c is incorrect and is 
> reversed.
> 
> In the latest version of the file (linux-3.14-rc1), the mask is computed 
> by:
> 
>                  mask = 0;
>                  if (FM & (1 << 0))
>                         mask |= 0x90000000;
>                  if (FM & (1 << 1))
>                         mask |= 0x0f000000;
>                  if (FM & (1 << 2))
>                         mask |= 0x00f00000;
>                  if (FM & (1 << 3))
>                         mask |= 0x000f0000;
>                  if (FM & (1 << 4))
>                         mask |= 0x0000f000;
>                  if (FM & (1 << 5))
>                         mask |= 0x00000f00;
>                  if (FM & (1 << 6))
>                         mask |= 0x000000f0;
>                  if (FM & (1 << 7))
>                         mask |= 0x0000000f;
> 
> I think it should be:
> 
>                 mask = 0;
>                 if (FM & (1 << 0))
>                         mask |= 0x0000000f;
>                 if (FM & (1 << 1))
>                         mask |= 0x000000f0;
>                 if (FM & (1 << 2))
>                         mask |= 0x00000f00;
>                 if (FM & (1 << 3))
>                         mask |= 0x0000f000;
>                 if (FM & (1 << 4))
>                         mask |= 0x000f0000;
>                 if (FM & (1 << 5))
>                         mask |= 0x00f00000;
>                 if (FM & (1 << 6))
>                         mask |= 0x0f000000;
>                 if (FM & (1 << 7))
>                         mask |= 0x90000000;
> 
> With the above mask computation I get consistent results for both the 
> MPC8548
> and MPC7410 boards.
> 
> Am I missing something subtle?

No I think you are correct. This said, this code may probably be optimized 
to eliminate a lot of the conditional branches. I think that:

mask = (FM & 1);
mask |= (FM << 3) & 0x10;
mask |= (FM << 6) & 0x100;
mask |= (FM << 9) & 0x1000;
mask |= (FM << 12) & 0x10000;
mask |= (FM << 15) & 0x100000;
mask |= (FM << 18) & 0x1000000;
mask |= (FM << 21) & 0x10000000;
mask *= 15;

should do the job, in less code space and without a single branch.

Each one of the "mask |=" lines should be translated into an
rlwinm instruction followed by an "or". Actually it should be possible
to transform each of these lines into a single "rlwimi" instruction
but I don't know how to coerce gcc to reach this level of optimization.

Another way of optomizing this could be:

mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
mask *= 15;

It's not easy to see which of the solutions is faster, the second one
needs to create quite a few constants, but its dependency length is 
lower. It is very likely that the first solution is faster in cache-cold
case and the second in cache-hot. 

Regardless, the original code is rather naïve, larger and slower in all cases,
with timing variation depending on branch mispredictions.

	Regards,
	Gabriel

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-06 10:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Joonsoo Kim, linuxppc-dev, Christoph Lameter,
	Wanpeng Li
In-Reply-To: <alpine.DEB.2.02.1402060041040.21148@chino.kir.corp.google.com>

2014-02-06 David Rientjes <rientjes@google.com>:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>
> I may be misunderstanding this patch and there's no help because there's
> no changelog.

Sorry about that.
I made this patch just for testing. :)
Thanks for looking this.

>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 12ae6ce..a6d5438 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>>   */
>>  DECLARE_PER_CPU(int, _numa_mem_);
>> +int _node_numa_mem_[MAX_NUMNODES];
>>
>>  #ifndef set_numa_mem
>>  static inline void set_numa_mem(int node)
>>  {
>>       this_cpu_write(_numa_mem_, node);
>> +     _node_numa_mem_[numa_node_id()] = node;
>> +}
>> +#endif
>> +
>> +#ifndef get_numa_mem
>> +static inline int get_numa_mem(int node)
>> +{
>> +     return _node_numa_mem_[node];
>>  }
>>  #endif
>>
>> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>>  static inline void set_cpu_numa_mem(int cpu, int node)
>>  {
>>       per_cpu(_numa_mem_, cpu) = node;
>> +     _node_numa_mem_[numa_node_id()] = node;
>
> The intention seems to be that _node_numa_mem_[X] for a node X will return
> a node Y with memory that has the nearest distance?  In other words,
> caching the value returned by local_memory_node(X)?

Yes, you are right.

> That doesn't seem to be what it's doing since numa_node_id() is the node
> of the cpu that current is running on so this ends up getting initialized
> to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> cpu_possible_mask.

Yes, I made a mistake.
Thanks for pointer.
I fix it and attach v2.
Now I'm out of office, so I'm not sure this second version is correct :(

Thanks.

----------8<--------------
>From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date: Thu, 6 Feb 2014 17:07:05 +0900
Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
determining the
 fallback node

We need to determine the fallback node in slub allocator if the allocation
target node is memoryless node. Without it, the SLUB wrongly select
the node which has no memory and can't use a partial slab, because of node
mismatch. Introduced function, node_numa_mem(X), will return
a node Y with memory that has the nearest distance. If X is memoryless
node, it will return nearest distance node, but, if
X is normal node, it will return itself.

We will use this function in following patch to determine the fallback
node.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..66b19b8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -233,11 +233,20 @@ static inline int numa_node_id(void)
  * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
  */
 DECLARE_PER_CPU(int, _numa_mem_);
+int _node_numa_mem_[MAX_NUMNODES];

 #ifndef set_numa_mem
 static inline void set_numa_mem(int node)
 {
  this_cpu_write(_numa_mem_, node);
+ _node_numa_mem_[numa_node_id()] = node;
+}
+#endif
+
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+ return _node_numa_mem_[node];
 }
 #endif

@@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
 static inline void set_cpu_numa_mem(int cpu, int node)
 {
  per_cpu(_numa_mem_, cpu) = node;
+ _node_numa_mem_[cpu_to_node(cpu)] = node;
 }
 #endif

@@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
 }
 #endif

+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+ return node;
+}
+#endif
+
 #ifndef cpu_to_mem
 static inline int cpu_to_mem(int cpu)
 {
-- 
1.7.9.5

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox