* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-07 5:42 UTC (permalink / raw)
To: Nishanth Aravamudan
Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
Linux Memory Management List, Paul Mackerras, Anton Blanchard,
David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140206191131.GB7845@linux.vnet.ibm.com>
On Thu, Feb 06, 2014 at 11:11:31AM -0800, Nishanth Aravamudan wrote:
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 12ae6ce..66b19b8 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> > * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> > */
> > DECLARE_PER_CPU(int, _numa_mem_);
> > +int _node_numa_mem_[MAX_NUMNODES];
>
> Should be static, I think?
Yes, will update it.
Thanks.
^ permalink raw reply
* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
From: Joonsoo Kim @ 2014-02-07 5:41 UTC (permalink / raw)
To: Christoph Lameter
Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm, paulus,
Anton Blanchard, David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402061127001.5348@nuc>
On Thu, Feb 06, 2014 at 11:30:20AM -0600, Christoph Lameter wrote:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
>
> > diff --git a/mm/slub.c b/mm/slub.c
> > index cc1f995..c851f82 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > void *object;
> > int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> >
> > + if (node == NUMA_NO_NODE)
> > + searchnode = numa_mem_id();
> > + else {
> > + searchnode = node;
> > + if (!node_present_pages(node))
>
> This check wouild need to be something that checks for other contigencies
> in the page allocator as well. A simple solution would be to actually run
> a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> If that fails then fallback. See how fallback_alloc() does it in slab.
>
Hello, Christoph.
This !node_present_pages() ensure that allocation on this node cannot succeed.
So we can directly use numa_mem_id() here.
> > + searchnode = get_numa_mem(node);
> > + }
>
> > @@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > redo:
> >
> > if (unlikely(!node_match(page, node))) {
> > - stat(s, ALLOC_NODE_MISMATCH);
> > - deactivate_slab(s, page, c->freelist);
> > - c->page = NULL;
> > - c->freelist = NULL;
> > - goto new_slab;
> > + int searchnode = node;
> > +
> > + if (node != NUMA_NO_NODE && !node_present_pages(node))
>
> Same issue here. I would suggest not deactivating the slab and first check
> if the node has no pages. If so then just take an object from the current
> cpu slab. If that is not available do an allcoation from the indicated
> node and take whatever the page allocator gave you.
Here I do is not to deactivate the slab. I first check if the node has no pages.
And then, not taking an object from the current cpu slab. Instead, checking
current cpu slab comes from proper node getting from introduced get_numa_mem().
I think that this approach is better than just taking an object whatever node
requested.
Thanks.
^ permalink raw reply
* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-07 5:27 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
Daniel Lezcano, Rafael J. Wysocki, LKML, Ingo Molnar,
Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <alpine.LFD.2.11.1402070105170.1906@knanqh.ubzr>
Hi Nicolas,
On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
>
>> Hi Daniel,
>>
>> On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
>>> Hi Nico,
>>>
>>>
>>> On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>>>
>>>> The core idle loop now takes care of it.
>>>>
>>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>>> ---
>>>> arch/powerpc/platforms/powernv/setup.c | 13 +------------
>>>> 1 file changed, 1 insertion(+), 12 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>>>> b/arch/powerpc/platforms/powernv/setup.c
>>>> index 21166f65c9..a932feb290 100644
>>>> --- a/arch/powerpc/platforms/powernv/setup.c
>>>> +++ b/arch/powerpc/platforms/powernv/setup.c
>>>> @@ -26,7 +26,6 @@
>>>> #include <linux/of_fdt.h>
>>>> #include <linux/interrupt.h>
>>>> #include <linux/bug.h>
>>>> -#include <linux/cpuidle.h>
>>>>
>>>> #include <asm/machdep.h>
>>>> #include <asm/firmware.h>
>>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>>>> return 1;
>>>> }
>>>>
>>>> -void powernv_idle(void)
>>>> -{
>>>> - /* Hook to cpuidle framework if available, else
>>>> - * call on default platform idle code
>>>> - */
>>>> - if (cpuidle_idle_call()) {
>>>> - power7_idle();
>>>> - }
>>>>
>>>
>>> The cpuidle_idle_call is called from arch_cpu_idle in
>>> arch/powerpc/kernel/idle.c between a ppc64_runlatch_off|on section.
>>> Shouldn't the cpuidle-powernv driver call these functions when entering
>>> idle ?
>>
>> Yes they should, I will send out a patch that does that ontop of this.
>> There have been cpuidle driver cleanups for powernv and pseries in this
>> merge window. While no change would be required in the pseries cpuidle
>> driver as a result of Nicolas's cleanup, we would need to add the
>> ppc64_runlatch_on and off functions before and after the entry into the
>> powernv idle states.
>
> What about creating arch_cpu_idle_enter() and arch_cpu_idle_exit() in
> arch/powerpc/kernel/idle.c and calling ppc64_runlatch_off() and
> ppc64_runlatch_on() respectively from there instead? Would that work?
> That would make the idle consolidation much easier afterwards.
I would not suggest doing this. The ppc64_runlatch_*() routines need to
be called when we are sure that the cpu is about to enter or has exit an
idle state. Moving the ppc64_runlatch_on() routine to
arch_cpu_idle_enter() for instance is not a good idea because there are
places where the cpu can decide not to enter any idle state before the
call to cpuidle_idle_call() itself. In that case communicating
prematurely that we are in an idle state would not be a good idea.
So its best to add the ppc64_runlatch_* calls in the powernv cpuidle
driver IMO. We could however create idle_loop_prologue/epilogue()
variants inside it so that in addition to the runlatch routines we could
potentially add more such similar routines that are powernv specific.
If there are cases where there is work to be done prior to and post an
entry into an idle state common to both pseries and powernv, we will
probably put them in arch_cpu_idle_enter/exit(). But the runlatch
routines are not suitable to be moved there as far as I can see.
Thank you
Regards
Preeti U Murthy
>
>
> Nicolas
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
^ permalink raw reply
* Re: [PATCH v2] powerpc: Add cpu family documentation
From: James Yang @ 2014-02-07 2:00 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev, Stephen Rothwell
In-Reply-To: <1391229347-23026-1-git-send-email-mpe@ellerman.id.au>
On Sat, 1 Feb 2014, Michael Ellerman wrote:
> This patch adds some documentation on the different cpu families
> supported by arch/powerpc.
>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
> v2: Reworked formatting to avoid wrapping.
> Fixed up Freescale details.
>
>
> Documentation/powerpc/cpu_families.txt | 227 +++++++++++++++++++++++++++++++++
> 1 file changed, 227 insertions(+)
> create mode 100644 Documentation/powerpc/cpu_families.txt
>
> diff --git a/Documentation/powerpc/cpu_families.txt b/Documentation/powerpc/cpu_families.txt
> new file mode 100644
> index 0000000..fa4f159
> --- /dev/null
> +++ b/Documentation/powerpc/cpu_families.txt
> @@ -0,0 +1,227 @@
> +CPU Families
> +============
> +
> +This document tries to summarise some of the different cpu families that exist
> +and are supported by arch/powerpc.
I think there are more CPUs that exist(ed), but aren't supported by
arch/powerpc. 602, Exponential, and there were a few others. Did you
want to include those?
Do the arrows' endpoints have a particular meaning?
Does the direction of the arrows (left vs. down) have a meaning?
What's the meaning of the box?
Do you want to have all of the minor derivatives or just major
implementations? Some derivatives were just faster versions, others
have microarchitectural changes, and some have additional registers or
removed interfaces.
There are a lot of ways to delineate this.
> +
> +
> +Book3S (aka sPAPR)
> +------------------
> +
> + - Hash MMU
> + - Mix of 32 & 64 bit
> +
> + +--------------+ +----------------+
> + | Old POWER | ---------------------------> | RS64 (threads) |
> + +--------------+ +----------------+
> + |
> + |
> + v
> + +--------------+ +----------------+ +-------+
> + | 601 | ---------------------------> | 603 | -> | 740 |
> + +--------------+ +----------------+ +-------+
There was also at least 603e, 603ev, EC603e, and then offshoot
G2 core, then the offshoot of that e300c1/e300c2/e300c3/e300c4. I
might be missing one here.
740 is a 750 but without the L2, so it would probably fit
better off pointing out of the 750 box. Then there is the 755/745
which have more BATs, L2 features, etc. Missing from the IBM side is
750CXe, 750GX, 750GL, but I don't know the lineage of those. Do you
want the Nintendo stuff too? RAD750?
> + | |
> + | |
> + v v
> + +--------------+ +----------------+ +-------+
> + | 604 | | 750 (G3) | -> | 750CX |
> + +--------------+ +----------------+ +-------+
Also 604e, and I guess 604ev?
> + | | |
> + | | |
> + v v v
> + +--------------+ +----------------+ +-------+
> + | 620 (64 bit) | | 7400 | | 750CL |
> + +--------------+ +----------------+ +-------+
> + | | |
> + | | |
> + v v v
> + +--------------+ +----------------+ +-------+
> + | POWER3/630 | | 7410 | | 750FX |
> + +--------------+ +----------------+ +-------+
7400/7410 mostly the same thing
> + | |
> + | |
> + v v
> + +--------------+ +----------------+
> + | POWER3+ | | 7450 |
> + +--------------+ +----------------+
> + | |
> + | |
> + v v
> + +--------------+ +----------------+
> + | POWER4 | | 7455 |
> + +--------------+ +----------------+
> + | |
> + | |
> + v v
> + +--------------+ +-------+ +----------------+
> + | POWER4+ | ---------------> | 970 | | 7447 |
> + +--------------+ +-------+ +----------------+
> + | | |
> + | | |
> + v v v
> + +--------------+ +-------+ +-------+ +----------------+
> + | POWER5 | --> | Cell | | 970FX | | 7448 |
> + +--------------+ +-------+ +-------+ +----------------+
What about Xenon, how would it fit?
For 745x: I'd organize it:
column = major derivatives
row = minor derivatives
7450/7451/7455/7457 <-- L3
7441/7445/7447 /7448 <-- no L3
7447A/e600
> + | |
> + | |
> + v v
> + +--------------+ +-------+
> + | POWER5+ | | 970MP |
> + +--------------+ +-------+
> + |
> + |
> + v
> + +--------------+
> + | POWER5++ |
> + +--------------+
> + |
> + |
> + v
> + +--------------+
> + | POWER6 |
> + +--------------+
> + |
> + |
> + v
> + +--------------+
> + | POWER7 |
> + +--------------+
> + |
> + |
> + v
> + +--------------+
> + | POWER7+ |
> + +--------------+
> + |
> + |
> + v
> + +--------------+
> + | POWER8 |
> + +--------------+
> +
> +
> + +---------------+
> + | PA6T (64 bit) |
> + +---------------+
> +
> +
> +IBM BookE
> +---------
> +
> + - Software loaded TLB.
> + - All 32 bit
> +
> + +--------------+
> + | 401 |
> + +--------------+
> + |
> + |
> + v
> + +--------------+
> + | 403 |
> + +--------------+
> + |
> + |
> + v
> + +--------------+
> + | 405 |
> + +--------------+
> + |
> + |
> + v
> + +--------------+
> + | 440 |
> + +--------------+
> + |
> + |
> + v
> + +--------------+ +----------------+
> + | 450 | --> | BG/P |
> + +--------------+ +----------------+
> + |
> + |
> + v
> + +--------------+
> + | 460 |
> + +--------------+
would you want to fit 464/464FP here?
> + |
> + |
> + v
> + +--------------+
> + | 476 |
> + +--------------+
> +
Did Titan ever get arch/powerpc support?
> +
> +Motorola/Freescale 8xx
> +----------------------
> +
> + - Software loaded with hardware assist.
> + - All 32 bit
> +
> + +--------------+
> + | 8xx |
> + +--------------+
> + |
> + |
> + v
> + +--------------+
> + | 850 |
> + +--------------+
I don't know enough about the core in there to say that there is a
distinction for 850. My impression is that it's not a 603-derivative
so I agree this lineage of the 8xx/850 is separate from the others.
> +
> +
> +Freescale BookE
> +---------------
> +
> + - Software loaded TLB.
> + - e6500 adds HW loaded indirect TLB entries.
> + - Mix of 32 & 64 bit
> +
> + +--------------+
> + | e200 |
> + +--------------+
> +
> +
> + +--------------------------------+
> + | e500 |
> + +--------------------------------+
> + |
> + |
> + v
> + +--------------------------------+
> + | e500v2 |
> + +--------------------------------+
> + |
> + |
> + v
> + +--------------------------------+
> + | e500mc |
> + +--------------------------------+
> + |
> + |
> + v
> + +--------------------------------+
> + | e5500 (Book3e) (64 bit) |
> + +--------------------------------+
> + |
> + |
> + v
> + +--------------------------------+
> + | e6500 (HW TLB) (Multithreaded) |
> + +--------------------------------+
64-bit for e6500 as well
> +
> +
> +IBM A2 core
> +-----------
> +
> + - Book3E, software loaded TLB + HW loaded indirect TLB entries.
> + - 64 bit
> +
> + +--------------+ +----------------+
> + | A2 core | --> | WSP |
> + +--------------+ +----------------+
> + |
> + |
> + v
> + +--------------+
> + | BG/Q |
> + +--------------+
> --
> 1.8.3.2
>
>
>
^ permalink raw reply
* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Stephen N Chivers @ 2014-02-07 1:27 UTC (permalink / raw)
To: Gabriel Paubert; +Cc: Chris Proctor, linuxppc-dev
In-Reply-To: <20140206082635.GA7048@visitor2.iram.es>
Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> From: Gabriel Paubert <paubert@iram.es>
> To: Stephen N Chivers <schivers@csc.com.au>
> Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> Date: 02/06/2014 07:26 PM
> Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
>=20
> On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> > I have a MPC8548e based board and an application that makes
> > extensive use of floating point including numerous calls to cos.
> > In the same program there is the use of an sqlite database.
> >=20
> > The kernel is derived from 2.6.31 and is compiled with math emulation.
> >=20
> > At some point after the reading of the SQLITE database, the
> > return result from cos goes from an in range value to an out
> > of range value.
> >=20
> > This is as a result of the FP rounding mode mutating from "round to=20
> > nearest"
> > to "round toward zero".
> >=20
> > The cos in the glibc version being used is known to be sensitive to=20
> > rounding
> > direction and Joseph Myers has previously fixed glibc.
> >=20
> > The failure does not occur on a machine that has a hardware floating
> > point unit (a MPC7410 processor).
> >=20
> > I have traced the mutation to the following series of instructions:
> >=20
> > mffs f0
> > mtfsb1 4*cr7+so
> > mtfsb0 4*cr7+eq
> > fadd f13,f1,f2
> > mtfsf 1, f0
> >=20
> > The instructions are part of the stream emitted by gcc for the=20
conversion
> > of a 128 bit floating point value into an integer in the sqlite=20
database=20
> > read.
> >=20
> > Immediately before the execution of the mffs instruction the "rounding
> > mode" is "round to nearest".
> >=20
> > On the MPC8548 board, the execution of the mtfsf instruction does not
> > restore the rounding mode to "round to nearest".
> >=20
> > I believe that the mask computation in mtfsf.c is incorrect and is=20
> > reversed.
> >=20
> > In the latest version of the file (linux-3.14-rc1), the mask is=20
computed=20
> > by:
> >=20
> > mask =3D 0;
> > if (FM & (1 << 0))
> > mask |=3D 0x90000000;
> > if (FM & (1 << 1))
> > mask |=3D 0x0f000000;
> > if (FM & (1 << 2))
> > mask |=3D 0x00f00000;
> > if (FM & (1 << 3))
> > mask |=3D 0x000f0000;
> > if (FM & (1 << 4))
> > mask |=3D 0x0000f000;
> > if (FM & (1 << 5))
> > mask |=3D 0x00000f00;
> > if (FM & (1 << 6))
> > mask |=3D 0x000000f0;
> > if (FM & (1 << 7))
> > mask |=3D 0x0000000f;
> >=20
> > I think it should be:
> >=20
> > mask =3D 0;
> > if (FM & (1 << 0))
> > mask |=3D 0x0000000f;
> > if (FM & (1 << 1))
> > mask |=3D 0x000000f0;
> > if (FM & (1 << 2))
> > mask |=3D 0x00000f00;
> > if (FM & (1 << 3))
> > mask |=3D 0x0000f000;
> > if (FM & (1 << 4))
> > mask |=3D 0x000f0000;
> > if (FM & (1 << 5))
> > mask |=3D 0x00f00000;
> > if (FM & (1 << 6))
> > mask |=3D 0x0f000000;
> > if (FM & (1 << 7))
> > mask |=3D 0x90000000;
> >=20
> > With the above mask computation I get consistent results for both the=20
> > MPC8548
> > and MPC7410 boards.
> >=20
> > Am I missing something subtle?
>=20
> No I think you are correct. This said, this code may probably be=20
optimized=20
> to eliminate a lot of the conditional branches. I think that:
>=20
> mask =3D (FM & 1);
> mask |=3D (FM << 3) & 0x10;
> mask |=3D (FM << 6) & 0x100;
> mask |=3D (FM << 9) & 0x1000;
> mask |=3D (FM << 12) & 0x10000;
> mask |=3D (FM << 15) & 0x100000;
> mask |=3D (FM << 18) & 0x1000000;
> mask |=3D (FM << 21) & 0x10000000;
> mask *=3D 15;
>=20
> should do the job, in less code space and without a single branch.
>=20
> Each one of the "mask |=3D" lines should be translated into an
> rlwinm instruction followed by an "or". Actually it should be possible
> to transform each of these lines into a single "rlwimi" instruction
> but I don't know how to coerce gcc to reach this level of optimization.
>=20
> Another way of optomizing this could be:
>=20
> mask =3D (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> mask =3D (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> mask =3D (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> mask *=3D 15;
>=20
> It's not easy to see which of the solutions is faster, the second one
> needs to create quite a few constants, but its dependency length is=20
> lower. It is very likely that the first solution is faster in cache-cold
> case and the second in cache-hot.=20
>=20
> Regardless, the original code is rather na=EFve, larger and slower in all=
=20
cases,
> with timing variation depending on branch mispredictions.
Thanks for the response, it is appreciated.
I have tried simple test versions of the two suggestions above, both
produce the same results as the original formulation.
My toolchain is gcc-4.1.2 with binutils 2.17.
When compiled without optimization I get:
original 58 instructions 20 memory accesses 9 branches
method1 53 instructions 27 memory accesses 0 branches
method2 37 instructions 13 memory accesses 0 branches
with optimization:
original 25 instructions 0 memory accesses 8 branches
method1 18 instructions 0 memory accesses 0 branches
method2 21 instructions 0 memory accesses 0 branches
The memory accesses do not include "setup" such as moving FM to
a register.
The instruction counts for method1 and method2 include an extra
and operation to preserve the original behaviour wrt the sticky
FX bit (I think) although maybe that is also something else
that it is wrong with the original implementation.
In my naivety I would go with method2 as it generates fewer
instructions when not optimized and isn't far from method1 when
optimized.
I will await more comments before providing a patch next week.
>=20
> Regards,
> Gabriel
Stephen Chivers,
CSC Australia Pty. Ltd.
^ permalink raw reply
* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Nicolas Pitre @ 2014-02-07 1:17 UTC (permalink / raw)
To: Preeti U Murthy
Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
Daniel Lezcano, Rafael J. Wysocki, LKML, Ingo Molnar,
Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <52F3BCFE.3010703@linux.vnet.ibm.com>
On Thu, 6 Feb 2014, Preeti U Murthy wrote:
> Hi Daniel,
>
> On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
> > Hi Nico,
> >
> >
> > On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> >
> >> The core idle loop now takes care of it.
> >>
> >> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> >> ---
> >> arch/powerpc/platforms/powernv/setup.c | 13 +------------
> >> 1 file changed, 1 insertion(+), 12 deletions(-)
> >>
> >> diff --git a/arch/powerpc/platforms/powernv/setup.c
> >> b/arch/powerpc/platforms/powernv/setup.c
> >> index 21166f65c9..a932feb290 100644
> >> --- a/arch/powerpc/platforms/powernv/setup.c
> >> +++ b/arch/powerpc/platforms/powernv/setup.c
> >> @@ -26,7 +26,6 @@
> >> #include <linux/of_fdt.h>
> >> #include <linux/interrupt.h>
> >> #include <linux/bug.h>
> >> -#include <linux/cpuidle.h>
> >>
> >> #include <asm/machdep.h>
> >> #include <asm/firmware.h>
> >> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
> >> return 1;
> >> }
> >>
> >> -void powernv_idle(void)
> >> -{
> >> - /* Hook to cpuidle framework if available, else
> >> - * call on default platform idle code
> >> - */
> >> - if (cpuidle_idle_call()) {
> >> - power7_idle();
> >> - }
> >>
> >
> > The cpuidle_idle_call is called from arch_cpu_idle in
> > arch/powerpc/kernel/idle.c between a ppc64_runlatch_off|on section.
> > Shouldn't the cpuidle-powernv driver call these functions when entering
> > idle ?
>
> Yes they should, I will send out a patch that does that ontop of this.
> There have been cpuidle driver cleanups for powernv and pseries in this
> merge window. While no change would be required in the pseries cpuidle
> driver as a result of Nicolas's cleanup, we would need to add the
> ppc64_runlatch_on and off functions before and after the entry into the
> powernv idle states.
What about creating arch_cpu_idle_enter() and arch_cpu_idle_exit() in
arch/powerpc/kernel/idle.c and calling ppc64_runlatch_off() and
ppc64_runlatch_on() respectively from there instead? Would that work?
That would make the idle consolidation much easier afterwards.
Nicolas
^ permalink raw reply
* Re: [GIT PULL] tree-wide: clean up no longer required #include <linux/init.h>
From: Stephen Rothwell @ 2014-02-06 22:25 UTC (permalink / raw)
To: torvalds
Cc: linux-arch, linux-mips, linux-m68k, rusty, linux-ia64, kvm,
linux-s390, netdev, x86, gregkh, Paul Gortmaker, linux-alpha,
sparclinux, akpm, linuxppc-dev, linux-arm-kernel
In-Reply-To: <1391547118-21967-1-git-send-email-paul.gortmaker@windriver.com>
[-- Attachment #1: Type: text/plain, Size: 587 bytes --]
Hi Linus,
On Tue, 4 Feb 2014 15:51:58 -0500 Paul Gortmaker <paul.gortmaker@windriver.com> wrote:
>
> We've had this in linux-next for 2+ weeks (thanks Stephen!) as a
> linux-stable like queue of patches, and as can be seen here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/paulg/init.git
>
> most of the changes in the last week have been trivial adding acks
> or dropping patches that maintainers decided to take themselves.
Any thoughts on merging this? I can feel it bitrotting :-(
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply
* Re: [PATCH V3 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast
From: Thomas Gleixner @ 2014-02-06 20:52 UTC (permalink / raw)
To: Preeti U Murthy
Cc: deepthi, rafael.j.wysocki, linux-pm, peterz, fweisbec,
daniel.lezcano, linux-kernel, paulus, srivatsa.bhat, paulmck,
linuxppc-dev, mingo
In-Reply-To: <52F3C9BE.4050203@linux.vnet.ibm.com>
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1319 bytes --]
On Thu, 6 Feb 2014, Preeti U Murthy wrote:
> On 02/06/2014 09:33 PM, Thomas Gleixner wrote:
> > On Thu, 6 Feb 2014, Preeti U Murthy wrote:
> >
> > Compiler warnings are not so important, right?
> >
> > kernel/time/tick-broadcast.c: In function ‘tick_broadcast_oneshot_control’:
> > kernel/time/tick-broadcast.c:700:3: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]
> > kernel/time/tick-broadcast.c:711:3: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]
>
> My apologies for this, will make sure this will not repeat. On compilation I
> did not receive any warnings with the additional compile time flags too.I
> compiled it on powerpc. Let me look into why the warnings did not show up.
> Nevertheless I should have taken care of this even by simply looking at the
> code.
Huch, PPC seems to have an extra stupid version of gcc :)
> The cpuidle patch then is below. The trace_cpu_idle_rcuidle() functions have
> been moved around so that the broadcast CPU does not trace any idle event
> and that the symmetry between the trace functions and the call to the
> broadcast framework is maintained. Wow, it does become very simple :)
Indeed :)
Care to resend the whole lot with all fixes applied and perhaps
compile tested on x86 :)
Thanks,
tglx
^ permalink raw reply
* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: David Rientjes @ 2014-02-06 20:52 UTC (permalink / raw)
To: Joonsoo Kim
Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
Linux Memory Management List, Paul Mackerras, Anton Blanchard,
Matt Mackall, Joonsoo Kim, linuxppc-dev, Christoph Lameter,
Wanpeng Li
In-Reply-To: <CAAmzW4PXkdpNi5pZ=4BzdXNvqTEAhcuw-x0pWidqrxzdePxXxA@mail.gmail.com>
On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Thu, 6 Feb 2014 17:07:05 +0900
> Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> determining the
> fallback node
>
> We need to determine the fallback node in slub allocator if the allocation
> target node is memoryless node. Without it, the SLUB wrongly select
> the node which has no memory and can't use a partial slab, because of node
> mismatch. Introduced function, node_numa_mem(X), will return
> a node Y with memory that has the nearest distance. If X is memoryless
> node, it will return nearest distance node, but, if
> X is normal node, it will return itself.
>
> We will use this function in following patch to determine the fallback
> node.
>
I like the approach and it may fix the problem today, but it may not be
sufficient in the future: nodes may not only be memoryless but they may
also be cpuless. It's possible that a node can only have I/O, networking,
or storage devices and we can define affinity for them that is remote from
every cpu and/or memory by the ACPI specification.
It seems like a better approach would be to do this when a node is brought
online and determine the fallback node based not on the zonelists as you
do here but rather on locality (such as through a SLIT if provided, see
node_distance()).
Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make
a lot of sense in generic code. I'd suggest something like
node_to_mem_node().
^ permalink raw reply
* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Scott Wood @ 2014-02-06 20:19 UTC (permalink / raw)
To: Torsten Duwe
Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Anton Blanchard,
Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140206173727.GA13048@lst.de>
On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
> On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2014 at 11:37:37AM +0100, Torsten Duwe wrote:
> > > x86 has them, MIPS has them, ARM has them, even ia64 has them:
> > > ticket locks. They reduce memory bus and cache pressure especially
> > > for contended spinlocks, increasing performance.
> > >
> > > This patch is a port of the x86 spin locks, mostly written in C,
> > > to the powerpc, introducing inline asm where needed. The pSeries
> > > directed yield for vCPUs is taken care of by an additional "holder"
> > > field in the lock.
> > >
> >
> > A few questions; what's with the ppc64 holder thing? Not having a 32bit
> > spinlock_t is sad.
>
> I must admit that I haven't tested the patch on non-pseries ppc64 nor on
> ppc32. Only ppc64 has the ldarx and I tried to atomically replace the
> holder along with the locks. That might prove unneccessary.
Why is the functionality of holder only required on 64-bit? We have too
many 32/64 differences as is. Perhaps on 32-bit a lower max number of
CPUs could be assumed, to make it fit in one word.
> > Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> > in the PowerISA doc. If so I think you can do better by being able to
> > atomically load both tickets but only storing the head without affecting
> > the tail.
>
> V2.06b, Book II, Chapter 3, "sthcx" says:
> | If a reservation exists and the length associated [...] is not 2 bytes,
> | it is undefined whether (RS)_48:63 are stored [...]
>
> That doesn't make me feel comfortable :(
Plus, sthcx doesn't exist on all PPC chips.
-Scott
^ permalink raw reply
* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Tom Musta @ 2014-02-06 19:28 UTC (permalink / raw)
To: Peter Zijlstra, Torsten Duwe
Cc: linux-kernel, Paul Mackerras, Anton Blanchard, Paul E. McKenney,
linuxppc-dev, Ingo Molnar
In-Reply-To: <20140206180826.GI5002@laptop.programming.kicks-ass.net>
On 2/6/2014 12:08 PM, Peter Zijlstra wrote:
>>> Can you pair lwarx with sthcx ? I couldn't immediately find the answer
>>> > > in the PowerISA doc. If so I think you can do better by being able to
>>> > > atomically load both tickets but only storing the head without affecting
>>> > > the tail.
>> >
>> > V2.06b, Book II, Chapter 3, "sthcx" says:
>> > | If a reservation exists and the length associated [...] is not 2 bytes,
>> > | it is undefined whether (RS)_48:63 are stored [...]
>> >
>> > That doesn't make me feel comfortable :(
> That's on page 692, right? The way I read that is of the lharx/sthcx
> don't have the exact same address, storage is undefined. But I can't
> find mention of non-matching load and store size, although I can imagine
> it being the same undefined.
My read is consistent with Torsten's ... this looks like a bad idea.
Look at the RTL for sthcx. on page 692 (Power ISA V2.06) and you will see this:
if RESERVE then
if RESERVE_LENGTH = 2 then
...
else
undefined_case <- 1
else
...
A legal implementation might never perform the store.
^ permalink raw reply
* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Nishanth Aravamudan @ 2014-02-06 19:28 UTC (permalink / raw)
To: Joonsoo Kim
Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140206185955.GA7845@linux.vnet.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 8967 bytes --]
On 06.02.2014 [10:59:55 -0800], Nishanth Aravamudan wrote:
> On 06.02.2014 [17:04:18 +0900], Joonsoo Kim wrote:
> > On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > >
> > > > > Thank you for clarifying and providing a test patch. I ran with this on
> > > > > the system showing the original problem, configured to have 15GB of
> > > > > memory.
> > > > >
> > > > > With your patch after boot:
> > > > >
> > > > > MemTotal: 15604736 kB
> > > > > MemFree: 8768192 kB
> > > > > Slab: 3882560 kB
> > > > > SReclaimable: 105408 kB
> > > > > SUnreclaim: 3777152 kB
> > > > >
> > > > > With Anton's patch after boot:
> > > > >
> > > > > MemTotal: 15604736 kB
> > > > > MemFree: 11195008 kB
> > > > > Slab: 1427968 kB
> > > > > SReclaimable: 109184 kB
> > > > > SUnreclaim: 1318784 kB
> > > > >
> > > > >
> > > > > I know that's fairly unscientific, but the numbers are reproducible.
> > > > >
> > > >
> > > > I don't think the goal of the discussion is to reduce the amount of slab
> > > > allocated, but rather get the most local slab memory possible by use of
> > > > kmalloc_node(). When a memoryless node is being passed to kmalloc_node(),
> > > > which is probably cpu_to_node() for a cpu bound to a node without memory,
> > > > my patch is allocating it on the most local node; Anton's patch is
> > > > allocating it on whatever happened to be the cpu slab.
> > > >
> > > > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > > > --- a/mm/slub.c
> > > > > > +++ b/mm/slub.c
> > > > > > @@ -2278,10 +2278,14 @@ redo:
> > > > > >
> > > > > > if (unlikely(!node_match(page, node))) {
> > > > > > stat(s, ALLOC_NODE_MISMATCH);
> > > > > > - deactivate_slab(s, page, c->freelist);
> > > > > > - c->page = NULL;
> > > > > > - c->freelist = NULL;
> > > > > > - goto new_slab;
> > > > > > + if (unlikely(!node_present_pages(node)))
> > > > > > + node = numa_mem_id();
> > > > > > + if (!node_match(page, node)) {
> > > > > > + deactivate_slab(s, page, c->freelist);
> > > > > > + c->page = NULL;
> > > > > > + c->freelist = NULL;
> > > > > > + goto new_slab;
> > > > > > + }
> > > > >
> > > > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > > > if we have a memoryless node, we expect the page's locality to be that
> > > > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > > > Just wanting to make sure I understand the intent.
> > > > >
> > > >
> > > > Yeah, the default policy should be to fallback to local memory if the node
> > > > passed is memoryless.
> > > >
> > > > > What I find odd is that there are only 2 nodes on this system, node 0
> > > > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > > > should be coming from node 1 (thus node_match() should always be true?)
> > > > >
> > > >
> > > > The nice thing about slub is its debugging ability, what is
> > > > /sys/kernel/slab/cache/objects showing in comparison between the two
> > > > patches?
> > >
> > > Ok, I finally got around to writing a script that compares the objects
> > > output from both kernels.
> > >
> > > log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > and Joonsoo's patch.
> > >
> > > log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > and Anton's patch.
> > >
> > > slab objects objects percent
> > > log1 log2 change
> > > -----------------------------------------------------------
> > > :t-0000104 71190 85680 20.353982 %
> > > UDP 4352 3392 22.058824 %
> > > inode_cache 54302 41923 22.796582 %
> > > fscache_cookie_jar 3276 2457 25.000000 %
> > > :t-0000896 438 292 33.333333 %
> > > :t-0000080 310401 195323 37.073978 %
> > > ext4_inode_cache 335 201 40.000000 %
> > > :t-0000192 89408 128898 44.168307 %
> > > :t-0000184 151300 81880 45.882353 %
> > > :t-0000512 49698 73648 48.191074 %
> > > :at-0000192 242867 120948 50.199904 %
> > > xfs_inode 34350 15221 55.688501 %
> > > :t-0016384 11005 17257 56.810541 %
> > > proc_inode_cache 103868 34717 66.575846 %
> > > tw_sock_TCP 768 256 66.666667 %
> > > :t-0004096 15240 25672 68.451444 %
> > > nfs_inode_cache 1008 315 68.750000 %
> > > :t-0001024 14528 24720 70.154185 %
> > > :t-0032768 655 1312 100.305344%
> > > :t-0002048 14242 30720 115.700042%
> > > :t-0000640 1020 2550 150.000000%
> > > :t-0008192 10005 27905 178.910545%
> > >
> > > FWIW, the configuration of this LPAR has slightly changed. It is now configured
> > > for maximally 400 CPUs, of which 200 are present. The result is that even with
> > > Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> > > script reports:
> > >
> > > slab mem objs slabs
> > > used active active
> > > ------------------------------------------------------------
> > > kmalloc-512 1182 MB 2.03% 100.00%
> > > kmalloc-192 1182 MB 1.38% 100.00%
> > > kmalloc-16384 966 MB 17.66% 100.00%
> > > kmalloc-4096 353 MB 15.92% 100.00%
> > > kmalloc-8192 259 MB 27.28% 100.00%
> > > kmalloc-32768 207 MB 9.86% 100.00%
> > >
> > > In comparison (log2 above):
> > >
> > > slab mem objs slabs
> > > used active active
> > > ------------------------------------------------------------
> > > kmalloc-16384 273 MB 98.76% 100.00%
> > > kmalloc-8192 225 MB 98.67% 100.00%
> > > pgtable-2^11 114 MB 100.00% 100.00%
> > > pgtable-2^12 109 MB 100.00% 100.00%
> > > kmalloc-4096 104 MB 98.59% 100.00%
> > >
> > > I appreciate all the help so far, if anyone has any ideas how best to
> > > proceed further, or what they'd like debugged more, I'm happy to get
> > > this fixed. We're hitting this on a couple of different systems and I'd
> > > like to find a good resolution to the problem.
> >
> > Hello,
> >
> > I have no memoryless system, so, to debug it, I need your help. :)
> > First, please let me know node information on your system.
>
> [ 0.000000] Node 0 Memory:
> [ 0.000000] Node 1 Memory: 0x0-0x200000000
>
> [ 0.000000] On node 0 totalpages: 0
> [ 0.000000] On node 1 totalpages: 131072
> [ 0.000000] DMA zone: 112 pages used for memmap
> [ 0.000000] DMA zone: 0 pages reserved
> [ 0.000000] DMA zone: 131072 pages, LIFO batch:1
>
> [ 0.638391] Node 0 CPUs: 0-199
> [ 0.638394] Node 1 CPUs:
>
> Do you need anything else?
>
> > I'm preparing 3 another patches which are nearly same with previous patch,
> > but slightly different approach. Could you test them on your system?
> > I will send them soon.
>
> Test results are in the attached tarball [1].
>
> > And I think that same problem exists if CONFIG_SLAB is enabled. Could you
> > confirm that?
>
> I will test and let you know.
Ok, with your patches applied and CONFIG_SLAB enabled:
MemTotal: 8264640 kB
MemFree: 7119680 kB
Slab: 207232 kB
SReclaimable: 32896 kB
SUnreclaim: 174336 kB
For reference, same kernel with CONFIG_SLUB:
MemTotal: 8264640 kB
MemFree: 4264000 kB
Slab: 3065408 kB
SReclaimable: 104704 kB
SUnreclaim: 2960704 kB
So CONFIG_SLAB is much better in this case.
Without your patches (but still CONFIG_HAVE_MEMORYLESS_NODES, kthread
locality patch and two other unrelated bugfix patches):
3.13.0-slub:
MemTotal: 8264704 kB
MemFree: 4404288 kB
Slab: 2963648 kB
SReclaimable: 106816 kB
SUnreclaim: 2856832 kB
3.13.0-slab:
MemTotal: 8264640 kB
MemFree: 7263168 kB
Slab: 206144 kB
SReclaimable: 32576 kB
SUnreclaim: 173568 kB
In case it's helpful, I've attached /proc/slabinfo from both kernels.
Thanks,
Nish
[-- Attachment #2: slabusage.3.13.SLAB --]
[-- Type: text/plain, Size: 13115 bytes --]
slab mem objs slabs
used active active
------------------------------------------------------------
thread_info 34 MB 96.33% 100.00%
kmalloc-1024 22 MB 97.44% 100.00%
task_struct 19 MB 95.15% 100.00%
kmalloc-16384 9 MB 98.05% 100.00%
inode_cache 8 MB 97.74% 100.00%
kmalloc-512 7 MB 89.56% 100.00%
dentry 7 MB 98.89% 100.00%
kmalloc-8192 6 MB 98.64% 100.00%
proc_inode_cache 6 MB 90.20% 100.00%
idr_layer_cache 4 MB 94.76% 100.00%
sighand_cache 4 MB 94.69% 100.00%
pgtable-2^12 3 MB 72.58% 100.00%
xfs_inode 3 MB 98.89% 100.00%
sysfs_dir_cache 3 MB 98.29% 100.00%
radix_tree_node 2 MB 97.19% 100.00%
kmalloc-32768 2 MB 97.96% 100.00%
kmalloc-4096 2 MB 97.68% 100.00%
filp 2 MB 20.71% 100.00%
signal_cache 2 MB 72.35% 100.00%
pgtable-2^10 2 MB 52.81% 100.00%
kmalloc-256 2 MB 85.56% 100.00%
kmalloc-2048 1 MB 84.95% 100.00%
shmem_inode_cache 1 MB 89.59% 100.00%
dtl 1 MB 98.77% 100.00%
kmalloc-192 1 MB 77.89% 100.00%
vm_area_struct 1 MB 76.80% 100.00%
cred_jar 1 MB 36.80% 100.00%
kmem_cache 1 MB 97.69% 100.00%
kmalloc-65536 0 MB 100.00% 100.00%
kmalloc-128 0 MB 87.07% 100.00%
buffer_head 0 MB 92.52% 100.00%
kmalloc-32 0 MB 92.89% 100.00%
anon_vma_chain 0 MB 47.46% 100.00%
sock_inode_cache 0 MB 65.45% 100.00%
kmalloc-64 0 MB 94.98% 100.00%
files_cache 0 MB 60.85% 100.00%
names_cache 0 MB 85.83% 100.00%
mm_struct 0 MB 22.06% 100.00%
xfs_buf 0 MB 91.50% 100.00%
UNIX 0 MB 37.90% 100.00%
task_delay_info 0 MB 66.76% 100.00%
skbuff_head_cache 0 MB 50.33% 100.00%
pid 0 MB 62.63% 100.00%
RAW 0 MB 92.59% 100.00%
kmalloc-96 0 MB 63.71% 100.00%
anon_vma 0 MB 52.25% 100.00%
xfs_ifork 0 MB 88.60% 100.00%
biovec-256 0 MB 75.56% 100.00%
TCP 0 MB 19.66% 100.00%
ftrace_event_field 0 MB 63.17% 100.00%
fs_cache 0 MB 24.30% 100.00%
file_lock_cache 0 MB 5.24% 100.00%
eventpoll_epi 0 MB 13.21% 100.00%
cifs_request 0 MB 71.43% 100.00%
cfq_queue 0 MB 26.90% 100.00%
blkdev_queue 0 MB 48.39% 100.00%
UDP 0 MB 12.50% 100.00%
xfs_trans 0 MB 4.33% 100.00%
xfs_log_ticket 0 MB 3.45% 100.00%
xfs_log_item_desc 0 MB 2.42% 100.00%
xfs_ioend 0 MB 84.65% 100.00%
xfs_ili 0 MB 66.20% 100.00%
xfs_buf_item 0 MB 7.94% 100.00%
xfs_btree_cur 0 MB 1.94% 100.00%
uid_cache 0 MB 1.61% 100.00%
tcp_bind_bucket 0 MB 2.18% 100.00%
taskstats 0 MB 3.55% 100.00%
sigqueue 0 MB 0.75% 100.00%
sgpool-8 0 MB 1.59% 100.00%
sgpool-64 0 MB 6.45% 100.00%
sgpool-32 0 MB 3.17% 100.00%
sgpool-16 0 MB 1.57% 100.00%
sgpool-128 0 MB 13.33% 100.00%
sd_ext_cdb 0 MB 0.11% 100.00%
scsi_sense_cache 0 MB 0.60% 100.00%
scsi_cmd_cache 0 MB 1.19% 100.00%
rpc_tasks 0 MB 3.17% 100.00%
rpc_inode_cache 0 MB 31.68% 100.00%
rpc_buffers 0 MB 25.81% 100.00%
revoke_table 0 MB 0.12% 100.00%
pool_workqueue 0 MB 4.37% 100.00%
numa_policy 0 MB 46.75% 100.00%
nsproxy 0 MB 0.16% 100.00%
nfs_write_data 0 MB 50.79% 100.00%
nfs_inode_cache 0 MB 27.69% 100.00%
nfs_commit_data 0 MB 4.76% 100.00%
nf_conntrack_c000000000cc9900 0 MB 45.22% 100.00%
mqueue_inode_cache 0 MB 1.39% 100.00%
mnt_cache 0 MB 53.57% 100.00%
key_jar 0 MB 5.56% 100.00%
jbd2_revoke_table_s 0 MB 0.06% 100.00%
ip_fib_trie 0 MB 0.73% 100.00%
ip_fib_alias 0 MB 0.71% 100.00%
ip_dst_cache 0 MB 30.16% 100.00%
inotify_inode_mark 0 MB 17.23% 100.00%
inet_peer_cache 0 MB 3.97% 100.00%
hugetlbfs_inode_cache 0 MB 2.59% 100.00%
ftrace_event_file 0 MB 92.58% 100.00%
fsnotify_event 0 MB 0.18% 100.00%
ext4_inode_cache 0 MB 4.35% 100.00%
ext4_groupinfo_4k 0 MB 8.55% 100.00%
ext4_extent_status 0 MB 0.07% 100.00%
ext3_inode_cache 0 MB 4.76% 100.00%
eventpoll_pwq 0 MB 15.20% 100.00%
dnotify_struct 0 MB 0.60% 100.00%
dnotify_mark 0 MB 2.08% 100.00%
dm_io 0 MB 2.28% 100.00%
cifs_small_rq 0 MB 23.62% 100.00%
cifs_mpx_ids 0 MB 0.60% 100.00%
cfq_io_cq 0 MB 26.42% 100.00%
blkdev_requests 0 MB 10.56% 100.00%
blkdev_ioc 0 MB 19.31% 100.00%
biovec-16 0 MB 1.19% 100.00%
bio-1 0 MB 13.49% 100.00%
bio-0 0 MB 1.59% 100.00%
bdev_cache 0 MB 52.78% 100.00%
xfs_mru_cache_elem 0 MB 0.00% 0.00%
xfs_icr 0 MB 0.00% 0.00%
xfs_efi_item 0 MB 0.00% 0.00%
xfs_efd_item 0 MB 0.00% 0.00%
xfs_da_state 0 MB 0.00% 0.00%
xfs_bmap_free_item 0 MB 0.00% 0.00%
xfrm_dst_cache 0 MB 0.00% 0.00%
tw_sock_TCP 0 MB 0.00% 0.00%
skbuff_fclone_cache 0 MB 0.00% 0.00%
shared_policy_node 0 MB 0.00% 0.00%
secpath_cache 0 MB 0.00% 0.00%
scsi_data_buffer 0 MB 0.00% 0.00%
revoke_record 0 MB 0.00% 0.00%
request_sock_TCP 0 MB 0.00% 0.00%
reiser_inode_cache 0 MB 0.00% 0.00%
posix_timers_cache 0 MB 0.00% 0.00%
pid_namespace 0 MB 0.00% 0.00%
nfsd_drc 0 MB 0.00% 0.00%
nfsd4_stateids 0 MB 0.00% 0.00%
nfsd4_openowners 0 MB 0.00% 0.00%
nfsd4_lockowners 0 MB 0.00% 0.00%
nfsd4_files 0 MB 0.00% 0.00%
nfsd4_delegations 0 MB 0.00% 0.00%
nfs_read_data 0 MB 0.00% 0.00%
nfs_page 0 MB 0.00% 0.00%
nfs_direct_cache 0 MB 0.00% 0.00%
nf_conntrack_expect 0 MB 0.00% 0.00%
net_namespace 0 MB 0.00% 0.00%
kmalloc-8388608 0 MB 0.00% 0.00%
kmalloc-524288 0 MB 0.00% 0.00%
kmalloc-4194304 0 MB 0.00% 0.00%
kmalloc-262144 0 MB 0.00% 0.00%
kmalloc-2097152 0 MB 0.00% 0.00%
kmalloc-16777216 0 MB 0.00% 0.00%
kmalloc-131072 0 MB 0.00% 0.00%
kmalloc-1048576 0 MB 0.00% 0.00%
kioctx 0 MB 0.00% 0.00%
kiocb 0 MB 0.00% 0.00%
kcopyd_job 0 MB 0.00% 0.00%
journal_head 0 MB 0.00% 0.00%
journal_handle 0 MB 0.00% 0.00%
jbd2_transaction_s 0 MB 0.00% 0.00%
jbd2_revoke_record_s 0 MB 0.00% 0.00%
jbd2_journal_head 0 MB 0.00% 0.00%
jbd2_journal_handle 0 MB 0.00% 0.00%
jbd2_inode 0 MB 0.00% 0.00%
jbd2_4k 0 MB 0.00% 0.00%
isofs_inode_cache 0 MB 0.00% 0.00%
io 0 MB 0.00% 0.00%
inotify_event_private_data 0 MB 0.00% 0.00%
fstrm_item 0 MB 0.00% 0.00%
fsnotify_event_holder 0 MB 0.00% 0.00%
flow_cache 0 MB 0.00% 0.00%
fat_inode_cache 0 MB 0.00% 0.00%
fat_cache 0 MB 0.00% 0.00%
fasync_cache 0 MB 0.00% 0.00%
ext4_xattr 0 MB 0.00% 0.00%
ext4_system_zone 0 MB 0.00% 0.00%
ext4_prealloc_space 0 MB 0.00% 0.00%
ext4_io_end 0 MB 0.00% 0.00%
ext4_free_data 0 MB 0.00% 0.00%
ext4_allocation_context 0 MB 0.00% 0.00%
ext3_xattr 0 MB 0.00% 0.00%
ext2_xattr 0 MB 0.00% 0.00%
ext2_inode_cache 0 MB 0.00% 0.00%
dma-kmalloc-96 0 MB 0.00% 0.00%
dma-kmalloc-8388608 0 MB 0.00% 0.00%
dma-kmalloc-8192 0 MB 0.00% 0.00%
dma-kmalloc-65536 0 MB 0.00% 0.00%
dma-kmalloc-64 0 MB 0.00% 0.00%
dma-kmalloc-524288 0 MB 0.00% 0.00%
dma-kmalloc-512 0 MB 0.00% 0.00%
dma-kmalloc-4194304 0 MB 0.00% 0.00%
dma-kmalloc-4096 0 MB 0.00% 0.00%
dma-kmalloc-32768 0 MB 0.00% 0.00%
dma-kmalloc-32 0 MB 0.00% 0.00%
dma-kmalloc-262144 0 MB 0.00% 0.00%
dma-kmalloc-256 0 MB 0.00% 0.00%
dma-kmalloc-2097152 0 MB 0.00% 0.00%
dma-kmalloc-2048 0 MB 0.00% 0.00%
dma-kmalloc-192 0 MB 0.00% 0.00%
dma-kmalloc-16777216 0 MB 0.00% 0.00%
dma-kmalloc-16384 0 MB 0.00% 0.00%
dma-kmalloc-131072 0 MB 0.00% 0.00%
dma-kmalloc-128 0 MB 0.00% 0.00%
dma-kmalloc-1048576 0 MB 0.00% 0.00%
dma-kmalloc-1024 0 MB 0.00% 0.00%
dm_uevent 0 MB 0.00% 0.00%
dm_rq_target_io 0 MB 0.00% 0.00%
dio 0 MB 0.00% 0.00%
cifs_inode_cache 0 MB 0.00% 0.00%
bsg_cmd 0 MB 0.00% 0.00%
biovec-64 0 MB 0.00% 0.00%
biovec-128 0 MB 0.00% 0.00%
UDP-Lite 0 MB 0.00% 0.00%
PING 0 MB 0.00% 0.00%
[-- Attachment #3: slabusage.3.13.SLUB --]
[-- Type: text/plain, Size: 7076 bytes --]
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 1018 MB 14.09% 100.00%
task_struct 704 MB 17.20% 100.00%
pgtable-2^12 110 MB 100.00% 100.00%
kmalloc-8192 109 MB 49.21% 100.00%
pgtable-2^10 105 MB 100.00% 100.00%
kmalloc-65536 92 MB 100.00% 100.00%
kmalloc-512 83 MB 16.68% 100.00%
kmalloc-128 75 MB 17.55% 100.00%
kmalloc-4096 52 MB 97.30% 100.00%
kmalloc-16 38 MB 24.78% 100.00%
kmalloc-256 33 MB 99.09% 100.00%
kmalloc-1024 27 MB 60.45% 100.00%
sighand_cache 27 MB 100.00% 100.00%
idr_layer_cache 25 MB 100.00% 100.00%
kmalloc-2048 25 MB 97.59% 100.00%
dentry 23 MB 100.00% 100.00%
inode_cache 20 MB 100.00% 100.00%
proc_inode_cache 19 MB 100.00% 100.00%
sysfs_dir_cache 16 MB 100.00% 100.00%
vm_area_struct 14 MB 100.00% 100.00%
kmalloc-64 14 MB 97.79% 100.00%
kmalloc-192 13 MB 97.60% 100.00%
kmalloc-32 12 MB 97.56% 100.00%
anon_vma 12 MB 100.00% 100.00%
mm_struct 12 MB 100.00% 100.00%
sigqueue 12 MB 100.00% 100.00%
files_cache 12 MB 100.00% 100.00%
cfq_queue 11 MB 100.00% 100.00%
radix_tree_node 11 MB 100.00% 100.00%
kmalloc-96 10 MB 97.06% 100.00%
blkdev_requests 10 MB 100.00% 100.00%
xfs_inode 9 MB 100.00% 100.00%
shmem_inode_cache 9 MB 100.00% 100.00%
ext4_system_zone 9 MB 100.00% 100.00%
sock_inode_cache 9 MB 100.00% 100.00%
RAW 8 MB 100.00% 100.00%
kmalloc-8 8 MB 100.00% 100.00%
kmalloc-32768 8 MB 100.00% 100.00%
blkdev_ioc 7 MB 100.00% 100.00%
buffer_head 6 MB 100.00% 100.00%
xfs_da_state 6 MB 100.00% 100.00%
mnt_cache 6 MB 100.00% 100.00%
numa_policy 6 MB 100.00% 100.00%
dnotify_mark 4 MB 100.00% 100.00%
TCP 3 MB 100.00% 100.00%
cifs_request 3 MB 100.00% 100.00%
UDP 3 MB 100.00% 100.00%
xfs_ili 3 MB 100.00% 100.00%
xfs_btree_cur 3 MB 100.00% 100.00%
nf_conntrack_c000000000cb5480 2 MB 100.00% 100.00%
fsnotify_event_holder 1 MB 100.00% 100.00%
dm_rq_target_io 1 MB 100.00% 100.00%
bdev_cache 1 MB 100.00% 100.00%
kmem_cache 1 MB 89.09% 100.00%
blkdev_queue 0 MB 100.00% 100.00%
dio 0 MB 100.00% 100.00%
taskstats 0 MB 100.00% 100.00%
kmem_cache_node 0 MB 100.00% 100.00%
shared_policy_node 0 MB 100.00% 100.00%
rpc_inode_cache 0 MB 100.00% 100.00%
nfs_inode_cache 0 MB 100.00% 100.00%
revoke_table 0 MB 100.00% 100.00%
ip_fib_trie 0 MB 100.00% 100.00%
ext4_inode_cache 0 MB 100.00% 100.00%
hugetlbfs_inode_cache 0 MB 100.00% 100.00%
ext3_inode_cache 0 MB 100.00% 100.00%
tw_sock_TCP 0 MB 100.00% 100.00%
mqueue_inode_cache 0 MB 100.00% 100.00%
ext4_extent_status 0 MB 100.00% 100.00%
ext4_allocation_context 0 MB 100.00% 100.00%
xfs_icr 0 MB 0.00% 0.00%
revoke_record 0 MB 0.00% 0.00%
reiser_inode_cache 0 MB 0.00% 0.00%
posix_timers_cache 0 MB 0.00% 0.00%
pid_namespace 0 MB 0.00% 0.00%
nfsd4_openowners 0 MB 0.00% 0.00%
nfsd4_delegations 0 MB 0.00% 0.00%
nfs_direct_cache 0 MB 0.00% 0.00%
net_namespace 0 MB 0.00% 0.00%
kmalloc-131072 0 MB 0.00% 0.00%
kcopyd_job 0 MB 0.00% 0.00%
journal_head 0 MB 0.00% 0.00%
journal_handle 0 MB 0.00% 0.00%
jbd2_transaction_s 0 MB 0.00% 0.00%
jbd2_journal_handle 0 MB 0.00% 0.00%
isofs_inode_cache 0 MB 0.00% 0.00%
fat_inode_cache 0 MB 0.00% 0.00%
fat_cache 0 MB 0.00% 0.00%
ext4_io_end 0 MB 0.00% 0.00%
ext4_free_data 0 MB 0.00% 0.00%
ext3_xattr 0 MB 0.00% 0.00%
ext2_inode_cache 0 MB 0.00% 0.00%
dma-kmalloc-96 0 MB 0.00% 0.00%
dma-kmalloc-8192 0 MB 0.00% 0.00%
dma-kmalloc-8 0 MB 0.00% 0.00%
dma-kmalloc-65536 0 MB 0.00% 0.00%
dma-kmalloc-64 0 MB 0.00% 0.00%
dma-kmalloc-512 0 MB 0.00% 0.00%
dma-kmalloc-4096 0 MB 0.00% 0.00%
dma-kmalloc-32768 0 MB 0.00% 0.00%
dma-kmalloc-32 0 MB 0.00% 0.00%
dma-kmalloc-256 0 MB 0.00% 0.00%
dma-kmalloc-2048 0 MB 0.00% 0.00%
dma-kmalloc-192 0 MB 0.00% 0.00%
dma-kmalloc-16384 0 MB 0.00% 0.00%
dma-kmalloc-16 0 MB 0.00% 0.00%
dma-kmalloc-131072 0 MB 0.00% 0.00%
dma-kmalloc-128 0 MB 0.00% 0.00%
dma-kmalloc-1024 0 MB 0.00% 0.00%
dm_uevent 0 MB 0.00% 0.00%
cifs_inode_cache 0 MB 0.00% 0.00%
bsg_cmd 0 MB 0.00% 0.00%
UDP-Lite 0 MB 0.00% 0.00%
^ permalink raw reply
* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Nishanth Aravamudan @ 2014-02-06 19:11 UTC (permalink / raw)
To: Joonsoo Kim
Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
Linux Memory Management List, Paul Mackerras, Anton Blanchard,
David Rientjes, Joonsoo Kim, linuxppc-dev, Christoph Lameter,
Wanpeng Li
In-Reply-To: <CAAmzW4PXkdpNi5pZ=4BzdXNvqTEAhcuw-x0pWidqrxzdePxXxA@mail.gmail.com>
On 06.02.2014 [19:29:16 +0900], Joonsoo Kim wrote:
> 2014-02-06 David Rientjes <rientjes@google.com>:
> > On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> >
> >> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>
> >
> > I may be misunderstanding this patch and there's no help because there's
> > no changelog.
>
> Sorry about that.
> I made this patch just for testing. :)
> Thanks for looking this.
>
> >> diff --git a/include/linux/topology.h b/include/linux/topology.h
> >> index 12ae6ce..a6d5438 100644
> >> --- a/include/linux/topology.h
> >> +++ b/include/linux/topology.h
> >> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >> * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> >> */
> >> DECLARE_PER_CPU(int, _numa_mem_);
> >> +int _node_numa_mem_[MAX_NUMNODES];
> >>
> >> #ifndef set_numa_mem
> >> static inline void set_numa_mem(int node)
> >> {
> >> this_cpu_write(_numa_mem_, node);
> >> + _node_numa_mem_[numa_node_id()] = node;
> >> +}
> >> +#endif
> >> +
> >> +#ifndef get_numa_mem
> >> +static inline int get_numa_mem(int node)
> >> +{
> >> + return _node_numa_mem_[node];
> >> }
> >> #endif
> >>
> >> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
> >> static inline void set_cpu_numa_mem(int cpu, int node)
> >> {
> >> per_cpu(_numa_mem_, cpu) = node;
> >> + _node_numa_mem_[numa_node_id()] = node;
> >
> > The intention seems to be that _node_numa_mem_[X] for a node X will return
> > a node Y with memory that has the nearest distance? In other words,
> > caching the value returned by local_memory_node(X)?
>
> Yes, you are right.
>
> > That doesn't seem to be what it's doing since numa_node_id() is the node
> > of the cpu that current is running on so this ends up getting initialized
> > to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> > cpu_possible_mask.
>
> Yes, I made a mistake.
> Thanks for pointer.
> I fix it and attach v2.
> Now I'm out of office, so I'm not sure this second version is correct :(
>
> Thanks.
>
> ----------8<--------------
> From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Thu, 6 Feb 2014 17:07:05 +0900
> Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> determining the
> fallback node
>
> We need to determine the fallback node in slub allocator if the allocation
> target node is memoryless node. Without it, the SLUB wrongly select
> the node which has no memory and can't use a partial slab, because of node
> mismatch. Introduced function, node_numa_mem(X), will return
> a node Y with memory that has the nearest distance. If X is memoryless
> node, it will return nearest distance node, but, if
> X is normal node, it will return itself.
>
> We will use this function in following patch to determine the fallback
> node.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce..66b19b8 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> */
> DECLARE_PER_CPU(int, _numa_mem_);
> +int _node_numa_mem_[MAX_NUMNODES];
Should be static, I think?
>
> #ifndef set_numa_mem
> static inline void set_numa_mem(int node)
> {
> this_cpu_write(_numa_mem_, node);
> + _node_numa_mem_[numa_node_id()] = node;
> +}
> +#endif
> +
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return _node_numa_mem_[node];
> }
> #endif
>
> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
> static inline void set_cpu_numa_mem(int cpu, int node)
> {
> per_cpu(_numa_mem_, cpu) = node;
> + _node_numa_mem_[cpu_to_node(cpu)] = node;
> }
> #endif
>
> @@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
> }
> #endif
>
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return node;
> +}
> +#endif
> +
> #ifndef cpu_to_mem
> static inline int cpu_to_mem(int cpu)
> {
> --
> 1.7.9.5
>
^ permalink raw reply
* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-06 18:08 UTC (permalink / raw)
To: Torsten Duwe
Cc: linux-kernel, Paul Mackerras, Anton Blanchard, Paul E. McKenney,
linuxppc-dev, Ingo Molnar
In-Reply-To: <20140206173727.GA13048@lst.de>
On Thu, Feb 06, 2014 at 06:37:27PM +0100, Torsten Duwe wrote:
> On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2014 at 11:37:37AM +0100, Torsten Duwe wrote:
> > > x86 has them, MIPS has them, ARM has them, even ia64 has them:
> > > ticket locks. They reduce memory bus and cache pressure especially
> > > for contended spinlocks, increasing performance.
> > >
> > > This patch is a port of the x86 spin locks, mostly written in C,
> > > to the powerpc, introducing inline asm where needed. The pSeries
> > > directed yield for vCPUs is taken care of by an additional "holder"
> > > field in the lock.
> > >
> >
> > A few questions; what's with the ppc64 holder thing? Not having a 32bit
> > spinlock_t is sad.
>
> I must admit that I haven't tested the patch on non-pseries ppc64 nor on
> ppc32. Only ppc64 has the ldarx and I tried to atomically replace the
> holder along with the locks. That might prove unneccessary.
But what is the holder for? Can't we do away with that field?
> > Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> > in the PowerISA doc. If so I think you can do better by being able to
> > atomically load both tickets but only storing the head without affecting
> > the tail.
>
> V2.06b, Book II, Chapter 3, "sthcx" says:
> | If a reservation exists and the length associated [...] is not 2 bytes,
> | it is undefined whether (RS)_48:63 are stored [...]
>
> That doesn't make me feel comfortable :(
That's on page 692, right? The way I read that is of the lharx/sthcx
don't have the exact same address, storage is undefined. But I can't
find mention of non-matching load and store size, although I can imagine
it being the same undefined.
^ permalink raw reply
* Re: [PATCH V3 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast
From: Preeti U Murthy @ 2014-02-06 17:43 UTC (permalink / raw)
To: Thomas Gleixner
Cc: deepthi, rafael.j.wysocki, linux-pm, peterz, fweisbec,
daniel.lezcano, linux-kernel, paulus, srivatsa.bhat, paulmck,
linuxppc-dev, mingo
In-Reply-To: <alpine.DEB.2.02.1402061655560.21991@ionos.tec.linutronix.de>
Hi Thomas,
On 02/06/2014 09:33 PM, Thomas Gleixner wrote:
> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
>
> Compiler warnings are not so important, right?
>
> kernel/time/tick-broadcast.c: In function ‘tick_broadcast_oneshot_control’:
> kernel/time/tick-broadcast.c:700:3: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]
> kernel/time/tick-broadcast.c:711:3: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]
My apologies for this, will make sure this will not repeat. On compilation I
did not receive any warnings with the additional compile time flags too.I
compiled it on powerpc. Let me look into why the warnings did not show up.
Nevertheless I should have taken care of this even by simply looking at the
code.
>
>> + /*
>> + * If the current CPU owns the hrtimer broadcast
>> + * mechanism, it cannot go deep idle.
>> + */
>> + ret = broadcast_needs_cpu(bc, cpu);
>
> So we leave the CPU in the broadcast mask, just to force another call
> to the notify code right away to remove it again. Wouldn't it be more
> clever to clear the flag right away? That would make the changes to
> the cpuidle code simpler. Delta patch below.
You are right.
>
> Thanks,
>
> tglx
> ---
>
> --- tip.orig/kernel/time/tick-broadcast.c
> +++ tip/kernel/time/tick-broadcast.c
> @@ -697,7 +697,7 @@ int tick_broadcast_oneshot_control(unsig
> * states
> */
> if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
> - return;
> + return 0;
>
> /*
> * We are called with preemtion disabled from the depth of the
> @@ -708,7 +708,7 @@ int tick_broadcast_oneshot_control(unsig
> dev = td->evtdev;
>
> if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
> - return;
> + return 0;
>
> bc = tick_broadcast_device.evtdev;
>
> @@ -731,9 +731,14 @@ int tick_broadcast_oneshot_control(unsig
> }
> /*
> * If the current CPU owns the hrtimer broadcast
> - * mechanism, it cannot go deep idle.
> + * mechanism, it cannot go deep idle and we remove the
> + * CPU from the broadcast mask. We don't have to go
> + * through the EXIT path as the local timer is not
> + * shutdown.
> */
> ret = broadcast_needs_cpu(bc, cpu);
> + if (ret)
> + cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
> } else {
> if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_oneshot_mask)) {
> clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
>
>
The cpuidle patch then is below. The trace_cpu_idle_rcuidle() functions have
been moved around so that the broadcast CPU does not trace any idle event
and that the symmetry between the trace functions and the call to the
broadcast framework is maintained. Wow, it does become very simple :)
time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set
From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
local timers stop. The cpuidle_idle_call() currently handles such idle states
by calling into the broadcast framework so as to wakeup CPUs at their next
wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
into the broadcast frameowork can fail for archs that do not have an external
clock device to handle wakeups and the CPU in question has to thus be made
the stand by CPU. This patch handles such cases by failing the call into
cpuidle so that the arch can take some default action. The arch will certainly
not enter a similar idle state because a failed cpuidle call will also implicitly
indicate that the broadcast framework has not registered this CPU to be woken up.
Hence we are safe if we fail the cpuidle call.
In the process move the functions that trace idle statistics just before and
after the entry and exit into idle states respectively. In other
scenarios where the call to cpuidle fails, we end up not tracing idle
entry and exit since a decision on an idle state could not be taken. Similarly
when the call to broadcast framework fails, we skip tracing idle statistics
because we are in no further position to take a decision on an alternative
idle state to enter into.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
drivers/cpuidle/cpuidle.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..8beb0f02 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -140,12 +140,14 @@ int cpuidle_idle_call(void)
return 0;
}
- trace_cpu_idle_rcuidle(next_state, dev->cpu);
-
broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
- if (broadcast)
- clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
+ if (broadcast &&
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
+ return -EBUSY;
+
+
+ trace_cpu_idle_rcuidle(next_state, dev->cpu);
if (cpuidle_state_is_coupled(dev, drv, next_state))
entered_state = cpuidle_enter_state_coupled(dev, drv,
@@ -153,11 +155,11 @@ int cpuidle_idle_call(void)
else
entered_state = cpuidle_enter_state(dev, drv, next_state);
+ trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
+
if (broadcast)
clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
- trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
-
/* give the governor an opportunity to reflect on the outcome */
if (cpuidle_curr_governor->reflect)
cpuidle_curr_governor->reflect(dev, entered_state);
Thank you
Regards
Preeti U Murthy
^ permalink raw reply related
* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Torsten Duwe @ 2014-02-06 17:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Paul Mackerras, Anton Blanchard, Paul E. McKenney,
linuxppc-dev, Ingo Molnar
In-Reply-To: <20140206163837.GT2936@laptop.programming.kicks-ass.net>
On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2014 at 11:37:37AM +0100, Torsten Duwe wrote:
> > x86 has them, MIPS has them, ARM has them, even ia64 has them:
> > ticket locks. They reduce memory bus and cache pressure especially
> > for contended spinlocks, increasing performance.
> >
> > This patch is a port of the x86 spin locks, mostly written in C,
> > to the powerpc, introducing inline asm where needed. The pSeries
> > directed yield for vCPUs is taken care of by an additional "holder"
> > field in the lock.
> >
>
> A few questions; what's with the ppc64 holder thing? Not having a 32bit
> spinlock_t is sad.
I must admit that I haven't tested the patch on non-pseries ppc64 nor on
ppc32. Only ppc64 has the ldarx and I tried to atomically replace the
holder along with the locks. That might prove unneccessary.
> Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> in the PowerISA doc. If so I think you can do better by being able to
> atomically load both tickets but only storing the head without affecting
> the tail.
V2.06b, Book II, Chapter 3, "sthcx" says:
| If a reservation exists and the length associated [...] is not 2 bytes,
| it is undefined whether (RS)_48:63 are stored [...]
That doesn't make me feel comfortable :(
Torsten
^ permalink raw reply
* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
From: Christoph Lameter @ 2014-02-06 17:31 UTC (permalink / raw)
To: David Rientjes
Cc: Han Pingtian, Nishanth Aravamudan, penberg, linux-mm, paulus,
Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.02.1402060037210.21148@chino.kir.corp.google.com>
On Thu, 6 Feb 2014, David Rientjes wrote:
> I think you'll need to send these to Andrew since he appears to be picking
> up slub patches these days.
I can start managing merges again if Pekka no longer has the time.
^ permalink raw reply
* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
From: Christoph Lameter @ 2014-02-06 17:30 UTC (permalink / raw)
To: Joonsoo Kim
Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm, paulus,
Anton Blanchard, David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <1391674026-20092-3-git-send-email-iamjoonsoo.kim@lge.com>
On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> diff --git a/mm/slub.c b/mm/slub.c
> index cc1f995..c851f82 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> void *object;
> int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
>
> + if (node == NUMA_NO_NODE)
> + searchnode = numa_mem_id();
> + else {
> + searchnode = node;
> + if (!node_present_pages(node))
This check wouild need to be something that checks for other contigencies
in the page allocator as well. A simple solution would be to actually run
a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
If that fails then fallback. See how fallback_alloc() does it in slab.
> + searchnode = get_numa_mem(node);
> + }
> @@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> redo:
>
> if (unlikely(!node_match(page, node))) {
> - stat(s, ALLOC_NODE_MISMATCH);
> - deactivate_slab(s, page, c->freelist);
> - c->page = NULL;
> - c->freelist = NULL;
> - goto new_slab;
> + int searchnode = node;
> +
> + if (node != NUMA_NO_NODE && !node_present_pages(node))
Same issue here. I would suggest not deactivating the slab and first check
if the node has no pages. If so then just take an object from the current
cpu slab. If that is not available do an allcoation from the indicated
node and take whatever the page allocator gave you.
^ permalink raw reply
* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
From: Christoph Lameter @ 2014-02-06 17:26 UTC (permalink / raw)
To: Joonsoo Kim
Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm, paulus,
Anton Blanchard, David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <1391674026-20092-1-git-send-email-iamjoonsoo.kim@lge.com>
On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
>
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.
Acked-by: Christoph Lameter <cl@linux.com>
^ permalink raw reply
* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Christoph Lameter @ 2014-02-06 17:25 UTC (permalink / raw)
To: Nishanth Aravamudan
Cc: Han Pingtian, David Rientjes, penberg, linux-mm, paulus,
Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140206020833.GD5433@linux.vnet.ibm.com>
On Wed, 5 Feb 2014, Nishanth Aravamudan wrote:
> > Right so if we are ignoring the node then the simplest thing to do is to
> > not deactivate the current cpu slab but to take an object from it.
>
> Ok, that's what Anton's patch does, I believe. Are you ok with that
> patch as it is?
No. Again his patch only works if the node is memoryless not if there are
other issues that prevent allocation from that node.
^ permalink raw reply
* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-06 16:49 UTC (permalink / raw)
To: Daniel Lezcano
Cc: Nicolas Pitre, Lists linaro-kernel, linux-pm@vger.kernel.org,
Peter Zijlstra, Rafael J. Wysocki, LKML, Ingo Molnar,
Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <CAKnoXLxCAqKaniGOwegNcOarS4FpoNKDY5PxOhe6j4wdtUAkLQ@mail.gmail.com>
Hi Daniel,
On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
> Hi Nico,
>
>
> On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>
>> The core idle loop now takes care of it.
>>
>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>> ---
>> arch/powerpc/platforms/powernv/setup.c | 13 +------------
>> 1 file changed, 1 insertion(+), 12 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>> b/arch/powerpc/platforms/powernv/setup.c
>> index 21166f65c9..a932feb290 100644
>> --- a/arch/powerpc/platforms/powernv/setup.c
>> +++ b/arch/powerpc/platforms/powernv/setup.c
>> @@ -26,7 +26,6 @@
>> #include <linux/of_fdt.h>
>> #include <linux/interrupt.h>
>> #include <linux/bug.h>
>> -#include <linux/cpuidle.h>
>>
>> #include <asm/machdep.h>
>> #include <asm/firmware.h>
>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>> return 1;
>> }
>>
>> -void powernv_idle(void)
>> -{
>> - /* Hook to cpuidle framework if available, else
>> - * call on default platform idle code
>> - */
>> - if (cpuidle_idle_call()) {
>> - power7_idle();
>> - }
>>
>
> The cpuidle_idle_call is called from arch_cpu_idle in
> arch/powerpc/kernel/idle.c between a ppc64_runlatch_off|on section.
> Shouldn't the cpuidle-powernv driver call these functions when entering
> idle ?
Yes they should, I will send out a patch that does that ontop of this.
There have been cpuidle driver cleanups for powernv and pseries in this
merge window. While no change would be required in the pseries cpuidle
driver as a result of Nicolas's cleanup, we would need to add the
ppc64_runlatch_on and off functions before and after the entry into the
powernv idle states.
Thanks
Regards
Preeti U Murthy
>
> -- Daniel
>
>
>> -}
>> -
>> define_machine(powernv) {
>> .name = "PowerNV",
>> .probe = pnv_probe,
>> @@ -236,7 +225,7 @@ define_machine(powernv) {
>> .show_cpuinfo = pnv_show_cpuinfo,
>> .progress = pnv_progress,
>> .machine_shutdown = pnv_shutdown,
>> - .power_save = powernv_idle,
>> + .power_save = power7_idle,
>> .calibrate_decr = generic_calibrate_decr,
>> #ifdef CONFIG_KEXEC
>> .kexec_cpu_down = pnv_kexec_cpu_down,
>> --
>> 1.8.4.108.g55ea5f6
>>
>>
>
^ permalink raw reply
* Re: [PATCH v2 6/6] cpu/idle.c: move to sched/idle.c
From: Peter Zijlstra @ 2014-02-06 16:43 UTC (permalink / raw)
To: Nicolas Pitre
Cc: linaro-kernel, Russell King, linux-sh, linux-pm, Daniel Lezcano,
Rafael J. Wysocki, linux-kernel, Paul Mundt, Preeti U Murthy,
Thomas Gleixner, linuxppc-dev, Ingo Molnar, linux-arm-kernel
In-Reply-To: <alpine.LFD.2.11.1402061152550.1906@knanqh.ubzr>
On Thu, Feb 06, 2014 at 02:09:59PM +0000, Nicolas Pitre wrote:
> Hi Peter,
>
> Did you merge those patches in your tree?
tree, tree, what's in a word. Its in my patch stack yes. I should get
some of that into tip I suppose, been side-tracked a bit this week.
Sorry for the delay.
> If so, is it published somewhere?
http://programming.kicks-ass.net/sekrit/patches.tar.bz2
It is of varying quality at best.
> That would be a good idea if that could appear in linux-next
> so to prevent people from adding more calls to cpuidle_idle_call() from
> architecture code. I'm sending you 2 additional patches right away to
> remove those that appeared in v3.14-rc1.
Yeah, once they land in tip they'll end up in -next automagically. I'll
try and get that sorted tomorrow somewhere.
^ permalink raw reply
* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-06 16:38 UTC (permalink / raw)
To: Torsten Duwe
Cc: linux-kernel, Paul Mackerras, Anton Blanchard, Paul E. McKenney,
linuxppc-dev, Ingo Molnar
In-Reply-To: <20140206103736.GA18054@lst.de>
On Thu, Feb 06, 2014 at 11:37:37AM +0100, Torsten Duwe wrote:
> x86 has them, MIPS has them, ARM has them, even ia64 has them:
> ticket locks. They reduce memory bus and cache pressure especially
> for contended spinlocks, increasing performance.
>
> This patch is a port of the x86 spin locks, mostly written in C,
> to the powerpc, introducing inline asm where needed. The pSeries
> directed yield for vCPUs is taken care of by an additional "holder"
> field in the lock.
>
A few questions; what's with the ppc64 holder thing? Not having a 32bit
spinlock_t is sad.
Can you pair lwarx with sthcx ? I couldn't immediately find the answer
in the PowerISA doc. If so I think you can do better by being able to
atomically load both tickets but only storing the head without affecting
the tail.
In that case you can avoid the ll/sc on unlock, because only the lock
owner can modify the tail, so you can use a single half-word store.
^ permalink raw reply
* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Daniel Lezcano @ 2014-02-06 16:25 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
Rafael J. Wysocki, LKML, Ingo Molnar, Preeti U Murthy,
Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <1391696188-14540-1-git-send-email-nicolas.pitre@linaro.org>
[-- Attachment #1: Type: text/plain, Size: 1864 bytes --]
Hi Nico,
On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> The core idle loop now takes care of it.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
> arch/powerpc/platforms/powernv/setup.c | 13 +------------
> 1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/setup.c
> b/arch/powerpc/platforms/powernv/setup.c
> index 21166f65c9..a932feb290 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -26,7 +26,6 @@
> #include <linux/of_fdt.h>
> #include <linux/interrupt.h>
> #include <linux/bug.h>
> -#include <linux/cpuidle.h>
>
> #include <asm/machdep.h>
> #include <asm/firmware.h>
> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
> return 1;
> }
>
> -void powernv_idle(void)
> -{
> - /* Hook to cpuidle framework if available, else
> - * call on default platform idle code
> - */
> - if (cpuidle_idle_call()) {
> - power7_idle();
> - }
>
The cpuidle_idle_call is called from arch_cpu_idle in
arch/powerpc/kernel/idle.c between a ppc64_runlatch_off|on section.
Shouldn't the cpuidle-powernv driver call these functions when entering
idle ?
-- Daniel
> -}
> -
> define_machine(powernv) {
> .name = "PowerNV",
> .probe = pnv_probe,
> @@ -236,7 +225,7 @@ define_machine(powernv) {
> .show_cpuinfo = pnv_show_cpuinfo,
> .progress = pnv_progress,
> .machine_shutdown = pnv_shutdown,
> - .power_save = powernv_idle,
> + .power_save = power7_idle,
> .calibrate_decr = generic_calibrate_decr,
> #ifdef CONFIG_KEXEC
> .kexec_cpu_down = pnv_kexec_cpu_down,
> --
> 1.8.4.108.g55ea5f6
>
>
[-- Attachment #2: Type: text/html, Size: 2860 bytes --]
^ permalink raw reply
* Re: [PATCH 2/2] ARM64: powernv: remove redundant cpuidle_idle_call()
From: Daniel Lezcano @ 2014-02-06 16:20 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
Rafael J. Wysocki, LKML, Ingo Molnar, Preeti U Murthy,
Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <1391696188-14540-2-git-send-email-nicolas.pitre@linaro.org>
[-- Attachment #1: Type: text/plain, Size: 225 bytes --]
On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> The core idle loop now takes care of it.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[-- Attachment #2: Type: text/html, Size: 658 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox