LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Nishanth Aravamudan @ 2014-02-18 17:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402181051560.1291@nuc>

On 18.02.2014 [10:57:09 -0600], Christoph Lameter wrote:
> On Mon, 17 Feb 2014, Joonsoo Kim wrote:
> 
> > On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > > Hi Joonsoo,
> > > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > > memoryless nodes present? Even with fakenuma? Just curious.
> 
> x86_64 currently does not support memoryless nodes otherwise it would
> have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
> a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
> memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
> node without memory but just processors. Not very efficient. Not sure why
> people use these configurations.

Well, on powerpc, with the hypervisor providing the resources and the
topology, you can have cpuless and memoryless nodes. I'm not sure how
"fake" the NUMA is -- as I think since the resources are virtualized to
be one system, it's logically possible that the actual topology of the
resources can be CPUs from physical node 0 and memory from physical node
2. I would think with KVM on a sufficiently large (physically NUMA
x86_64) and loaded system, one could cause the same sort of
configuration to occur for a guest?

In any case, these configurations happen fairly often on long-running
(not rebooted) systems as LPARs are created/destroyed, resources are
DLPAR'd in and out of LPARs, etc.

> > I don't know, because I'm not expert on NUMA system :)
> > At first glance, fakenuma can't be used for testing
> > CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.
> 
> Well yeah. You'd have to do some mods to enable that testing.

I might look into it, as it might have sped up testing these changes.

Thanks,
Nish

^ permalink raw reply

* Re: Anyone using SysRQ key sequences on console serial port ?
From: John Donnelly @ 2014-02-18 19:47 UTC (permalink / raw)
  To: Scott Wood; +Cc: Paul Gortmaker, linuxppc-dev
In-Reply-To: <1392665821.6733.630.camel@snotra.buserror.net>

[-- Attachment #1: Type: text/plain, Size: 1900 bytes --]

I am enable to get one keyboard sequence responded to with the noted change
in the dts .

 for instance:

   SysRQ ( Break)   c

  Panics .. Which is a good response, and since it doesn't require a return
to user mode ( I suspect )  it appears to work.

Any other requests fail to report any information :

 SysRQ (break )  l - list active processes
                         m - list memory

  Any additional SysRQ are ignored., and the system appears hung.

 On an reference Intel platform, multiple SyqRQ can be issued and the
system remains healthy .

On Mon, Feb 17, 2014 at 1:37 PM, Scott Wood <scottwood@freescale.com> wrote:

> On Sun, 2014-02-16 at 10:56 -0500, Paul Gortmaker wrote:
> > On Fri, Feb 14, 2014 at 3:42 PM, John Donnelly <john.d@servergy.com>
> wrote:
> > > Hi,
> > >
> > >  I tried using the SysRq hotkey sequence on a serial console -
> > > 3.11.0-5-powerpc-e500mc system, by issuing a " break " and the system
> > > immediately wedges after displaying "SysRQ : HELP : " using both
> "Putty" and
> > > "Teraterm"  terminal emulators. I know the system is dead because my
> ssh
> > > sessions stopped too.
> >
> > Yes it does work -- or at least it _did_ work.  Make sure your dts has
> an entry
> >
> >                         compatible = "fsl,ns16550", "ns16550";
> >
> > since that enables a workaround I'd added for a hardware errata relating
> > to sending breaks over the serial console.   What you describe above
> > makes me think you aren't getting the workaround enabled.
>
> Also make sure CONFIG_SERIAL_8250_FSL is enabled.
>
> -Scott
>
>
>

-- 

*Regards,*
* John.*

*--*

*o* Energy-efficiency is #1 reason data centers look to expand.  -- Digital
Realty Trust
*o* Green Data Centers spending to increase 300% worldwide by 2016.  --
Pike Research
*o *Data Centers have become as vital to the functioni
ng of society as power stations.  -- The Economist

[-- Attachment #2: Type: text/html, Size: 4606 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Christoph Lameter @ 2014-02-18 19:58 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140218172832.GD31998@linux.vnet.ibm.com>

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

>
> Well, on powerpc, with the hypervisor providing the resources and the
> topology, you can have cpuless and memoryless nodes. I'm not sure how
> "fake" the NUMA is -- as I think since the resources are virtualized to
> be one system, it's logically possible that the actual topology of the
> resources can be CPUs from physical node 0 and memory from physical node
> 2. I would think with KVM on a sufficiently large (physically NUMA
> x86_64) and loaded system, one could cause the same sort of
> configuration to occur for a guest?

Ok but since you have a virtualized environment: Why not provide a fake
home node with fake memory that could be anywhere? This would avoid the
whole problem of supporting such a config at the kernel level.

Do not have a fake node that has no memory.

> In any case, these configurations happen fairly often on long-running
> (not rebooted) systems as LPARs are created/destroyed, resources are
> DLPAR'd in and out of LPARs, etc.

Ok then also move the memory of the local node somewhere?

> I might look into it, as it might have sped up testing these changes.

I guess that will be necessary in order to support the memoryless nodes
long term.

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Nishanth Aravamudan @ 2014-02-18 21:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402181356120.2910@nuc>

On 18.02.2014 [13:58:20 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> >
> > Well, on powerpc, with the hypervisor providing the resources and the
> > topology, you can have cpuless and memoryless nodes. I'm not sure how
> > "fake" the NUMA is -- as I think since the resources are virtualized to
> > be one system, it's logically possible that the actual topology of the
> > resources can be CPUs from physical node 0 and memory from physical node
> > 2. I would think with KVM on a sufficiently large (physically NUMA
> > x86_64) and loaded system, one could cause the same sort of
> > configuration to occur for a guest?
> 
> Ok but since you have a virtualized environment: Why not provide a fake
> home node with fake memory that could be anywhere? This would avoid the
> whole problem of supporting such a config at the kernel level.

We use the topology provided by the hypervisor, it does actually reflect
where CPUs and memory are, and their corresponding performance/NUMA
characteristics.

> Do not have a fake node that has no memory.
> 
> > In any case, these configurations happen fairly often on long-running
> > (not rebooted) systems as LPARs are created/destroyed, resources are
> > DLPAR'd in and out of LPARs, etc.
> 
> Ok then also move the memory of the local node somewhere?

This happens below the OS, we don't control the hypervisor's decisions.
I'm not sure if that's what you are suggesting.

Thanks,
Nish

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Christoph Lameter @ 2014-02-18 21:49 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140218210923.GA28170@linux.vnet.ibm.com>

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> We use the topology provided by the hypervisor, it does actually reflect
> where CPUs and memory are, and their corresponding performance/NUMA
> characteristics.

And so there are actually nodes without memory that have processors?
Can the hypervisor or the linux arch code be convinced to ignore nodes
without memory or assign a sane default node to processors?

> > Ok then also move the memory of the local node somewhere?
>
> This happens below the OS, we don't control the hypervisor's decisions.
> I'm not sure if that's what you are suggesting.

You could also do this from the powerpc arch code by sanitizing the
processor / node information that is then used by Linux.

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Nishanth Aravamudan @ 2014-02-18 22:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402181547210.3973@nuc>

On 18.02.2014 [15:49:22 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> > We use the topology provided by the hypervisor, it does actually reflect
> > where CPUs and memory are, and their corresponding performance/NUMA
> > characteristics.
> 
> And so there are actually nodes without memory that have processors?

Virtually (topologically as indicated to Linux), yes. Physically, I
don't think they are, but they might be exhausted, which is we get sort
of odd-appearing NUMA configurations.

> Can the hypervisor or the linux arch code be convinced to ignore nodes
> without memory or assign a sane default node to processors?

I think this happens quite often, so I don't know that we want to ignore
the performance impact of the underlying NUMA configuration. I guess we
could special-case memoryless/cpuless configurations somewhat, but I
don't think there's any reason to do that if we can make memoryless-node
support work in-kernel?

> > > Ok then also move the memory of the local node somewhere?
> >
> > This happens below the OS, we don't control the hypervisor's decisions.
> > I'm not sure if that's what you are suggesting.
> 
> You could also do this from the powerpc arch code by sanitizing the
> processor / node information that is then used by Linux.

I see what you're saying now, thanks!

-Nish

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: David Rientjes @ 2014-02-18 22:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linuxppc-dev, Anton Blanchard, LKML, linux-mm
In-Reply-To: <20140218090658.GA28130@dhcp22.suse.cz>

On Tue, 18 Feb 2014, Michal Hocko wrote:

> Hi,
> I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> enable zone reclaim). The commit message suggests that the zone reclaim
> is desirable for all NUMA configurations.
> 
> History has shown that the zone reclaim is more often harmful than
> helpful and leads to performance problems. The default RECLAIM_DISTANCE
> for generic case has been increased from 20 to 30 around 3.0
> (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
> 
> I strongly suspect that the patch is incorrect and it should be
> reverted. Before I will send a revert I would like to understand what
> led to the patch in the first place. I do not see why would PPC use only
> LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> seen use different values.
> 

I strongly suspect that the patch is correct since powerpc node distances 
are different than the architectures you're talking about and get doubled 
for every NUMA domain that the hardware supports.

^ permalink raw reply

* Re: [PATCH v3 3/3] powerpc/pseries: Report in kernel device tree update to drmgr
From: Tyrel Datwyler @ 2014-02-18 23:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Tyrel Datwyler; +Cc: nfont, linuxppc-dev
In-Reply-To: <1392596560.24061.12.camel@pasglop>

On 02/16/2014 04:22 PM, Benjamin Herrenschmidt wrote:
> On Fri, 2014-01-31 at 15:58 -0800, Tyrel Datwyler wrote:
>> Traditionally it has been drmgr's responsibilty to update the device tree
>> through the /proc/ppc64/ofdt interface after a suspend/resume operation.
>> This patchset however has modified suspend/resume ops to preform that update
>> entirely in the kernel during the resume. Therefore, a mechanism is required
>> for drmgr to determine who is responsible for the update. This patch adds a
>> show function to the "hibernate" attribute that returns 1 if the kernel
>> updates the device tree after the resume and 0 if drmgr is responsible.
> 
> This is not right.
> 
> We cannot change the kernel behaviour in a way that is incompatible with
> existing userspace, and unless you can explain me why that is not the
> case, it feels like this patch set does just that ....
> 
> What happens if you have an older drmgr which still does the update
> itself ?
> 

In this case we get a double update which I clearly neglected to mention
in the patch. The first patch in this series actually removes an
unnecessary double update from the existing kernel implementation. The
same information is returned by the update-nodes/properties call the
second time around and aside from being a waste of a few cycles is harmless.

Tyrel

> You can't just change user visible behaviours and interface without at
> least explaining why it's ok to do so at the very least (and ensuring
> that of course) without breaking existing userspace.
> 
> Now it's possible that the double update caused by this series is
> harmless, in which case please explain it in the changeset, I certainly
> can't assume it and if it's not, you'll need some other way for drmgr to
> signal the kernel to indicate it supports the new behaviour before you
> enable it.
> 
> Ben.
> 
>> Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/pseries/suspend.c | 25 ++++++++++++++++++++++++-
>>  1 file changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/suspend.c b/arch/powerpc/platforms/pseries/suspend.c
>> index 1d9c580..b87b978 100644
>> --- a/arch/powerpc/platforms/pseries/suspend.c
>> +++ b/arch/powerpc/platforms/pseries/suspend.c
>> @@ -192,7 +192,30 @@ out:
>>  	return rc;
>>  }
>>  
>> -static DEVICE_ATTR(hibernate, S_IWUSR, NULL, store_hibernate);
>> +#define USER_DT_UPDATE	0
>> +#define KERN_DT_UPDATE	1
>> +
>> +/**
>> + * show_hibernate - Report device tree update responsibilty
>> + * @dev:		subsys root device
>> + * @attr:		device attribute struct
>> + * @buf:		buffer
>> + *
>> + * Report whether a device tree update is performed by the kernel after a
>> + * resume, or if drmgr must coordinate the update from user space.
>> + *
>> + * Return value:
>> + *	0 if drmgr is to initiate update, and 1 otherwise
>> + **/
>> +static ssize_t show_hibernate(struct device *dev,
>> +			      struct device_attribute *attr,
>> +			      char *buf)
>> +{
>> +	return sprintf(buf, "%d\n", KERN_DT_UPDATE);
>> +}
>> +
>> +static DEVICE_ATTR(hibernate, S_IWUSR | S_IRUGO,
>> +		   show_hibernate, store_hibernate);
>>  
>>  static struct bus_type suspend_subsys = {
>>  	.name = "power",
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: Nishanth Aravamudan @ 2014-02-18 23:34 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linuxppc-dev, Anton Blanchard, LKML
In-Reply-To: <20140218090658.GA28130@dhcp22.suse.cz>

Hi Michal,

On 18.02.2014 [10:06:58 +0100], Michal Hocko wrote:
> Hi,
> I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> enable zone reclaim). The commit message suggests that the zone reclaim
> is desirable for all NUMA configurations.
> 
> History has shown that the zone reclaim is more often harmful than
> helpful and leads to performance problems. The default RECLAIM_DISTANCE
> for generic case has been increased from 20 to 30 around 3.0
> (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).

Interesting.

> I strongly suspect that the patch is incorrect and it should be
> reverted. Before I will send a revert I would like to understand what
> led to the patch in the first place. I do not see why would PPC use only
> LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> seen use different values.
> 
> Anton, could you comment please?

I'll let Anton comment here, but in looking into this issue in working
on CONFIG_HAVE_MEMORYLESS_NODE support, I realized that any LPAR with
memoryless nodes will set zone_reclaim_mode to 1. I think we want to
ignore memoryless nodes when we set up the reclaim mode like the
following? I'll send it as a proper patch if you agree?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5de4337..4f6ff6f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1853,8 +1853,9 @@ static void __paginginit init_zone_allows_reclaim(int nid)
 {
        int i;
 
-       for_each_online_node(i)
-               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
+       for_each_online_node(i) {
+               if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
+                                       local_memory_node(nid) != nid)
                        node_set(i, NODE_DATA(nid)->reclaim_nodes);
                else
                        zone_reclaim_mode = 1;

Note, this won't actually do anything if CONFIG_HAVE_MEMORYLESS_NODES is
not set, but if it is, I think semantically it will indicate that
memoryless nodes *have* to reclaim remotely.

And actually the above won't work, because the callpath is

start_kernel -> setup_arch -> paging_init [-> free_area_init_nodes ->
free_area_init_node -> init_zone_allows_reclaim] which is called before
build_all_zonelists. This is a similar ordering problem as I'm having
with the MEMORYLESS_NODE support, will work on it.

Thanks,
Nish

^ permalink raw reply related

* Re: [PATCH v3 3/3] powerpc/pseries: Report in kernel device tree update to drmgr
From: Benjamin Herrenschmidt @ 2014-02-18 23:47 UTC (permalink / raw)
  To: Tyrel Datwyler; +Cc: nfont, linuxppc-dev, Tyrel Datwyler
In-Reply-To: <5303EC9E.7070507@gmail.com>

On Tue, 2014-02-18 at 15:28 -0800, Tyrel Datwyler wrote:
> In this case we get a double update which I clearly neglected to
> mention
> in the patch. The first patch in this series actually removes an
> unnecessary double update from the existing kernel implementation. The
> same information is returned by the update-nodes/properties call the
> second time around and aside from being a waste of a few cycles is
> harmless.

Thanks. Can you resend the patches with updated descriptions ? I will
try to send them to Linus by end of week.

Cheers,
Ben.

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: Nishanth Aravamudan @ 2014-02-18 23:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linuxppc-dev, Anton Blanchard, LKML
In-Reply-To: <20140218233404.GB10844@linux.vnet.ibm.com>

On 18.02.2014 [15:34:05 -0800], Nishanth Aravamudan wrote:
> Hi Michal,
> 
> On 18.02.2014 [10:06:58 +0100], Michal Hocko wrote:
> > Hi,
> > I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> > 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> > enable zone reclaim). The commit message suggests that the zone reclaim
> > is desirable for all NUMA configurations.
> > 
> > History has shown that the zone reclaim is more often harmful than
> > helpful and leads to performance problems. The default RECLAIM_DISTANCE
> > for generic case has been increased from 20 to 30 around 3.0
> > (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
> 
> Interesting.
> 
> > I strongly suspect that the patch is incorrect and it should be
> > reverted. Before I will send a revert I would like to understand what
> > led to the patch in the first place. I do not see why would PPC use only
> > LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> > seen use different values.
> > 
> > Anton, could you comment please?
> 
> I'll let Anton comment here, but in looking into this issue in working
> on CONFIG_HAVE_MEMORYLESS_NODE support, I realized that any LPAR with
> memoryless nodes will set zone_reclaim_mode to 1. I think we want to
> ignore memoryless nodes when we set up the reclaim mode like the
> following? I'll send it as a proper patch if you agree?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5de4337..4f6ff6f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1853,8 +1853,9 @@ static void __paginginit init_zone_allows_reclaim(int nid)
>  {
>         int i;
> 
> -       for_each_online_node(i)
> -               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> +       for_each_online_node(i) {
> +               if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> +                                       local_memory_node(nid) != nid)
>                         node_set(i, NODE_DATA(nid)->reclaim_nodes);
>                 else
>                         zone_reclaim_mode = 1;
> 
> Note, this won't actually do anything if CONFIG_HAVE_MEMORYLESS_NODES is
> not set, but if it is, I think semantically it will indicate that
> memoryless nodes *have* to reclaim remotely.
> 
> And actually the above won't work, because the callpath is
> 
> start_kernel -> setup_arch -> paging_init [-> free_area_init_nodes ->
> free_area_init_node -> init_zone_allows_reclaim] which is called before
> build_all_zonelists. This is a similar ordering problem as I'm having
> with the MEMORYLESS_NODE support, will work on it.

How about the following?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5de4337..1a0eced 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
        int i;
 
        for_each_online_node(i)
-               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
+               if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
+                                       !NODE_DATA(nid)->node_present_pages)
                        node_set(i, NODE_DATA(nid)->reclaim_nodes);
                else
                        zone_reclaim_mode = 1;
@@ -4901,13 +4902,13 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
        pgdat->node_id = nid;
        pgdat->node_start_pfn = node_start_pfn;
-       init_zone_allows_reclaim(nid);
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
        get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 #endif
        calculate_node_totalpages(pgdat, start_pfn, end_pfn,
                                  zones_size, zholes_size);
 
+       init_zone_allows_reclaim(nid);
        alloc_node_mem_map(pgdat);
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
        printk(KERN_DEBUG "free_area_init_node: node %d, pgdat %08lx, node_mem_map %08lx\n",

I think it's safe to move init_zone_allows_reclaim, because I don't
think any allocates are occurring here that could cause us to reclaim
anyways, right? Moving it allows us to safely reference
node_present_pages.

Thanks,
Nish

^ permalink raw reply related

* Re: [PATCH][v2] powerpc/fsl: Add/update miscellaneous missing bindings
From: Scott Wood @ 2014-02-19  0:20 UTC (permalink / raw)
  To: Harninder Rai; +Cc: devicetree, linuxppc-dev
In-Reply-To: <1392276554-10368-1-git-send-email-harninder.rai@freescale.com>

On Thu, 2014-02-13 at 12:59 +0530, Harninder Rai wrote:
> Missing bindings were found on running checkpatch.pl on bsc9132
> device tree. This patch add/update the following
> 
> - Add bindings for L2 cache controller
> - Add bindings for memory controller
> - Update bindings for USB controller
> 
> Signed-off-by: Harninder Rai <harninder.rai@freescale.com>
> ---
> Changes since base version:
> Incorporated Scott's comments
> - Rename l2cc.txt to l2cache.txt
> - Add information about ePAPR compliance
> - Add missing "cache" in compatible
> - Miscellaneous minors
> 
>  .../devicetree/bindings/powerpc/fsl/l2cache.txt    |   26 ++++++++++++++++++++
>  .../devicetree/bindings/powerpc/fsl/mem-ctrlr.txt  |   16 ++++++++++++
>  Documentation/devicetree/bindings/usb/fsl-usb.txt  |    2 +
>  3 files changed, 44 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt
>  create mode 100644 Documentation/devicetree/bindings/powerpc/fsl/mem-ctrlr.txt
> 
> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt b/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt
> new file mode 100644
> index 0000000..79ef4a1
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt
> @@ -0,0 +1,26 @@
> +Freescale L2 Cache Controller
> +
> +L2 cache is present in Freescale's QorIQ and QorIQ Qonverge platforms.
> +The cache bindings explained below are ePAPR compliant
> +
> +Required Properties:
> +
> +- compatible	: Should include "fsl,chip-l2-cache-controller" and "cache"
> +		  where chip is the processor (bsc9132, npc8572 etc.)
> +- reg		: Address and size of L2 cache controller registers
> +- cache-size	: Size of the entire L2 cache
> +- interrupts	: Error interrupt of L2 controller
> +
> +Optional Properties:
> +
> +- cache-line-size : Size of L2 cache lines

cache-line-size is required as per ePAPR.

> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/mem-ctrlr.txt b/Documentation/devicetree/bindings/powerpc/fsl/mem-ctrlr.txt
> new file mode 100644
> index 0000000..70b42bb
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/powerpc/fsl/mem-ctrlr.txt
> @@ -0,0 +1,16 @@
> +Freescale DDR memory controller
> +
> +Properties:
> +
> +- compatible	: Should include "fsl,chip-memory-controller" where
> +		  chip is the processor (bsc9132, mpc8572 etc.)

Please also cover newer device trees that use
"fsl,qoriq-memory-controller" and don't have
"fsl,CHIP-memory-controller" (they also use a variant with the block
version, but since the block version is readable in a register I don't
think it's necessary to specify that here).

> diff --git a/Documentation/devicetree/bindings/usb/fsl-usb.txt b/Documentation/devicetree/bindings/usb/fsl-usb.txt
> index bd5723f..afa5809 100644
> --- a/Documentation/devicetree/bindings/usb/fsl-usb.txt
> +++ b/Documentation/devicetree/bindings/usb/fsl-usb.txt
> @@ -9,6 +9,8 @@ Required properties :
>   - compatible : Should be "fsl-usb2-mph" for multi port host USB
>     controllers, or "fsl-usb2-dr" for dual role USB controllers
>     or "fsl,mpc5121-usb2-dr" for dual role USB controllers of MPC5121
> +   Wherever applicable, the IP version of the USB controller should
> +   also be mentioned (for eg. fsl-usb2-dr-v2.2 for bsc9132).

Please terminate the previous sentence with a period.

-Scott

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: Nishanth Aravamudan @ 2014-02-19  0:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linuxppc-dev, Anton Blanchard, LKML
In-Reply-To: <20140218235800.GC10844@linux.vnet.ibm.com>

On 18.02.2014 [15:58:00 -0800], Nishanth Aravamudan wrote:
> On 18.02.2014 [15:34:05 -0800], Nishanth Aravamudan wrote:
> > Hi Michal,
> > 
> > On 18.02.2014 [10:06:58 +0100], Michal Hocko wrote:
> > > Hi,
> > > I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> > > 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> > > enable zone reclaim). The commit message suggests that the zone reclaim
> > > is desirable for all NUMA configurations.
> > > 
> > > History has shown that the zone reclaim is more often harmful than
> > > helpful and leads to performance problems. The default RECLAIM_DISTANCE
> > > for generic case has been increased from 20 to 30 around 3.0
> > > (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
> > 
> > Interesting.
> > 
> > > I strongly suspect that the patch is incorrect and it should be
> > > reverted. Before I will send a revert I would like to understand what
> > > led to the patch in the first place. I do not see why would PPC use only
> > > LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> > > seen use different values.
> > > 
> > > Anton, could you comment please?
> > 
> > I'll let Anton comment here, but in looking into this issue in working
> > on CONFIG_HAVE_MEMORYLESS_NODE support, I realized that any LPAR with
> > memoryless nodes will set zone_reclaim_mode to 1. I think we want to
> > ignore memoryless nodes when we set up the reclaim mode like the
> > following? I'll send it as a proper patch if you agree?
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5de4337..4f6ff6f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1853,8 +1853,9 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> >  {
> >         int i;
> > 
> > -       for_each_online_node(i)
> > -               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > +       for_each_online_node(i) {
> > +               if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> > +                                       local_memory_node(nid) != nid)
> >                         node_set(i, NODE_DATA(nid)->reclaim_nodes);
> >                 else
> >                         zone_reclaim_mode = 1;
> > 
> > Note, this won't actually do anything if CONFIG_HAVE_MEMORYLESS_NODES is
> > not set, but if it is, I think semantically it will indicate that
> > memoryless nodes *have* to reclaim remotely.
> > 
> > And actually the above won't work, because the callpath is
> > 
> > start_kernel -> setup_arch -> paging_init [-> free_area_init_nodes ->
> > free_area_init_node -> init_zone_allows_reclaim] which is called before
> > build_all_zonelists. This is a similar ordering problem as I'm having
> > with the MEMORYLESS_NODE support, will work on it.
> 
> How about the following?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5de4337..1a0eced 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
>         int i;
>  
>         for_each_online_node(i)
> -               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> +               if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> +                                       !NODE_DATA(nid)->node_present_pages)

err s/nid/i/ above.

-Nish

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: David Rientjes @ 2014-02-19  1:43 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michal Hocko, linux-mm, linuxppc-dev, Anton Blanchard, LKML
In-Reply-To: <20140218235800.GC10844@linux.vnet.ibm.com>

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> How about the following?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5de4337..1a0eced 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
>         int i;
>  
>         for_each_online_node(i)
> -               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> +               if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> +                                       !NODE_DATA(i)->node_present_pages)
>                         node_set(i, NODE_DATA(nid)->reclaim_nodes);
>                 else
>                         zone_reclaim_mode = 1;

 [ I changed the above from NODE_DATA(nid) -> NODE_DATA(i) as you caught 
   so we're looking at the right code. ]

That can't be right, it would allow reclaiming from a memoryless node.  I 
think what you want is

	for_each_online_node(i) {
		if (!node_present_pages(i))
			continue;
		if (node_distance(nid, i) <= RECLAIM_DISTANCE) {
			node_set(i, NODE_DATA(nid)->reclaim_nodes);
			continue;
		}
		/* Always try to reclaim locally */
		zone_reclaim_mode = 1;
	}

but we really should be able to do for_each_node_state(i, N_MEMORY) here 
and memoryless nodes should already be excluded from that mask.

> @@ -4901,13 +4902,13 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>  
>         pgdat->node_id = nid;
>         pgdat->node_start_pfn = node_start_pfn;
> -       init_zone_allows_reclaim(nid);
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>         get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>  #endif
>         calculate_node_totalpages(pgdat, start_pfn, end_pfn,
>                                   zones_size, zholes_size);
>  
> +       init_zone_allows_reclaim(nid);
>         alloc_node_mem_map(pgdat);
>  #ifdef CONFIG_FLAT_NODE_MEM_MAP
>         printk(KERN_DEBUG "free_area_init_node: node %d, pgdat %08lx, node_mem_map %08lx\n",
> 
> I think it's safe to move init_zone_allows_reclaim, because I don't
> think any allocates are occurring here that could cause us to reclaim
> anyways, right? Moving it allows us to safely reference
> node_present_pages.
> 

Yeah, this is fine.

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: Michal Hocko @ 2014-02-19  8:16 UTC (permalink / raw)
  To: David Rientjes; +Cc: linuxppc-dev, Anton Blanchard, LKML, linux-mm
In-Reply-To: <alpine.DEB.2.02.1402181424490.20772@chino.kir.corp.google.com>

On Tue 18-02-14 14:27:11, David Rientjes wrote:
> On Tue, 18 Feb 2014, Michal Hocko wrote:
> 
> > Hi,
> > I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> > 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> > enable zone reclaim). The commit message suggests that the zone reclaim
> > is desirable for all NUMA configurations.
> > 
> > History has shown that the zone reclaim is more often harmful than
> > helpful and leads to performance problems. The default RECLAIM_DISTANCE
> > for generic case has been increased from 20 to 30 around 3.0
> > (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
> > 
> > I strongly suspect that the patch is incorrect and it should be
> > reverted. Before I will send a revert I would like to understand what
> > led to the patch in the first place. I do not see why would PPC use only
> > LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> > seen use different values.
> > 
> 
> I strongly suspect that the patch is correct since powerpc node distances 
> are different than the architectures you're talking about and get doubled 
> for every NUMA domain that the hardware supports.

Even if the units of the distance is different on PPC should every NUMA
machine have zone_reclaim enabled? That doesn't right to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: David Rientjes @ 2014-02-19  8:20 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linuxppc-dev, Anton Blanchard, LKML, linux-mm
In-Reply-To: <20140219081644.GA14783@dhcp22.suse.cz>

On Wed, 19 Feb 2014, Michal Hocko wrote:

> > I strongly suspect that the patch is correct since powerpc node distances 
> > are different than the architectures you're talking about and get doubled 
> > for every NUMA domain that the hardware supports.
> 
> Even if the units of the distance is different on PPC should every NUMA
> machine have zone_reclaim enabled? That doesn't right to me.
> 

In my experience on powerpc it's very correct, there's typically a 
significant latency in remote access and we don't have the benefit of a 
SLIT that actually defines the locality between proximity domains like we 
do on other architectures.  We have had significant issues with thp, for 
example, being allocated remotely instead of pages locally, much more 
drastic than on our x86 machines, particularly AMD machines.

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: Michal Hocko @ 2014-02-19  8:23 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, linuxppc-dev, Anton Blanchard, LKML
In-Reply-To: <20140218233404.GB10844@linux.vnet.ibm.com>

On Tue 18-02-14 15:34:05, Nishanth Aravamudan wrote:
> Hi Michal,
> 
> On 18.02.2014 [10:06:58 +0100], Michal Hocko wrote:
> > Hi,
> > I have just noticed that ppc has RECLAIM_DISTANCE reduced to 10 set by
> > 56608209d34b (powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to
> > enable zone reclaim). The commit message suggests that the zone reclaim
> > is desirable for all NUMA configurations.
> > 
> > History has shown that the zone reclaim is more often harmful than
> > helpful and leads to performance problems. The default RECLAIM_DISTANCE
> > for generic case has been increased from 20 to 30 around 3.0
> > (32e45ff43eaf mm: increase RECLAIM_DISTANCE to 30).
> 
> Interesting.
> 
> > I strongly suspect that the patch is incorrect and it should be
> > reverted. Before I will send a revert I would like to understand what
> > led to the patch in the first place. I do not see why would PPC use only
> > LOCAL_DISTANCE and REMOTE_DISTANCE distances and in fact machines I have
> > seen use different values.
> > 
> > Anton, could you comment please?
> 
> I'll let Anton comment here, but in looking into this issue in working
> on CONFIG_HAVE_MEMORYLESS_NODE support, I realized that any LPAR with
> memoryless nodes will set zone_reclaim_mode to 1. I think we want to
> ignore memoryless nodes when we set up the reclaim mode like the
> following? I'll send it as a proper patch if you agree?

Funny enough, ppc memoryless node setup is what led me to this code.
We had a setup like this:
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 7168 MB
node 2 free: 6019 MB
node distances:
node   0   2
0:  10  40
2:  40  10

Which ends up enabling zone_reclaim although there is only a single node
with memory. Not that RECLAIM_DISTANCE would make any difference here as
the distance is even above default RECLAIM_DISTANCE.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5de4337..4f6ff6f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1853,8 +1853,9 @@ static void __paginginit init_zone_allows_reclaim(int nid)
>  {
>         int i;
>  
> -       for_each_online_node(i)
> -               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> +       for_each_online_node(i) {
> +               if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> +                                       local_memory_node(nid) != nid)
>                         node_set(i, NODE_DATA(nid)->reclaim_nodes);
>                 else
>                         zone_reclaim_mode = 1;
> 
> Note, this won't actually do anything if CONFIG_HAVE_MEMORYLESS_NODES is
> not set, but if it is, I think semantically it will indicate that
> memoryless nodes *have* to reclaim remotely.
> 
> And actually the above won't work, because the callpath is
> 
> start_kernel -> setup_arch -> paging_init [-> free_area_init_nodes ->
> free_area_init_node -> init_zone_allows_reclaim] which is called before
> build_all_zonelists. This is a similar ordering problem as I'm having
> with the MEMORYLESS_NODE support, will work on it.

I think you just want for_each_node_state(nid, N_MEMORY) and skip all
the memory less nodes, no?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: Michal Hocko @ 2014-02-19  8:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Nishanth Aravamudan, linuxppc-dev, Anton Blanchard,
	LKML
In-Reply-To: <alpine.DEB.2.02.1402181737530.17521@chino.kir.corp.google.com>

On Tue 18-02-14 17:43:38, David Rientjes wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> > How about the following?
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5de4337..1a0eced 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1854,7 +1854,8 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> >         int i;
> >  
> >         for_each_online_node(i)
> > -               if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> > +               if (node_distance(nid, i) <= RECLAIM_DISTANCE ||
> > +                                       !NODE_DATA(i)->node_present_pages)
> >                         node_set(i, NODE_DATA(nid)->reclaim_nodes);
> >                 else
> >                         zone_reclaim_mode = 1;
> 
>  [ I changed the above from NODE_DATA(nid) -> NODE_DATA(i) as you caught 
>    so we're looking at the right code. ]
> 
> That can't be right, it would allow reclaiming from a memoryless node.  I 
> think what you want is
> 
> 	for_each_online_node(i) {
> 		if (!node_present_pages(i))
> 			continue;
> 		if (node_distance(nid, i) <= RECLAIM_DISTANCE) {
> 			node_set(i, NODE_DATA(nid)->reclaim_nodes);
> 			continue;
> 		}
> 		/* Always try to reclaim locally */
> 		zone_reclaim_mode = 1;
> 	}
> 
> but we really should be able to do for_each_node_state(i, N_MEMORY) here 
> and memoryless nodes should already be excluded from that mask.

Agreed! Actually the code I am currently interested in is based on 3.0
kernel where zone_reclaim_mode is set in build_zonelists which relies on
find_next_best_node which iterates only N_HIGH_MEMORY nodes which should
have non 0 present pages.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: ppc: RECLAIM_DISTANCE 10?
From: Michal Hocko @ 2014-02-19  9:19 UTC (permalink / raw)
  To: David Rientjes; +Cc: linuxppc-dev, Anton Blanchard, LKML, linux-mm
In-Reply-To: <alpine.DEB.2.02.1402190017480.7280@chino.kir.corp.google.com>

On Wed 19-02-14 00:20:21, David Rientjes wrote:
> On Wed, 19 Feb 2014, Michal Hocko wrote:
> 
> > > I strongly suspect that the patch is correct since powerpc node distances 
> > > are different than the architectures you're talking about and get doubled 
> > > for every NUMA domain that the hardware supports.
> > 
> > Even if the units of the distance is different on PPC should every NUMA
> > machine have zone_reclaim enabled? That doesn't right to me.
> > 
> 
> In my experience on powerpc it's very correct, there's typically a 
> significant latency in remote access and we don't have the benefit of a 
> SLIT that actually defines the locality between proximity domains like we 
> do on other architectures. 

Interesting. So is the PPC NUMA basically about local vs. very distant?
Should REMOTE_DISTANCE reflect that as well? Or can we have 
distance < REMOTE_DISTANCE and it would still make sense to have
zone_reclaim enabled?

[...]

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* [PATCH][v3] powerpc/fsl: Add/update miscellaneous missing binding
From: Harninder Rai @ 2014-02-19 12:41 UTC (permalink / raw)
  To: scottwood, devicetree; +Cc: Harninder Rai, linuxppc-dev

Missing bindings were found on running checkpatch.pl on bsc9132
device tree. This patch add/update the following

- Add bindings for L2 cache controller
- Add bindings for memory controller
- Update bindings for USB controller

Signed-off-by: Harninder Rai <harninder.rai@freescale.com>
---
* Changes in v3:
Incorporated Scott's comments
- Move cache-line-size property under required properties
- Update compatible for mem-ctrl to reflect "fsl,qoriq-memory-controller"
- One minor

* Changes since base version:
Incorporated Scott's comments
- Rename l2cc.txt to l2cache.txt
- Add information about ePAPR compliance
- Add missing "cache" in compatible
- Miscellaneous minors

 .../devicetree/bindings/powerpc/fsl/l2cache.txt    |   23 +++++++++++++++++
 .../devicetree/bindings/powerpc/fsl/mem-ctrlr.txt  |   27 ++++++++++++++++++++
 Documentation/devicetree/bindings/usb/fsl-usb.txt  |    4 ++-
 3 files changed, 53 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt
 create mode 100644 Documentation/devicetree/bindings/powerpc/fsl/mem-ctrlr.txt

diff --git a/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt b/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt
new file mode 100644
index 0000000..c41b218
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt
@@ -0,0 +1,23 @@
+Freescale L2 Cache Controller
+
+L2 cache is present in Freescale's QorIQ and QorIQ Qonverge platforms.
+The cache bindings explained below are ePAPR compliant
+
+Required Properties:
+
+- compatible	: Should include "fsl,chip-l2-cache-controller" and "cache"
+		  where chip is the processor (bsc9132, npc8572 etc.)
+- reg		: Address and size of L2 cache controller registers
+- cache-size	: Size of the entire L2 cache
+- interrupts	: Error interrupt of L2 controller
+- cache-line-size : Size of L2 cache lines
+
+Example:
+
+	L2: l2-cache-controller@20000 {
+		compatible = "fsl,bsc9132-l2-cache-controller", "cache";
+		reg = <0x20000 0x1000>;
+		cache-line-size = <32>; // 32 bytes
+		cache-size = <0x40000>; // L2,256K
+		interrupts = <16 2 1 0>;
+	};
diff --git a/Documentation/devicetree/bindings/powerpc/fsl/mem-ctrlr.txt b/Documentation/devicetree/bindings/powerpc/fsl/mem-ctrlr.txt
new file mode 100644
index 0000000..f87856f
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/fsl/mem-ctrlr.txt
@@ -0,0 +1,27 @@
+Freescale DDR memory controller
+
+Properties:
+
+- compatible	: Should include "fsl,chip-memory-controller" where
+		  chip is the processor (bsc9132, mpc8572 etc.), or
+		  "fsl,qoriq-memory-controller".
+- reg		: Address and size of DDR controller registers
+- interrupts	: Error interrupt of DDR controller
+
+Example 1:
+
+	memory-controller@2000 {
+		compatible = "fsl,bsc9132-memory-controller";
+		reg = <0x2000 0x1000>;
+		interrupts = <16 2 1 8>;
+	};
+
+
+Example 2:
+
+	ddr1: memory-controller@8000 {
+		compatible = "fsl,qoriq-memory-controller-v4.7",
+				"fsl,qoriq-memory-controller";
+		reg = <0x8000 0x1000>;
+		interrupts = <16 2 1 23>;
+	};
diff --git a/Documentation/devicetree/bindings/usb/fsl-usb.txt b/Documentation/devicetree/bindings/usb/fsl-usb.txt
index bd5723f..4779c02 100644
--- a/Documentation/devicetree/bindings/usb/fsl-usb.txt
+++ b/Documentation/devicetree/bindings/usb/fsl-usb.txt
@@ -8,7 +8,9 @@ and additions :
 Required properties :
  - compatible : Should be "fsl-usb2-mph" for multi port host USB
    controllers, or "fsl-usb2-dr" for dual role USB controllers
-   or "fsl,mpc5121-usb2-dr" for dual role USB controllers of MPC5121
+   or "fsl,mpc5121-usb2-dr" for dual role USB controllers of MPC5121.
+   Wherever applicable, the IP version of the USB controller should
+   also be mentioned (for eg. fsl-usb2-dr-v2.2 for bsc9132).
  - phy_type : For multi port host USB controllers, should be one of
    "ulpi", or "serial". For dual role USB controllers, should be
    one of "ulpi", "utmi", "utmi_wide", or "serial".
-- 
1.7.6.GIT

^ permalink raw reply related

* Re: [PATCH RFC v7 0/6] MPC512x DMA slave s/g support, OF DMA lookup
From: Alexander Popov @ 2014-02-19 13:17 UTC (permalink / raw)
  To: Gerhard Sittig
  Cc: devicetree, Lars-Peter Clausen, Arnd Bergmann, Vinod Koul,
	Dan Williams, Anatolij Gustschin, linuxppc-dev
In-Reply-To: <20140213003253.GL4524@book.gsilab.sittig.org>

Hello, Gerhard

2014-02-13 4:32 GMT+04:00 Gerhard Sittig <gsi@denx.de>:
> For some reason you have kept the DMA maintainers, but dropped
> the dmaengine ML from Cc: -- was this intentional, given that the
> series is specifically about DMA and you want to get feedback?
No, it was not done by intention, I'll fix that in the new version of the
series.

> And you may want to help DT people by not sending purely Linux
> implementation related stuff to them (they already are drinking
> from the firehose).  DT reviewers are foremost interested in
> bindings and policy and remaining OS agnostic, and leave
> mechanical .dts file updates to subsystem maintainers.
Do you mean I should better Cc to devicetree@vger.kernel.org
only parts 3, 4 and 5 of this series? How about cover letter?

Thank you.

Alexander

^ permalink raw reply

* Re: [PATCH RFC v7 2/6] dma: mpc512x: add support for peripheral transfers
From: Alexander Popov @ 2014-02-19 14:45 UTC (permalink / raw)
  To: Gerhard Sittig
  Cc: Lars-Peter Clausen, Arnd Bergmann, Vinod Koul, dmaengine,
	Dan Williams, Anatolij Gustschin, linuxppc-dev
In-Reply-To: <20140213000743.GI4524@book.gsilab.sittig.org>

[ adding dmaengine ML to Cc: ]

Thanks for your feedback, Gerhard

2014-02-13 4:07 GMT+04:00 Gerhard Sittig <gsi@denx.de>:
> On Wed, Feb 12, 2014 at 17:25 +0400, Alexander Popov wrote:
>>  /*
>> - * This is initial version of MPC5121 DMA driver. Only memory to memory
>> - * transfers are supported (tested using dmatest module).
>> + * MPC512x and MPC8308 DMA driver. It supports
>> + * memory to memory data transfers (tested using dmatest module) and
>> + * data transfers between memory and peripheral I/O memory
>> + * by means of slave s/g with these limitations:
>> + * - chunked transfers (transfers with more than one part) are refused
>> + * as long as proper support for scatter/gather is missing;
>> + * - transfers on MPC8308 always start from software as this SoC appears
>> + * not to have external request lines for peripheral flow control;
>> + * - minimal memory <-> I/O memory transfer size is 4 bytes.
>>   */
>
> Often I assume people would notice themselves, and apparently I'm
> wrong. :)  Can you adjust the formatting such (here and
> elsewhere) that the bullet list is clearly visible as such?
> Flowing text like above obfuscates the fact that the content may
> have a structure ...
Ok, thanks :)

> There are known limitations which are not listed here, "minimal
> transfer size" is incomplete.  It appears that you assume
> constraints on start addresses as well as sizes/lengths.  Can you
> update the documentation to match the implementation?
Ok, I see. How about that?
* - minimal memory <-> I/O memory transfer chunk is 4 bytes and consequently
*     source and destination addresses must be 4-byte aligned
*     and transfer size must be aligned on (4 * maxburst) boundary;

>> +             /* Grab either several mem-to-mem transfer descriptors
>> +              * or one peripheral transfer descriptor,
>> +              * don't mix mem-to-mem and peripheral transfer descriptors
>> +              * within the same 'active' list. */
>> +             if (mdesc->will_access_peripheral) {
>> +                     if (list_empty(&mchan->active))
>> +                             list_move_tail(&mdesc->node, &mchan->active);
>> +                     break;
>> +             } else
>> +                     list_move_tail(&mdesc->node, &mchan->active);
>> +     }
> There are style issues.  Both in multi line comments, and in the
> braces of the if/else block.
Ah, thanks! I'll fix that.

>> +     struct mpc_dma *mdma = dma_chan_to_mpc_dma(chan);
>> +     struct mpc_dma_chan *mchan = dma_chan_to_mpc_dma_chan(chan);
>> +     struct mpc_dma_desc *mdesc = NULL;
>> +     dma_addr_t per_paddr;
>> +     u32 tcd_nunits;
>> +     struct mpc_dma_tcd *tcd;
>> +     unsigned long iflags;
>> +     struct scatterlist *sg;
>> +     size_t len;
>> +     int iter, i;
> Personally I much dislike this style of mixing declarations and
> instructions.  But others may disagree, and strongly so.
Excuse me, I would like to keep it similar to other parts of this driver
and not to change that style.

>> +             mdesc = list_first_entry(&mchan->free, struct mpc_dma_desc,
>> +                                                                     node);
> style (continuation and indentation)
Thanks! I'll fix that.

>
>> +             if (!mdesc) {
>> +                     spin_unlock_irqrestore(&mchan->lock, iflags);
>> +                     /* try to free completed descriptors */
>> +                     mpc_dma_process_completed(mdma);
>> +                     return NULL;
>> +             }
>> +
>> +             list_del(&mdesc->node);
>> +
>> +             per_paddr = mchan->per_paddr;
>> +             tcd_nunits = mchan->tcd_nunits;
>> +
>> +             spin_unlock_irqrestore(&mchan->lock, iflags);
>> +
>> +             if (per_paddr == 0 || tcd_nunits == 0)
>> +                     goto err_prep;
>> +
>> +             mdesc->error = 0;
>> +             mdesc->will_access_peripheral = 1;
>> +             tcd = mdesc->tcd;
>> +
>> +             /* Prepare Transfer Control Descriptor for this transaction */
>> +
>> +             memset(tcd, 0, sizeof(struct mpc_dma_tcd));
>> +
>> +             if (!IS_ALIGNED(sg_dma_address(sg), 4))
>> +                     goto err_prep;
>
> You found multiple ways of encoding the "4 byte alignment", using
> both the fixed number as well as (several) symbolic identifiers.
> Can you look into making them use the same condition if the same
> motivation is behind the test?
Gerhard, I don't see checks which can be kind of "merged"
since they have different motivation behind:
1) src_addr_width and dst_addr_width have type "enum dma_slave_buswidth"
and are compared with DMA_SLAVE_BUSWIDTH_4_BYTES which has
the same type;
2) source and destination addresses are checked to be 4-byte aligned;
3) transfer size is checked to be aligned on nbytes boundary
which is (4 * maxburst).

I think the code provides good readability. What do you think?

Thank you!

Best regards,
Alexander

^ permalink raw reply

* [PATCH RFC/RFT v3 6/9] powerpc: move cacheinfo sysfs to generic cacheinfo infrastructure
From: Sudeep Holla @ 2014-02-19 16:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul Mackerras, linuxppc-dev, sudeep.holla
In-Reply-To: <1392825976-17633-1-git-send-email-sudeep.holla@arm.com>

From: Sudeep Holla <sudeep.holla@arm.com>

This patch removes the redundant sysfs cacheinfo code by making use of
the newly introduced generic cacheinfo infrastructure.

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/cacheinfo.c | 831 ++++++----------------------------------
 arch/powerpc/kernel/cacheinfo.h |   8 -
 arch/powerpc/kernel/sysfs.c     |   4 -
 3 files changed, 109 insertions(+), 734 deletions(-)
 delete mode 100644 arch/powerpc/kernel/cacheinfo.h

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 2912b87..05b7580 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -10,38 +10,10 @@
  * 2 as published by the Free Software Foundation.
  */
 
+#include <linux/cacheinfo.h>
 #include <linux/cpu.h>
-#include <linux/cpumask.h>
 #include <linux/kernel.h>
-#include <linux/kobject.h>
-#include <linux/list.h>
-#include <linux/notifier.h>
 #include <linux/of.h>
-#include <linux/percpu.h>
-#include <linux/slab.h>
-#include <asm/prom.h>
-
-#include "cacheinfo.h"
-
-/* per-cpu object for tracking:
- * - a "cache" kobject for the top-level directory
- * - a list of "index" objects representing the cpu's local cache hierarchy
- */
-struct cache_dir {
-	struct kobject *kobj; /* bare (not embedded) kobject for cache
-			       * directory */
-	struct cache_index_dir *index; /* list of index objects */
-};
-
-/* "index" object: each cpu's cache directory has an index
- * subdirectory corresponding to a cache object associated with the
- * cpu.  This object's lifetime is managed via the embedded kobject.
- */
-struct cache_index_dir {
-	struct kobject kobj;
-	struct cache_index_dir *next; /* next index in parent directory */
-	struct cache *cache;
-};
 
 /* Template for determining which OF properties to query for a given
  * cache type */
@@ -60,11 +32,6 @@ struct cache_type_info {
 	const char *nr_sets_prop;
 };
 
-/* These are used to index the cache_type_info array. */
-#define CACHE_TYPE_UNIFIED     0
-#define CACHE_TYPE_INSTRUCTION 1
-#define CACHE_TYPE_DATA        2
-
 static const struct cache_type_info cache_type_info[] = {
 	{
 		/* PowerPC Processor binding says the [di]-cache-*
@@ -77,246 +44,115 @@ static const struct cache_type_info cache_type_info[] = {
 		.nr_sets_prop    = "d-cache-sets",
 	},
 	{
-		.name            = "Instruction",
-		.size_prop       = "i-cache-size",
-		.line_size_props = { "i-cache-line-size",
-				     "i-cache-block-size", },
-		.nr_sets_prop    = "i-cache-sets",
-	},
-	{
 		.name            = "Data",
 		.size_prop       = "d-cache-size",
 		.line_size_props = { "d-cache-line-size",
 				     "d-cache-block-size", },
 		.nr_sets_prop    = "d-cache-sets",
 	},
+	{
+		.name            = "Instruction",
+		.size_prop       = "i-cache-size",
+		.line_size_props = { "i-cache-line-size",
+				     "i-cache-block-size", },
+		.nr_sets_prop    = "i-cache-sets",
+	},
 };
 
-/* Cache object: each instance of this corresponds to a distinct cache
- * in the system.  There are separate objects for Harvard caches: one
- * each for instruction and data, and each refers to the same OF node.
- * The refcount of the OF node is elevated for the lifetime of the
- * cache object.  A cache object is released when its shared_cpu_map
- * is cleared (see cache_cpu_clear).
- *
- * A cache object is on two lists: an unsorted global list
- * (cache_list) of cache objects; and a singly-linked list
- * representing the local cache hierarchy, which is ordered by level
- * (e.g. L1d -> L1i -> L2 -> L3).
- */
-struct cache {
-	struct device_node *ofnode;    /* OF node for this cache, may be cpu */
-	struct cpumask shared_cpu_map; /* online CPUs using this cache */
-	int type;                      /* split cache disambiguation */
-	int level;                     /* level not explicit in device tree */
-	struct list_head list;         /* global list of cache objects */
-	struct cache *next_local;      /* next cache of >= level */
-};
-
-static DEFINE_PER_CPU(struct cache_dir *, cache_dir_pcpu);
-
-/* traversal/modification of this list occurs only at cpu hotplug time;
- * access is serialized by cpu hotplug locking
- */
-static LIST_HEAD(cache_list);
-
-static struct cache_index_dir *kobj_to_cache_index_dir(struct kobject *k)
-{
-	return container_of(k, struct cache_index_dir, kobj);
-}
-
-static const char *cache_type_string(const struct cache *cache)
+static inline int get_cacheinfo_idx(enum cache_type type)
 {
-	return cache_type_info[cache->type].name;
-}
-
-static void cache_init(struct cache *cache, int type, int level,
-		       struct device_node *ofnode)
-{
-	cache->type = type;
-	cache->level = level;
-	cache->ofnode = of_node_get(ofnode);
-	INIT_LIST_HEAD(&cache->list);
-	list_add(&cache->list, &cache_list);
-}
-
-static struct cache *new_cache(int type, int level, struct device_node *ofnode)
-{
-	struct cache *cache;
-
-	cache = kzalloc(sizeof(*cache), GFP_KERNEL);
-	if (cache)
-		cache_init(cache, type, level, ofnode);
-
-	return cache;
-}
-
-static void release_cache_debugcheck(struct cache *cache)
-{
-	struct cache *iter;
-
-	list_for_each_entry(iter, &cache_list, list)
-		WARN_ONCE(iter->next_local == cache,
-			  "cache for %s(%s) refers to cache for %s(%s)\n",
-			  iter->ofnode->full_name,
-			  cache_type_string(iter),
-			  cache->ofnode->full_name,
-			  cache_type_string(cache));
-}
-
-static void release_cache(struct cache *cache)
-{
-	if (!cache)
-		return;
-
-	pr_debug("freeing L%d %s cache for %s\n", cache->level,
-		 cache_type_string(cache), cache->ofnode->full_name);
-
-	release_cache_debugcheck(cache);
-	list_del(&cache->list);
-	of_node_put(cache->ofnode);
-	kfree(cache);
-}
-
-static void cache_cpu_set(struct cache *cache, int cpu)
-{
-	struct cache *next = cache;
-
-	while (next) {
-		WARN_ONCE(cpumask_test_cpu(cpu, &next->shared_cpu_map),
-			  "CPU %i already accounted in %s(%s)\n",
-			  cpu, next->ofnode->full_name,
-			  cache_type_string(next));
-		cpumask_set_cpu(cpu, &next->shared_cpu_map);
-		next = next->next_local;
-	}
+	if (type == CACHE_TYPE_UNIFIED)
+		return 0;
+	else
+		return type;
 }
 
-static int cache_size(const struct cache *cache, unsigned int *ret)
+static int cache_size(struct cache_info *this_leaf)
 {
 	const char *propname;
 	const __be32 *cache_size;
+	int ct_idx;
 
-	propname = cache_type_info[cache->type].size_prop;
-
-	cache_size = of_get_property(cache->ofnode, propname, NULL);
-	if (!cache_size)
-		return -ENODEV;
-
-	*ret = of_read_number(cache_size, 1);
-	return 0;
-}
-
-static int cache_size_kb(const struct cache *cache, unsigned int *ret)
-{
-	unsigned int size;
+	ct_idx = get_cacheinfo_idx(this_leaf->type);
+	propname = cache_type_info[ct_idx].size_prop;
 
-	if (cache_size(cache, &size))
+	cache_size = of_get_property(this_leaf->of_node, propname, NULL);
+	if (!cache_size) {
+		this_leaf->size = 0;
 		return -ENODEV;
-
-	*ret = size / 1024;
-	return 0;
+	} else {
+		this_leaf->size = of_read_number(cache_size, 1);
+		return 0;
+	}
 }
 
 /* not cache_line_size() because that's a macro in include/linux/cache.h */
-static int cache_get_line_size(const struct cache *cache, unsigned int *ret)
+static int cache_get_line_size(struct cache_info *this_leaf)
 {
 	const __be32 *line_size;
-	int i, lim;
+	int i, lim, ct_idx;
 
-	lim = ARRAY_SIZE(cache_type_info[cache->type].line_size_props);
+	ct_idx = get_cacheinfo_idx(this_leaf->type);
+	lim = ARRAY_SIZE(cache_type_info[ct_idx].line_size_props);
 
 	for (i = 0; i < lim; i++) {
 		const char *propname;
 
-		propname = cache_type_info[cache->type].line_size_props[i];
-		line_size = of_get_property(cache->ofnode, propname, NULL);
+		propname = cache_type_info[ct_idx].line_size_props[i];
+		line_size = of_get_property(this_leaf->of_node, propname, NULL);
 		if (line_size)
 			break;
 	}
 
-	if (!line_size)
+	if (!line_size) {
+		this_leaf->coherency_line_size = 0;
 		return -ENODEV;
-
-	*ret = of_read_number(line_size, 1);
-	return 0;
+	} else {
+		this_leaf->coherency_line_size = of_read_number(line_size, 1);
+		return 0;
+	}
 }
 
-static int cache_nr_sets(const struct cache *cache, unsigned int *ret)
+static int cache_nr_sets(struct cache_info *this_leaf)
 {
 	const char *propname;
 	const __be32 *nr_sets;
+	int ct_idx;
 
-	propname = cache_type_info[cache->type].nr_sets_prop;
+	ct_idx = get_cacheinfo_idx(this_leaf->type);
+	propname = cache_type_info[ct_idx].nr_sets_prop;
 
-	nr_sets = of_get_property(cache->ofnode, propname, NULL);
-	if (!nr_sets)
+	nr_sets = of_get_property(this_leaf->of_node, propname, NULL);
+	if (!nr_sets) {
+		this_leaf->number_of_sets = 0;
 		return -ENODEV;
-
-	*ret = of_read_number(nr_sets, 1);
-	return 0;
+	} else {
+		this_leaf->number_of_sets = of_read_number(nr_sets, 1);
+		return 0;
+	}
 }
 
-static int cache_associativity(const struct cache *cache, unsigned int *ret)
+static int cache_associativity(struct cache_info *this_leaf)
 {
-	unsigned int line_size;
-	unsigned int nr_sets;
-	unsigned int size;
-
-	if (cache_nr_sets(cache, &nr_sets))
-		goto err;
+	unsigned int line_size = this_leaf->coherency_line_size;
+	unsigned int nr_sets = this_leaf->number_of_sets;
+	unsigned int size = this_leaf->size;
 
 	/* If the cache is fully associative, there is no need to
 	 * check the other properties.
 	 */
 	if (nr_sets == 1) {
-		*ret = 0;
+		this_leaf->ways_of_associativity = 0;
 		return 0;
 	}
 
-	if (cache_get_line_size(cache, &line_size))
-		goto err;
-	if (cache_size(cache, &size))
-		goto err;
-
-	if (!(nr_sets > 0 && size > 0 && line_size > 0))
-		goto err;
-
-	*ret = (size / nr_sets) / line_size;
-	return 0;
-err:
-	return -ENODEV;
-}
-
-/* helper for dealing with split caches */
-static struct cache *cache_find_first_sibling(struct cache *cache)
-{
-	struct cache *iter;
-
-	if (cache->type == CACHE_TYPE_UNIFIED)
-		return cache;
-
-	list_for_each_entry(iter, &cache_list, list)
-		if (iter->ofnode == cache->ofnode && iter->next_local == cache)
-			return iter;
-
-	return cache;
-}
-
-/* return the first cache on a local list matching node */
-static struct cache *cache_lookup_by_node(const struct device_node *node)
-{
-	struct cache *cache = NULL;
-	struct cache *iter;
-
-	list_for_each_entry(iter, &cache_list, list) {
-		if (iter->ofnode != node)
-			continue;
-		cache = cache_find_first_sibling(iter);
-		break;
+	if (!(nr_sets > 0 && size > 0 && line_size > 0)) {
+		this_leaf->ways_of_associativity = 0;
+		return -ENODEV;
+	} else {
+		this_leaf->ways_of_associativity = (size / nr_sets) / line_size;
+		return 0;
 	}
-
-	return cache;
 }
 
 static bool cache_node_is_unified(const struct device_node *np)
@@ -324,523 +160,74 @@ static bool cache_node_is_unified(const struct device_node *np)
 	return of_get_property(np, "cache-unified", NULL);
 }
 
-static struct cache *cache_do_one_devnode_unified(struct device_node *node,
-						  int level)
-{
-	struct cache *cache;
-
-	pr_debug("creating L%d ucache for %s\n", level, node->full_name);
-
-	cache = new_cache(CACHE_TYPE_UNIFIED, level, node);
-
-	return cache;
-}
-
-static struct cache *cache_do_one_devnode_split(struct device_node *node,
-						int level)
-{
-	struct cache *dcache, *icache;
-
-	pr_debug("creating L%d dcache and icache for %s\n", level,
-		 node->full_name);
-
-	dcache = new_cache(CACHE_TYPE_DATA, level, node);
-	icache = new_cache(CACHE_TYPE_INSTRUCTION, level, node);
-
-	if (!dcache || !icache)
-		goto err;
-
-	dcache->next_local = icache;
-
-	return dcache;
-err:
-	release_cache(dcache);
-	release_cache(icache);
-	return NULL;
-}
-
-static struct cache *cache_do_one_devnode(struct device_node *node, int level)
-{
-	struct cache *cache;
-
-	if (cache_node_is_unified(node))
-		cache = cache_do_one_devnode_unified(node, level);
-	else
-		cache = cache_do_one_devnode_split(node, level);
-
-	return cache;
-}
-
-static struct cache *cache_lookup_or_instantiate(struct device_node *node,
-						 int level)
-{
-	struct cache *cache;
-
-	cache = cache_lookup_by_node(node);
-
-	WARN_ONCE(cache && cache->level != level,
-		  "cache level mismatch on lookup (got %d, expected %d)\n",
-		  cache->level, level);
-
-	if (!cache)
-		cache = cache_do_one_devnode(node, level);
-
-	return cache;
-}
-
-static void link_cache_lists(struct cache *smaller, struct cache *bigger)
-{
-	while (smaller->next_local) {
-		if (smaller->next_local == bigger)
-			return; /* already linked */
-		smaller = smaller->next_local;
-	}
-
-	smaller->next_local = bigger;
-}
-
-static void do_subsidiary_caches_debugcheck(struct cache *cache)
-{
-	WARN_ON_ONCE(cache->level != 1);
-	WARN_ON_ONCE(strcmp(cache->ofnode->type, "cpu"));
-}
-
-static void do_subsidiary_caches(struct cache *cache)
-{
-	struct device_node *subcache_node;
-	int level = cache->level;
-
-	do_subsidiary_caches_debugcheck(cache);
-
-	while ((subcache_node = of_find_next_cache_node(cache->ofnode))) {
-		struct cache *subcache;
-
-		level++;
-		subcache = cache_lookup_or_instantiate(subcache_node, level);
-		of_node_put(subcache_node);
-		if (!subcache)
-			break;
-
-		link_cache_lists(cache, subcache);
-		cache = subcache;
-	}
-}
-
-static struct cache *cache_chain_instantiate(unsigned int cpu_id)
-{
-	struct device_node *cpu_node;
-	struct cache *cpu_cache = NULL;
-
-	pr_debug("creating cache object(s) for CPU %i\n", cpu_id);
-
-	cpu_node = of_get_cpu_node(cpu_id, NULL);
-	WARN_ONCE(!cpu_node, "no OF node found for CPU %i\n", cpu_id);
-	if (!cpu_node)
-		goto out;
-
-	cpu_cache = cache_lookup_or_instantiate(cpu_node, 1);
-	if (!cpu_cache)
-		goto out;
-
-	do_subsidiary_caches(cpu_cache);
-
-	cache_cpu_set(cpu_cache, cpu_id);
-out:
-	of_node_put(cpu_node);
-
-	return cpu_cache;
-}
-
-static struct cache_dir *cacheinfo_create_cache_dir(unsigned int cpu_id)
-{
-	struct cache_dir *cache_dir;
-	struct device *dev;
-	struct kobject *kobj = NULL;
-
-	dev = get_cpu_device(cpu_id);
-	WARN_ONCE(!dev, "no dev for CPU %i\n", cpu_id);
-	if (!dev)
-		goto err;
-
-	kobj = kobject_create_and_add("cache", &dev->kobj);
-	if (!kobj)
-		goto err;
-
-	cache_dir = kzalloc(sizeof(*cache_dir), GFP_KERNEL);
-	if (!cache_dir)
-		goto err;
-
-	cache_dir->kobj = kobj;
-
-	WARN_ON_ONCE(per_cpu(cache_dir_pcpu, cpu_id) != NULL);
-
-	per_cpu(cache_dir_pcpu, cpu_id) = cache_dir;
-
-	return cache_dir;
-err:
-	kobject_put(kobj);
-	return NULL;
-}
-
-static void cache_index_release(struct kobject *kobj)
-{
-	struct cache_index_dir *index;
-
-	index = kobj_to_cache_index_dir(kobj);
-
-	pr_debug("freeing index directory for L%d %s cache\n",
-		 index->cache->level, cache_type_string(index->cache));
-
-	kfree(index);
-}
-
-static ssize_t cache_index_show(struct kobject *k, struct attribute *attr, char *buf)
+static void ci_leaf_init(struct cache_info *this_leaf,
+				enum cache_type type, unsigned int level)
 {
-	struct kobj_attribute *kobj_attr;
-
-	kobj_attr = container_of(attr, struct kobj_attribute, attr);
-
-	return kobj_attr->show(k, kobj_attr, buf);
-}
-
-static struct cache *index_kobj_to_cache(struct kobject *k)
-{
-	struct cache_index_dir *index;
-
-	index = kobj_to_cache_index_dir(k);
-
-	return index->cache;
+	this_leaf->level = level;
+	this_leaf->type = type;
+	cache_size(this_leaf);
+	cache_get_line_size(this_leaf);
+	cache_nr_sets(this_leaf);
+	cache_associativity(this_leaf);
 }
 
-static ssize_t size_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
+int init_cache_level(unsigned int cpu)
 {
-	unsigned int size_kb;
-	struct cache *cache;
+	struct device_node *np;
+	struct device *cpu_dev = get_cpu_device(cpu);
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	unsigned int level = 0, leaves = 0;
 
-	cache = index_kobj_to_cache(k);
-
-	if (cache_size_kb(cache, &size_kb))
+	if (!cpu_dev) {
+		pr_err("No cpu device for CPU %d\n", cpu);
 		return -ENODEV;
-
-	return sprintf(buf, "%uK\n", size_kb);
-}
-
-static struct kobj_attribute cache_size_attr =
-	__ATTR(size, 0444, size_show, NULL);
-
-
-static ssize_t line_size_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int line_size;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_get_line_size(cache, &line_size))
-		return -ENODEV;
-
-	return sprintf(buf, "%u\n", line_size);
-}
-
-static struct kobj_attribute cache_line_size_attr =
-	__ATTR(coherency_line_size, 0444, line_size_show, NULL);
-
-static ssize_t nr_sets_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int nr_sets;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_nr_sets(cache, &nr_sets))
-		return -ENODEV;
-
-	return sprintf(buf, "%u\n", nr_sets);
-}
-
-static struct kobj_attribute cache_nr_sets_attr =
-	__ATTR(number_of_sets, 0444, nr_sets_show, NULL);
-
-static ssize_t associativity_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int associativity;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_associativity(cache, &associativity))
-		return -ENODEV;
-
-	return sprintf(buf, "%u\n", associativity);
-}
-
-static struct kobj_attribute cache_assoc_attr =
-	__ATTR(ways_of_associativity, 0444, associativity_show, NULL);
-
-static ssize_t type_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	return sprintf(buf, "%s\n", cache_type_string(cache));
-}
-
-static struct kobj_attribute cache_type_attr =
-	__ATTR(type, 0444, type_show, NULL);
-
-static ssize_t level_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	struct cache_index_dir *index;
-	struct cache *cache;
-
-	index = kobj_to_cache_index_dir(k);
-	cache = index->cache;
-
-	return sprintf(buf, "%d\n", cache->level);
-}
-
-static struct kobj_attribute cache_level_attr =
-	__ATTR(level, 0444, level_show, NULL);
-
-static ssize_t shared_cpu_map_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	struct cache_index_dir *index;
-	struct cache *cache;
-	int len;
-	int n = 0;
-
-	index = kobj_to_cache_index_dir(k);
-	cache = index->cache;
-	len = PAGE_SIZE - 2;
-
-	if (len > 1) {
-		n = cpumask_scnprintf(buf, len, &cache->shared_cpu_map);
-		buf[n++] = '\n';
-		buf[n] = '\0';
 	}
-	return n;
-}
-
-static struct kobj_attribute cache_shared_cpu_map_attr =
-	__ATTR(shared_cpu_map, 0444, shared_cpu_map_show, NULL);
-
-/* Attributes which should always be created -- the kobject/sysfs core
- * does this automatically via kobj_type->default_attrs.  This is the
- * minimum data required to uniquely identify a cache.
- */
-static struct attribute *cache_index_default_attrs[] = {
-	&cache_type_attr.attr,
-	&cache_level_attr.attr,
-	&cache_shared_cpu_map_attr.attr,
-	NULL,
-};
-
-/* Attributes which should be created if the cache device node has the
- * right properties -- see cacheinfo_create_index_opt_attrs
- */
-static struct kobj_attribute *cache_index_opt_attrs[] = {
-	&cache_size_attr,
-	&cache_line_size_attr,
-	&cache_nr_sets_attr,
-	&cache_assoc_attr,
-};
-
-static const struct sysfs_ops cache_index_ops = {
-	.show = cache_index_show,
-};
-
-static struct kobj_type cache_index_type = {
-	.release = cache_index_release,
-	.sysfs_ops = &cache_index_ops,
-	.default_attrs = cache_index_default_attrs,
-};
-
-static void cacheinfo_create_index_opt_attrs(struct cache_index_dir *dir)
-{
-	const char *cache_name;
-	const char *cache_type;
-	struct cache *cache;
-	char *buf;
-	int i;
-
-	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!buf)
-		return;
-
-	cache = dir->cache;
-	cache_name = cache->ofnode->full_name;
-	cache_type = cache_type_string(cache);
-
-	/* We don't want to create an attribute that can't provide a
-	 * meaningful value.  Check the return value of each optional
-	 * attribute's ->show method before registering the
-	 * attribute.
-	 */
-	for (i = 0; i < ARRAY_SIZE(cache_index_opt_attrs); i++) {
-		struct kobj_attribute *attr;
-		ssize_t rc;
-
-		attr = cache_index_opt_attrs[i];
-
-		rc = attr->show(&dir->kobj, attr, buf);
-		if (rc <= 0) {
-			pr_debug("not creating %s attribute for "
-				 "%s(%s) (rc = %zd)\n",
-				 attr->attr.name, cache_name,
-				 cache_type, rc);
-			continue;
-		}
-		if (sysfs_create_file(&dir->kobj, &attr->attr))
-			pr_debug("could not create %s attribute for %s(%s)\n",
-				 attr->attr.name, cache_name, cache_type);
+	np = cpu_dev->of_node;
+	if (!np) {
+		pr_err("Failed to find cpu%d device node\n", cpu);
+		return -ENOENT;
 	}
 
-	kfree(buf);
-}
-
-static void cacheinfo_create_index_dir(struct cache *cache, int index,
-				       struct cache_dir *cache_dir)
-{
-	struct cache_index_dir *index_dir;
-	int rc;
-
-	index_dir = kzalloc(sizeof(*index_dir), GFP_KERNEL);
-	if (!index_dir)
-		goto err;
-
-	index_dir->cache = cache;
-
-	rc = kobject_init_and_add(&index_dir->kobj, &cache_index_type,
-				  cache_dir->kobj, "index%d", index);
-	if (rc)
-		goto err;
-
-	index_dir->next = cache_dir->index;
-	cache_dir->index = index_dir;
-
-	cacheinfo_create_index_opt_attrs(index_dir);
-
-	return;
-err:
-	kfree(index_dir);
-}
-
-static void cacheinfo_sysfs_populate(unsigned int cpu_id,
-				     struct cache *cache_list)
-{
-	struct cache_dir *cache_dir;
-	struct cache *cache;
-	int index = 0;
-
-	cache_dir = cacheinfo_create_cache_dir(cpu_id);
-	if (!cache_dir)
-		return;
-
-	cache = cache_list;
-	while (cache) {
-		cacheinfo_create_index_dir(cache, index, cache_dir);
-		index++;
-		cache = cache->next_local;
+	while (np) {
+		leaves += cache_node_is_unified(np) ? 1 : 2;
+		level++;
+		of_node_put(np);
+		np = of_find_next_cache_node(np);
 	}
-}
-
-void cacheinfo_cpu_online(unsigned int cpu_id)
-{
-	struct cache *cache;
-
-	cache = cache_chain_instantiate(cpu_id);
-	if (!cache)
-		return;
-
-	cacheinfo_sysfs_populate(cpu_id, cache);
-}
-
-#ifdef CONFIG_HOTPLUG_CPU /* functions needed for cpu offline */
-
-static struct cache *cache_lookup_by_cpu(unsigned int cpu_id)
-{
-	struct device_node *cpu_node;
-	struct cache *cache;
-
-	cpu_node = of_get_cpu_node(cpu_id, NULL);
-	WARN_ONCE(!cpu_node, "no OF node found for CPU %i\n", cpu_id);
-	if (!cpu_node)
-		return NULL;
-
-	cache = cache_lookup_by_node(cpu_node);
-	of_node_put(cpu_node);
+	this_cpu_ci->num_levels = level;
+	this_cpu_ci->num_leaves = leaves;
 
-	return cache;
+	return 0;
 }
 
-static void remove_index_dirs(struct cache_dir *cache_dir)
+int populate_cache_leaves(unsigned int cpu)
 {
-	struct cache_index_dir *index;
-
-	index = cache_dir->index;
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	struct cache_info *this_leaf = this_cpu_ci->info_list;
+	struct device *cpu_dev = get_cpu_device(cpu);
+	struct device_node *np;
+	unsigned int level, idx;
 
-	while (index) {
-		struct cache_index_dir *next;
-
-		next = index->next;
-		kobject_put(&index->kobj);
-		index = next;
+	np = of_node_get(cpu_dev->of_node);
+	if (!np) {
+		pr_err("Failed to find cpu%d device node\n", cpu);
+		return -ENOENT;
 	}
-}
-
-static void remove_cache_dir(struct cache_dir *cache_dir)
-{
-	remove_index_dirs(cache_dir);
 
-	/* Remove cache dir from sysfs */
-	kobject_del(cache_dir->kobj);
-
-	kobject_put(cache_dir->kobj);
-
-	kfree(cache_dir);
-}
-
-static void cache_cpu_clear(struct cache *cache, int cpu)
-{
-	while (cache) {
-		struct cache *next = cache->next_local;
-
-		WARN_ONCE(!cpumask_test_cpu(cpu, &cache->shared_cpu_map),
-			  "CPU %i not accounted in %s(%s)\n",
-			  cpu, cache->ofnode->full_name,
-			  cache_type_string(cache));
-
-		cpumask_clear_cpu(cpu, &cache->shared_cpu_map);
-
-		/* Release the cache object if all the cpus using it
-		 * are offline */
-		if (cpumask_empty(&cache->shared_cpu_map))
-			release_cache(cache);
-
-		cache = next;
+	for (idx = 0, level = 1; level <= this_cpu_ci->num_levels &&
+			idx < this_cpu_ci->num_leaves; idx++, level++) {
+		if (!this_leaf)
+			return -EINVAL;
+
+		this_leaf->of_node = np;
+		if (cache_node_is_unified(np)) {
+			ci_leaf_init(this_leaf++, CACHE_TYPE_UNIFIED, level);
+		} else {
+			ci_leaf_init(this_leaf++, CACHE_TYPE_DATA, level);
+			ci_leaf_init(this_leaf++, CACHE_TYPE_INST, level);
+		}
+		np = of_find_next_cache_node(np);
 	}
+	return 0;
 }
 
-void cacheinfo_cpu_offline(unsigned int cpu_id)
-{
-	struct cache_dir *cache_dir;
-	struct cache *cache;
-
-	/* Prevent userspace from seeing inconsistent state - remove
-	 * the sysfs hierarchy first */
-	cache_dir = per_cpu(cache_dir_pcpu, cpu_id);
-
-	/* careful, sysfs population may have failed */
-	if (cache_dir)
-		remove_cache_dir(cache_dir);
-
-	per_cpu(cache_dir_pcpu, cpu_id) = NULL;
-
-	/* clear the CPU's bit in its cache chain, possibly freeing
-	 * cache objects */
-	cache = cache_lookup_by_cpu(cpu_id);
-	if (cache)
-		cache_cpu_clear(cache, cpu_id);
-}
-#endif /* CONFIG_HOTPLUG_CPU */
diff --git a/arch/powerpc/kernel/cacheinfo.h b/arch/powerpc/kernel/cacheinfo.h
deleted file mode 100644
index a7b74d3..0000000
--- a/arch/powerpc/kernel/cacheinfo.h
+++ /dev/null
@@ -1,8 +0,0 @@
-#ifndef _PPC_CACHEINFO_H
-#define _PPC_CACHEINFO_H
-
-/* These are just hooks for sysfs.c to use. */
-extern void cacheinfo_cpu_online(unsigned int cpu_id);
-extern void cacheinfo_cpu_offline(unsigned int cpu_id);
-
-#endif /* _PPC_CACHEINFO_H */
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 97e1dc9..2bc61d3 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -19,8 +19,6 @@
 #include <asm/pmc.h>
 #include <asm/firmware.h>
 
-#include "cacheinfo.h"
-
 #ifdef CONFIG_PPC64
 #include <asm/paca.h>
 #include <asm/lppaca.h>
@@ -730,7 +728,6 @@ static void register_cpu_online(unsigned int cpu)
 		device_create_file(s, &dev_attr_altivec_idle_wait_time);
 	}
 #endif
-	cacheinfo_cpu_online(cpu);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -811,7 +808,6 @@ static void unregister_cpu_online(unsigned int cpu)
 		device_remove_file(s, &dev_attr_altivec_idle_wait_time);
 	}
 #endif
-	cacheinfo_cpu_offline(cpu);
 }
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH RFC/RFT v3 0/9] drivers: cacheinfo support
From: Sudeep Holla @ 2014-02-19 16:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-ia64, Greg Kroah-Hartman, x86, sudeep.holla,
	linux390, linuxppc-dev, linux-arm-kernel

From: Sudeep Holla <sudeep.holla@arm.com>

Hi,

This series adds a generic cacheinfo support similar to topology. The
implementation is based on x86 cacheinfo support. Currently x86, powerpc,
ia64 and s390 have their own implementations. While adding similar support
to ARM and ARM64, here is the attempt to make it generic quite similar to
topology info support. It also adds the missing ABI documentation for
the cacheinfo sysfs which is already being used.

It moves all the existing different implementations on x86, ia64, powerpc
and s390 to use the generic cacheinfo infrastructure introduced here.
These changes on non-ARM platforms are only compile tested and hence
the request for testing too.

This series also adds support for ARM and ARM64 architectures based on
the generic support.

Changes v2[2]->v3:
	- Added new class "cpu" to group all cpu devices
	- Converted all "raw" kobjects used in cacheinfo to device_attr
	  by creating cache index devices
	- Added back s390 show_cacheinfo for /proc/cpuinfo
	- Added disable_sysfs to cache_info for preventing a cache node
	  to be exposed through sysfs if required(used on s390)

Changes v1[1]->v2[2]:
	- Extended the generic cacheinfo support to accomodate all
	  the existing implementations
	- Moved all the existing implementations to use this new
	  generic infrastructure
	- Added missing ABI documentation as suggested by Greg KH
	- Added support for unimplemented CTR on pre-ARMv6 implementations
	  as suggested by Russell. However the ctr_info_list is not yet
	  populated
	- not yet changed to device_attr as suggested by Greg KH,
	  registering cache as device won't eliminate the need of kobject
	  unless each index of cache is registered as a device which don't
	  seem to be good idea, but now it's unified it can be done easily
	  in one place if needed

[1] https://lkml.org/lkml/2014/1/8/523
[2] https://lkml.org/lkml/2014/2/7/654

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux390@de.ibm.com
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-kernel@lists.infradead.org
---
Sudeep Holla (9):
  drivers: base: add new class "cpu" to group cpu devices
  drivers: base: support cpu cache information interface to userspace
    via sysfs
  ia64: move cacheinfo sysfs to generic cacheinfo infrastructure
  s390: move cacheinfo sysfs to generic cacheinfo infrastructure
  x86: move cacheinfo sysfs to generic cacheinfo infrastructure
  powerpc: move cacheinfo sysfs to generic cacheinfo infrastructure
  ARM64: kernel: add support for cpu cache information
  ARM: kernel: add support for cpu cache information
  ARM: kernel: add outer cache support for cacheinfo implementation

 Documentation/ABI/testing/sysfs-devices-system-cpu |  40 +
 arch/arm/include/asm/outercache.h                  |  13 +
 arch/arm/kernel/Makefile                           |   1 +
 arch/arm/kernel/cacheinfo.c                        | 248 ++++++
 arch/arm/mm/Kconfig                                |  13 +
 arch/arm/mm/cache-l2x0.c                           |  14 +
 arch/arm/mm/cache-tauros2.c                        |  35 +
 arch/arm/mm/cache-xsc3l2.c                         |  15 +
 arch/arm64/kernel/Makefile                         |   2 +-
 arch/arm64/kernel/cacheinfo.c                      | 134 ++++
 arch/ia64/kernel/topology.c                        | 399 ++--------
 arch/powerpc/kernel/cacheinfo.c                    | 831 +++------------------
 arch/powerpc/kernel/cacheinfo.h                    |   8 -
 arch/powerpc/kernel/sysfs.c                        |   4 -
 arch/s390/kernel/cache.c                           | 388 +++-------
 arch/x86/kernel/cpu/intel_cacheinfo.c              | 647 ++++------------
 drivers/base/Makefile                              |   2 +-
 drivers/base/cacheinfo.c                           | 485 ++++++++++++
 drivers/base/core.c                                |  35 +-
 drivers/base/cpu.c                                 |   7 +
 include/linux/cacheinfo.h                          |  55 ++
 include/linux/cpu.h                                |   2 +
 22 files changed, 1523 insertions(+), 1855 deletions(-)
 create mode 100644 arch/arm/kernel/cacheinfo.c
 create mode 100644 arch/arm64/kernel/cacheinfo.c
 delete mode 100644 arch/powerpc/kernel/cacheinfo.h
 create mode 100644 drivers/base/cacheinfo.c
 create mode 100644 include/linux/cacheinfo.h

-- 
1.8.3.2

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Christoph Lameter @ 2014-02-19 16:11 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140218222242.GA10844@linux.vnet.ibm.com>

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> the performance impact of the underlying NUMA configuration. I guess we
> could special-case memoryless/cpuless configurations somewhat, but I
> don't think there's any reason to do that if we can make memoryless-node
> support work in-kernel?

Well we can make it work in-kernel but it always has been a bit wacky (as
is the idea of numa "memory" nodes without memory).

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox