LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 1/2] cpu: Offline state Framework.
From: Andrew Morton @ 2009-09-02  4:49 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Peter Zijlstra, Venkatesh Pallipadi, linux-kernel, linuxppc-dev,
	Darrick J. Wong
In-Reply-To: <20090828100016.10641.62621.stgit@sofia.in.ibm.com>

On Fri, 28 Aug 2009 15:30:16 +0530 Gautham R Shenoy <ego@in.ibm.com> wrote:

> Provide an interface by which the system administrator can decide what state
> should the CPU go to when it is offlined.
> 
> To query the hotplug states, on needs to perform a read on the sysfs tunable:
> 	/sys/devices/system/cpu/cpu<number>/available_hotplug_states
> 
> To query or set the current state for a particular CPU, one needs to
> use the sysfs interface:
> 	/sys/devices/system/cpu/cpu<number>/current_state
> 
> This patch implements the architecture independent bits of the
> cpu-offline-state framework.
> 
> Architectures which want to expose the multiple offline-states to the
> userspace are expected to write a driver which can register
> with this framework.
> 
> Such a driver should:
> - Implement the callbacks defined in the structure struct cpu_offline_driver
>   which can be called into by this framework when the corresponding
>   sysfs interfaces are read or written into.
> 
> - Ensure that the following operation puts the CPU in the same state
>   as it did in the absence of the driver.
> 	echo 0 > /sys/devices/system/cpu/cpu<number>/online
> 
> This framework also serializes the writes to the "current_state"
> with respect to with the writes to the "online" sysfs tunable.
> 

It would be nice to document this new userspace interface somewhere.


> +struct cpu_offline_driver *cpu_offline_driver;
> +static DEFINE_MUTEX(cpu_offline_driver_lock);
> +
> +ssize_t show_available_states(struct sys_device *dev,
> +			struct sysdev_attribute *attr, char *buf)
> +{
> +	struct cpu *cpu = container_of(dev, struct cpu, sysdev);
> +	int cpu_num = cpu->sysdev.id;
> +	ssize_t ret;
> +
> +	mutex_lock(&cpu_offline_driver_lock);
> +	if (!cpu_offline_driver) {
> +		ret = -EEXIST;
> +		goto out_unlock;
> +	}
> +
> +	ret = cpu_offline_driver->read_available_states(cpu_num, buf);
> +
> +out_unlock:
> +	mutex_unlock(&cpu_offline_driver_lock);
> +
> +	return ret;
> +
> +}

The patch adds boatloads of global symbols which do not have names
which are appropriate for global symbols.

> +ssize_t show_current_state(struct sys_device *dev,
> +			struct sysdev_attribute *attr, char *buf)

Like that.

> +ssize_t store_current_state(struct sys_device *dev,
> +			struct sysdev_attribute *attr,
> +			const char *buf, size_t count)

And that.

> +
> +static SYSDEV_ATTR(available_hotplug_states, 0444, show_available_states,
> +								NULL);
> +static SYSDEV_ATTR(current_state, 0644, show_current_state,
> +						store_current_state);
> +
> +/* Should be called with cpu_offline_driver_lock held */
> +void cpu_offline_driver_add_cpu(struct sys_device *cpu_sys_dev)
> +{
> +	if (!cpu_offline_driver || !cpu_sys_dev)
> +		return;
> +
> +	sysdev_create_file(cpu_sys_dev, &attr_available_hotplug_states);
> +	sysdev_create_file(cpu_sys_dev, &attr_current_state);
> +}
> +
> +/* Should be called with cpu_offline_driver_lock held */
> +void cpu_offline_driver_remove_cpu(struct sys_device *cpu_sys_dev)
> +{
> +	if (!cpu_offline_driver || !cpu_sys_dev)
> +		return;
> +
> +	sysdev_remove_file(cpu_sys_dev, &attr_available_hotplug_states);
> +	sysdev_remove_file(cpu_sys_dev, &attr_current_state);
> +
> +}

Please don't just ignore possible error returns.

> +int register_cpu_offline_driver(struct cpu_offline_driver *arch_cpu_driver)
> +{
> +	int ret = 0;
> +	int cpu;
> +	mutex_lock(&cpu_offline_driver_lock);
> +

The blank line goes after end-of-locals and before start-of-code.

> +	if (cpu_offline_driver != NULL) {
> +		ret = -EEXIST;
> +		goto out_unlock;
> +	}
> +
> +	if (!(arch_cpu_driver->read_available_states &&
> +	      arch_cpu_driver->read_current_state &&
> +	      arch_cpu_driver->write_current_state)) {
> +		ret = -EINVAL;
> +		goto out_unlock;

This seems pretty pointless.  Just let the code oops - the developer
will notice fairly quickly.

> +	}
> +
> +	cpu_offline_driver = arch_cpu_driver;
> +
> +	for_each_possible_cpu(cpu)
> +		cpu_offline_driver_add_cpu(get_cpu_sysdev(cpu));
> +
> +out_unlock:
> +	mutex_unlock(&cpu_offline_driver_lock);
> +	return ret;
> +}
> +
> +void unregister_cpu_offline_driver(struct cpu_offline_driver *arch_cpu_driver)
> +{
> +	int cpu;
> +	mutex_lock(&cpu_offline_driver_lock);
> +
> +	if (!cpu_offline_driver) {
> +		WARN_ON(1);

	if (WARN_ON(!cpu_offline_driver)) {

> +		mutex_unlock(&cpu_offline_driver_lock);
> +		return;
> +	}
> +
> +	for_each_possible_cpu(cpu)
> +		cpu_offline_driver_remove_cpu(get_cpu_sysdev(cpu));
> +
> +	cpu_offline_driver = NULL;
> +	mutex_unlock(&cpu_offline_driver_lock);
> +}
> +
> +

^ permalink raw reply

* Re: [v4 PATCH 1/5]: cpuidle: Cleanup drivers/cpuidle/cpuidle.c
From: Arun R Bharadwaj @ 2009-09-02  5:21 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Gautham R Shenoy, linux-kernel, Paul Mackerras,
	Arun Bharadwaj, Ingo Molnar, linuxppc-dev
In-Reply-To: <20090901172825.GA6780@balbir.in.ibm.com>

* Balbir Singh <balbir@linux.vnet.ibm.com> [2009-09-01 22:58:25]:

> * Arun R B <arun@linux.vnet.ibm.com> [2009-09-01 17:08:40]:
> 
> > * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-01 17:07:04]:
> > 
> > Cleanup drivers/cpuidle/cpuidle.c
> > 
> > Cpuidle maintains a pm_idle_old void pointer because, currently in x86
> > there is no clean way of registering and unregistering a idle function.
> >
> > So remove pm_idle_old and leave the responsibility of maintaining the
> > list of registered idle loops to the architecture specific code. If the
> > architecture registers cpuidle_idle_call as its idle loop, only then
> > this loop is called.
> > 
> 
> It sounds as if there is a side-effect of this
> patch on x86 (am I reading it incorrectly), which can be fixed, but
> it will need a patch or so to get back the old behaviour on x86.
> 

Hi Balbir,

Yes, your understanding is correct. Currently, x86 exports pm_idle and
this pm_idle is set to cpuidle_idle_call inside cpuidle.c

So instead of that x86 should just export a function called
set_arch_idle() which will be called from within
register_idle_function() and set pm_idle to the idle handler which is
currently being registered.

I have implemented this for pseries, and in the process of doing it
for x86 too.

> > Also remove unwanted functions cpuidle_[un]install_idle_handler,
> > cpuidle_kick_cpus()
> >
> > Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
> > ---
> >  drivers/cpuidle/cpuidle.c  |   51 +++++++++++++++------------------------------
> >  drivers/cpuidle/governor.c |    3 --
> >  2 files changed, 17 insertions(+), 37 deletions(-)
> > 
> > Index: linux.trees.git/drivers/cpuidle/cpuidle.c
> > ===================================================================
> > --- linux.trees.git.orig/drivers/cpuidle/cpuidle.c
> > +++ linux.trees.git/drivers/cpuidle/cpuidle.c
> > @@ -24,9 +24,14 @@ DEFINE_PER_CPU(struct cpuidle_device *, 
> > 
> >  DEFINE_MUTEX(cpuidle_lock);
> >  LIST_HEAD(cpuidle_detected_devices);
> > -static void (*pm_idle_old)(void);
> > 
> >  static int enabled_devices;
> > +static int idle_function_registered;
> > +
> > +struct idle_function_desc cpuidle_idle_desc = {
> > +	.name           =       "cpuidle_loop",
> > +	.idle_func      =       cpuidle_idle_call,
> > +};
> > 
> >  #if defined(CONFIG_ARCH_HAS_CPU_IDLE_WAIT)
> >  static void cpuidle_kick_cpus(void)
> > @@ -54,13 +59,10 @@ static void cpuidle_idle_call(void)
> > 
> >  	/* check if the device is ready */
> >  	if (!dev || !dev->enabled) {
> > -		if (pm_idle_old)
> > -			pm_idle_old();
> > -		else
> >  #if defined(CONFIG_ARCH_HAS_DEFAULT_IDLE)
> > -			default_idle();
> > +		default_idle();
> >  #else
> > -			local_irq_enable();
> > +		local_irq_enable();
> >  #endif
> >  		return;
> >  	}
> > @@ -94,35 +96,11 @@ static void cpuidle_idle_call(void)
> >  }
> > 
> >  /**
> > - * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
> > - */
> > -void cpuidle_install_idle_handler(void)
> > -{
> > -	if (enabled_devices && (pm_idle != cpuidle_idle_call)) {
> > -		/* Make sure all changes finished before we switch to new idle */
> > -		smp_wmb();
> > -		pm_idle = cpuidle_idle_call;
> > -	}
> > -}
> > -
> > -/**
> > - * cpuidle_uninstall_idle_handler - uninstalls the cpuidle idle loop handler
> > - */
> > -void cpuidle_uninstall_idle_handler(void)
> > -{
> > -	if (enabled_devices && pm_idle_old && (pm_idle != pm_idle_old)) {
> > -		pm_idle = pm_idle_old;
> > -		cpuidle_kick_cpus();
> > -	}
> > -}
> > -
> > -/**
> >   * cpuidle_pause_and_lock - temporarily disables CPUIDLE
> >   */
> >  void cpuidle_pause_and_lock(void)
> >  {
> >  	mutex_lock(&cpuidle_lock);
> > -	cpuidle_uninstall_idle_handler();
> >  }
> > 
> >  EXPORT_SYMBOL_GPL(cpuidle_pause_and_lock);
> > @@ -132,7 +110,6 @@ EXPORT_SYMBOL_GPL(cpuidle_pause_and_lock
> >   */
> >  void cpuidle_resume_and_unlock(void)
> >  {
> > -	cpuidle_install_idle_handler();
> >  	mutex_unlock(&cpuidle_lock);
> >  }
> > 
> 
> What does this mean for users of cpuidle_pause_and_lock/unlock?
> Should we be calling register/unregister_idle_function here?
>

Yes, you are right. I have missed out on this part.
register/unregister_idle_function should replace
install/uninstall_idle_handler at those places. Thanks.

> 
> > @@ -287,6 +264,12 @@ static int __cpuidle_register_device(str
> >  	return 0;
> >  }
> > 
> > +static void register_cpuidle_idle_function(void)
> > +{
> > +	register_idle_function(&cpuidle_idle_desc);
> > +
> > +	idle_function_registered = 1;
> 
> Use booleans if possible, unless you intend to extend the meaning of
> registered someday.
>

I don't intend to extend the meaning of idle_function_registered.
Will use boolean here.

> > +}
> >  /**
> >   * cpuidle_register_device - registers a CPU's idle PM feature
> >   * @dev: the cpu
> > @@ -303,7 +286,9 @@ int cpuidle_register_device(struct cpuid
> >  	}
> > 
> >  	cpuidle_enable_device(dev);
> > -	cpuidle_install_idle_handler();
> > +
> > +	if (!idle_function_registered)
> > +		register_cpuidle_idle_function();
> > 
> >  	mutex_unlock(&cpuidle_lock);
> > 
> > @@ -382,8 +367,6 @@ static int __init cpuidle_init(void)
> >  {
> >  	int ret;
> > 
> > -	pm_idle_old = pm_idle;
> > -
> >  	ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
> >  	if (ret)
> >  		return ret;
> > Index: linux.trees.git/drivers/cpuidle/governor.c
> > ===================================================================
> > --- linux.trees.git.orig/drivers/cpuidle/governor.c
> > +++ linux.trees.git/drivers/cpuidle/governor.c
> > @@ -48,8 +48,6 @@ int cpuidle_switch_governor(struct cpuid
> >  	if (gov == cpuidle_curr_governor)
> >  		return 0;
> > 
> > -	cpuidle_uninstall_idle_handler();
> > -
> >  	if (cpuidle_curr_governor) {
> >  		list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
> >  			cpuidle_disable_device(dev);
> > @@ -63,7 +61,6 @@ int cpuidle_switch_governor(struct cpuid
> >  			return -EINVAL;
> >  		list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
> >  			cpuidle_enable_device(dev);
> > -		cpuidle_install_idle_handler();
> >  		printk(KERN_INFO "cpuidle: using governor %s\n", gov->name);
> >  	}
> > 
> 
> -- 
> 	Balbir

Thanks for the review!
--arun

^ permalink raw reply

* Re: [PATCH] Fix fake numa on ppc
From: Ankita Garg @ 2009-09-02  5:36 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20090901142729.GA5022@balbir.in.ibm.com>

Hi Balbir,

On Tue, Sep 01, 2009 at 07:57:29PM +0530, Balbir Singh wrote:
> * Ankita Garg <ankita@in.ibm.com> [2009-09-01 14:54:07]:
> 
> > Hi Balbir,
> > 
> > On Tue, Sep 01, 2009 at 11:27:53AM +0530, Balbir Singh wrote:
> > > * Ankita Garg <ankita@in.ibm.com> [2009-09-01 10:33:16]:
> > > 
> > > > Hello,
> > > > 
> > > > Below is a patch to fix a couple of issues with fake numa node creation
> > > > on ppc:
> > > > 
> > > > 1) Presently, fake nodes could be created such that real numa node
> > > > boundaries are not respected. So a node could have lmbs that belong to
> > > > different real nodes.
> > > > 
> > > > 2) The cpu association is broken. On a JS22 blade for example, which is
> > > > a 2-node numa machine, I get the following:
> > > > 
> > > > # cat /proc/cmdline
> > > > root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> > > > # cat /sys/devices/system/node/node0/cpulist
> > > > 0-3
> > > > # cat /sys/devices/system/node/node1/cpulist
> > > > 4-7
> > > > # cat /sys/devices/system/node/node4/cpulist
> > > > 
> > > > #
> > > > 
> > > > So, though the cpus 4-7 should have been associated with node4, they
> > > > still belong to node1. The patch works by recording a real numa node
> > > > boundary and incrementing the fake node count. At the same time, a
> > > > mapping is stored from the real numa node to the first fake node that
> > > > gets created on it.
> > > >
> > > 
> > > Some details on how you tested it and results before and after would
> > > be nice. Please see git commit 1daa6d08d1257aa61f376c3cc4795660877fb9e3
> > > for example
> > > 
> > >
> > 
> > Thanks for the quick review of the patch. Here is some information on
> > the testing:
> > 
> > Tested the patch with the following commandlines:
> > numa=fake=2G,4G,6G,8G,10G,12G,14G,16G
> > numa=fake=3G,6G,10G,16G
> > numa=fake=4G
> > numa=fake=
> > 
> > For testing if the fake nodes respect the real node boundaries, I added
> > some debug printks in the node creation path. Without the patch, for the
> > commandline numa=fake=2G,4G,6G,8G,10G,12G,14G,16G, this is what I got:
> > 
> > fake id: 1 nid: 0
> > fake id: 1 nid: 0
> > ...
> > fake id: 2 nid: 0
> > fake id: 2 nid: 0
> > ...
> > fake id: 2 nid: 0
> > created new fake_node with id 3
> > fake id: 3 nid: 0
> > fake id: 3 nid: 0
> > ...
> > fake id: 3 nid: 0
> > fake id: 3 nid: 0
> > fake id: 3 nid: 1
> > fake id: 3 nid: 1
> > ...
> > created new fake_node with id 4
> > fake id: 4 nid: 1
> > fake id: 4 nid: 1
> > ...
> > 
> > and so on. So, fake node 3 encompasses real node 0 & 1. Also,
> > 
> > # cat /sys/devices/system/node/node3/meminfo
> > Node 0 MemTotal:        2097152 kB
> > ...
> > # # cat /sys/devices/system/node/node4/meminfo
> > Node 0 MemTotal:        2097152 kB
> > ...
> > 
> > 
> > With the patch, I get:
> > 
> > fake id: 1 nid: 0
> > fake id: 1 nid: 0
> > ...
> > fake id: 2 nid: 0
> > fake id: 2 nid: 0
> > ...
> > fake id: 2 nid: 0
> > created new fake_node with id 3
> > fake id: 3 nid: 0
> > fake id: 3 nid: 0
> > ...
> > fake id: 3 nid: 0
> > fake id: 3 nid: 0
> > created new fake_node with id 4
> > fake id: 4 nid: 1
> > fake id: 4 nid: 1
> > ...
> > 
> > and so on. With the patch, the fake node sizes are slightly different
> > from that specified by the user.
> > 
> > # cat /sys/devices/system/node/node3/meminfo
> > Node 3 MemTotal:        1638400 kB
> > ...
> > # cat /sys/devices/system/node/node4/meminfo
> > Node 4 MemTotal:         458752 kB
> > ...
> > 
> > CPU association was tested as mentioned in the previous mail:
> > 
> > Without the patch,
> > 
> > # cat /proc/cmdline
> > root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> > # cat /sys/devices/system/node/node0/cpulist
> > 0-3
> > # cat /sys/devices/system/node/node1/cpulist
> > 4-7
> > # cat /sys/devices/system/node/node4/cpulist
> > 
> > #
> > 
> > With the patch,
> > 
> > # cat /proc/cmdline
> > root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> > # cat /sys/devices/system/node/node0/cpulist
> > 0-3
> > # cat /sys/devices/system/node/node1/cpulist
> > 
> 
> Oh! interesting.. cpuless nodes :) I think we need to fix this in the
> longer run and distribute cpus between fake numa nodes of a real node
> using some acceptable heuristic.
>

True. Presently this is broken on both x86 and ppc systems. It would be
interesting to find a way to map, for example, 4 cpus to >4 number of
fake nodes created from a single real numa node!
 
> > # cat /sys/devices/system/node/node4/cpulist
> > 4-7
> > 
> > > > 
> > > > Signed-off-by: Ankita Garg <ankita@in.ibm.com>
> > > > 
> > > > Index: linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> > > > ===================================================================
> > > > --- linux-2.6.31-rc5.orig/arch/powerpc/mm/numa.c
> > > > +++ linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> > > > @@ -26,6 +26,11 @@
> > > >  #include <asm/smp.h>
> > > > 
> > > >  static int numa_enabled = 1;
> > > > +static int fake_enabled = 1;
> > > > +
> > > > +/* The array maps a real numa node to the first fake node that gets
> > > > +created on it */
> > > 
> > > Coding style is broken
> > > 
> > 
> > Fixed.
> > 
> > > > +int fake_numa_node_mapping[MAX_NUMNODES];
> > > > 
> > > >  static char *cmdline __initdata;
> > > > 
> > > > @@ -49,14 +54,24 @@ static int __cpuinit fake_numa_create_ne
> > > >  	unsigned long long mem;
> > > >  	char *p = cmdline;
> > > >  	static unsigned int fake_nid;
> > > > +	static unsigned int orig_nid = 0;
> > > 
> > > Should we call this prev_nid?
> > > 
> > 
> > Yes, makes sense.
> > > >  	static unsigned long long curr_boundary;
> > > > 
> > > >  	/*
> > > >  	 * Modify node id, iff we started creating NUMA nodes
> > > >  	 * We want to continue from where we left of the last time
> > > >  	 */
> > > > -	if (fake_nid)
> > > > +	if (fake_nid) {
> > > > +		if (orig_nid != *nid) {
> > > 
> > > OK, so this is called when the real NUMA node changes - comments would
> > > be nice
> > >
> > 
> > Thanks, have added the comment.
> > 
> > > > +			fake_nid++;
> > > > +			fake_numa_node_mapping[*nid] = fake_nid;
> > > > +			orig_nid = *nid;
> > > > +			*nid = fake_nid;
> > > > +			return 0;
> > > > +		}
> > > >  		*nid = fake_nid;
> > > > +	}
> > > > +
> > > >  	/*
> > > >  	 * In case there are no more arguments to parse, the
> > > >  	 * node_id should be the same as the last fake node id
> > > > @@ -440,7 +455,7 @@ static int of_drconf_to_nid_single(struc
> > > >   */
> > > >  static int __cpuinit numa_setup_cpu(unsigned long lcpu)
> > > >  {
> > > > -	int nid = 0;
> > > > +	int nid = 0, new_nid;
> > > >  	struct device_node *cpu = of_get_cpu_node(lcpu, NULL);
> > > > 
> > > >  	if (!cpu) {
> > > > @@ -450,8 +465,15 @@ static int __cpuinit numa_setup_cpu(unsi
> > > > 
> > > >  	nid = of_node_to_nid_single(cpu);
> > > > 
> > > > +	if (fake_enabled && nid) {
> > > > +		new_nid = fake_numa_node_mapping[nid];
> > > > +		if (new_nid > 0)
> > > > +			nid = new_nid;
> > > > +	}
> > > > +
> > > >  	if (nid < 0 || !node_online(nid))
> > > >  		nid = any_online_node(NODE_MASK_ALL);
> > > > +
> > > >  out:
> > > >  	map_cpu_to_node(lcpu, nid);
> > > > 
> > > > @@ -1005,8 +1027,11 @@ static int __init early_numa(char *p)
> > > >  		numa_debug = 1;
> > > > 
> > > >  	p = strstr(p, "fake=");
> > > > -	if (p)
> > > > +	if (p) {
> > > >  		cmdline = p + strlen("fake=");
> > > > +		if (numa_enabled)
> > > > +			fake_enabled = 1;
> > > 
> > > Have you tried passing just numa=fake= without any commandline?
> > > That should enable fake_enabled, but I wonder if that negatively
> > > impacts numa_setup_cpu(). I wonder if you should look at cmdline
> > > to decide on fake_enabled.
> > >
> > 
> > fake_enabled does get set even for numa=fake=. However, it does not
> > impact numa_setup_cpu, since fake_numa_node_mapping array would have no
> > mapping stored and there is a condition there already to check for the
> > value of the mapping. I confirmed this by booting with the above
> > parameter as well.
> > 
> > > > +	}
> > > > 
> > > >  	return 0;
> > > >  }
> > > >
> > > 
> > > Overall, I think this is the right thing to do, we need to move in
> > > this direction. 
> > > 
> > 
> > Heres the updated patch:
> > 
> > Signed-off-by: Ankita Garg <ankita@in.ibm.com> 
> > 
> > Index: linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> > ===================================================================
> > --- linux-2.6.31-rc5.orig/arch/powerpc/mm/numa.c
> > +++ linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> > @@ -26,6 +26,13 @@
> >  #include <asm/smp.h>
> > 
> >  static int numa_enabled = 1;
> > +static int fake_enabled = 1;
> > +
> > +/*
> > + * The array maps a real numa node to the first fake node that gets
> > + * created on it
> > + */
> > +int fake_numa_node_mapping[MAX_NUMNODES];
> > 
> >  static char *cmdline __initdata;
> > 
> > @@ -49,14 +56,29 @@ static int __cpuinit fake_numa_create_ne
> >  	unsigned long long mem;
> >  	char *p = cmdline;
> >  	static unsigned int fake_nid;
> > +	static unsigned int prev_nid = 0;
> >  	static unsigned long long curr_boundary;
> > 
> >  	/*
> >  	 * Modify node id, iff we started creating NUMA nodes
> >  	 * We want to continue from where we left of the last time
> >  	 */
> > -	if (fake_nid)
> > +	if (fake_nid) {
> > +		/*
> > +		 * Moved over to the next real numa node, increment fake
> > +		 * node number and store the mapping of the real node to
> > +		 * the fake node
> > +		 */
> > +		if (prev_nid != *nid) {
> > +			fake_nid++;
> > +			fake_numa_node_mapping[*nid] = fake_nid;
> > +			prev_nid = *nid;
> > +			*nid = fake_nid;
> > +			return 0;
> > +		}
> >  		*nid = fake_nid;
> > +	}
> > +
> >  	/*
> >  	 * In case there are no more arguments to parse, the
> >  	 * node_id should be the same as the last fake node id
> > @@ -440,7 +462,7 @@ static int of_drconf_to_nid_single(struc
> >   */
> >  static int __cpuinit numa_setup_cpu(unsigned long lcpu)
> >  {
> > -	int nid = 0;
> > +	int nid = 0, new_nid;
> >  	struct device_node *cpu = of_get_cpu_node(lcpu, NULL);
> > 
> >  	if (!cpu) {
> > @@ -450,8 +472,15 @@ static int __cpuinit numa_setup_cpu(unsi
> > 
> >  	nid = of_node_to_nid_single(cpu);
> > 
> > +	if (fake_enabled && nid) {
> > +		new_nid = fake_numa_node_mapping[nid];
> > +		if (new_nid > 0)
> > +			nid = new_nid;
> > +	}
> > +
> >  	if (nid < 0 || !node_online(nid))
> >  		nid = any_online_node(NODE_MASK_ALL);
> > +
> >  out:
> >  	map_cpu_to_node(lcpu, nid);
> > 
> > @@ -1005,8 +1034,12 @@ static int __init early_numa(char *p)
> >  		numa_debug = 1;
> > 
> >  	p = strstr(p, "fake=");
> > -	if (p)
> > +	if (p) {
> >  		cmdline = p + strlen("fake=");
> > +		if (numa_enabled) {
> > +			fake_enabled = 1;
> > +		}
> > +	}
> > 
> >  	return 0;
> >  }
> >
> 
> 
> Looks good to me
> 
> 
> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> 
> 
> -- 
> 	Balbir

-- 
Regards,
Ankita Garg (ankita@in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs, 
Bangalore, India   

^ permalink raw reply

* Re: [PATCH v2 0/2] cpu: pseries: Offline state framework.
From: Peter Zijlstra @ 2009-09-02  5:33 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: linux-kernel, Venkatesh Pallipadi, linuxppc-dev, Darrick J. Wong
In-Reply-To: <20090828095741.10641.32053.stgit@sofia.in.ibm.com>

On Fri, 2009-08-28 at 15:30 +0530, Gautham R Shenoy wrote:
> Hi,
> 
> This is the version 2 of the patch series to provide a cpu-offline framework
> that enables the administrators choose the state the offline CPU must be put
> into when multiple such states are exposed by the underlying architecture.
> 
> Version 1 of the Patch can be found here:
> http://lkml.org/lkml/2009/8/6/236
> 
> The patch-series exposes the following sysfs tunables to
> allow the system-adminstrator to choose the state of a CPU:
> 
> To query the available hotplug states, one needs to read the sysfs tunable:
> 	/sys/devices/system/cpu/cpu<number>/available_hotplug_states
> To query or set the current state, on needs to read/write the sysfs tunable:
> 	/sys/devices/system/cpu/cpu<number>/current_states
> 
> The patchset ensures that the writes to the "current_state" sysfs file are
> serialized against the writes to the "online" file.
> 
> This patchset also contains the offline state driver implemented for
> pSeries. For pSeries, we define three available_hotplug_states. They are:
> 
> 	online: The processor is online.
> 
> 	deallocate: This is the the default behaviour when the cpu is offlined
> 	even in the absense of this driver. The CPU would call make an
> 	rtas_stop_self() call and hand over the CPU back to the resource pool,
> 	thereby effectively deallocating that vCPU from the LPAR.
> 	NOTE: This would result in a configuration change to the LPAR
> 	which is visible to the outside world.
> 
> 	deactivate: This cedes the vCPU to the hypervisor which
> 	in turn can put the vCPU time to the best use.
> 	NOTE: This option DOES NOT result in a configuration change
> 	and the vCPU would be still entitled to the LPAR to which it earlier
> 	belong to.
> 
> Awaiting your feedback.

I'm still thinking this is a bad idea.

The OS should only know about online/offline.

Use the hypervisor interface to deal with the cpu once its offline.

That is, I think this interface you propose is a layering violation.

^ permalink raw reply

* Re: [v4 PATCH 1/5]: cpuidle: Cleanup drivers/cpuidle/cpuidle.c
From: Arun R Bharadwaj @ 2009-09-02  5:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Gautham R Shenoy, aun, linux-kernel,
	Paul Mackerras, Ingo Molnar, linuxppc-dev
In-Reply-To: <20090901172825.GA6780@balbir.in.ibm.com>

* Balbir Singh <balbir@linux.vnet.ibm.com> [2009-09-01 22:58:25]:

> * Arun R B <arun@linux.vnet.ibm.com> [2009-09-01 17:08:40]:
> 
> > * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-01 17:07:04]:
> > 
> > Cleanup drivers/cpuidle/cpuidle.c
> > 
> > Cpuidle maintains a pm_idle_old void pointer because, currently in x86
> > there is no clean way of registering and unregistering a idle function.
> >
> > So remove pm_idle_old and leave the responsibility of maintaining the
> > list of registered idle loops to the architecture specific code. If the
> > architecture registers cpuidle_idle_call as its idle loop, only then
> > this loop is called.
> > 
> 
> It sounds as if there is a side-effect of this
> patch on x86 (am I reading it incorrectly), which can be fixed, but
> it will need a patch or so to get back the old behaviour on x86.
> 
> > Also remove unwanted functions cpuidle_[un]install_idle_handler,
> > cpuidle_kick_cpus()
> >
> > Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
> > ---
> >  drivers/cpuidle/cpuidle.c  |   51 +++++++++++++++------------------------------
> >  drivers/cpuidle/governor.c |    3 --
> >  2 files changed, 17 insertions(+), 37 deletions(-)
> > 
> > Index: linux.trees.git/drivers/cpuidle/cpuidle.c
> > ===================================================================
> > --- linux.trees.git.orig/drivers/cpuidle/cpuidle.c
> > +++ linux.trees.git/drivers/cpuidle/cpuidle.c
> > @@ -24,9 +24,14 @@ DEFINE_PER_CPU(struct cpuidle_device *, 
> > 
> >  DEFINE_MUTEX(cpuidle_lock);
> >  LIST_HEAD(cpuidle_detected_devices);
> > -static void (*pm_idle_old)(void);
> > 
> >  static int enabled_devices;
> > +static int idle_function_registered;
> > +
> > +struct idle_function_desc cpuidle_idle_desc = {
> > +	.name           =       "cpuidle_loop",
> > +	.idle_func      =       cpuidle_idle_call,
> > +};
> > 
> >  #if defined(CONFIG_ARCH_HAS_CPU_IDLE_WAIT)
> >  static void cpuidle_kick_cpus(void)
> > @@ -54,13 +59,10 @@ static void cpuidle_idle_call(void)
> > 
> >  	/* check if the device is ready */
> >  	if (!dev || !dev->enabled) {
> > -		if (pm_idle_old)
> > -			pm_idle_old();
> > -		else
> >  #if defined(CONFIG_ARCH_HAS_DEFAULT_IDLE)
> > -			default_idle();
> > +		default_idle();
> >  #else
> > -			local_irq_enable();
> > +		local_irq_enable();
> >  #endif
> >  		return;
> >  	}
> > @@ -94,35 +96,11 @@ static void cpuidle_idle_call(void)
> >  }
> > 
> >  /**
> > - * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
> > - */
> > -void cpuidle_install_idle_handler(void)
> > -{
> > -	if (enabled_devices && (pm_idle != cpuidle_idle_call)) {
> > -		/* Make sure all changes finished before we switch to new idle */
> > -		smp_wmb();
> > -		pm_idle = cpuidle_idle_call;
> > -	}
> > -}
> > -
> > -/**
> > - * cpuidle_uninstall_idle_handler - uninstalls the cpuidle idle loop handler
> > - */
> > -void cpuidle_uninstall_idle_handler(void)
> > -{
> > -	if (enabled_devices && pm_idle_old && (pm_idle != pm_idle_old)) {
> > -		pm_idle = pm_idle_old;
> > -		cpuidle_kick_cpus();
> > -	}
> > -}
> > -
> > -/**
> >   * cpuidle_pause_and_lock - temporarily disables CPUIDLE
> >   */
> >  void cpuidle_pause_and_lock(void)
> >  {
> >  	mutex_lock(&cpuidle_lock);
> > -	cpuidle_uninstall_idle_handler();
> >  }
> > 
> >  EXPORT_SYMBOL_GPL(cpuidle_pause_and_lock);
> > @@ -132,7 +110,6 @@ EXPORT_SYMBOL_GPL(cpuidle_pause_and_lock
> >   */
> >  void cpuidle_resume_and_unlock(void)
> >  {
> > -	cpuidle_install_idle_handler();
> >  	mutex_unlock(&cpuidle_lock);
> >  }
> > 
> 
> What does this mean for users of cpuidle_pause_and_lock/unlock?
> Should we be calling register/unregister_idle_function here?
> 

Just observed the use case for cpuidle_pause_and_lock/unlock.
It is not clear as to why we need to switch back to the old idle
handler and then again back to cpuidle's idle handler. Wouldn't it
make more sense to just register the idle handler when the first
cpuidle device is being registered and unregister the idle handler
when the last cpuidle device is unregistered?

--arun

> 
> > @@ -287,6 +264,12 @@ static int __cpuidle_register_device(str
> >  	return 0;
> >  }
> > 
> > +static void register_cpuidle_idle_function(void)
> > +{
> > +	register_idle_function(&cpuidle_idle_desc);
> > +
> > +	idle_function_registered = 1;
> 
> Use booleans if possible, unless you intend to extend the meaning of
> registered someday.
> 
> > +}
> >  /**
> >   * cpuidle_register_device - registers a CPU's idle PM feature
> >   * @dev: the cpu
> > @@ -303,7 +286,9 @@ int cpuidle_register_device(struct cpuid
> >  	}
> > 
> >  	cpuidle_enable_device(dev);
> > -	cpuidle_install_idle_handler();
> > +
> > +	if (!idle_function_registered)
> > +		register_cpuidle_idle_function();
> > 
> >  	mutex_unlock(&cpuidle_lock);
> > 
> > @@ -382,8 +367,6 @@ static int __init cpuidle_init(void)
> >  {
> >  	int ret;
> > 
> > -	pm_idle_old = pm_idle;
> > -
> >  	ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
> >  	if (ret)
> >  		return ret;
> > Index: linux.trees.git/drivers/cpuidle/governor.c
> > ===================================================================
> > --- linux.trees.git.orig/drivers/cpuidle/governor.c
> > +++ linux.trees.git/drivers/cpuidle/governor.c
> > @@ -48,8 +48,6 @@ int cpuidle_switch_governor(struct cpuid
> >  	if (gov == cpuidle_curr_governor)
> >  		return 0;
> > 
> > -	cpuidle_uninstall_idle_handler();
> > -
> >  	if (cpuidle_curr_governor) {
> >  		list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
> >  			cpuidle_disable_device(dev);
> > @@ -63,7 +61,6 @@ int cpuidle_switch_governor(struct cpuid
> >  			return -EINVAL;
> >  		list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
> >  			cpuidle_enable_device(dev);
> > -		cpuidle_install_idle_handler();
> >  		printk(KERN_INFO "cpuidle: using governor %s\n", gov->name);
> >  	}
> > 
> 
> -- 
> 	Balbir

^ permalink raw reply

* Re: [v4 PATCH 1/5]: cpuidle: Cleanup drivers/cpuidle/cpuidle.c
From: Peter Zijlstra @ 2009-09-02  5:42 UTC (permalink / raw)
  To: arun
  Cc: Gautham R Shenoy, linux-kernel, Paul Mackerras, Ingo Molnar,
	linuxppc-dev
In-Reply-To: <20090901113840.GH7599@linux.vnet.ibm.com>

On Tue, 2009-09-01 at 17:08 +0530, Arun R Bharadwaj wrote:
> * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-01 17:07:04]:
> 
> Cleanup drivers/cpuidle/cpuidle.c
> 
> Cpuidle maintains a pm_idle_old void pointer because, currently in x86
> there is no clean way of registering and unregistering a idle function.

Right, and instead of fixing that, they build this cpuidle crap on top,
instead of replacing the current crap with it.

> So remove pm_idle_old and leave the responsibility of maintaining the
> list of registered idle loops to the architecture specific code. If the
> architecture registers cpuidle_idle_call as its idle loop, only then
> this loop is called.

OK, that's a start I guess. Best would be to replace all of pm_idle with
cpuidle, which is what should have been done from the very start.

If cpuidle cannot fully replace the pm_idle functionality, then it needs
to fix that. But having two layers of idle functions is just silly.

Looking at patch 2 and 3, you're making the same mistake on power, after
those patches there are multiple ways of registering idle functions, one
through some native interface and one through cpuidle, this strikes me
as undesirable.

If cpuidle is a good idle function manager, then it should be good
enough to be the sole one, if its not, then why bother with it at all.

^ permalink raw reply

* Re: [PATCH] powerpc/fsl-booke: Use HW PTE format if CONFIG_PTE_64BIT
From: Benjamin Herrenschmidt @ 2009-09-02  5:48 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev
In-Reply-To: <1251856122-24560-1-git-send-email-galak@kernel.crashing.org>

On Tue, 2009-09-01 at 20:48 -0500, Kumar Gala wrote:
> Switch to using the Power ISA defined PTE format when we have a 64-bit
> PTE.  This makes the code handling between fsl-booke and book3e-64
> similiar for TLB faults.
> 
> Additionally this lets use take advantage of the page size encodings and
> full permissions that the HW PTE defines.
> 
> Also defined _PMD_PRESENT, _PMD_PRESENT_MASK, and _PMD_BAD since the
> 32-bit ppc arch code expects them.

No immediate problem with the patch, though I can't test it so I
assume you did :-)

Is this 2.6.32 material ? I'm going to stick it into my test branch
for now regardless.

Cheers,
Ben.

> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
> ---
>  arch/powerpc/include/asm/pgtable-ppc32.h |    2 +
>  arch/powerpc/include/asm/pte-book3e.h    |    3 ++
>  arch/powerpc/include/asm/pte-fsl-booke.h |    7 -----
>  arch/powerpc/kernel/head_fsl_booke.S     |   36 ++++++++++++++++++++---------
>  4 files changed, 30 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable-ppc32.h b/arch/powerpc/include/asm/pgtable-ppc32.h
> index f2c52e2..55646ad 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc32.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc32.h
> @@ -111,6 +111,8 @@ extern int icache_44x_need_flush;
>  #include <asm/pte-40x.h>
>  #elif defined(CONFIG_44x)
>  #include <asm/pte-44x.h>
> +#elif defined(CONFIG_FSL_BOOKE) && defined(CONFIG_PTE_64BIT)
> +#include <asm/pte-book3e.h>
>  #elif defined(CONFIG_FSL_BOOKE)
>  #include <asm/pte-fsl-booke.h>
>  #elif defined(CONFIG_8xx)
> diff --git a/arch/powerpc/include/asm/pte-book3e.h b/arch/powerpc/include/asm/pte-book3e.h
> index b82b9dc..082d515 100644
> --- a/arch/powerpc/include/asm/pte-book3e.h
> +++ b/arch/powerpc/include/asm/pte-book3e.h
> @@ -75,6 +75,9 @@
>  /* On 32-bit, we never clear the top part of the PTE */
>  #ifdef CONFIG_PPC32
>  #define _PTE_NONE_MASK	0xffffffff00000000ULL
> +#define _PMD_PRESENT	0
> +#define _PMD_PRESENT_MASK (PAGE_MASK)
> +#define _PMD_BAD	(~PAGE_MASK)
>  #endif
>  
>  #endif /* __KERNEL__ */
> diff --git a/arch/powerpc/include/asm/pte-fsl-booke.h b/arch/powerpc/include/asm/pte-fsl-booke.h
> index ce8a9e9..2c12be5 100644
> --- a/arch/powerpc/include/asm/pte-fsl-booke.h
> +++ b/arch/powerpc/include/asm/pte-fsl-booke.h
> @@ -33,13 +33,6 @@
>  #define _PAGE_WRITETHRU	0x00400	/* H: W bit */
>  #define _PAGE_SPECIAL	0x00800 /* S: Special page */
>  
> -#ifdef CONFIG_PTE_64BIT
> -/* ERPN in a PTE never gets cleared, ignore it */
> -#define _PTE_NONE_MASK	0xffffffffffff0000ULL
> -/* We extend the size of the PTE flags area when using 64-bit PTEs */
> -#define PTE_RPN_SHIFT	(PAGE_SHIFT + 8)
> -#endif
> -
>  #define _PMD_PRESENT	0
>  #define _PMD_PRESENT_MASK (PAGE_MASK)
>  #define _PMD_BAD	(~PAGE_MASK)
> diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S
> index 2c5af52..975788c 100644
> --- a/arch/powerpc/kernel/head_fsl_booke.S
> +++ b/arch/powerpc/kernel/head_fsl_booke.S
> @@ -575,7 +575,12 @@ interrupt_base:
>  	 *       place or can we save a couple of instructions here ?
>  	 */
>  	mfspr	r12,SPRN_ESR
> +#ifdef CONFIG_PTE_64BIT
> +	li	r13,_PAGE_PRESENT
> +	oris	r13,r13,_PAGE_ACCESSED@h
> +#else
>  	li	r13,_PAGE_PRESENT|_PAGE_ACCESSED
> +#endif
>  	rlwimi	r13,r12,11,29,29
>  
>  	FIND_PTE
> @@ -643,7 +648,12 @@ interrupt_base:
>  
>  4:
>  	/* Make up the required permissions */
> +#ifdef CONFIG_PTE_64BIT
> +	li	r13,_PAGE_PRESENT | _PAGE_EXEC
> +	oris	r13,r13,_PAGE_ACCESSED@h
> +#else
>  	li	r13,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
> +#endif
>  
>  	FIND_PTE
>  	andc.	r13,r13,r11		/* Check permission */
> @@ -733,7 +743,7 @@ finish_tlb_load:
>  
>  	mfspr	r12, SPRN_MAS2
>  #ifdef CONFIG_PTE_64BIT
> -	rlwimi	r12, r11, 26, 24, 31	/* extract ...WIMGE from pte */
> +	rlwimi	r12, r11, 32-19, 27, 31	/* extract WIMGE from pte */
>  #else
>  	rlwimi	r12, r11, 26, 27, 31	/* extract WIMGE from pte */
>  #endif
> @@ -742,6 +752,20 @@ finish_tlb_load:
>  #endif
>  	mtspr	SPRN_MAS2, r12
>  
> +#ifdef CONFIG_PTE_64BIT
> +	rlwinm	r12, r11, 32-2, 26, 31	/* Move in perm bits */
> +	andi.	r10, r11, _PAGE_DIRTY
> +	bne	1f
> +	li	r10, MAS3_SW | MAS3_UW
> +	andc	r12, r12, r10
> +1:	rlwimi	r12, r13, 20, 0, 11	/* grab RPN[32:43] */
> +	rlwimi	r12, r11, 20, 12, 19	/* grab RPN[44:51] */
> +	mtspr	SPRN_MAS3, r12
> +BEGIN_MMU_FTR_SECTION
> +	srwi	r10, r13, 12		/* grab RPN[12:31] */
> +	mtspr	SPRN_MAS7, r10
> +END_MMU_FTR_SECTION_IFSET(MMU_FTR_BIG_PHYS)
> +#else
>  	li	r10, (_PAGE_EXEC | _PAGE_PRESENT)
>  	rlwimi	r10, r11, 31, 29, 29	/* extract _PAGE_DIRTY into SW */
>  	and	r12, r11, r10
> @@ -749,16 +773,6 @@ finish_tlb_load:
>  	slwi	r10, r12, 1
>  	or	r10, r10, r12
>  	iseleq	r12, r12, r10
> -	
> -#ifdef CONFIG_PTE_64BIT
> -	rlwimi	r12, r13, 24, 0, 7	/* grab RPN[32:39] */
> -	rlwimi	r12, r11, 24, 8, 19	/* grab RPN[40:51] */
> -	mtspr	SPRN_MAS3, r12
> -BEGIN_MMU_FTR_SECTION
> -	srwi	r10, r13, 8		/* grab RPN[8:31] */
> -	mtspr	SPRN_MAS7, r10
> -END_MMU_FTR_SECTION_IFSET(MMU_FTR_BIG_PHYS)
> -#else
>  	rlwimi	r11, r12, 0, 20, 31	/* Extract RPN from PTE and merge with perms */
>  	mtspr	SPRN_MAS3, r11
>  #endif

^ permalink raw reply

* Re: [PATCH] Fix fake numa on ppc
From: Benjamin Herrenschmidt @ 2009-09-02  5:52 UTC (permalink / raw)
  To: balbir; +Cc: linuxppc-dev, Ankita Garg, linux-kernel
In-Reply-To: <20090901142729.GA5022@balbir.in.ibm.com>

On Tue, 2009-09-01 at 19:57 +0530, Balbir Singh wrote:

> > 
> > Heres the updated patch:
> > 
> > Signed-off-by: Ankita Garg <ankita@in.ibm.com> 

The thread is too messy. Please Ankita can you re-submit in proper
form (ie. subject, cset, signed-off & patch) please ?

In general, don't mix reply to comments and new patch submission,
or patchwork gets really confuse. Post a new name, with something
like [PATCH v2] <name of the patch> in the subject.

Thanks.

Cheers,
Ben.

> > Index: linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> > ===================================================================
> > --- linux-2.6.31-rc5.orig/arch/powerpc/mm/numa.c
> > +++ linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> > @@ -26,6 +26,13 @@
> >  #include <asm/smp.h>
> > 
> >  static int numa_enabled = 1;
> > +static int fake_enabled = 1;
> > +
> > +/*
> > + * The array maps a real numa node to the first fake node that gets
> > + * created on it
> > + */
> > +int fake_numa_node_mapping[MAX_NUMNODES];
> > 
> >  static char *cmdline __initdata;
> > 
> > @@ -49,14 +56,29 @@ static int __cpuinit fake_numa_create_ne
> >  	unsigned long long mem;
> >  	char *p = cmdline;
> >  	static unsigned int fake_nid;
> > +	static unsigned int prev_nid = 0;
> >  	static unsigned long long curr_boundary;
> > 
> >  	/*
> >  	 * Modify node id, iff we started creating NUMA nodes
> >  	 * We want to continue from where we left of the last time
> >  	 */
> > -	if (fake_nid)
> > +	if (fake_nid) {
> > +		/*
> > +		 * Moved over to the next real numa node, increment fake
> > +		 * node number and store the mapping of the real node to
> > +		 * the fake node
> > +		 */
> > +		if (prev_nid != *nid) {
> > +			fake_nid++;
> > +			fake_numa_node_mapping[*nid] = fake_nid;
> > +			prev_nid = *nid;
> > +			*nid = fake_nid;
> > +			return 0;
> > +		}
> >  		*nid = fake_nid;
> > +	}
> > +
> >  	/*
> >  	 * In case there are no more arguments to parse, the
> >  	 * node_id should be the same as the last fake node id
> > @@ -440,7 +462,7 @@ static int of_drconf_to_nid_single(struc
> >   */
> >  static int __cpuinit numa_setup_cpu(unsigned long lcpu)
> >  {
> > -	int nid = 0;
> > +	int nid = 0, new_nid;
> >  	struct device_node *cpu = of_get_cpu_node(lcpu, NULL);
> > 
> >  	if (!cpu) {
> > @@ -450,8 +472,15 @@ static int __cpuinit numa_setup_cpu(unsi
> > 
> >  	nid = of_node_to_nid_single(cpu);
> > 
> > +	if (fake_enabled && nid) {
> > +		new_nid = fake_numa_node_mapping[nid];
> > +		if (new_nid > 0)
> > +			nid = new_nid;
> > +	}
> > +
> >  	if (nid < 0 || !node_online(nid))
> >  		nid = any_online_node(NODE_MASK_ALL);
> > +
> >  out:
> >  	map_cpu_to_node(lcpu, nid);
> > 
> > @@ -1005,8 +1034,12 @@ static int __init early_numa(char *p)
> >  		numa_debug = 1;
> > 
> >  	p = strstr(p, "fake=");
> > -	if (p)
> > +	if (p) {
> >  		cmdline = p + strlen("fake=");
> > +		if (numa_enabled) {
> > +			fake_enabled = 1;
> > +		}
> > +	}
> > 
> >  	return 0;
> >  }
> >
> 
> 
> Looks good to me
> 
> 
> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>  
> 
> 

^ permalink raw reply

* Re: [PATCH] Fix fake numa on ppc
From: Benjamin Herrenschmidt @ 2009-09-02  5:53 UTC (permalink / raw)
  To: Ankita Garg; +Cc: linuxppc-dev, linux-kernel, Balbir Singh
In-Reply-To: <20090902053653.GA3806@in.ibm.com>

On Wed, 2009-09-02 at 11:06 +0530, Ankita Garg wrote:
> > Oh! interesting.. cpuless nodes :) I think we need to fix this in the
> > longer run and distribute cpus between fake numa nodes of a real node
> > using some acceptable heuristic.
> >
> 
> True. Presently this is broken on both x86 and ppc systems. It would be
> interesting to find a way to map, for example, 4 cpus to >4 number of
> fake nodes created from a single real numa node!

Since I'm pretty sure there could be CPU less nodes just like there
could be memory-less nodes, it would be good if fake numa could
simulate them too :-)

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] powerpc/fsl-booke: Use HW PTE format if CONFIG_PTE_64BIT
From: Kumar Gala @ 2009-09-02  6:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <1251870487.14675.416.camel@pasglop>


On Sep 2, 2009, at 12:48 AM, Benjamin Herrenschmidt wrote:

> On Tue, 2009-09-01 at 20:48 -0500, Kumar Gala wrote:
>> Switch to using the Power ISA defined PTE format when we have a 64- 
>> bit
>> PTE.  This makes the code handling between fsl-booke and book3e-64
>> similiar for TLB faults.
>>
>> Additionally this lets use take advantage of the page size  
>> encodings and
>> full permissions that the HW PTE defines.
>>
>> Also defined _PMD_PRESENT, _PMD_PRESENT_MASK, and _PMD_BAD since the
>> 32-bit ppc arch code expects them.
>
> No immediate problem with the patch, though I can't test it so I
> assume you did :-)

Yes, I tested it.

> Is this 2.6.32 material ? I'm going to stick it into my test branch
> for now regardless.

Yes this is .32 material

- k

^ permalink raw reply

* Re: [PATCH] Fix fake numa on ppc
From: David Rientjes @ 2009-09-02  5:58 UTC (permalink / raw)
  To: Ankita Garg; +Cc: linuxppc-dev, linux-kernel, Balbir Singh
In-Reply-To: <20090902053653.GA3806@in.ibm.com>

On Wed, 2 Sep 2009, Ankita Garg wrote:

> > > With the patch,
> > > 
> > > # cat /proc/cmdline
> > > root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> > > # cat /sys/devices/system/node/node0/cpulist
> > > 0-3
> > > # cat /sys/devices/system/node/node1/cpulist
> > > 
> > 
> > Oh! interesting.. cpuless nodes :) I think we need to fix this in the
> > longer run and distribute cpus between fake numa nodes of a real node
> > using some acceptable heuristic.
> >
> 
> True. Presently this is broken on both x86 and ppc systems. It would be
> interesting to find a way to map, for example, 4 cpus to >4 number of
> fake nodes created from a single real numa node!
>  

We've done it for years on x86_64.  It's quite trivial to map all fake 
nodes within a physical node to the cpus to which they have affinity both 
via node_to_cpumask_map() and cpu_to_node_map().  There should be no 
kernel space dependencies on a cpu appearing in only a single node's 
cpumask and if you map each fake node to its physical node's pxm, you can 
index into the slit and generate local NUMA distances amongst fake nodes.

So if you map the apicids and pxms appropriately depending on the 
physical topology of the machine, that is the only emulation necessary on 
x86_64 for the page allocator zonelist ordering, task migration, etc.  (If 
you use CONFIG_SLAB, you'll need to avoid the exponential growth of alien 
caches, but that's an implementation detail and isn't really within the 
scope of numa=fake's purpose to modify.)

^ permalink raw reply

* [PATCH v2] Fix fake numa on ppc
From: Ankita Garg @ 2009-09-02  6:09 UTC (permalink / raw)
  To: LKML, linuxppc-dev, Benjamin Herrenschmidt, Balbir Singh; +Cc: ankita

Hi,

Below is a patch to fix a couple of issues with fake numa node creation
on ppc:

1) Presently, fake nodes could be created such that real numa node
boundaries are not respected. So a node could have lmbs that belong to
different real nodes.

2) The cpu association is broken. On a JS22 blade for example, which is
a 2-node numa machine, I get the following:

# cat /proc/cmdline
root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
# cat /sys/devices/system/node/node0/cpulist
0-3
# cat /sys/devices/system/node/node1/cpulist
4-7
# cat /sys/devices/system/node/node4/cpulist

#

So, though the cpus 4-7 should have been associated with node4, they
still belong to node1. The patch works by recording a real numa node
boundary and incrementing the fake node count. At the same time, a
mapping is stored from the real numa node to the first fake node that
gets created on it.

Tested the patch with the following commandlines:
numa=fake=2G,4G,6G,8G,10G,12G,14G,16G
numa=fake=3G,6G,10G,16G
numa=fake=4G
numa=fake=

For testing if the fake nodes respect the real node boundaries, I added
some debug printks in the node creation path. Without the patch, for the
commandline numa=fake=2G,4G,6G,8G,10G,12G,14G,16G, this is what I got:

fake id: 1 nid: 0
fake id: 1 nid: 0
...
fake id: 2 nid: 0
fake id: 2 nid: 0
...
fake id: 2 nid: 0
created new fake_node with id 3
fake id: 3 nid: 0
fake id: 3 nid: 0
...
fake id: 3 nid: 0
fake id: 3 nid: 0
fake id: 3 nid: 1
fake id: 3 nid: 1
...
created new fake_node with id 4
fake id: 4 nid: 1
fake id: 4 nid: 1
...

and so on. So, fake node 3 encompasses real node 0 & 1. Also,

# cat /sys/devices/system/node/node3/meminfo
Node 0 MemTotal:        2097152 kB
...
# # cat /sys/devices/system/node/node4/meminfo
Node 0 MemTotal:        2097152 kB
...


With the patch, I get:

fake id: 1 nid: 0
fake id: 1 nid: 0
...
fake id: 2 nid: 0
fake id: 2 nid: 0
...
fake id: 2 nid: 0
created new fake_node with id 3
fake id: 3 nid: 0
fake id: 3 nid: 0
...
fake id: 3 nid: 0
fake id: 3 nid: 0
created new fake_node with id 4
fake id: 4 nid: 1
fake id: 4 nid: 1
...

and so on. With the patch, the fake node sizes are slightly different
from that specified by the user.

# cat /sys/devices/system/node/node3/meminfo
Node 3 MemTotal:        1638400 kB
...
# cat /sys/devices/system/node/node4/meminfo
Node 4 MemTotal:         458752 kB
...

CPU association was tested as mentioned in the previous mail:

Without the patch,

# cat /proc/cmdline
root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
# cat /sys/devices/system/node/node0/cpulist
0-3
# cat /sys/devices/system/node/node1/cpulist
4-7
# cat /sys/devices/system/node/node4/cpulist

#

With the patch,

# cat /proc/cmdline
root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
# cat /sys/devices/system/node/node0/cpulist
0-3
# cat /sys/devices/system/node/node1/cpulist

# cat /sys/devices/system/node/node4/cpulist
4-7

Signed-off-by: Ankita Garg <ankita@in.ibm.com> 
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>

Index: linux-2.6.31-rc5/arch/powerpc/mm/numa.c
===================================================================
--- linux-2.6.31-rc5.orig/arch/powerpc/mm/numa.c
+++ linux-2.6.31-rc5/arch/powerpc/mm/numa.c
@@ -26,6 +26,13 @@
 #include <asm/smp.h>
 
 static int numa_enabled = 1;
+static int fake_enabled = 1;
+
+/*
+ * The array maps a real numa node to the first fake node that gets
+ * created on it
+ */
+int fake_numa_node_mapping[MAX_NUMNODES];
 
 static char *cmdline __initdata;
 
@@ -49,14 +56,29 @@ static int __cpuinit fake_numa_create_ne
 	unsigned long long mem;
 	char *p = cmdline;
 	static unsigned int fake_nid;
+	static unsigned int prev_nid = 0;
 	static unsigned long long curr_boundary;
 
 	/*
 	 * Modify node id, iff we started creating NUMA nodes
 	 * We want to continue from where we left of the last time
 	 */
-	if (fake_nid)
+	if (fake_nid) {
+		/*
+		 * Moved over to the next real numa node, increment fake
+		 * node number and store the mapping of the real node to
+		 * the fake node
+		 */
+		if (prev_nid != *nid) {
+			fake_nid++;
+			fake_numa_node_mapping[*nid] = fake_nid;
+			prev_nid = *nid;
+			*nid = fake_nid;
+			return 0;
+		}
 		*nid = fake_nid;
+	}
+
 	/*
 	 * In case there are no more arguments to parse, the
 	 * node_id should be the same as the last fake node id
@@ -440,7 +462,7 @@ static int of_drconf_to_nid_single(struc
  */
 static int __cpuinit numa_setup_cpu(unsigned long lcpu)
 {
-	int nid = 0;
+	int nid = 0, new_nid;
 	struct device_node *cpu = of_get_cpu_node(lcpu, NULL);
 
 	if (!cpu) {
@@ -450,8 +472,15 @@ static int __cpuinit numa_setup_cpu(unsi
 
 	nid = of_node_to_nid_single(cpu);
 
+	if (fake_enabled && nid) {
+		new_nid = fake_numa_node_mapping[nid];
+		if (new_nid > 0)
+			nid = new_nid;
+	}
+
 	if (nid < 0 || !node_online(nid))
 		nid = any_online_node(NODE_MASK_ALL);
+
 out:
 	map_cpu_to_node(lcpu, nid);
 
@@ -1005,8 +1034,12 @@ static int __init early_numa(char *p)
 		numa_debug = 1;
 
 	p = strstr(p, "fake=");
-	if (p)
+	if (p) {
 		cmdline = p + strlen("fake=");
+		if (numa_enabled) {
+			fake_enabled = 1;
+		}
+	}
 
 	return 0;
 }


-- 
Regards,
Ankita Garg (ankita@in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs, 
Bangalore, India   

^ permalink raw reply

* Re: [PATCH] Fix fake numa on ppc
From: David Rientjes @ 2009-09-02  6:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, Ankita Garg, linux-kernel, Balbir Singh
In-Reply-To: <1251870823.14675.420.camel@pasglop>

On Wed, 2 Sep 2009, Benjamin Herrenschmidt wrote:

> Since I'm pretty sure there could be CPU less nodes just like there
> could be memory-less nodes, it would be good if fake numa could
> simulate them too :-)
> 

You don't want to simulate cpu less nodes since they do have affinity to 
ranges of memory, you want to map each fake node to a cpumask including 
all cpus with affinity to its memory, map each cpu to one fake node (with 
memory) that it has physical affinity to, and then give all fake nodes 
local NUMA distance to those on the same physical node.  Memoryless nodes 
take care of themselves since they rely purely on node_distance(), so the 
index into the slit for all fake nodes to those without memory will be the 
same.

^ permalink raw reply

* Re: [PATCH 1/1] powerpc: Fix to handle slb resize across migration
From: Benjamin Herrenschmidt @ 2009-09-02  6:21 UTC (permalink / raw)
  To: Brian King; +Cc: linuxppc-dev
In-Reply-To: <200908282206.n7SM6UK1011114@d03av01.boulder.ibm.com>

On Fri, 2009-08-28 at 17:06 -0500, Brian King wrote:
> The SLB can change sizes across a live migration, which was not
> being handled, resulting in possible machine crashes during
> migration if migrating to a machine which has a smaller max SLB
> size than the source machine. Fix this by first reducing the
> SLB size to the minimum possible value, which is 32, prior to
> migration. Then during the device tree update which occurs after
> migration, we make the call to ensure the SLB gets updated. Also
> add the slb_size to the lparcfg output so that the migration
> tools can check to make sure the kernel has this capability
> before allowing migration in scenarios where the SLB size will change.

The patch causes a build error on 32-bit hash in rtas.c due to this:

> diff -puN arch/powerpc/kernel/rtas.c~powerpc_slb_resize arch/powerpc/kernel/rtas.c
> --- linux-2.6/arch/powerpc/kernel/rtas.c~powerpc_slb_resize	2009-08-21 16:14:41.000000000 -0500
> +++ linux-2.6-bjking1/arch/powerpc/kernel/rtas.c	2009-08-21 16:14:41.000000000 -0500
> @@ -39,6 +39,7 @@
>  #include <asm/smp.h>
>  #include <asm/atomic.h>
>  #include <asm/time.h>
> +#include <asm/mmu-hash64.h>

This should just be asm/mmu.h

This is true of all occurences, ie, mmu-hash64.h isn't meant to be
directly included (though it causes no breakage in the other cases)

I'm going to commit a fixed version to powerpc-next today, you may want
to update the version you are sending to distros.

Cheers,
Ben.

^ permalink raw reply

* why do we need reloc_offset ??
From: HongWoo Lee @ 2009-09-02  6:33 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1191 bytes --]

Hi everyone~

In ther linux kernel code, I found the reloc_offset.

{{{
// file : misc.S
/* Returns (address we are running at) - (address we were linked at)
 * for use before the text and data are mapped to KERNELBASE.
 */
_GLOBAL(reloc_offset)
}}}

I couldn't understand the comment saying "Returns (address we are running
at) - (address we were linked at)".
For now, I'm studying each instruction.

And below is best comment I can explain for each instruction.

_GLOBAL(reloc_offset)
        mflr    r0                // move from link register, save the
return address
        bl      1f                 // bl 1f
1:     mflr    r3                // move from link register, r3 is just
return address pointing itself
        LOAD_REG_IMMEDIATE(r4,1b)    // get the 1b address, r4 is the
address
        subf    r3,r4,r3        // r3 = r3 – r4
        mtlr    r0                // restore return address
        blr

After this, I still don't know why "r3-r4" is the offset.
And what does it mean ??

Please explain to me the reason why we need reloc_offset and the code.
Or let me know any helpful reference about this.

Thank in advance.

HongWoo.

[-- Attachment #2: Type: text/html, Size: 1324 bytes --]

^ permalink raw reply

* Re: [PATCH v2] Fix fake numa on ppc
From: David Rientjes @ 2009-09-02  6:37 UTC (permalink / raw)
  To: Ankita Garg; +Cc: linuxppc-dev, LKML
In-Reply-To: <20090902060911.GA5728@in.ibm.com>

On Wed, 2 Sep 2009, Ankita Garg wrote:

> Hi,
> 
> Below is a patch to fix a couple of issues with fake numa node creation
> on ppc:
> 
> 1) Presently, fake nodes could be created such that real numa node
> boundaries are not respected. So a node could have lmbs that belong to
> different real nodes.
> 

On x86_64, we can use numa=off to completely disable NUMA so that all 
memory and all cpus are mapped to a single node 0.  That's an extreme 
example of the above and is totally permissible.

> 2) The cpu association is broken. On a JS22 blade for example, which is
> a 2-node numa machine, I get the following:
> 
> # cat /proc/cmdline
> root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> # cat /sys/devices/system/node/node0/cpulist
> 0-3
> # cat /sys/devices/system/node/node1/cpulist
> 4-7
> # cat /sys/devices/system/node/node4/cpulist
> 
> #
> 

This doesn't show what the true NUMA topology of the machine is, could you 
please post the output of

	$ cat /sys/devices/system/node/node*/cpulist
	$ cat /sys/devices/system/node/node*/distance
	$ ls -d /sys/devices/system/node/node*/cpu[0-8]

from a normal boot without any numa=fake?

> So, though the cpus 4-7 should have been associated with node4, they
> still belong to node1. The patch works by recording a real numa node
> boundary and incrementing the fake node count. At the same time, a
> mapping is stored from the real numa node to the first fake node that
> gets created on it.
> 

If there are multiple fake nodes on a real physical node, all cpus in that 
node should appear in the cpulist for each fake node for which it has 
local distance.

^ permalink raw reply

* Re: [PATCH v2] Fix fake numa on ppc
From: Ankita Garg @ 2009-09-02  8:03 UTC (permalink / raw)
  To: David Rientjes; +Cc: linuxppc-dev, LKML
In-Reply-To: <alpine.DEB.1.00.0909012326290.31814@chino.kir.corp.google.com>

Hi David,

On Tue, Sep 01, 2009 at 11:37:05PM -0700, David Rientjes wrote:
> On Wed, 2 Sep 2009, Ankita Garg wrote:
> 
> > Hi,
> > 
> > Below is a patch to fix a couple of issues with fake numa node creation
> > on ppc:
> > 
> > 1) Presently, fake nodes could be created such that real numa node
> > boundaries are not respected. So a node could have lmbs that belong to
> > different real nodes.
> > 
> 
> On x86_64, we can use numa=off to completely disable NUMA so that all 
> memory and all cpus are mapped to a single node 0.  That's an extreme 
> example of the above and is totally permissible.
> 
> > 2) The cpu association is broken. On a JS22 blade for example, which is
> > a 2-node numa machine, I get the following:
> > 
> > # cat /proc/cmdline
> > root=/dev/sda6  numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> > # cat /sys/devices/system/node/node0/cpulist
> > 0-3
> > # cat /sys/devices/system/node/node1/cpulist
> > 4-7
> > # cat /sys/devices/system/node/node4/cpulist
> > 
> > #
> > 
> 
> This doesn't show what the true NUMA topology of the machine is, could you 
> please post the output of
> 
> 	$ cat /sys/devices/system/node/node*/cpulist
> 	$ cat /sys/devices/system/node/node*/distance
> 	$ ls -d /sys/devices/system/node/node*/cpu[0-8]
> 
> from a normal boot without any numa=fake?
>

Heres the output as requested by you:

# ls /sys/devices/system/node/
has_cpu  has_normal_memory  node0  node1  online  possible
# cat /sys/devices/system/node/node*/cpulist
0-3
4-7
# cat /sys/devices/system/node/node*/distance
10 20
20 10
# ls -d /sys/devices/system/node/node*/cpu[0-8]
/sys/devices/system/node/node0/cpu0  /sys/devices/system/node/node0/cpu3
/sys/devices/system/node/node1/cpu6
/sys/devices/system/node/node0/cpu1  /sys/devices/system/node/node1/cpu4
/sys/devices/system/node/node1/cpu7
/sys/devices/system/node/node0/cpu2  /sys/devices/system/node/node1/cpu5

 
> > So, though the cpus 4-7 should have been associated with node4, they
> > still belong to node1. The patch works by recording a real numa node
> > boundary and incrementing the fake node count. At the same time, a
> > mapping is stored from the real numa node to the first fake node that
> > gets created on it.
> > 
> 
> If there are multiple fake nodes on a real physical node, all cpus in that 
> node should appear in the cpulist for each fake node for which it has 
> local distance.

Currently, the behavior of fake numa is not so on x86 as well? Below is
a sample output from a single node x86 system booted with numa=fake=8:

# cat node0/cpulist

# cat node1/cpulist

...
# cat node6/cpulist

# cat node7/cpulist
0-7

Presently, just fixing the cpu association issue with ppc, as explained
in my previous mail.

-- 
Regards,
Ankita Garg (ankita@in.ibm.com)
Linux Technology Center
IBM India Systems & Technology Labs, 
Bangalore, India   

^ permalink raw reply

* Re: [PATCH] Fix fake numa on ppc
From: Benjamin Herrenschmidt @ 2009-09-02  8:51 UTC (permalink / raw)
  To: David Rientjes; +Cc: linuxppc-dev, Ankita Garg, linux-kernel, Balbir Singh
In-Reply-To: <alpine.DEB.1.00.0909012259050.26930@chino.kir.corp.google.com>

On Tue, 2009-09-01 at 23:03 -0700, David Rientjes wrote:
> On Wed, 2 Sep 2009, Benjamin Herrenschmidt wrote:
> 
> > Since I'm pretty sure there could be CPU less nodes just like there
> > could be memory-less nodes, it would be good if fake numa could
> > simulate them too :-)
> > 
> 
> You don't want to simulate cpu less nodes since they do have affinity to 
> ranges of memory, you want to map each fake node to a cpumask including 
> all cpus with affinity to its memory, map each cpu to one fake node (with 
> memory) that it has physical affinity to, and then give all fake nodes 
> local NUMA distance to those on the same physical node.  Memoryless nodes 
> take care of themselves since they rely purely on node_distance(), so the 
> index into the slit for all fake nodes to those without memory will be the 
> same.

Ok, makes sense, thanks.

Cheers,
Ben.

^ permalink raw reply

* mpc512x USB non-working after kexec
From: Sebastian Andrzej Siewior @ 2009-09-02  9:44 UTC (permalink / raw)
  To: John Rigby; +Cc: linuxppc-dev

Hi John,

on my mpc512x custom board I have non-working USB after the kexec
syscall. I've traced it down to the ehci_turn_off_all_ports() which is
called from within ehci_shutdown(). If I omit ehci_turn_off_all_ports()
it works fine. The content of the portstatus register is the same after
the setup prior the usb-stack takes over so I don't see any difference
here.
Do you see such a behaviour on your board? Unfortunately I don't have
any other "official"/FSL board where I could cross-check and say it is
the HW.

Sebastian

^ permalink raw reply

* [RFC] net/fs_enet: send a reset request to the PHY on init
From: Sebastian Andrzej Siewior @ 2009-09-02 11:04 UTC (permalink / raw)
  To: Pantelis Antoniou, Vitaly Bordug; +Cc: linuxppc-dev, netdev

Usually u-boot sends a phy request in its network init routine. An uboot
without network support doesn't do it and I endup without working
network. I still can switch between 10/100Mbit (according to the LED on
the hub and phy registers) but I can't send or receive any data.

At this point I'm not sure if the PowerON Reset takes the PHY a few
nsecs too early out of reset or if this reset is required and everyone
relies on U-boot performing this reset.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
This is done on a custom mpc512x board. Unfortunately I don't have other
boards to check. The PHY is a AMD Am79C874, phylib uses the generic one.

 drivers/net/fs_enet/fs_enet-main.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c
index ee15402..a3c962b 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -823,7 +823,8 @@ static int fs_init_phy(struct net_device *dev)
 	}
 
 	fep->phydev = phydev;
-
+	phy_write(phydev, MII_BMCR, BMCR_RESET);
+	udelay(1);
 	return 0;
 }
 
-- 
1.6.4.GIT

^ permalink raw reply related

* Re: [RFC][POWERPC] WDT: added support for the WDT Chain driver.
From: Detlev Zundel @ 2009-09-02 11:32 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-kernel
In-Reply-To: <20090902004928.132974c9@vitb-lp>

Hello Vitaly,

> From: Heiko Schocher <hs@denx.de>
>
> [POWERPC] WDT: added support for the WDT Chain driver.
>
>     This new driver implements a character device with major number 10
>     and minor number 130.  It is a software abstraction of the hardware
>     watchdog with two different APIs.  While the driver periodically
>     triggers the hardware watchdog, the software can setup independent
>     timeout periods.
>
>     More info in Documentation/watchdog/wdt_chain.txt
>
>     Signed-off-by: Heiko Schocher <hs@denx.de>
>     Signed-off-by: Vitaly Bordug <vitb@kernel.crashing.org>
> ---
> This code was (and is) originally residing in DENX public git repo. I
> think it would be useful upstream, to prevent reinventing the same
> thing. 

Thanks a lot for taking an interest in this piece of code.  I would
suggest however that you coordinate with Heiko as he has worked some
more on this driver in the meantime.  This new code however is not in
our repository.  We should start the RFC discussion with current code.

For the casual reader this means, don't invest time into reviewing this
but wait for a cleaned up version instead.

Thanks
  Detlev

--
DENX Software Engineering GmbH,      MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich,  Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-40 Fax: (+49)-8142-66989-80 Email: dzu@denx.de

^ permalink raw reply

* RE: [PATCH] [V3] net: add Xilinx emac lite device driver
From: John Linn @ 2009-09-02 13:33 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michal Simek, netdev, Sadanand Mutyala, linuxppc-dev, jgarzik,
	davem, John Williams
In-Reply-To: <20090820094914.46f1db9c@nehalam>

> -----Original Message-----
> From: Stephen Hemminger [mailto:shemminger@vyatta.com]
> Sent: Thursday, August 20, 2009 5:49 PM
> To: John Linn
> Cc: netdev@vger.kernel.org; linuxppc-dev@ozlabs.org;
davem@davemloft.net; jgarzik@pobox.com; John
> Linn; grant.likely@secretlab.ca; Josh Boyer; John Williams; Michal
Simek; Sadanand Mutyala
> Subject: Re: [PATCH] [V3] net: add Xilinx emac lite device driver
> =

> On Thu, 20 Aug 2009 03:49:51 -0600
> John Linn <john.linn@xilinx.com> wrote:
> =

> > +/**
> > + * xemaclite_ioctl - Perform IO Control operations on the network
device
> > + * @dev:	Pointer to the network device
> > + * @rq:		Pointer to the interface request structure
> > + * @cmd:	IOCTL command
> > + *
> > + * The only IOCTL operation supported by this function is setting
the MAC
> > + * address. An error is reported if any other operations are
requested.
> > + *
> > + * Return:	0 to indicate success, or a negative error for failure.
> > + */
> > +static int xemaclite_ioctl(struct net_device *dev, struct ifreq
*rq, int cmd)
> > +{
> > +	struct net_local *lp =3D (struct net_local *) netdev_priv(dev);
> > +	struct hw_addr_data *hw_addr =3D (struct hw_addr_data *)
&rq->ifr_hwaddr;
> > +
> > +	switch (cmd) {
> > +	case SIOCETHTOOL:
> > +		return -EIO;
> > +
> > +	case SIOCSIFHWADDR:
> > +		dev_err(&lp->ndev->dev, "SIOCSIFHWADDR\n");
> > +
> > +		/* Copy MAC address in from user space */
> > +		copy_from_user((void __force *) dev->dev_addr,
> > +			       (void __user __force *) hw_addr,
> > +			       IFHWADDRLEN);
> > +		xemaclite_set_mac_address(lp, dev->dev_addr);
> > +		break;
> > +	default:
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	return 0;
> > +}
> =

> Do you really need this? I doubt the SIOCSIFHWADDR even reaches
driver!
> =

> The normal call path for setting hardware address is:
>   dev_ifsioc
>          dev_set_mac_address
>              ops->ndo_set_mac_address -->
> =

> The driver should be:
>    1. defining new code to do ndo_set_mac_address
>    2. remove existing xmaclite_ioctl - all ioctl's handled by upper
layers
> =

> FYI - the only ioctl's that make it to network device ndo_ioctl
>  are listed in dev_ifsioc
>    SIOCDEVPRIVATE ... SIOCDEVPRIVATE + 15
>    SIOCBOND*
>    SIOCMII*
>    SIOCBR*
>    SIOCHWTSTAMP
>    SIOCWANDEV
> =

> =


We agree and will be updating the driver for this change. Sorry about
the delay as I was on vacation.

Thanks,
John


This email and any attachments are intended for the sole use of the named r=
ecipient(s) and contain(s) confidential information that may be proprietary=
, privileged or copyrighted under applicable law. If you are not the intend=
ed recipient, do not read, copy, or forward this email message or any attac=
hments. Delete this email message and any attachments immediately.

^ permalink raw reply

* Re: [PATCH v2] Fix fake numa on ppc
From: David Rientjes @ 2009-09-02 19:36 UTC (permalink / raw)
  To: Ankita Garg; +Cc: linuxppc-dev, LKML
In-Reply-To: <20090902080346.GB3806@in.ibm.com>

On Wed, 2 Sep 2009, Ankita Garg wrote:

> Currently, the behavior of fake numa is not so on x86 as well? Below is
> a sample output from a single node x86 system booted with numa=fake=8:
> 
> # cat node0/cpulist
> 
> # cat node1/cpulist
> 
> ...
> # cat node6/cpulist
> 
> # cat node7/cpulist
> 0-7
> 
> Presently, just fixing the cpu association issue with ppc, as explained
> in my previous mail.
> 

Right, I'm proposing an alternate mapping scheme (which we've used for 
years) for both platforms such that a cpu is bound (and is set in 
cpumask_of_node()) to each fake node with which it has physical affinity.  
That is the only way for zonelist ordering in node order, task migration 
from offlined cpus, correct sched domains, etc.  I can propose a patchset 
for x86_64 to do exactly this if there aren't any objections and I hope 
you'll help do ppc.

^ permalink raw reply

* Re: [PATCH v2 0/2] cpu: pseries: Offline state framework.
From: Pavel Machek @ 2009-09-02 20:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gautham R Shenoy, Venkatesh Pallipadi, linux-kernel, linuxppc-dev,
	Darrick J. Wong
In-Reply-To: <1251869611.7547.38.camel@twins>

On Wed 2009-09-02 07:33:31, Peter Zijlstra wrote:
> On Fri, 2009-08-28 at 15:30 +0530, Gautham R Shenoy wrote:
> > Hi,
> > 
> > This is the version 2 of the patch series to provide a cpu-offline framework
> > that enables the administrators choose the state the offline CPU must be put
> > into when multiple such states are exposed by the underlying architecture.
> > 
> > Version 1 of the Patch can be found here:
> > http://lkml.org/lkml/2009/8/6/236
> > 
> > The patch-series exposes the following sysfs tunables to
> > allow the system-adminstrator to choose the state of a CPU:
> > 
> > To query the available hotplug states, one needs to read the sysfs tunable:
> > 	/sys/devices/system/cpu/cpu<number>/available_hotplug_states
> > To query or set the current state, on needs to read/write the sysfs tunable:
> > 	/sys/devices/system/cpu/cpu<number>/current_states
> > 
> > The patchset ensures that the writes to the "current_state" sysfs file are
> > serialized against the writes to the "online" file.
> > 
> > This patchset also contains the offline state driver implemented for
> > pSeries. For pSeries, we define three available_hotplug_states. They are:
> > 
> > 	online: The processor is online.
> > 
> > 	deallocate: This is the the default behaviour when the cpu is offlined
> > 	even in the absense of this driver. The CPU would call make an
> > 	rtas_stop_self() call and hand over the CPU back to the resource pool,
> > 	thereby effectively deallocating that vCPU from the LPAR.
> > 	NOTE: This would result in a configuration change to the LPAR
> > 	which is visible to the outside world.
> > 
> > 	deactivate: This cedes the vCPU to the hypervisor which
> > 	in turn can put the vCPU time to the best use.
> > 	NOTE: This option DOES NOT result in a configuration change
> > 	and the vCPU would be still entitled to the LPAR to which it earlier
> > 	belong to.
> > 
> > Awaiting your feedback.
> 
> I'm still thinking this is a bad idea.
> 
> The OS should only know about online/offline.
> 
> Use the hypervisor interface to deal with the cpu once its offline.
> 
> That is, I think this interface you propose is a layering violation.

Agreed. Plus having interface like 'go to this state during offliine'
then 'go offline' is strange/stupid. For hypervisor case, you might
want to change 'state' of cpu that is already offline.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* Re: [PATCH v2] Fix fake numa on ppc
From: Balbir Singh @ 2009-09-02 19:56 UTC (permalink / raw)
  To: David Rientjes; +Cc: linuxppc-dev, Ankita Garg, LKML
In-Reply-To: <alpine.DEB.1.00.0909021226160.10279@chino.kir.corp.google.com>

On Thu, Sep 3, 2009 at 1:06 AM, David Rientjes<rientjes@google.com> wrote:
> On Wed, 2 Sep 2009, Ankita Garg wrote:
>
>> Currently, the behavior of fake numa is not so on x86 as well? Below is
>> a sample output from a single node x86 system booted with numa=3Dfake=3D=
8:
>>
>> # cat node0/cpulist
>>
>> # cat node1/cpulist
>>
>> ...
>> # cat node6/cpulist
>>
>> # cat node7/cpulist
>> 0-7
>>
>> Presently, just fixing the cpu association issue with ppc, as explained
>> in my previous mail.
>>
>
> Right, I'm proposing an alternate mapping scheme (which we've used for
> years) for both platforms such that a cpu is bound (and is set in
> cpumask_of_node()) to each fake node with which it has physical affinity.
> That is the only way for zonelist ordering in node order, task migration
> from offlined cpus, correct sched domains, etc. =A0I can propose a patchs=
et
> for x86_64 to do exactly this if there aren't any objections and I hope
> you'll help do ppc.

Sounds interesting, I'd definitely be interested in seeing your
proposal, but I would think of that as additional development on top
of this patch

Balbir Singh.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox