netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/5] try parent numa_node at first before using default
       [not found] <200707101641.17672.yinghai.lu@sun.com>
@ 2007-07-10 23:52 ` Yinghai Lu
  2007-07-11 10:54   ` Stefan Richter
  2007-07-10 23:52 ` [PATCH 4/5] net: show numa_node for net_device in /sys Yinghai Lu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2007-07-10 23:52 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Greg KH, rientjes, Christoph Lameter,
	Christoph Hellwig, David Miller, Stefan Richter
  Cc: Linux Kernel Mailing List, netdev

[PATCH 1/5] try parent numa_node at first before using default

For pci_device, pcibios_scan_root and pci_scan_root will call pci_device_add.
pci_device_add will call device_initialize and set_dev_node(&dev->dev,
pcibus_to_node(bus)).
other device such as netdev, and usb_device, set_dev_node is never be
used. So that field numa_node always is -1.

So for netdev, it will need to use dev->parent to get pci_device to
use it's numa_node. esp in netdev_alloc_skb()
not sure how other device such as infiniband do that.

Actually before patch
[PATCH 1/2] x86_64: get mp_bus_to_node as early
there is a bug about squence of bus->sysdata and using pcibus_to_node.
the numa_node of pci_dev->dev is never set correctly...always 0.

So some device have to use pcibus_to_node(to_pci_dev(dev)->bus) directly
such as dma_alloc_pages in arch/x86_64/kernel/pci-dma.c.
or hwif_to_node in include/linux/ide.h

with this patch, we could use device->numa_node direclty for all device.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>

diff --git a/drivers/base/core.c b/drivers/base/core.c
index dd40d78..c344d82 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -547,6 +547,8 @@ static void klist_children_put(struct klist_node *n)
 
 void device_initialize(struct device *dev)
 {
+	int node;
+
 	kobj_set_kset_s(dev, devices_subsys);
 	kobject_init(&dev->kobj);
 	klist_init(&dev->klist_children, klist_children_get,
@@ -557,7 +559,9 @@ void device_initialize(struct device *dev)
 	spin_lock_init(&dev->devres_lock);
 	INIT_LIST_HEAD(&dev->devres_head);
 	device_init_wakeup(dev, 0);
-	set_dev_node(dev, -1);
+
+	node = dev->parent ? dev_to_node(dev->parent) : -1;
+	set_dev_node(dev, node);
 }
 
 #ifdef CONFIG_SYSFS_DEPRECATED

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] net: show numa_node for net_device  in /sys
       [not found] <200707101641.17672.yinghai.lu@sun.com>
  2007-07-10 23:52 ` [PATCH 1/5] try parent numa_node at first before using default Yinghai Lu
@ 2007-07-10 23:52 ` Yinghai Lu
  2007-07-10 23:53 ` [PATCH 5/5] dma: use dev_to_node to get node for device in dma_alloc_pages Yinghai Lu
  2007-07-11  0:05 ` [PATCH 2/5] net: use numa_node in net_devcice->dev instead of parent Yinghai Lu
  3 siblings, 0 replies; 22+ messages in thread
From: Yinghai Lu @ 2007-07-10 23:52 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Greg KH, rientjes, Christoph Lameter,
	Christoph Hellwig, David Miller, Stefan Richter
  Cc: Linux Kernel Mailing List, netdev

[PATCH 4/5] net: show numa_node for net_device  in /sys

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 5c19b06..45b87a5 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -230,6 +230,14 @@ static ssize_t store_weight(struct device *dev, struct device_attribute *attr,
 	return netdev_store(dev, attr, buf, len, change_weight);
 }
 
+#ifdef CONFIG_NUMA
+static ssize_t
+numa_node_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+        return sprintf (buf, "%d\n", dev->numa_node);
+}
+#endif
+
 static struct device_attribute net_class_attributes[] = {
 	__ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
 	__ATTR(iflink, S_IRUGO, show_iflink, NULL),
@@ -247,6 +255,9 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
 	       store_tx_queue_len),
 	__ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight),
+#ifdef CONFIG_NUMA
+        __ATTR_RO(numa_node),
+#endif
 	{}
 };
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] dma: use dev_to_node to get node for device in dma_alloc_pages
       [not found] <200707101641.17672.yinghai.lu@sun.com>
  2007-07-10 23:52 ` [PATCH 1/5] try parent numa_node at first before using default Yinghai Lu
  2007-07-10 23:52 ` [PATCH 4/5] net: show numa_node for net_device in /sys Yinghai Lu
@ 2007-07-10 23:53 ` Yinghai Lu
  2007-07-23 19:30   ` Christoph Lameter
  2007-07-11  0:05 ` [PATCH 2/5] net: use numa_node in net_devcice->dev instead of parent Yinghai Lu
  3 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2007-07-10 23:53 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Greg KH, rientjes, Christoph Lameter,
	Christoph Hellwig, David Miller, Stefan Richter
  Cc: Linux Kernel Mailing List, netdev

[PATCH 5/5] dma: use dev_to_node to get node for device in dma_alloc_pages

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>

diff --git a/arch/x86_64/kernel/pci-dma.c b/arch/x86_64/kernel/pci-dma.c
index 9f80aad..6dbf1c9 100644
--- a/arch/x86_64/kernel/pci-dma.c
+++ b/arch/x86_64/kernel/pci-dma.c
@@ -52,11 +52,9 @@ dma_alloc_pages(struct device *dev, gfp_t gfp, unsigned order)
 {
 	struct page *page;
 	int node;
-#ifdef CONFIG_PCI
-	if (dev->bus == &pci_bus_type)
-		node = pcibus_to_node(to_pci_dev(dev)->bus);
-	else
-#endif
+
+	node = dev_to_node(dev);
+	if (node == -1)
 		node = numa_node_id();
 
 	if (node < first_node(node_online_map))

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] net: use numa_node in net_devcice->dev instead of parent
       [not found] <200707101641.17672.yinghai.lu@sun.com>
                   ` (2 preceding siblings ...)
  2007-07-10 23:53 ` [PATCH 5/5] dma: use dev_to_node to get node for device in dma_alloc_pages Yinghai Lu
@ 2007-07-11  0:05 ` Yinghai Lu
  3 siblings, 0 replies; 22+ messages in thread
From: Yinghai Lu @ 2007-07-11  0:05 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Greg KH, rientjes, Christoph Lameter,
	Christoph Hellwig, David Miller, Stefan Richter
  Cc: Linux Kernel Mailing List, netdev

[PATCH 2/5] net: use numa_node in net_devcice->dev instead of parent

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 27cfe5f..005cc1c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -217,7 +217,7 @@ nodata:
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 		unsigned int length, gfp_t gfp_mask)
 {
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
+	int node = dev_to_node(&dev->dev);
 	struct sk_buff *skb;
 
 	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-10 23:52 ` [PATCH 1/5] try parent numa_node at first before using default Yinghai Lu
@ 2007-07-11 10:54   ` Stefan Richter
  2007-07-11 11:03     ` Stefan Richter
  2007-07-11 21:08     ` Greg KH
  0 siblings, 2 replies; 22+ messages in thread
From: Stefan Richter @ 2007-07-11 10:54 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Andrew Morton, Andi Kleen, Greg KH, rientjes, Christoph Lameter,
	Christoph Hellwig, David Miller, Linux Kernel Mailing List,
	netdev

Yinghai Lu wrote:
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -547,6 +547,8 @@ static void klist_children_put(struct klist_node *n)
>  
>  void device_initialize(struct device *dev)
>  {
> +	int node;
> +
>  	kobj_set_kset_s(dev, devices_subsys);
>  	kobject_init(&dev->kobj);
>  	klist_init(&dev->klist_children, klist_children_get,
> @@ -557,7 +559,9 @@ void device_initialize(struct device *dev)
>  	spin_lock_init(&dev->devres_lock);
>  	INIT_LIST_HEAD(&dev->devres_head);
>  	device_init_wakeup(dev, 0);
> -	set_dev_node(dev, -1);
> +
> +	node = dev->parent ? dev_to_node(dev->parent) : -1;
> +	set_dev_node(dev, node);
>  }

Two remarks:

  - device_add() is perhaps a better place to do this.  Otherwise you
    had to change code like drivers/input/gameport/gameport.c::
    gameport_init_port() which sets the parent device *after* the
    call to device_initialize().

  - device_move() should update the device.node.
-- 
Stefan Richter
-=====-=-=== -=== -=-==
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-11 10:54   ` Stefan Richter
@ 2007-07-11 11:03     ` Stefan Richter
  2007-07-11 21:08     ` Greg KH
  1 sibling, 0 replies; 22+ messages in thread
From: Stefan Richter @ 2007-07-11 11:03 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Andrew Morton, Andi Kleen, Greg KH, rientjes, Christoph Lameter,
	Christoph Hellwig, David Miller, Linux Kernel Mailing List,
	netdev

I wrote:
>   - device_move() should update the device.node.

device.numa_node of course.
-- 
Stefan Richter
-=====-=-=== -=== -=-==
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-11 10:54   ` Stefan Richter
  2007-07-11 11:03     ` Stefan Richter
@ 2007-07-11 21:08     ` Greg KH
  2007-07-11 21:28       ` Yinghai Lu
  1 sibling, 1 reply; 22+ messages in thread
From: Greg KH @ 2007-07-11 21:08 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Yinghai Lu, Andrew Morton, Andi Kleen, rientjes,
	Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

On Wed, Jul 11, 2007 at 12:54:58PM +0200, Stefan Richter wrote:
> Yinghai Lu wrote:
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -547,6 +547,8 @@ static void klist_children_put(struct klist_node *n)
> >  
> >  void device_initialize(struct device *dev)
> >  {
> > +	int node;
> > +
> >  	kobj_set_kset_s(dev, devices_subsys);
> >  	kobject_init(&dev->kobj);
> >  	klist_init(&dev->klist_children, klist_children_get,
> > @@ -557,7 +559,9 @@ void device_initialize(struct device *dev)
> >  	spin_lock_init(&dev->devres_lock);
> >  	INIT_LIST_HEAD(&dev->devres_head);
> >  	device_init_wakeup(dev, 0);
> > -	set_dev_node(dev, -1);
> > +
> > +	node = dev->parent ? dev_to_node(dev->parent) : -1;
> > +	set_dev_node(dev, node);
> >  }
> 
> Two remarks:
> 
>   - device_add() is perhaps a better place to do this.  Otherwise you
>     had to change code like drivers/input/gameport/gameport.c::
>     gameport_init_port() which sets the parent device *after* the
>     call to device_initialize().

I agree, lots of code sets up the parent pointer after initialize and
before add.  One such example is the whole USB subsystem.

Which makes me wonder how this code was really tested at all to show
that it actually had an affect...

So consider this patch series rejected from my side for now.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-11 21:08     ` Greg KH
@ 2007-07-11 21:28       ` Yinghai Lu
  2007-07-12  2:47         ` Stefan Richter
  0 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2007-07-11 21:28 UTC (permalink / raw)
  To: Greg KH
  Cc: Stefan Richter, Andrew Morton, Andi Kleen, rientjes,
	Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

Greg KH wrote:
> On Wed, Jul 11, 2007 at 12:54:58PM +0200, Stefan Richter wrote:
>> Yinghai Lu wrote:
>>> --- a/drivers/base/core.c
>>> +++ b/drivers/base/core.c
>>> @@ -547,6 +547,8 @@ static void klist_children_put(struct klist_node *n)
>>>  
>>>  void device_initialize(struct device *dev)
>>>  {
>>> +	int node;
>>> +
>>>  	kobj_set_kset_s(dev, devices_subsys);
>>>  	kobject_init(&dev->kobj);
>>>  	klist_init(&dev->klist_children, klist_children_get,
>>> @@ -557,7 +559,9 @@ void device_initialize(struct device *dev)
>>>  	spin_lock_init(&dev->devres_lock);
>>>  	INIT_LIST_HEAD(&dev->devres_head);
>>>  	device_init_wakeup(dev, 0);
>>> -	set_dev_node(dev, -1);
>>> +
>>> +	node = dev->parent ? dev_to_node(dev->parent) : -1;
>>> +	set_dev_node(dev, node);
>>>  }
>> Two remarks:
>>
>>   - device_add() is perhaps a better place to do this.  Otherwise you
>>     had to change code like drivers/input/gameport/gameport.c::
>>     gameport_init_port() which sets the parent device *after* the
>>     call to device_initialize().
if other device overwrite that, it is OK.
even for all pci_dev, pci_device_add will call set_dev_node(&dev->dev, pcibus_to_node(bus));
to overwrite it.
but for netdev under pci_dev, it will get node from pci_dev directly.

> 
> I agree, lots of code sets up the parent pointer after initialize and
> before add.  One such example is the whole USB subsystem.
> 
> Which makes me wonder how this code was really tested at all to show
> that it actually had an affect...

original default is -1, and this patch just try to use parent's node as default.

YH

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-11 21:28       ` Yinghai Lu
@ 2007-07-12  2:47         ` Stefan Richter
  2007-07-12  3:01           ` Yinghai Lu
  0 siblings, 1 reply; 22+ messages in thread
From: Stefan Richter @ 2007-07-12  2:47 UTC (permalink / raw)
  To: Yinghai.Lu
  Cc: Greg KH, Andrew Morton, Andi Kleen, rientjes, Christoph Lameter,
	Christoph Hellwig, David Miller, Linux Kernel Mailing List,
	netdev

Yinghai Lu wrote:
> original default is -1, and this patch just try to use parent's node as
> default.

But in many cases, the patch does so at a time when the parent is not
yet known.
-- 
Stefan Richter
-=====-=-=== -=== -==--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-12  2:47         ` Stefan Richter
@ 2007-07-12  3:01           ` Yinghai Lu
  2007-07-12  5:47             ` Stefan Richter
  0 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2007-07-12  3:01 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Greg KH, Andrew Morton, Andi Kleen, rientjes, Christoph Lameter,
	Christoph Hellwig, David Miller, Linux Kernel Mailing List,
	netdev

Stefan Richter wrote:
> Yinghai Lu wrote:
>> original default is -1, and this patch just try to use parent's node as
>> default.
> 
> But in many cases, the patch does so at a time when the parent is not
> yet known.
then it will use -1.

YH

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-12  3:01           ` Yinghai Lu
@ 2007-07-12  5:47             ` Stefan Richter
  2007-07-12  7:15               ` Cornelia Huck
  0 siblings, 1 reply; 22+ messages in thread
From: Stefan Richter @ 2007-07-12  5:47 UTC (permalink / raw)
  To: Yinghai.Lu
  Cc: Greg KH, Andrew Morton, Andi Kleen, rientjes, Christoph Lameter,
	Christoph Hellwig, David Miller, Linux Kernel Mailing List,
	netdev

Yinghai Lu wrote:
> Stefan Richter wrote:
>> Yinghai Lu wrote:
>>> original default is -1, and this patch just try to use parent's node as
>>> default.
>>
>> But in many cases, the patch does so at a time when the parent is not
>> yet known.
> then it will use -1.

Yes.

The patch does nothing for all subsystems which do

	device_initialize(&dev);
	dev->parent = pd;
	device_add(&dev);

Let's avoid to add infrastructure which does nothing, or only does
something by accident.

The alternatives are:

  - Change all subsystems to set dev->parent before device_initialize().
    *Document* that the device_initialize() API has this requirement.
    This is counter-intuitive, amounts to some work across the kernel,
    and could be gotten wrong again in future code because it's a
    counter-intuitive API.

  - Move your code from device_initialize() to device_add().  One minor
    drawback is that node-specific allocations based on the device's
    numa_node would not be optimized before device_add(), but there is
    probably no need for this.  Driver probes come after device_add().

  - Let subsystems explicitly call set_dev_node() on their own.

Also keep in mind that either device_move() should update the numa_node,
or the subsystems which call device_move() should explicitly update it
on their own.   (Unless they know that their devices will always stay at
the same NUMA node even when switching parents.)
-- 
Stefan Richter
-=====-=-=== -=== -==--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-12  5:47             ` Stefan Richter
@ 2007-07-12  7:15               ` Cornelia Huck
  2007-07-12 11:30                 ` Stefan Richter
  0 siblings, 1 reply; 22+ messages in thread
From: Cornelia Huck @ 2007-07-12  7:15 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Yinghai.Lu, Greg KH, Andrew Morton, Andi Kleen, rientjes,
	Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

On Thu, 12 Jul 2007 07:47:52 +0200,
Stefan Richter <stefanr@s5r6.in-berlin.de> wrote:

> The patch does nothing for all subsystems which do
> 
> 	device_initialize(&dev);
> 	dev->parent = pd;
> 	device_add(&dev);
> 
> Let's avoid to add infrastructure which does nothing, or only does
> something by accident.

I agree. dev->parent now is only expected to be set in device_add(),
not in device_initialize(), so the sequence above is perfectly fine.
(Why should you need it, anyway? parent information becomes necessary
only when something is added to the tree.)
> 
> The alternatives are:
> 
>   - Change all subsystems to set dev->parent before device_initialize().
>     *Document* that the device_initialize() API has this requirement.
>     This is counter-intuitive, amounts to some work across the kernel,
>     and could be gotten wrong again in future code because it's a
>     counter-intuitive API.

Yes. We shouldn't do that.
> 
>   - Move your code from device_initialize() to device_add().  One minor
>     drawback is that node-specific allocations based on the device's
>     numa_node would not be optimized before device_add(), but there is
>     probably no need for this.  Driver probes come after device_add().

I'd expect most allocations to be done when probing, so this shouldn't
hurt much.
> 
>   - Let subsystems explicitly call set_dev_node() on their own.
> 
> 
> Also keep in mind that either device_move() should update the numa_node,
> or the subsystems which call device_move() should explicitly update it
> on their own.   (Unless they know that their devices will always stay at
> the same NUMA node even when switching parents.)

I'd trust the subsystems to know best whether something regarding NUMA
changed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-12  7:15               ` Cornelia Huck
@ 2007-07-12 11:30                 ` Stefan Richter
  2007-07-12 15:23                   ` Cornelia Huck
  0 siblings, 1 reply; 22+ messages in thread
From: Stefan Richter @ 2007-07-12 11:30 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Yinghai.Lu, Greg KH, Andrew Morton, Andi Kleen, rientjes,
	Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

Cornelia Huck wrote:
> On Thu, 12 Jul 2007 07:47:52 +0200,
> Stefan Richter <stefanr@s5r6.in-berlin.de> wrote:
>> Also keep in mind that either device_move() should update the numa_node,
>> or the subsystems which call device_move() should explicitly update it
>> on their own.   (Unless they know that their devices will always stay at
>> the same NUMA node even when switching parents.)
> 
> I'd trust the subsystems to know best whether something regarding NUMA
> changed.

Generally,

  - the *subsystems* know about the topology (parent devices, NUMA
    nodes).  The assumption that the child device belongs per default to
    the same NUMA node as its parent device is usually correct, but it
    really is only an assumption.

  - The *drivers* incorporate as much knowledge about topology from the
    subsystem as is necessary e.g. for correct DMA mapping.  E.g. a
    device X on bus B which is attached to controller Y needs DMA
    mapping for device Y, not for device X.  Same, NUMA-optimized
    allocations have to be done for the NUMA node of DMA device Y.  An
    association of device X with a NUMA node is not interesting to the
    driver.  It becomes a bit more difficult if the driver sits between
    two subsystems and that upper-layer subsystem has no flexible API to
    pass up the correct NUMA node.

So, since figuring the correct DMA device out is done by drivers
themselves, they usually can figure out the correct NUMA node as well.
The only precondition is that each DMA device has the correct NUMA node
set.  This is the job of subsystems like PCI, but it may be reasonable
to have simple and broadly applicable helpers for this in the driver core.

So, _is_ the dev->numa_node = dev->parent.numa_node assumption, if
implemented in device_add() and device_move(), as simple and as broadly
applicable as we want driver core's API to be?  Perhaps yes.
-- 
Stefan Richter
-=====-=-=== -=== -==--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] try parent numa_node at first before using default
  2007-07-12 11:30                 ` Stefan Richter
@ 2007-07-12 15:23                   ` Cornelia Huck
  2007-07-12 17:59                     ` [PATCH] " Yinghai Lu
  0 siblings, 1 reply; 22+ messages in thread
From: Cornelia Huck @ 2007-07-12 15:23 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Yinghai.Lu, Greg KH, Andrew Morton, Andi Kleen, rientjes,
	Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

On Thu, 12 Jul 2007 13:30:28 +0200,
Stefan Richter <stefanr@s5r6.in-berlin.de> wrote:

> So, since figuring the correct DMA device out is done by drivers
> themselves, they usually can figure out the correct NUMA node as well.
> The only precondition is that each DMA device has the correct NUMA node
> set.  This is the job of subsystems like PCI, but it may be reasonable
> to have simple and broadly applicable helpers for this in the driver core.
> 
> So, _is_ the dev->numa_node = dev->parent.numa_node assumption, if
> implemented in device_add() and device_move(), as simple and as broadly
> applicable as we want driver core's API to be?  Perhaps yes.

If making the general assumption that the device will inherit the
numa_node through its parent(s) from the responsible DMA device is
reasonable, then device_move() should certainly update the numa_node
from the new parent. (For special cases, the caller could still call
set_dev_node() itself.)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] try parent numa_node at first before using default
  2007-07-12 15:23                   ` Cornelia Huck
@ 2007-07-12 17:59                     ` Yinghai Lu
  2007-07-12 18:31                       ` Greg KH
  2007-07-13  5:48                       ` Cornelia Huck
  0 siblings, 2 replies; 22+ messages in thread
From: Yinghai Lu @ 2007-07-12 17:59 UTC (permalink / raw)
  To: Cornelia Huck, Stefan Richter, Greg KH, Andrew Morton, Andi Kleen,
	rientjes, Christoph Lameter, Christoph Hellwig, David Miller
  Cc: Linux Kernel Mailing List, netdev

[PATCH] try parent numa_node at first before using default

For pci_device, pcibios_scan_root and pci_scan_root will call pci_device_add.
pci_device_add will call device_initialize and set_dev_node(&dev->dev,
pcibus_to_node(bus)).
other device such as netdev, and usb_device, set_dev_node is never be
used. So that field numa_node always is -1.

So for netdev, it will need to use dev->parent to get pci_device to
use it's numa_node. esp in netdev_alloc_skb()
not sure how other device such as infiniband do that.

Actually before patch
[PATCH 1/2] x86_64: get mp_bus_to_node as early
there is a bug about squence of bus->sysdata and using pcibus_to_node.
the numa_node of pci_dev->dev is never set correctly...always 0.

So some device have to use pcibus_to_node(to_pci_dev(dev)->bus) directly
such as dma_alloc_pages in arch/x86_64/kernel/pci-dma.c.
or hwif_to_node in include/linux/ide.h

According to Stefan Richter
  - Change all subsystems to set dev->parent before device_initialize().
    *Document* that the device_initialize() API has this requirement.
    This is counter-intuitive, amounts to some work across the kernel,
    and could be gotten wrong again in future code because it's a
    counter-intuitive API.

  - Move your code from device_initialize() to device_add().  One minor
    drawback is that node-specific allocations based on the device's
    numa_node would not be optimized before device_add(), but there is
    probably no need for this.  Driver probes come after device_add().

  - Let subsystems explicitly call set_dev_node() on their own.

this patch is using second method.

Also we don't need call set_dev_node in pci_device_add anymore. but need to
make sure every pci root bus's bridge device numa is set.

with this patch, we could use device->numa_node direclty for all device.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>

diff --git a/drivers/base/core.c b/drivers/base/core.c
index dd40d78..091c2b1 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -666,6 +666,11 @@ int device_add(struct device *dev)
 	if (error)
 		goto Error;
 
+	/* use parent numa_node */
+	if (parent) {
+		set_dev_node(dev, dev_to_node(dev->parent));
+	}
+
 	/* first, register with generic layer. */
 	kobject_set_name(&dev->kobj, "%s", dev->bus_id);
 	error = kobject_add(&dev->kobj);
@@ -1285,8 +1290,11 @@ int device_move(struct device *dev, struct device *new_parent)
 	dev->parent = new_parent;
 	if (old_parent)
 		klist_remove(&dev->knode_parent);
-	if (new_parent)
+	if (new_parent) {
 		klist_add_tail(&dev->knode_parent, &new_parent->klist_children);
+		set_dev_node(dev, dev_to_node(new_parent));
+	}
+
 	if (!dev->class)
 		goto out_put;
 	error = device_move_class_links(dev, old_parent, new_parent);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index e48fcf0..c029ffc 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -935,7 +938,6 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 	dev->dev.release = pci_release_dev;
 	pci_dev_get(dev);
 
-	set_dev_node(&dev->dev, pcibus_to_node(bus));
 	dev->dev.dma_mask = &dev->dma_mask;
 	dev->dev.coherent_dma_mask = 0xffffffffull;
 
@@ -1096,6 +1098,9 @@ struct pci_bus * pci_create_bus(struct device *parent,
 		goto dev_reg_err;
 	b->bridge = get_device(dev);
 
+	if (!parent)
+		set_dev_node(b->bridge, pcibus_to_node(b));
+
 	b->class_dev.class = &pcibus_class;
 	sprintf(b->class_dev.class_id, "%04x:%02x", pci_domain_nr(b), bus);
 	error = class_device_register(&b->class_dev);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] try parent numa_node at first before using default
  2007-07-12 17:59                     ` [PATCH] " Yinghai Lu
@ 2007-07-12 18:31                       ` Greg KH
  2007-07-12 19:06                         ` Yinghai Lu
  2007-07-13  5:48                       ` Cornelia Huck
  1 sibling, 1 reply; 22+ messages in thread
From: Greg KH @ 2007-07-12 18:31 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Cornelia Huck, Stefan Richter, Andrew Morton, Andi Kleen,
	rientjes, Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

On Thu, Jul 12, 2007 at 10:59:53AM -0700, Yinghai Lu wrote:
> [PATCH] try parent numa_node at first before using default
> 
> For pci_device, pcibios_scan_root and pci_scan_root will call pci_device_add.
> pci_device_add will call device_initialize and set_dev_node(&dev->dev,
> pcibus_to_node(bus)).
> other device such as netdev, and usb_device, set_dev_node is never be
> used. So that field numa_node always is -1.
> 
> So for netdev, it will need to use dev->parent to get pci_device to
> use it's numa_node. esp in netdev_alloc_skb()
> not sure how other device such as infiniband do that.
> 
> Actually before patch
> [PATCH 1/2] x86_64: get mp_bus_to_node as early
> there is a bug about squence of bus->sysdata and using pcibus_to_node.
> the numa_node of pci_dev->dev is never set correctly...always 0.
> 
> So some device have to use pcibus_to_node(to_pci_dev(dev)->bus) directly
> such as dma_alloc_pages in arch/x86_64/kernel/pci-dma.c.
> or hwif_to_node in include/linux/ide.h
> 
> According to Stefan Richter
>   - Change all subsystems to set dev->parent before device_initialize().
>     *Document* that the device_initialize() API has this requirement.
>     This is counter-intuitive, amounts to some work across the kernel,
>     and could be gotten wrong again in future code because it's a
>     counter-intuitive API.
> 
>   - Move your code from device_initialize() to device_add().  One minor
>     drawback is that node-specific allocations based on the device's
>     numa_node would not be optimized before device_add(), but there is
>     probably no need for this.  Driver probes come after device_add().
> 
>   - Let subsystems explicitly call set_dev_node() on their own.
> 
> this patch is using second method.
> 
> Also we don't need call set_dev_node in pci_device_add anymore. but need to
> make sure every pci root bus's bridge device numa is set.
> 
> with this patch, we could use device->numa_node direclty for all device.

Please split this into two separate patches, as they are doing two
different things.  One for the driver core, and one for pci devices.

Also, what kind of performance issues have you seen with this patch in
place?  As your previous patches were incorrect, I am guessing that you
are not able to test this?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] try parent numa_node at first before using default
  2007-07-12 18:31                       ` Greg KH
@ 2007-07-12 19:06                         ` Yinghai Lu
  2007-07-13  3:16                           ` Greg KH
  0 siblings, 1 reply; 22+ messages in thread
From: Yinghai Lu @ 2007-07-12 19:06 UTC (permalink / raw)
  To: Greg KH
  Cc: Cornelia Huck, Stefan Richter, Andrew Morton, Andi Kleen,
	rientjes, Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

On 7/12/07, Greg KH <greg@kroah.com> wrote:
> On Thu, Jul 12, 2007 at 10:59:53AM -0700, Yinghai Lu wrote:
> Please split this into two separate patches, as they are doing two
> different things.  One for the driver core, and one for pci devices.

they need to applied together, otherwise all pci_dev numa_node will be
change to device_add to -1 after pci_device_add.

>
> Also, what kind of performance issues have you seen with this patch in
> place?  As your previous patches were incorrect, I am guessing that you
> are not able to test this?
forcedeth on second node. without this patch in addition to
get_mpbus_to_node patch, the numa_node is always 0.


LBsuse91AMD64:/sys/devices/pci0000:80/0000:80:08.0 # cat numa_node
1
LBsuse91AMD64:/sys/devices/pci0000:80/0000:80:08.0 # cat net/eth2/numa_node
1

and after initialization

LBsuse91AMD64:~ # tail -10 /var/log/messages
Jul 13 03:05:16 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011d33d010, 11d33d010, 670
Jul 13 03:05:26 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011d322010, 11d322010, 670
Jul 13 03:05:29 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011d33a810, 11d33a810, 670
Jul 13 03:05:31 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011d33e810, 11d33e810, 670
Jul 13 03:05:31 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011d358810, 11d358810, 670
Jul 13 03:05:31 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011cfe1010, 11cfe1010, 670
Jul 13 03:05:31 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011cfe1810, 11cfe1810, 670
Jul 13 03:05:31 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011c7f9010, 11c7f9010, 670
Jul 13 03:05:31 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011c7f9810, 11c7f9810, 670
Jul 13 03:05:34 LBsuse91AMD64 kernel: forcedeth 0000:00:08.0: YHLU
gart_map_single - !need_iommu: ffff81011c2a7010, 11c2a7010, 670

gart_map_single is used all the time via netdev_alloc_skb for forcedth.

YH

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] try parent numa_node at first before using default
  2007-07-12 19:06                         ` Yinghai Lu
@ 2007-07-13  3:16                           ` Greg KH
  2007-07-13  4:42                             ` Yinghai Lu
  0 siblings, 1 reply; 22+ messages in thread
From: Greg KH @ 2007-07-13  3:16 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Cornelia Huck, Stefan Richter, Andrew Morton, Andi Kleen,
	rientjes, Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

On Thu, Jul 12, 2007 at 12:06:00PM -0700, Yinghai Lu wrote:
>  On 7/12/07, Greg KH <greg@kroah.com> wrote:
> > On Thu, Jul 12, 2007 at 10:59:53AM -0700, Yinghai Lu wrote:
> > Please split this into two separate patches, as they are doing two
> > different things.  One for the driver core, and one for pci devices.
> 
>  they need to applied together, otherwise all pci_dev numa_node will be
>  change to device_add to -1 after pci_device_add.

Hm, good point.

> > Also, what kind of performance issues have you seen with this patch in
> > place?  As your previous patches were incorrect, I am guessing that you
> > are not able to test this?
>  forcedeth on second node. without this patch in addition to
>  get_mpbus_to_node patch, the numa_node is always 0.

But you were not able to test USB, right?  Does that work properly now
with this patch?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] try parent numa_node at first before using default
  2007-07-13  3:16                           ` Greg KH
@ 2007-07-13  4:42                             ` Yinghai Lu
  0 siblings, 0 replies; 22+ messages in thread
From: Yinghai Lu @ 2007-07-13  4:42 UTC (permalink / raw)
  To: Greg KH
  Cc: Cornelia Huck, Stefan Richter, Andrew Morton, Andi Kleen,
	rientjes, Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

On 7/12/07, Greg KH <greg@kroah.com> wrote:
>
> But you were not able to test USB, right?  Does that work properly now
> with this patch?

Will test it later with USB cdrom by big file reading.

YH

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] try parent numa_node at first before using default
  2007-07-12 17:59                     ` [PATCH] " Yinghai Lu
  2007-07-12 18:31                       ` Greg KH
@ 2007-07-13  5:48                       ` Cornelia Huck
  2007-07-13 19:27                         ` [PATCH] try parent numa_node at first before using default v2 Yinghai Lu
  1 sibling, 1 reply; 22+ messages in thread
From: Cornelia Huck @ 2007-07-13  5:48 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Stefan Richter, Greg KH, Andrew Morton, Andi Kleen, rientjes,
	Christoph Lameter, Christoph Hellwig, David Miller,
	Linux Kernel Mailing List, netdev

On Thu, 12 Jul 2007 10:59:53 -0700,
Yinghai Lu <Yinghai.Lu@Sun.COM> wrote:

> @@ -1285,8 +1290,11 @@ int device_move(struct device *dev, struct device *new_parent)
>  	dev->parent = new_parent;
>  	if (old_parent)
>  		klist_remove(&dev->knode_parent);
> -	if (new_parent)
> +	if (new_parent) {
>  		klist_add_tail(&dev->knode_parent, &new_parent->klist_children);
> +		set_dev_node(dev, dev_to_node(new_parent));
> +	}
> +
>  	if (!dev->class)
>  		goto out_put;
>  	error = device_move_class_links(dev, old_parent, new_parent);

You're not correctly undoing the changes if the last function fails.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] try parent numa_node at first before using default v2
  2007-07-13  5:48                       ` Cornelia Huck
@ 2007-07-13 19:27                         ` Yinghai Lu
  0 siblings, 0 replies; 22+ messages in thread
From: Yinghai Lu @ 2007-07-13 19:27 UTC (permalink / raw)
  To: Cornelia Huck, Stefan Richter, Greg KH, Andrew Morton, Andi Kleen,
	rientjes, Christoph Lameter, Christoph Hellwig, David Miller
  Cc: Linux Kernel Mailing List, netdev

[PATCH] try parent numa_node at first before using default v2

For pci_device, pcibios_scan_root and pci_scan_root will call pci_device_add.
pci_device_add will call device_initialize and set_dev_node(&dev->dev,
pcibus_to_node(bus)).
other device such as netdev, and usb_device, set_dev_node is never be
used. So that field numa_node always is -1.

So for netdev, it will need to use dev->parent to get pci_device to
use it's numa_node. esp in netdev_alloc_skb()
not sure how other device such as infiniband do that.

Actually before patch
[PATCH 1/2] x86_64: get mp_bus_to_node as early
there is a bug about squence of bus->sysdata and using pcibus_to_node.
the numa_node of pci_dev->dev is never set correctly...always 0.

So some device have to use pcibus_to_node(to_pci_dev(dev)->bus) directly
such as dma_alloc_pages in arch/x86_64/kernel/pci-dma.c.
or hwif_to_node in include/linux/ide.h

According to Stefan Richter
  - Change all subsystems to set dev->parent before device_initialize().
    *Document* that the device_initialize() API has this requirement.
    This is counter-intuitive, amounts to some work across the kernel,
    and could be gotten wrong again in future code because it's a
    counter-intuitive API.

  - Move your code from device_initialize() to device_add().  One minor
    drawback is that node-specific allocations based on the device's
    numa_node would not be optimized before device_add(), but there is
    probably no need for this.  Driver probes come after device_add().

  - Let subsystems explicitly call set_dev_node() on their own.

this patch is using second method.

Also we don't need call set_dev_node in pci_device_add anymore. but need to
make sure every pci root bus's bridge device numa is set.

with this patch, we could use device->numa_node direclty for all device.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index e48fcf0..c029ffc 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -935,7 +938,6 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 	dev->dev.release = pci_release_dev;
 	pci_dev_get(dev);
 
-	set_dev_node(&dev->dev, pcibus_to_node(bus));
 	dev->dev.dma_mask = &dev->dma_mask;
 	dev->dev.coherent_dma_mask = 0xffffffffull;
 
@@ -1096,6 +1098,9 @@ struct pci_bus * pci_create_bus(struct device *parent,
 		goto dev_reg_err;
 	b->bridge = get_device(dev);
 
+	if (!parent)
+		set_dev_node(b->bridge, pcibus_to_node(b));
+
 	b->class_dev.class = &pcibus_class;
 	sprintf(b->class_dev.class_id, "%04x:%02x", pci_domain_nr(b), bus);
 	error = class_device_register(&b->class_dev);
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0455aa7..d8b063b 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -672,6 +672,11 @@ int device_add(struct device *dev)
 	if (error)
 		goto Error;
 
+	/* use parent numa_node */
+	if (parent) {
+		set_dev_node(dev, dev_to_node(parent));
+	}
+
 	/* first, register with generic layer. */
 	kobject_set_name(&dev->kobj, "%s", dev->bus_id);
 	error = kobject_add(&dev->kobj);
@@ -1269,8 +1274,11 @@ int device_move(struct device *dev, struct device *new_parent)
 	dev->parent = new_parent;
 	if (old_parent)
 		klist_remove(&dev->knode_parent);
-	if (new_parent)
+	if (new_parent) {
 		klist_add_tail(&dev->knode_parent, &new_parent->klist_children);
+		set_dev_node(dev, dev_to_node(new_parent));
+	}
+
 	if (!dev->class)
 		goto out_put;
 	error = device_move_class_links(dev, old_parent, new_parent);
@@ -1280,9 +1288,12 @@ int device_move(struct device *dev, struct device *new_parent)
 		if (!kobject_move(&dev->kobj, &old_parent->kobj)) {
 			if (new_parent)
 				klist_remove(&dev->knode_parent);
-			if (old_parent)
+			dev->parent = old_parent;
+			if (old_parent) {
 				klist_add_tail(&dev->knode_parent,
 					       &old_parent->klist_children);
+				set_dev_node(dev, dev_to_node(old_parent));
+			}
 		}
 		put_device(new_parent);
 		goto out;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] dma: use dev_to_node to get node for device in dma_alloc_pages
  2007-07-10 23:53 ` [PATCH 5/5] dma: use dev_to_node to get node for device in dma_alloc_pages Yinghai Lu
@ 2007-07-23 19:30   ` Christoph Lameter
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Lameter @ 2007-07-23 19:30 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Andrew Morton, Andi Kleen, Greg KH, rientjes, Christoph Hellwig,
	David Miller, Stefan Richter, Linux Kernel Mailing List, netdev

On Tue, 10 Jul 2007 16:53:09 -0700
Yinghai Lu <Yinghai.Lu@Sun.COM> wrote:

> [PATCH 5/5] dma: use dev_to_node to get node for device in
> dma_alloc_pages

Acked-by: Christoph Lameter <clameter@sgi.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-07-23 19:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200707101641.17672.yinghai.lu@sun.com>
2007-07-10 23:52 ` [PATCH 1/5] try parent numa_node at first before using default Yinghai Lu
2007-07-11 10:54   ` Stefan Richter
2007-07-11 11:03     ` Stefan Richter
2007-07-11 21:08     ` Greg KH
2007-07-11 21:28       ` Yinghai Lu
2007-07-12  2:47         ` Stefan Richter
2007-07-12  3:01           ` Yinghai Lu
2007-07-12  5:47             ` Stefan Richter
2007-07-12  7:15               ` Cornelia Huck
2007-07-12 11:30                 ` Stefan Richter
2007-07-12 15:23                   ` Cornelia Huck
2007-07-12 17:59                     ` [PATCH] " Yinghai Lu
2007-07-12 18:31                       ` Greg KH
2007-07-12 19:06                         ` Yinghai Lu
2007-07-13  3:16                           ` Greg KH
2007-07-13  4:42                             ` Yinghai Lu
2007-07-13  5:48                       ` Cornelia Huck
2007-07-13 19:27                         ` [PATCH] try parent numa_node at first before using default v2 Yinghai Lu
2007-07-10 23:52 ` [PATCH 4/5] net: show numa_node for net_device in /sys Yinghai Lu
2007-07-10 23:53 ` [PATCH 5/5] dma: use dev_to_node to get node for device in dma_alloc_pages Yinghai Lu
2007-07-23 19:30   ` Christoph Lameter
2007-07-11  0:05 ` [PATCH 2/5] net: use numa_node in net_devcice->dev instead of parent Yinghai Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).