Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Andrew Morton @ 2010-10-07 19:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric W. Biederman, Américo Wang, Robin Holt, linux-kernel,
	Willy Tarreau, David S. Miller, netdev, James Morris,
	Pekka Savola (ipv6), Patrick McHardy, Alexey Kuznetsov
In-Reply-To: <1286470743.2912.276.camel@edumazet-laptop>

On Thu, 07 Oct 2010 18:59:03 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 07 octobre 2010 __ 09:37 -0700, Eric W. Biederman a __crit :
> 
> > The difference between long handling and int handling is a
> > usability issue.  I don't expect we will be exporting new
> > vectors via sysctl, so the conversion of a handful of vectors
> > from int to long is where this is most likely to be used.
> > 
> > I skimmed through all of what I presume are the current users
> > aka linux-2.6.36-rcX and there don't appear to be any users
> > of proc_dounlongvec_minmax that use it's vector properties there.
> > 
> > Which doubly tells me that incrementing the min and max pointers
> > is not what we want to do.
> > 
> 
> Thats fine by me, thanks Eric.
> 
> Andrew, please remove previous patch from your tree and replace it by
> following one :
> 
> [PATCH v2] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
> 
> When proc_doulongvec_minmax() is used with an array of longs,
> and no min/max check requested (.extra1 or .extra2 being NULL), we
> dereference a NULL pointer for the second element of the array.
> 
> Noticed while doing some changes in network stack for the "16TB problem"
> 
> Fix is to not change min & max pointers in
> __do_proc_doulongvec_minmax(), so that all elements of the vector share
> an unique min/max limit, like proc_dointvec_minmax().
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  kernel/sysctl.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index f88552c..8e45451 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2485,7 +2485,7 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
>  		kbuf[left] = 0;
>  	}
>  
> -	for (; left && vleft--; i++, min++, max++, first=0) {
> +	for (; left && vleft--; i++, first=0) {
>  		unsigned long val;
>  
>  		if (write) {

Did we check to see whether any present callers are passing in pointers
to arrays of min/max values?

I wonder if there's any documentation for this interface which just
became wrong.

^ permalink raw reply

* pull request: wireless-2.6 2010-10-07
From: John W. Linville @ 2010-10-07 18:58 UTC (permalink / raw)
  To: davem; +Cc: linux-wireless, netdev, linux-kernel

Dave,

Here are a few more more-or-less-one-liner fixes intended for 2.6.36.
The one from Felix fixes a regression introduced in the 2.6.36 cycle.
The one from Johannes fixes a crash reported by Ben Greear (as
documented in the changelog).  The one from me reverts and earlier patch
from me that can result in stuff in dmesg from not calling
netif_receive_skb in the proper context.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit fb3dbece264a50ab4373f3af0bbbd9175d3ad4d7:

  Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/padovan/bluetooth-2.6 (2010-10-07 00:59:39 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git master

Felix Fietkau (1):
      ath9k_hw: fix regression in ANI listen time calculation

Johannes Berg (1):
      mac80211: delete AddBA response timer

John W. Linville (1):
      Revert "mac80211: use netif_receive_skb in ieee80211_tx_status callpath"

 drivers/net/wireless/ath/ath9k/ani.c |    2 +-
 net/mac80211/agg-tx.c                |    2 ++
 net/mac80211/status.c                |    4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/ani.c b/drivers/net/wireless/ath/ath9k/ani.c
index cc648b6..a3d95cc 100644
--- a/drivers/net/wireless/ath/ath9k/ani.c
+++ b/drivers/net/wireless/ath/ath9k/ani.c
@@ -543,7 +543,7 @@ static u8 ath9k_hw_chan_2_clockrate_mhz(struct ath_hw *ah)
 	if (conf_is_ht40(conf))
 		return clockrate * 2;
 
-	return clockrate * 2;
+	return clockrate;
 }
 
 static int32_t ath9k_hw_ani_get_listen_time(struct ath_hw *ah)
diff --git a/net/mac80211/agg-tx.c b/net/mac80211/agg-tx.c
index c893f23..8f23401 100644
--- a/net/mac80211/agg-tx.c
+++ b/net/mac80211/agg-tx.c
@@ -175,6 +175,8 @@ int ___ieee80211_stop_tx_ba_session(struct sta_info *sta, u16 tid,
 
 	set_bit(HT_AGG_STATE_STOPPING, &tid_tx->state);
 
+	del_timer_sync(&tid_tx->addba_resp_timer);
+
 	/*
 	 * After this packets are no longer handed right through
 	 * to the driver but are put onto tid_tx->pending instead,
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index 10caec5..34da679 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -377,7 +377,7 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
 				skb2 = skb_clone(skb, GFP_ATOMIC);
 				if (skb2) {
 					skb2->dev = prev_dev;
-					netif_receive_skb(skb2);
+					netif_rx(skb2);
 				}
 			}
 
@@ -386,7 +386,7 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
 	}
 	if (prev_dev) {
 		skb->dev = prev_dev;
-		netif_receive_skb(skb);
+		netif_rx(skb);
 		skb = NULL;
 	}
 	rcu_read_unlock();
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply related

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Narendra_K @ 2010-10-07 18:44 UTC (permalink / raw)
  To: kay.sievers
  Cc: greg, Matt_Domsch, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <AANLkTinP5WPDPvk+kq8vsyP=xC9qcoe+c=1EBp0XJNPk@mail.gmail.com>

On Thu, Oct 07, 2010 at 10:35:14PM +0530, Kay Sievers wrote:
>    On Thu, Oct 7, 2010 at 18:48, Greg KH <greg@kroah.com> wrote:
>    > On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
>    >> 1) SMBIOS type 41 method.  Windows does not use this today, and I
>    >>    can't speak to their future plans.  Narendra's kernel patch does,
>    >>    as has biosdevname, the udev helper we first wrote for this
>    >>    purpose, for several years.
>    >
>    > Then stick with that udev helper please :)
> 
>    What about just exporting this information in sysfs, and not touch the
>    naming?
> 
>    Anyway, I'm pretty sure all of this naming of onboard devices should
>    happen only at install time, or from a system management tool and not
>    at hotplug time.
> 
>    We should not get confused by the way the (very simple)
>    automatic-rule-creater for persistent netdev naming in udev works.
>    This is really just a tool for the common case, and works fine for the
>    majority of people.

Right. It works as the automatic rule creator saves the snapshot of the
registered network interfaces at run time. 

> 
>    I'm not sure, if we should put all these special use cases in the
>    hotplug path. I mean it's not that people add and remove 4 port
>    network cards with special BIOS all the time, and expect proper naming
>    on the first bootup, right? The installer, or the system management
>    tool could just create/edit udev rules to provide proper device naming
>    on whatever property is available at a specific hardware, be it the
>    MAC address or some other persistent match?
> 

The proposal made is not expecting deterministic naming when an add-in
card with 'N' ports is plugged in/out. It is specific to onboard devices
only. Expectation is onboard devices have deterministic naming at first
bootup. And no special BIOS required as SMBIOS tables are in use for
sometime now.

I did explore using rules based on the exported attribute ATTRS{index}
on a system with 4 Onboard devices and two add-in devices.(where add-in
device becomes eth0 and eth1)

# PCI device 0x14e4:0x164c (bnx2) (custom name provided by external
# tool) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*",
# ATTRS{index}=="1", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

(and similary for eth1..eth3)

And for add-in devices.

# PCI device 0x8086:0x10c9 (igb) (custom name provided by external tool)
# SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*",
# ATTR{address}=="00:1b:21:54:33:3c", ATTR{type}=="1", KERNEL=="eth*",
# NAME="eth4" (and similar for eth5 as they do not have an index)

This works as i edited the file manually.

If this has to be done on a large number of systems where an image
based deployment is preferred, getting onboard device names as expected
is an issue and is important.

-- 
With regards,
Narendra K

^ permalink raw reply

* Re: [PATCH net-next] net: percpu net_device refcount
From: Eric Dumazet @ 2010-10-07 18:06 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20101007103051.63b5177c@nehalam>

Le jeudi 07 octobre 2010 à 10:30 -0700, Stephen Hemminger a écrit :

> It makes sense, but what about 256 cores and 1024 Vlans?
> That adds up to 4M of memory which is might be noticeable.
> 
> 

Well, 256 cores and 1024 Vlan -> 1 Mbyte of memory, not 4 ;)

This seems reasonable to me.

Eventually we could use a fallback, if percpu allocation failed -> use a
static field in net_device.




^ permalink raw reply

* [PATCH net-next-2.6] sfc: Don't try to set filters with search depths we know won't work
From: Ben Hutchings @ 2010-10-07 17:50 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

The filter engine will time-out and ignore filters beyond
200-something hops.  We also need to avoid infinite loops in
efx_filter_search() when the table is full.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/filter.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sfc/filter.c b/drivers/net/sfc/filter.c
index abc884d..52cb608 100644
--- a/drivers/net/sfc/filter.c
+++ b/drivers/net/sfc/filter.c
@@ -20,6 +20,12 @@
 #define FILTER_CTL_SRCH_FUDGE_WILD 3
 #define FILTER_CTL_SRCH_FUDGE_FULL 1
 
+/* Hard maximum hop limit.  Hardware will time-out beyond 200-something.
+ * We also need to avoid infinite loops in efx_filter_search() when the
+ * table is full.
+ */
+#define FILTER_CTL_SRCH_MAX 200
+
 struct efx_filter_table {
 	u32		offset;		/* address of table relative to BAR */
 	unsigned	size;		/* number of entries */
@@ -183,7 +189,8 @@ static int efx_filter_search(struct efx_filter_table *table,
 	incr = efx_filter_increment(key);
 
 	for (depth = 1, filter_idx = hash & (table->size - 1);
-	     test_bit(filter_idx, table->used_bitmap);
+	     depth <= FILTER_CTL_SRCH_MAX &&
+		     test_bit(filter_idx, table->used_bitmap);
 	     ++depth) {
 		cmp = &table->spec[filter_idx];
 		if (efx_filter_equal(spec, cmp))
@@ -192,6 +199,8 @@ static int efx_filter_search(struct efx_filter_table *table,
 	}
 	if (!for_insert)
 		return -ENOENT;
+	if (depth > FILTER_CTL_SRCH_MAX)
+		return -EBUSY;
 found:
 	*depth_required = depth;
 	return filter_idx;
-- 
1.7.2.1


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Narendra_K @ 2010-10-07 17:44 UTC (permalink / raw)
  To: greg
  Cc: kay.sievers, Matt_Domsch, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007171532.GA29857@kroah.com>

On Thu, Oct 07, 2010 at 10:45:32PM +0530, Greg KH wrote:
>    On Thu, Oct 07, 2010 at 07:05:14PM +0200, Kay Sievers wrote:
>    > On Thu, Oct 7, 2010 at 18:48, Greg KH <greg@kroah.com> wrote:
>    > > On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
>    > >> 1) SMBIOS type 41 method.  Windows does not use this today, and I
>    > >>    can't speak to their future plans.  Narendra's kernel patch does,
>    > >>    as has biosdevname, the udev helper we first wrote for this
>    > >>    purpose, for several years.
>    > >
>    > > Then stick with that udev helper please :)
>    >
>    > What about just exporting this information in sysfs, and not touch the
>    naming?
> 
>    I think this is now exported in userspace, right Narendra?
> 

Right. It is in the mainline kernel.
( commit 911e1c9b05a8e3559a7aa89083930700a0b9e7ee)

-- 
With regards,
Narendra K

^ permalink raw reply

* Re: [PATCH net-next] net: percpu net_device refcount
From: Stephen Hemminger @ 2010-10-07 17:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1286471555.2912.291.camel@edumazet-laptop>

On Thu, 07 Oct 2010 19:12:35 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> We tried very hard to remove all possible dev_hold()/dev_put() pairs in
> network stack, using RCU conversions.
> 
> There is still an unavoidable device refcount change for every dst we
> create/destroy, and this can slow down some workloads (routers or some
> app servers)
> 
> We can switch to a percpu refcount implementation, now dynamic per_cpu
> infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
> per device.
> 

It makes sense, but what about 256 cores and 1024 Vlans?
That adds up to 4M of memory which is might be noticeable.


-- 

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Greg KH @ 2010-10-07 17:15 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Matt Domsch, Narendra_K, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <AANLkTinP5WPDPvk+kq8vsyP=xC9qcoe+c=1EBp0XJNPk@mail.gmail.com>

On Thu, Oct 07, 2010 at 07:05:14PM +0200, Kay Sievers wrote:
> On Thu, Oct 7, 2010 at 18:48, Greg KH <greg@kroah.com> wrote:
> > On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
> >> 1) SMBIOS type 41 method.  Windows does not use this today, and I
> >>    can't speak to their future plans.  Narendra's kernel patch does,
> >>    as has biosdevname, the udev helper we first wrote for this
> >>    purpose, for several years.
> >
> > Then stick with that udev helper please :)
> 
> What about just exporting this information in sysfs, and not touch the naming?

I think this is now exported in userspace, right Narendra?

> Anyway, I'm pretty sure all of this naming of onboard devices should
> happen only at install time, or from a system management tool and not
> at hotplug time.
> 
> We should not get confused by the way the (very simple)
> automatic-rule-creater for persistent netdev naming in udev works.
> This is really just a tool for the common case, and works fine for the
> majority of people.
> 
> I'm not sure, if we should put all these special use cases in the
> hotplug path. I mean it's not that people add and remove 4 port
> network cards with special BIOS all the time, and expect proper naming
> on the first bootup, right? The installer, or the system management
> tool could just create/edit udev rules to provide proper device naming
> on whatever property is available at a specific hardware, be it the
> MAC address or some other persistent match?

I totally agree.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Kay Sievers @ 2010-10-07 17:13 UTC (permalink / raw)
  To: David Lamparter
  Cc: Matt Domsch, Greg KH, Narendra_K, netdev, linux-hotplug,
	linux-pci, Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007164906.GB122295@jupiter.n2.diac24.net>

On Thu, Oct 7, 2010 at 18:49, David Lamparter <equinox@diac24.net> wrote:
> On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
>> At most, they'll have to deal with the same "eth0_rename"
>
> Getting that "eth0_rename" stuff is not something unchangeably imposed
> by the kernel. I think this should be considered an udev bug most of all
> things, since if the udev scripts did it right, this wouldn't happen. A
> "right" way to do this would for example be to first rename all devices
> to "tmpwhatever0", then rename them back to their proper ethX names.

Whats the difference between eth0_rename and tmpeth0? I don't see any. :)

> Now I don't know even a single bit about udev scripting, so maybe this
> isn't possible with the current udev infrastructure, but there's
> certainly no kernel part stopping a proper implementation.
>
> Btw, what happened to nameif(8)? It's totally outdated (and deprecated I
> guess?), but it got the job done...

It can't swap already taken device names on hotplug, and that job
might get complicated very easily. Also stuff needs to be renamed
before the device is announced to userspace, otherwise the interface
gets configured and can't be renamed anymore.

Kay

^ permalink raw reply

* [PATCH net-next] net: percpu net_device refcount
From: Eric Dumazet @ 2010-10-07 17:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.

There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers)

We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.

On x86, dev_hold(dev) code :

before
	lock	incl 0x280(%ebx)
after:
	movl	0x260(%ebx),%eax
	incl	fs:(%eax)

Stress bench :

(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)

Before:

real	1m1.662s
user	0m14.373s
sys	12m55.960s

After:

real	0m51.179s
user	0m15.329s
sys	10m15.942s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    6 ++--
 net/core/dev.c            |   47 ++++++++++++++++++++++++++++++------
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6abcef6..fa1d88d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1020,7 +1020,7 @@ struct net_device {
 	struct timer_list	watchdog_timer;
 
 	/* Number of references to this device */
-	atomic_t		refcnt ____cacheline_aligned_in_smp;
+	int __percpu		*pcpu_refcnt;
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
@@ -1792,7 +1792,7 @@ extern void netdev_run_todo(void);
  */
 static inline void dev_put(struct net_device *dev)
 {
-	atomic_dec(&dev->refcnt);
+	irqsafe_cpu_dec(*dev->pcpu_refcnt);
 }
 
 /**
@@ -1803,7 +1803,7 @@ static inline void dev_put(struct net_device *dev)
  */
 static inline void dev_hold(struct net_device *dev)
 {
-	atomic_inc(&dev->refcnt);
+	irqsafe_cpu_inc(*dev->pcpu_refcnt);
 }
 
 /* Carrier loss detection, dial on demand. The functions netif_carrier_on
diff --git a/net/core/dev.c b/net/core/dev.c
index 7d14955..2186e1e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5195,9 +5195,6 @@ int init_dummy_netdev(struct net_device *dev)
 	 */
 	dev->reg_state = NETREG_DUMMY;
 
-	/* initialize the ref count */
-	atomic_set(&dev->refcnt, 1);
-
 	/* NAPI wants this */
 	INIT_LIST_HEAD(&dev->napi_list);
 
@@ -5205,6 +5202,19 @@ int init_dummy_netdev(struct net_device *dev)
 	set_bit(__LINK_STATE_PRESENT, &dev->state);
 	set_bit(__LINK_STATE_START, &dev->state);
 
+#if 0
+	/* We dont allocate pcpu_refcnt for dummy devices,
+	 * because users of this 'device' dont need to change
+	 * its refcount
+	 */
+	dev->pcpu_refcnt = alloc_percpu(int);
+	if (!dev->pcpu_refcnt)
+		return -ENOMEM;
+
+	/* initialize the ref count */
+	dev_hold(dev);
+#endif
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(init_dummy_netdev);
@@ -5246,6 +5256,15 @@ out:
 }
 EXPORT_SYMBOL(register_netdev);
 
+static int netdev_refcnt_read(const struct net_device *dev)
+{
+	int i, refcnt = 0;
+
+	for_each_possible_cpu(i)
+		refcnt += *per_cpu_ptr(dev->pcpu_refcnt, i);
+	return refcnt;
+}
+
 /*
  * netdev_wait_allrefs - wait until all references are gone.
  *
@@ -5260,11 +5279,14 @@ EXPORT_SYMBOL(register_netdev);
 static void netdev_wait_allrefs(struct net_device *dev)
 {
 	unsigned long rebroadcast_time, warning_time;
+	int refcnt;
 
 	linkwatch_forget_dev(dev);
 
 	rebroadcast_time = warning_time = jiffies;
-	while (atomic_read(&dev->refcnt) != 0) {
+	refcnt = netdev_refcnt_read(dev);
+
+	while (refcnt != 0) {
 		if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
 			rtnl_lock();
 
@@ -5291,11 +5313,13 @@ static void netdev_wait_allrefs(struct net_device *dev)
 
 		msleep(250);
 
+		refcnt = netdev_refcnt_read(dev);
+
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "
 			       "waiting for %s to become free. Usage "
 			       "count = %d\n",
-			       dev->name, atomic_read(&dev->refcnt));
+			       dev->name, refcnt);
 			warning_time = jiffies;
 		}
 	}
@@ -5353,7 +5377,7 @@ void netdev_run_todo(void)
 		netdev_wait_allrefs(dev);
 
 		/* paranoia */
-		BUG_ON(atomic_read(&dev->refcnt));
+		BUG_ON(netdev_refcnt_read(dev));
 		WARN_ON(rcu_dereference_raw(dev->ip_ptr));
 		WARN_ON(dev->ip6_ptr);
 		WARN_ON(dev->dn_ptr);
@@ -5523,9 +5547,13 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	dev = PTR_ALIGN(p, NETDEV_ALIGN);
 	dev->padded = (char *)dev - (char *)p;
 
-	if (dev_addr_init(dev))
+	dev->pcpu_refcnt = alloc_percpu(int);
+	if (!dev->pcpu_refcnt)
 		goto free_tx;
 
+	if (dev_addr_init(dev))
+		goto free_pcpu;
+
 	dev_mc_init(dev);
 	dev_uc_init(dev);
 
@@ -5556,6 +5584,8 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 
 free_tx:
 	kfree(tx);
+free_pcpu:
+	free_percpu(dev->pcpu_refcnt);
 free_p:
 	kfree(p);
 	return NULL;
@@ -5589,6 +5619,9 @@ void free_netdev(struct net_device *dev)
 	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
 		netif_napi_del(p);
 
+	free_percpu(dev->pcpu_refcnt);
+	dev->pcpu_refcnt = NULL;
+
 	/*  Compatibility with error handling in drivers */
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
 		kfree((char *)dev - dev->padded);



^ permalink raw reply related

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Kay Sievers @ 2010-10-07 17:05 UTC (permalink / raw)
  To: Greg KH
  Cc: Matt Domsch, Narendra_K, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007164835.GA27339@kroah.com>

On Thu, Oct 7, 2010 at 18:48, Greg KH <greg@kroah.com> wrote:
> On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
>> 1) SMBIOS type 41 method.  Windows does not use this today, and I
>>    can't speak to their future plans.  Narendra's kernel patch does,
>>    as has biosdevname, the udev helper we first wrote for this
>>    purpose, for several years.
>
> Then stick with that udev helper please :)

What about just exporting this information in sysfs, and not touch the naming?

Anyway, I'm pretty sure all of this naming of onboard devices should
happen only at install time, or from a system management tool and not
at hotplug time.

We should not get confused by the way the (very simple)
automatic-rule-creater for persistent netdev naming in udev works.
This is really just a tool for the common case, and works fine for the
majority of people.

I'm not sure, if we should put all these special use cases in the
hotplug path. I mean it's not that people add and remove 4 port
network cards with special BIOS all the time, and expect proper naming
on the first bootup, right? The installer, or the system management
tool could just create/edit udev rules to provide proper device naming
on whatever property is available at a specific hardware, be it the
MAC address or some other persistent match?

Kay

^ permalink raw reply

* Re: [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Eric Dumazet @ 2010-10-07 16:59 UTC (permalink / raw)
  To: Eric W. Biederman, Andrew Morton
  Cc: Américo Wang, Robin Holt, linux-kernel, Willy Tarreau,
	David S. Miller, netdev, James Morris, Pekka Savola (ipv6),
	Patrick McHardy, Alexey Kuznetsov
In-Reply-To: <m1iq1e3qnn.fsf@fess.ebiederm.org>

Le jeudi 07 octobre 2010 à 09:37 -0700, Eric W. Biederman a écrit :

> The difference between long handling and int handling is a
> usability issue.  I don't expect we will be exporting new
> vectors via sysctl, so the conversion of a handful of vectors
> from int to long is where this is most likely to be used.
> 
> I skimmed through all of what I presume are the current users
> aka linux-2.6.36-rcX and there don't appear to be any users
> of proc_dounlongvec_minmax that use it's vector properties there.
> 
> Which doubly tells me that incrementing the min and max pointers
> is not what we want to do.
> 

Thats fine by me, thanks Eric.

Andrew, please remove previous patch from your tree and replace it by
following one :

[PATCH v2] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()

When proc_doulongvec_minmax() is used with an array of longs,
and no min/max check requested (.extra1 or .extra2 being NULL), we
dereference a NULL pointer for the second element of the array.

Noticed while doing some changes in network stack for the "16TB problem"

Fix is to not change min & max pointers in
__do_proc_doulongvec_minmax(), so that all elements of the vector share
an unique min/max limit, like proc_dointvec_minmax().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 kernel/sysctl.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f88552c..8e45451 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2485,7 +2485,7 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
 		kbuf[left] = 0;
 	}
 
-	for (; left && vleft--; i++, min++, max++, first=0) {
+	for (; left && vleft--; i++, first=0) {
 		unsigned long val;
 
 		if (write) {

^ permalink raw reply related

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: David Lamparter @ 2010-10-07 16:49 UTC (permalink / raw)
  To: Matt Domsch
  Cc: Greg KH, Narendra_K, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007163113.GA14260@auslistsprd01.us.dell.com>

Hi Matt,

On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
> At most, they'll have to deal with the same "eth0_rename"

Getting that "eth0_rename" stuff is not something unchangeably imposed
by the kernel. I think this should be considered an udev bug most of all
things, since if the udev scripts did it right, this wouldn't happen. A
"right" way to do this would for example be to first rename all devices
to "tmpwhatever0", then rename them back to their proper ethX names.

Now I don't know even a single bit about udev scripting, so maybe this
isn't possible with the current udev infrastructure, but there's
certainly no kernel part stopping a proper implementation.

Btw, what happened to nameif(8)? It's totally outdated (and deprecated I
guess?), but it got the job done...

-David

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Greg KH @ 2010-10-07 16:48 UTC (permalink / raw)
  To: Matt Domsch
  Cc: Narendra_K, netdev, linux-hotplug, linux-pci, Jordan_Hargrave,
	Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007163113.GA14260@auslistsprd01.us.dell.com>

On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
> On Thu, Oct 07, 2010 at 08:11:34AM -0700, Greg KH wrote:
> > On Thu, Oct 07, 2010 at 07:23:35AM -0700, Narendra_K@Dell.com wrote:
> > > Hello,
> > > 
> > > V1 -> V2:
> > > 
> > > This patch addresses the scenario of buggy firmware/BIOS tables. The
> > > patch introduces a command line parameter 'no_netfwindex', passing which
> > > firmware provided index will not be used to derive 'eth' names. By
> > > default, firmware index will be used and the parameter can be used to
> > > work around buggy firmware/BIOS tables.
> > > 
> > > Please find the patch below.
> > > 
> > > From: Narendra K <narendra_k@dell.com>
> > > Subject: [PATCH] Use firmware provided index to register a network device
> > > 
> > > This patch uses the firmware provided index to derive the ethN name.
> > > If the firmware provides an index for the corresponding pdev, the N
> > > is derived from the index.
> > 
> > No, this has the very big chance to reorder existing network names and
> > should all be done in userspace with the firmware information exported
> > in sysfs, if the user so desires.
> 
> Existing names that use 70-persistent-net.rules won't change.

But users that don't use it would cause their names to change, and you
can't break that, no matter how naturally fragile those types of systems
are.  Sorry.

> The kernel patch has the advantage of not requiring users to change
> their scripts, by reserving the first N values for onboard devices as
> BIOS describes them.

So you feel that updating a kernel is easier than getting a user to
update their userspace scripts?  While I might also feel it is that way,
in reality, it isn't, sorry.

> Userspace udev rules require us to change user behavior, and likely
> their scripts, to use a new namespace such as ethlom1, in order to get
> deterministic naming.

Then use that, don't put this in the kernel please.

> > > It took some time to find out the details asked above. Right,
> > > windows does not use SMBIOS type 41 record to derive names. But as
> > > a datapoint, windows also has the same problem.
> 
> > If windows does not use this field, then you can guarantee it will
> > show up incorrectly in BIOSes and never be tested by manufacturers
> > before they ship their machines.
> 
> There are 2 methods of exporting the information that have gotten
> confusing here.
> 
> 1) SMBIOS type 41 method.  Windows does not use this today, and I
>    can't speak to their future plans.  Narendra's kernel patch does,
>    as has biosdevname, the udev helper we first wrote for this
>    purpose, for several years.

Then stick with that udev helper please :)

> 2) soon-to-be-released PCIe Firmware Spec, exporting label and index
>    as an ACPI _DSM().  It is anticipated that Windows will use this
>    information in some fashion at some point, per our conversations
>    with Microsoft as part of the PCI SIG proposal process.  We've held
>    off submitting a kernel patch until the spec is public.

Let's worry about that _when_ it is public, and when there is a Windows
version that supports it please.  Until then, there will not be support
for this in BIOSes in any usable way.

> Cases:
> 1) BIOS doesn't provide this information (the common case today for
>    all but Dell 10G and newer servers), nothing is reserved, and
>    therefore the existing 70-persistent-net.rules take effect to name
>    devices.  No change in behavior for existing systems.  New installs
>    will have NICs named non-deterministically, and recorded in
>    70-persistent-net.rules.  Users desiring consistent naming must
>    adjust 70-persistent-net.rules after install, and to avoid
>    eth0_rename collisions, must rename into another namespace.

That's fine and is how things work today, right?

> 2) BIOS provides this information correctly (Dell 10G and newer servers, or
>    anyone else implementing the spec).  The first N values for onboard
>    devices are reserved and assigned by the kernel to onboard
>    devices.

And you just broke people's machines that don't use udev for network
naming.  Sorry, that's not going to work.

>    70-persistent-net.rules may still try to rename the
>    ports, and may fail with eth0_rename collisions.

So you just broke their machines as well, when it was working, again,
not acceptable.

> 3) BIOS provides this information incorrectly (none known today).

Only because we have no way of testing for this :)

> What I can't find here is a solution where the user gets determistic
> naming, without being forced to change the namespace and adjust their
> scripts.  Can anyone?

No, and they shoudn't.

You should only have "deterministic" naming if you want it, and you set
it up to do so, from userspace.  The kernel is not responsible for this,
sorry.

If you want to do it in userspace, you have all the tools, and the data
exported from the kernel to do so.

> I'm not opposed to "do it all in userspace" - really, I'm not.  But
> the 'where' is only part of the problem.  No userspace solution has
> yet been proposed that doesn't require a namespace change, and I'm
> still hoping to avoid that.

Just use a new script for people who want this, they will have to know
they want it anyway, right?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Eric W. Biederman @ 2010-10-07 16:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Américo Wang, Robin Holt, Andrew Morton, linux-kernel,
	Willy Tarreau, David S. Miller, netdev, James Morris,
	Pekka Savola (ipv6), Patrick McHardy, Alexey Kuznetsov
In-Reply-To: <1286445081.2912.15.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> Le jeudi 07 octobre 2010 à 17:25 +0800, Américo Wang a écrit :
>> >>
>> >
>> >Here is the final one.
>> 
>> Oops, that one is not correct. Hopefully this one
>> is correct.
>> 
>> --------------->
>> 
>> Eric D. noticed that we may trigger an OOPS if we leave ->extra{1,2}
>> to NULL when we use proc_doulongvec_minmax().
>> 
>> Actually, we don't need to store min/max values in a vector,
>> because all the elements in the vector should share the same min/max
>> value, like what proc_dointvec_minmax() does.
>> 
>
> If we assert same min/max limits are to be applied to all elements,
> a much simpler fix than yours would be :
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index f88552c..8e45451 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2485,7 +2485,7 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
>  		kbuf[left] = 0;
>  	}
>  
> -	for (; left && vleft--; i++, min++, max++, first=0) {
> +	for (; left && vleft--; i++, first=0) {
>  		unsigned long val;
>  
>  		if (write) {
>
>
> Please dont send huge patches like this to 'fix' a bug,
> especially on slow path.
>
> First we fix the bug, _then_ we can try to make code more 
> efficient or more pretty or shorter.
>
> So the _real_ question is :
>
> Should the min/max limits should be a single pair,
> shared by all elements, or a vector of limits.

The difference between long handling and int handling is a
usability issue.  I don't expect we will be exporting new
vectors via sysctl, so the conversion of a handful of vectors
from int to long is where this is most likely to be used.

I skimmed through all of what I presume are the current users
aka linux-2.6.36-rcX and there don't appear to be any users
of proc_dounlongvec_minmax that use it's vector properties there.

Which doubly tells me that incrementing the min and max pointers
is not what we want to do.

Eric

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Matt Domsch @ 2010-10-07 16:31 UTC (permalink / raw)
  To: Greg KH
  Cc: Narendra_K, netdev, linux-hotplug, linux-pci, Jordan_Hargrave,
	Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007151134.GA25713@kroah.com>

On Thu, Oct 07, 2010 at 08:11:34AM -0700, Greg KH wrote:
> On Thu, Oct 07, 2010 at 07:23:35AM -0700, Narendra_K@Dell.com wrote:
> > Hello,
> > 
> > V1 -> V2:
> > 
> > This patch addresses the scenario of buggy firmware/BIOS tables. The
> > patch introduces a command line parameter 'no_netfwindex', passing which
> > firmware provided index will not be used to derive 'eth' names. By
> > default, firmware index will be used and the parameter can be used to
> > work around buggy firmware/BIOS tables.
> > 
> > Please find the patch below.
> > 
> > From: Narendra K <narendra_k@dell.com>
> > Subject: [PATCH] Use firmware provided index to register a network device
> > 
> > This patch uses the firmware provided index to derive the ethN name.
> > If the firmware provides an index for the corresponding pdev, the N
> > is derived from the index.
> 
> No, this has the very big chance to reorder existing network names and
> should all be done in userspace with the firmware information exported
> in sysfs, if the user so desires.

Existing names that use 70-persistent-net.rules won't change.  They
were non-deterministically assigned and recorded in that file, so
they'll persist.  At most, they'll have to deal with the same
"eth0_rename" garbage that anyone wanting to get a consistent name
from other than MAC address would get.  The only reason the
auto-created 70-persistent-net.rules works is that most people don't
change their NIC configuration after install, so it has no renaming to do.

The kernel patch has the advantage of not requiring users to change
their scripts, by reserving the first N values for onboard devices as
BIOS describes them.

Userspace udev rules require us to change user behavior, and likely
their scripts, to use a new namespace such as ethlom1, in order to get
deterministic naming.

> > It took some time to find out the details asked above. Right,
> > windows does not use SMBIOS type 41 record to derive names. But as
> > a datapoint, windows also has the same problem.

> If windows does not use this field, then you can guarantee it will
> show up incorrectly in BIOSes and never be tested by manufacturers
> before they ship their machines.

There are 2 methods of exporting the information that have gotten
confusing here.

1) SMBIOS type 41 method.  Windows does not use this today, and I
   can't speak to their future plans.  Narendra's kernel patch does,
   as has biosdevname, the udev helper we first wrote for this purpose, for several years.

2) soon-to-be-released PCIe Firmware Spec, exporting label and index
   as an ACPI _DSM().  It is anticipated that Windows will use this
   information in some fashion at some point, per our conversations
   with Microsoft as part of the PCI SIG proposal process.  We've held
   off submitting a kernel patch until the spec is public.

Cases:
1) BIOS doesn't provide this information (the common case today for
   all but Dell 10G and newer servers), nothing is reserved, and
   therefore the existing 70-persistent-net.rules take effect to name
   devices.  No change in behavior for existing systems.  New installs
   will have NICs named non-deterministically, and recorded in
   70-persistent-net.rules.  Users desiring consistent naming must
   adjust 70-persistent-net.rules after install, and to avoid
   eth0_rename collisions, must rename into another namespace.

2) BIOS provides this information correctly (Dell 10G and newer servers, or
   anyone else implementing the spec).  The first N values for onboard
   devices are reserved and assigned by the kernel to onboard
   devices.  70-persistent-net.rules may still try to rename the
   ports, and may fail with eth0_rename collisions.  Users would have
   to adjust 70-persistent-net.rules to accommodate, but they do not
   have to change the namespace from ethX to something else. New
   installs will have onboard NICs named deterministically, and recorded in
   70-persistent-net.rules.

3) BIOS provides this information incorrectly (none known today).
   Some number of N values may be reserved, more or fewer than
   actually present.  In either case, it's possible that
   70-persistent-net.rules would have to be adjusted to accommodate,
   but they do not have to change the namespace from ethX to something
   else. New installs will have onboard NICs named deterministically,
   if not ideally due to buggy BIOS, and recorded in
   70-persistent-net.rules.

What I can't find here is a solution where the user gets determistic
naming, without being forced to change the namespace and adjust their
scripts.  Can anyone?  I'm not opposed to "do it all in userspace" -
really, I'm not.  But the 'where' is only part of the problem.  No
userspace solution has yet been proposed that doesn't require a
namespace change, and I'm still hoping to avoid that.

Thanks,
Matt

-- 
Matt Domsch
Technology Strategist
Dell | Office of the CTO

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Stephen Hemminger @ 2010-10-07 16:14 UTC (permalink / raw)
  To: Narendra_K
  Cc: netdev, linux-hotplug, linux-pci, Matt_Domsch, Jordan_Hargrave,
	Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007142319.GB2641@libnet-test.oslab.blr.amer.dell.com>

On Thu, 7 Oct 2010 07:23:35 -0700
<Narendra_K@Dell.com> wrote:

> Hello,
> 
> V1 -> V2:
> 
> This patch addresses the scenario of buggy firmware/BIOS tables. The
> patch introduces a command line parameter 'no_netfwindex', passing which
> firmware provided index will not be used to derive 'eth' names. By
> default, firmware index will be used and the parameter can be used to
> work around buggy firmware/BIOS tables.
> 
> Please find the patch below.
> 
> From: Narendra K <narendra_k@dell.com>
> Subject: [PATCH] Use firmware provided index to register a network device
> 
> This patch uses the firmware provided index to derive the ethN name.
> If the firmware provides an index for the corresponding pdev, the N
> is derived from the index.
> 
> As an example, consider a PowerEdge R710 which has 4 BCM5709
> Lan-On-Motherboard ports,1 Intel 82572EI port and 4 82575GB ports.
> The system firmware communicates the order of the 4 Lan-On-Motherboard
> ports by assigning indexes to each one of them. This is available to
> the OS as the SMBIOS type 41 record(for onboard devices), in the field
> 'device type index'. It looks like below -

I agree with Greg. Provide the firmware index in sysfs for use
in udev userspace and let udev do the policy on the naming.
The firmware index could be missing, wrong, or interesting on only
some devices.

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Greg KH @ 2010-10-07 15:11 UTC (permalink / raw)
  To: Narendra_K
  Cc: netdev, linux-hotplug, linux-pci, Matt_Domsch, Jordan_Hargrave,
	Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007142319.GB2641@libnet-test.oslab.blr.amer.dell.com>

On Thu, Oct 07, 2010 at 07:23:35AM -0700, Narendra_K@Dell.com wrote:
> Hello,
> 
> V1 -> V2:
> 
> This patch addresses the scenario of buggy firmware/BIOS tables. The
> patch introduces a command line parameter 'no_netfwindex', passing which
> firmware provided index will not be used to derive 'eth' names. By
> default, firmware index will be used and the parameter can be used to
> work around buggy firmware/BIOS tables.
> 
> Please find the patch below.
> 
> From: Narendra K <narendra_k@dell.com>
> Subject: [PATCH] Use firmware provided index to register a network device
> 
> This patch uses the firmware provided index to derive the ethN name.
> If the firmware provides an index for the corresponding pdev, the N
> is derived from the index.

No, this has the very big chance to reorder existing network names and
should all be done in userspace with the firmware information exported
in sysfs, if the user so desires.

So consider this patch rejected from me.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] SIW: iWARP Protocol headers
From: Bernard Metzler @ 2010-10-07 14:55 UTC (permalink / raw)
  To: David Dillow
  Cc: Bart Van Assche, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1286393504.26136.31.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>



David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote on 10/06/2010 09:31:44 PM:

> On Wed, 2010-10-06 at 12:37 -0600, Jason Gunthorpe wrote:
> > On Wed, Oct 06, 2010 at 02:22:55PM -0400, David Dillow wrote:
> > > On Wed, 2010-10-06 at 11:25 -0600, Jason Gunthorpe wrote:
> > > > On Wed, Oct 06, 2010 at 02:52:46PM +0200, Bernard Metzler wrote:
> > > > It is actually a little more complicated than just this. I assume
you
> > > > are casting the structures over packet payloads? In this case you
> > > > have to guarentee alignment (or used packed everywhere). Does iwarp
> > > > have provisions for alignment? If so you can construct your
bitfields
> > > > using the alignment type, ie if iWarp guarantees 4 bytes then the
> > > > biggest type you can use is u32 - then you can avoid using packed.
> > > >
> > > > Mind you, I'm not sure how to guarentee alignment when you consider
> > > > all the possible sources of unalignment in the stack: TCP, IP,L2
stack ?
> > >
> > > You don't have to. The TCP stack, IIRC, requires the architecture to
fix
> > > up misaligned accesses, or at least it used to. It will try to keep
> > > things aligned if possible -- see the comment above NET_IP_ALIGN in
> > > include/linux/skbuff.h.
> >
> > I was under the impression that this was just to align the IP
> > header.
>
> It is, and it assumes an ethernet header of 14 bytes. As you note, VLANs
> and everything else can mess with that, which is why the TCP stack
> requires the architecture to handle unaligned accesses. It's a bonus if
> iWarp can guarantee an alignment WRT the IP or TCP header, but not
> required.
>
> > > The structures in Bernard's header have all fields naturally
> > > aligned, so no need for the packed attribute and its baggage.
> >
> > No, they arent:
> >
> > > +struct iwarp_rdma_write {
> > > +   struct iwarp_ctrl   ctrl;
> > > +   __u32         sink_stag;
> > > +   __u64         sink_to;
> > > +} __attribute__((__packed__));
> >
> > For instance has sink_to unaligned if the maximum aligment of the
> > payload is only guarenteed to be 4. The proper thing to do in this
> > case is probably __attribute__((__packed__, __aligned__(4))) if that
> > works the way I think.. :)
>
> struct iwarp_ctrl consists of 2 u16s, so sink_stag is aligned to 4 bytes
> from the start of the struct iwarp_rdma_write, and sink_to is at 8 bytes
> -- so they are naturally aligned with respect to the structure. The
> compiler will not add padding, so no need for the packed attribute.
>
> Note I'm not talking about alignment in memory -- that'll depend on
> where the data starts, and as I mentioned, the stack doesn't make any
> guarantees about that, relying on the architecture to fixup any
> misaligned accesses.

agreed. all n-bytes are starting on a multipe of n-byte offset
relative to the start of the structure.

many thanks,
bernard.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] Use firmware provided index to register a network interface
From: Greg KH @ 2010-10-07 14:31 UTC (permalink / raw)
  To: Narendra_K
  Cc: netdev, linux-hotplug, linux-pci, Matt_Domsch, Charles_Rose,
	Jordan_Hargrave, Vijay_Nijhawan
In-Reply-To: <20101007141433.GA2641@libnet-test.oslab.blr.amer.dell.com>

On Thu, Oct 07, 2010 at 07:44:57PM +0530, Narendra_K@Dell.com wrote:
> On Thu, Sep 23, 2010 at 10:03:36PM +0530, Greg KH wrote:
> >    On Thu, Sep 23, 2010 at 09:20:57PM +0530, Narendra_K@Dell.com wrote:
> >    > > Now trying to change the kernel namespace itself seems like a bad hack
> >    > > around this fact.
> >    > >
> >    >
> >    > The patch does not change the existing kernel name space.
> > 
> >    You are "reordering it", right?
> > 
> >    > It adheres to the existing ethN name space with IFNAMSIZ length
> >    > requirements. The patch only makes sure that eth0 always corresponds
> >    > to what is labeled as 'Gb1' on server chassis, on systems where SMBIOS
> >    > type 41 record is available. And removes the need for any renames for
> >    > the interfaces which have a firmware index assigned by system
> >    > firmware, as the very first name assigned by the kernel will be as
> >    > expected and deterministic.
> > 
> >    And on systems with buggy firmware, what is going to happen here?  And
> >    yes, there will be buggy BIOS tables, we can guarantee that, as this is
> >    not something that Windows supports, right?
> > 
> 
> It took some time to find out the details asked above. Right, windows
> does not use SMBIOS type 41 record to derive names. But as a datapoint,
> windows also has the same problem.

If windows does not use this field, then you can guarantee it will show
up incorrectly in BIOSes and never be tested by manufacturers before
they ship their machines.

> Yes, firmware and BIOS tables can be buggy. How about a command line
> parameter 'no_netfwindex', passing which firmware index will not be
> used to derive ethN names ? That would handle the scenario of buggy
> firmware and names will be derived in the now existing way.

I'd prefer the option never be added in the first place as you are
placing a big burden on people whose machines were working properly on
old kernels, to suddenly have to add a command line option to get their
box to now work again.

And again, as this is a "simple" rename type operation, I still don't
see why you can't just do this from a udev rule, especially if the
firmware information is exported to userspace.  That is where such
"policy" should be added, not within the kernel itself.

thanks,

greg k-h

^ permalink raw reply

* [PATCH V2] Use firmware provided index to register a network interface
From: Narendra_K @ 2010-10-07 14:23 UTC (permalink / raw)
  To: netdev, linux-hotplug, linux-pci
  Cc: Matt_Domsch, Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose

Hello,

V1 -> V2:

This patch addresses the scenario of buggy firmware/BIOS tables. The
patch introduces a command line parameter 'no_netfwindex', passing which
firmware provided index will not be used to derive 'eth' names. By
default, firmware index will be used and the parameter can be used to
work around buggy firmware/BIOS tables.

Please find the patch below.

From: Narendra K <narendra_k@dell.com>
Subject: [PATCH] Use firmware provided index to register a network device

This patch uses the firmware provided index to derive the ethN name.
If the firmware provides an index for the corresponding pdev, the N
is derived from the index.

As an example, consider a PowerEdge R710 which has 4 BCM5709
Lan-On-Motherboard ports,1 Intel 82572EI port and 4 82575GB ports.
The system firmware communicates the order of the 4 Lan-On-Motherboard
ports by assigning indexes to each one of them. This is available to
the OS as the SMBIOS type 41 record(for onboard devices), in the field
'device type index'. It looks like below -

Handle 0x2900, DMI type 41, 11 bytes
Onboard Device
	Reference Designation: Embedded NIC 1
	Type: Ethernet
	Status: Enabled
	Type Instance: 1
	Bus Address: 0000:01:00.0

Handle 0x2901, DMI type 41, 11 bytes
Onboard Device
	Reference Designation: Embedded NIC 2
	Type: Ethernet
	Status: Enabled
	Type Instance: 2
	Bus Address: 0000:01:00.1

Handle 0x2902, DMI type 41, 11 bytes
Onboard Device
	Reference Designation: Embedded NIC 3
	Type: Ethernet
	Status: Enabled
	Type Instance: 3
	Bus Address: 0000:02:00.0

Handle 0x2903, DMI type 41, 11 bytes
Onboard Device
	Reference Designation: Embedded NIC 4
	Type: Ethernet
	Status: Enabled
	Type Instance: 4
	Bus Address: 0000:02:00.1

The OS can use this index to name the network interfaces as below.

Onboard devices -

Interface		Fwindex		Driver
Name
eth[fwindex - 1] =eth0	1		bnx2
eth[fwindex - 1] =eth1	2		bnx2
eth[fwindex - 1] =eth2	3		bnx2
eth[fwindex - 1] =eth3	4		bnx2

The add-in devices do not get any index and they will get names from
eth4 onwards.

Add-in interfaces -

eth4			e1000e
eth5			igb
eth6			igb
eth7			igb
eth8			igb

With this patch,

1. This patch adheres to the established ABI of ethN namespace with
IFNAMSIZ length and ensures that onboard network interfaces get
expected names at the first instance itself and avoids any renaming
later.

2. The 'eth0' of the OS always corresponds to the 'Gb1' as labeled on
the system chassis. There is determinism in the way Lan-On-Motherboard
ports get named.

3. The add-in devices will always be named from beyond what the
Lan-On-Motherboard names as show above. But there is no determinism
as to which add-in interface gets what ethN name.

Passing 'no_netfwindex' command line parameter would result in
firmware index not being used to derive the names as described above.

Signed-off-by: Narendra K <narendra_k@dell.com>
---
 Documentation/kernel-parameters.txt |    6 +++++
 drivers/pci/pci-label.c             |    1 +
 drivers/pci/pci-sysfs.c             |    5 ++++
 include/linux/netdevice.h           |    2 +
 include/linux/pci.h                 |    1 +
 net/core/dev.c                      |   42 ++++++++++++++++++++++++++++++-----
 6 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 3cdb4d8..73edbc0 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1529,6 +1529,12 @@ and is between 256 and 4096 characters. It is defined in the file
 			This usage is only documented in each driver source
 			file if at all.
 
+	no_netfwindex	[NET] Do not use firmware index to derive ethN names
+			Names for onboard network interfaces are derived from
+			the firmware provided index for these devices. Using
+			this parameter would result in firmware index not being
+			used to derive ethN names.
+
 	nf_conntrack.acct=
 			[NETFILTER] Enable connection tracking flow accounting
 			0 to disable accounting
diff --git a/drivers/pci/pci-label.c b/drivers/pci/pci-label.c
index 90c0a72..8086268 100644
--- a/drivers/pci/pci-label.c
+++ b/drivers/pci/pci-label.c
@@ -55,6 +55,7 @@ find_smbios_instance_string(struct pci_dev *pdev, char *buf,
 							 "%s\n",
 							 dmi->name);
 			}
+			pdev->firmware_index = donboard->instance;
 			return strlen(dmi->name);
 		}
 	}
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index b5a7d9b..448ed9d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -28,6 +28,7 @@
 #include "pci.h"
 
 static int sysfs_initialized;	/* = 0 */
+int pci_netdevs_with_fwindex;
 
 /* show configuration fields */
 #define pci_config_attr(field, format_string)				\
@@ -1167,6 +1168,10 @@ int __must_check pci_create_sysfs_dev_files (struct pci_dev *pdev)
 
 	pci_create_firmware_label_files(pdev);
 
+	if (pdev->firmware_index && (pdev->class >> 16) ==
+						PCI_BASE_CLASS_NETWORK)
+		pci_netdevs_with_fwindex++;
+
 	return 0;
 
 err_vga_file:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 46c36ff..4398dcf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1080,6 +1080,8 @@ struct net_device {
 
 #define	NETDEV_ALIGN		32
 
+extern int pci_netdevs_with_fwindex;
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b1d1795..90113bb 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -243,6 +243,7 @@ struct pci_dev {
 	unsigned short	subsystem_vendor;
 	unsigned short	subsystem_device;
 	unsigned int	class;		/* 3 bytes: (base,sub,prog-if) */
+	unsigned int	firmware_index; /* Firmware provided index */
 	u8		revision;	/* PCI revision, low byte of class word */
 	u8		hdr_type;	/* PCI header type (`multi' flag masked out) */
 	u8		pcie_cap;	/* PCI-E capability offset */
diff --git a/net/core/dev.c b/net/core/dev.c
index 1ae6543..f7982c4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -853,9 +853,18 @@ int dev_valid_name(const char *name)
 }
 EXPORT_SYMBOL(dev_valid_name);
 
+int netdev_use_fwindex = 1;
+
+static int __init netdev_use_fwindex_to_register(char *str)
+{
+	netdev_use_fwindex = 0;
+	return 0;
+}
+early_param("no_netfwindex", netdev_use_fwindex_to_register);
+
 /**
  *	__dev_alloc_name - allocate a name for a device
- *	@net: network namespace to allocate the device name in
+ *	@dev:  device
  *	@name: name format string
  *	@buf:  scratch buffer and result name string
  *
@@ -868,13 +877,15 @@ EXPORT_SYMBOL(dev_valid_name);
  *	Returns the number of the unit assigned or a negative errno code.
  */
 
-static int __dev_alloc_name(struct net *net, const char *name, char *buf)
+static int __dev_alloc_name(struct net_device *dev, const char *name, char *buf)
 {
 	int i = 0;
 	const char *p;
 	const int max_netdevices = 8*PAGE_SIZE;
 	unsigned long *inuse;
 	struct net_device *d;
+	struct net *net;
+	struct pci_dev *pdev;
 
 	p = strnchr(name, IFNAMSIZ-1, '%');
 	if (p) {
@@ -886,15 +897,36 @@ static int __dev_alloc_name(struct net *net, const char *name, char *buf)
 		if (p[1] != 'd' || strchr(p + 2, '%'))
 			return -EINVAL;
 
+		if (likely(netdev_use_fwindex)) {
+			pdev = to_pci_dev(dev->dev.parent);
+			if (pdev && pdev->firmware_index) {
+				snprintf(buf, IFNAMSIZ, name,
+					 pdev->firmware_index - 1);
+				return pdev->firmware_index - 1;
+			}
+		}
+
 		/* Use one page as a bit array of possible slots */
 		inuse = (unsigned long *) get_zeroed_page(GFP_ATOMIC);
 		if (!inuse)
 			return -ENOMEM;
 
+		/* Reserve 0 to < pci_netdevs_with_fwindex for integrated
+		 * ports with fwindex and allocate from pci_netdevs_with_fwindex
+		 * onwards for add-in devices
+		 */
+		if (likely(netdev_use_fwindex)) {
+			for (i = 0; i < pci_netdevs_with_fwindex; i++)
+				set_bit(i, inuse);
+		} else
+			pci_netdevs_with_fwindex = 0;
+
+		net = dev_net(dev);
+
 		for_each_netdev(net, d) {
 			if (!sscanf(d->name, name, &i))
 				continue;
-			if (i < 0 || i >= max_netdevices)
+			if (i < pci_netdevs_with_fwindex || i >= max_netdevices)
 				continue;
 
 			/*  avoid cases where sscanf is not exact inverse of printf */
@@ -936,12 +968,10 @@ static int __dev_alloc_name(struct net *net, const char *name, char *buf)
 int dev_alloc_name(struct net_device *dev, const char *name)
 {
 	char buf[IFNAMSIZ];
-	struct net *net;
 	int ret;
 
 	BUG_ON(!dev_net(dev));
-	net = dev_net(dev);
-	ret = __dev_alloc_name(net, name, buf);
+	ret = __dev_alloc_name(dev, name, buf);
 	if (ret >= 0)
 		strlcpy(dev->name, buf, IFNAMSIZ);
 	return ret;
-- 
1.7.0.1

-- 
With regards,
Narendra K

^ permalink raw reply related

* Re: [PATCH] Use firmware provided index to register a network interface
From: Tim Small @ 2010-10-07 14:27 UTC (permalink / raw)
  To: Narendra_K
  Cc: greg, netdev, linux-hotplug, linux-pci, Matt_Domsch, Charles_Rose,
	Jordan_Hargrave, Vijay_Nijhawan
In-Reply-To: <20101007141433.GA2641@libnet-test.oslab.blr.amer.dell.com>

On 07/10/10 15:14, Narendra_K@Dell.com wrote:
> Yes, firmware and BIOS tables can be buggy. How about a command line
> parameter 'no_netfwindex', passing which firmware index will not be
> used to derive ethN names ? That would handle the scenario of buggy
> firmware and names will be derived in the now existing way.
>
> I will submit a patch shortly implementing this.
>    

What was the reason for not doing this in user space again?  You stated 
that you got races when doing renames like eth0 -> eth2, but the 
solution looked like renaming into a different name space such as eth0 
-> ethlom2  etc. so that it wouldn't race with the names handed out by 
the kernel?

Tim.

^ permalink raw reply

* Re: [PATCH] Use firmware provided index to register a network interface
From: Narendra_K @ 2010-10-07 14:14 UTC (permalink / raw)
  To: greg
  Cc: netdev, linux-hotplug, linux-pci, Matt_Domsch, Charles_Rose,
	Jordan_Hargrave, Vijay_Nijhawan
In-Reply-To: <20100923163336.GA15603@kroah.com>

On Thu, Sep 23, 2010 at 10:03:36PM +0530, Greg KH wrote:
>    On Thu, Sep 23, 2010 at 09:20:57PM +0530, Narendra_K@Dell.com wrote:
>    > > Now trying to change the kernel namespace itself seems like a bad hack
>    > > around this fact.
>    > >
>    >
>    > The patch does not change the existing kernel name space.
> 
>    You are "reordering it", right?
> 
>    > It adheres to the existing ethN name space with IFNAMSIZ length
>    > requirements. The patch only makes sure that eth0 always corresponds
>    > to what is labeled as 'Gb1' on server chassis, on systems where SMBIOS
>    > type 41 record is available. And removes the need for any renames for
>    > the interfaces which have a firmware index assigned by system
>    > firmware, as the very first name assigned by the kernel will be as
>    > expected and deterministic.
> 
>    And on systems with buggy firmware, what is going to happen here?  And
>    yes, there will be buggy BIOS tables, we can guarantee that, as this is
>    not something that Windows supports, right?
> 

It took some time to find out the details asked above. Right, windows
does not use SMBIOS type 41 record to derive names. But as a datapoint,
windows also has the same problem.

Yes, firmware and BIOS tables can be buggy. How about a command line
parameter 'no_netfwindex', passing which firmware index will not be
used to derive ethN names ? That would handle the scenario of buggy
firmware and names will be derived in the now existing way.

I will submit a patch shortly implementing this.

-- 
With regards,
Narendra K

^ permalink raw reply

* Re: [RFC] net: allow FEC driver to not have attached PHY
From: Simon Farnsworth @ 2010-10-07 13:28 UTC (permalink / raw)
  To: netdev
In-Reply-To: <201010070350.o973oGFE026910@goober.internal.moreton.com.au>

Greg Ungerer wrote:

> 
> Hi All,
> 
> I have a board with a ColdFire SoC on it with the built-in FEC
> ethernet module. On this hardware the FEC eth output is directly
> attached to a RTL8305 4-port 10/100 switch. There is no conventional
> PHY, the FEC output is direct into the uplink port of the switch
> chip.
> 
> This setup doesn't work after the FEC code was switch to using
> phylib. The driver used to have code to bypass phy detection/setup
> for this particular board. The phylib probe finds nothing, and of
> course sets a no-link condition.
> 
> So, what is the cleanest way to support this?
> 
Is there anything that stops you using the fixed MDIO support in phylib 
(CONFIG_FIXED_PHY)? It seems to me to be intended for this sort of 
situation.

-- 
Simon


^ permalink raw reply

* Re: [RFC] net: allow FEC driver to not have attached PHY
From: Florian Fainelli @ 2010-10-07 13:24 UTC (permalink / raw)
  To: Greg Ungerer; +Cc: netdev, gerg
In-Reply-To: <201010070350.o973oGFE026910@goober.internal.moreton.com.au>

Hi,

On Thursday 07 October 2010 05:50:16 Greg Ungerer wrote:
> Hi All,
> 
> I have a board with a ColdFire SoC on it with the built-in FEC
> ethernet module. On this hardware the FEC eth output is directly
> attached to a RTL8305 4-port 10/100 switch. There is no conventional
> PHY, the FEC output is direct into the uplink port of the switch
> chip.
> 
> This setup doesn't work after the FEC code was switch to using
> phylib. The driver used to have code to bypass phy detection/setup
> for this particular board. The phylib probe finds nothing, and of
> course sets a no-link condition.
> 
> So, what is the cleanest way to support this?

If phy detection fails and you cannot attach to a known PHY driver, you could 
register a fixed-PHY driver wich will report the link to be up. I had to do 
something like this for the cpmac driver[1] where various switches and 
external PHY configurations are supported.

[1]: 
https://dev.openwrt.org/browser/trunk/target/linux/ar7/patches-2.6.35/972-
cpmac_multi_probe.patch

> 
> The attached patch adds a config option to do this sort of generically
> for the FEC driver. But I am wondering if there isn't a better way?
> 
> Regards
> Greg
> 
> 
> ---
> 
> [RFC] net: allow FEC driver to not have attached PHY
> 
> At least one board using the FEC driver does not have a conventional
> PHY attached to it, it is directly connected to a somewhat simple
> ethernet switch (the board is the SnapGear/LITE, and the attached
> 4-port ethernet switch is a RealTek RTL8305). This switch does not
> present the usual register interface of a PHY, it presents nothing.
> So a PHY scan will find nothing.
> 
> After the FEC driver was changed to use phylib for supporting phys
> it no longer works on this particular board/switch setup.
> 
> Add a config option to allow configuring the FEC driver to not expect
> a PHY to be present.
> 
> Signed-off-by: Greg Ungerer <gerg@uclinux.org>
> ---
>  drivers/net/Kconfig |    9 +++++++++
>  drivers/net/fec.c   |    7 +++++++
>  2 files changed, 16 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 93494e2..ee44728 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -1934,6 +1934,15 @@ config FEC2
>  	  Say Y here if you want to use the second built-in 10/100 Fast
>  	  ethernet controller on some Motorola ColdFire processors.
> 
> +config FEC_NOPHY
> +	bool "FEC has no attached PHY"
> +	depends on FEC
> +	help
> +	  Some boards using the FEC driver may not have a PHY directly
> +	  attached to it. Typically in this scenario the FEC output is
> +	  directly connected to the input of an ethernet switch or hub.
> +	  Say Y here if your hardware is like this.
> +
>  config FEC_MPC52xx
>  	tristate "MPC52xx FEC driver"
>  	depends on PPC_MPC52xx && PPC_BESTCOMM
> diff --git a/drivers/net/fec.c b/drivers/net/fec.c
> index 768b840..3637f89 100644
> --- a/drivers/net/fec.c
> +++ b/drivers/net/fec.c
> @@ -910,6 +910,11 @@ fec_enet_open(struct net_device *dev)
>  	if (ret)
>  		return ret;
> 
> +#ifdef CONFIG_FEC_NOPHY
> +	/* No PHY connected, assume link is always up */
> +	fep->link = 1;
> +	fec_restart(dev, 0);
> +#else
>  	/* Probe and connect to PHY when open the interface */
>  	ret = fec_enet_mii_probe(dev);
>  	if (ret) {
> @@ -917,6 +922,8 @@ fec_enet_open(struct net_device *dev)
>  		return ret;
>  	}
>  	phy_start(fep->phy_dev);
> +#endif
> +
>  	netif_start_queue(dev);
>  	fep->opened = 1;
>  	return 0;
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox