Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCHv4 net-next-2.6 1/5] XFRM,IPv6: Remove xfrm_spi_hash() dependency on destination address
From: Arnaud Ebalard @ 2010-10-07 20:13 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, Eric Dumazet, Hideaki YOSHIFUJI, netdev
In-Reply-To: <20101005041707.GA26458@gondor.apana.org.au>

Hi,

Herbert Xu <herbert@gondor.apana.org.au> writes:

> On Tue, Oct 05, 2010 at 10:11:14AM +0800, Herbert Xu wrote:
>>
>> I'm thinking about the case where each remote end (or one remote
>> end with many IP addresses) chooses to use a single SPI which then
>> all gets hashed to the same bucket.
>> 
>> Outbound SAs are hashed into the same SPI hash table as inbound SAs.
>
> Another solution would be to create a hash table for inbound SAs
> only.  Unfortunately I don't think we have anything in our current
> user-interface to indicate whether an SA is inbound or outbound.
>
> So to do this you'll need to use a heuristic such as doing a
> lookup on the source/destination address at insertion time.

I spent some time scratching my head trying to find a good solution. It
would indeed be perfect to have a specific hash table for inbound
SA. But as you point, this would only be via a heuristic at insertion
time and there are various cases which would not work: a SA can be
installed w/o any of the addresses being configured on the system.

I think I will try the following alternative approach based on your
comments and proposals:

 - drop my patch to change spi hash computation
 - handle destination address remapping during input upon failure of
   xfrm_state_lookup()
 - handle source address remapping as it is currently done in the patch,
   i.e. by comparing received one against x->props.saddr once the
   state found and do 

To support the destination address remapping, I will have to reverse the
logic I currently have for destination remapping states, to allow the
lookup to be done based on the on-wire address (CoA) instead of the
address in the SA (HoA). If a remapping state is found for the on-wire
address, then a new lookup is done using the associated HoA this time.

I think it would make the feature easier less intrusive for the IPsec
stack.

Thanks for your support and patience, Herbert.

a+

^ permalink raw reply

* [PATCH net-next-2.6] net: Fix rxq ref counting
From: Tom Herbert @ 2010-10-07 20:09 UTC (permalink / raw)
  To: davem, netdev

The rx->count reference is used to track reference counts to the
number of rx-queue kobjects created for the device.  This patch
eliminates initialization of the counter in netif_alloc_rx_queues
and instead increments the counter each time a kobject is created.
This is now symmetric with the decrement that is done when an object is
released.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 7d14955..58b31d1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5026,7 +5026,6 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 			return -ENOMEM;
 		}
 		dev->_rx = rx;
-		atomic_set(&rx->count, count);
 
 		/*
 		 * Set a pointer to first element in the array which holds the
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index fa81fd0..b143173 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -726,6 +726,7 @@ static struct kobj_type rx_queue_ktype = {
 static int rx_queue_add_kobject(struct net_device *net, int index)
 {
 	struct netdev_rx_queue *queue = net->_rx + index;
+	struct netdev_rx_queue *first = queue->first;
 	struct kobject *kobj = &queue->kobj;
 	int error = 0;
 
@@ -738,6 +739,7 @@ static int rx_queue_add_kobject(struct net_device *net, int index)
 	}
 
 	kobject_uevent(kobj, KOBJ_ADD);
+	atomic_inc(&first->count);
 
 	return error;
 }

^ permalink raw reply related

* [PATCH] net: clear heap allocation for ETHTOOL_GRXCLSRLALL
From: Kees Cook @ 2010-10-07 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Ben Hutchings, Jeff Garzik, Jeff Kirsher,
	Peter P Waskiewicz Jr, netdev

Calling ETHTOOL_GRXCLSRLALL with a large rule_cnt will allocate kernel
heap without clearing it. For the one driver (niu) that implements it,
it will leave the unused portion of heap unchanged and copy the full
contents back to userspace.

Cc: stable@kernel.org
Signed-off-by: Kees Cook <kees.cook@canonical.com>
---
 net/core/ethtool.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 7a85367..4016ac6 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -348,7 +348,7 @@ static noinline_for_stack int ethtool_get_rxnfc(struct net_device *dev,
 	if (info.cmd == ETHTOOL_GRXCLSRLALL) {
 		if (info.rule_cnt > 0) {
 			if (info.rule_cnt <= KMALLOC_MAX_SIZE / sizeof(u32))
-				rule_buf = kmalloc(info.rule_cnt * sizeof(u32),
+				rule_buf = kzalloc(info.rule_cnt * sizeof(u32),
 						   GFP_USER);
 			if (!rule_buf)
 				return -ENOMEM;
-- 
1.7.1

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply related

* Re: ixgbe: normalize frag_list usage
From: Alexander Duyck @ 2010-10-07 19:59 UTC (permalink / raw)
  To: David Miller
  Cc: Kirsher, Jeffrey T, Brandeburg, Jesse, Allan, Bruce W,
	netdev@vger.kernel.org
In-Reply-To: <20101006.235837.241424681.davem@davemloft.net>

On 10/6/2010 11:58 PM, David Miller wrote:
> From: "Duyck, Alexander H"<alexander.h.duyck@intel.com>
> Date: Tue, 5 Oct 2010 15:45:32 -0700
>
>> The patch below is kind of what I had in mind for a way to do RSC and maintain
>> the pointer scheme you are looking for.  Consider this patch an RFC for now
>> since I based this off of Jeff's internal testing tree and so it would need
>> some modification to apply cleanly to net-next.
>
> Can you really not remember the head somewhere?

I can track it in the RSC_CB if that works for you.  Right now though I 
guess I am not seeing the difference between tracking this in 
skb->frag_next vs IXGBE_RSC_CB(skb)->frag_head.  I think it might help 
if you were to provide some functions that demonstrate exactly what you 
had in mind for frag list handling.  Specifically if you were to add a 
function for merging a frag into the frag list, and for how you want to 
approach cleaning up the skb->prev/frag_tail_tracker pointer when you 
are cleaning up an active frag_list.

> What I wanted is for everyone to build their frag list SKBs from head to tail,
> always.  So that I could, as I mentioned in my original posting, do something
> like:

My change is doing just that.  It is building from head to tail.  The 
only difference is that tail is tracking head since it has to be updated 
after the tail is added, and then I can drop next/frag_next back to NULL.

> struct sk_buff {
> 	union {
> 		struct list_head list;
> 		struct {
> 			struct sk_buff	*frag_next;
> 			struct sk_buff	*frag_tail_tracker;
> 		};
> 	};
> };
>
> The ->frag_tail_tracker is only used in the head SKB to maintain where the
> last SKB in the frag list is.

That is what I was doing in the head skb.

> You're tracking the head from the inner SKBs, such that my intended
> conventions are not being followed.

What I am doing is tracking the head from the tail skb.  The reason for 
this is that I need some way to update the len, data_len, and truesize 
after I have placed data in the tail via skb_put.  All of the other SKBs 
in the frag list are just tracking the next like any other SKB in the 
frag_list.

Now that I think about it I guess the issue is that I am adding the SKB 
to the frag_list before it is complete.  I will send another patch out 
in the next couple of hours that should address most of the issues.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Eric W. Biederman @ 2010-10-07 19:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Américo Wang, Robin Holt, linux-kernel,
	Willy Tarreau, David S. Miller, netdev, James Morris,
	Pekka Savola (ipv6), Patrick McHardy, Alexey Kuznetsov
In-Reply-To: <20101007121840.ca49e2ac.akpm@linux-foundation.org>

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 07 Oct 2010 18:59:03 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:

>> Thats fine by me, thanks Eric.
>> 
>> Andrew, please remove previous patch from your tree and replace it by
>> following one :
>> 
>> [PATCH v2] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
>> 
>> When proc_doulongvec_minmax() is used with an array of longs,
>> and no min/max check requested (.extra1 or .extra2 being NULL), we
>> dereference a NULL pointer for the second element of the array.
>> 
>> Noticed while doing some changes in network stack for the "16TB problem"
>> 
>> Fix is to not change min & max pointers in
>> __do_proc_doulongvec_minmax(), so that all elements of the vector share
>> an unique min/max limit, like proc_dointvec_minmax().
>> 
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> ---
>>  kernel/sysctl.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index f88552c..8e45451 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -2485,7 +2485,7 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
>>  		kbuf[left] = 0;
>>  	}
>>  
>> -	for (; left && vleft--; i++, min++, max++, first=0) {
>> +	for (; left && vleft--; i++, first=0) {
>>  		unsigned long val;
>>  
>>  		if (write) {
>
> Did we check to see whether any present callers are passing in pointers
> to arrays of min/max values?

In 2.6.36 there are not any callers that pass in a vector of anything, I
don't know about linux-next.  It looks to me like incrementing min and
max was simply a bug.

> I wonder if there's any documentation for this interface which just
> became wrong.

Or it just became right.  Clearly no one has been expecting min
and max to be vectors.

Eric

^ permalink raw reply

* Re: [PATCH] net: Fix sk_dst_check() to reset the obsolete dst_entry of a socket.
From: Chung-Yih Wang (王崇懿) @ 2010-10-07 19:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, timo.teras
In-Reply-To: <AANLkTimF7piLEy7BUP4-BnmiiGT4NimZw0gE+kmYQd-c@mail.gmail.com>

As I am testing the l2tp/ipsec client(it is working fine on 2.6.32 but
failed on 2.6.35 with the same client). Please see the following log
dump for sk_dst_check().

<2>[  201.390289] ==== sk_dst_check sk=C7485800 dst=C717AC60
obsolete=FFFFFFFF cookie=0 check=C0296510
<2>[  211.247467] ==== sk_dst_check sk=C7485000 dst=C717AC60
obsolete=FFFFFFFF cookie=0 check=C0296510

[Basically, the ipsec tunnel is built and then free the dst_entry for
this l2tp socket. Therefore, the obsolete entry should be reset in
sk_dst_check(). However, the kernel 2.6.35 will do nothing here since
the ipv4_dst_check still return the obsolete entry even if it is
obsolete(dst->obsolete=2)]

<2>[  216.571350] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<6>[  218.069396] alg: No test for authenc(hmac(sha1),cbc(des3_ede))
(authenc(hmac(sha1-generic),cbc(des3_ede-generic)))
<6>[  218.164764] batt:  96%, 4114 mV, 0 mA (-6 avg), 27.2 C, 1305 mAh
<2>[  218.575561] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  220.580535] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  222.585754] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  224.591522] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  226.599212] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  228.602600] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  230.608062] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  232.613464] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  234.618896] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  236.623504] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  238.628936] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  240.634338] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  242.639709] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  244.645111] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  246.648864] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  248.654693] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  250.660125] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510
<2>[  252.665527] ==== sk_dst_check sk=C7485400 dst=C6F670E0
obsolete=00000002 cookie=0 check=C0296510

^ permalink raw reply

* Re: [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Andrew Morton @ 2010-10-07 19:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric W. Biederman, Américo Wang, Robin Holt, linux-kernel,
	Willy Tarreau, David S. Miller, netdev, James Morris,
	Pekka Savola (ipv6), Patrick McHardy, Alexey Kuznetsov
In-Reply-To: <1286470743.2912.276.camel@edumazet-laptop>

On Thu, 07 Oct 2010 18:59:03 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 07 octobre 2010 __ 09:37 -0700, Eric W. Biederman a __crit :
> 
> > The difference between long handling and int handling is a
> > usability issue.  I don't expect we will be exporting new
> > vectors via sysctl, so the conversion of a handful of vectors
> > from int to long is where this is most likely to be used.
> > 
> > I skimmed through all of what I presume are the current users
> > aka linux-2.6.36-rcX and there don't appear to be any users
> > of proc_dounlongvec_minmax that use it's vector properties there.
> > 
> > Which doubly tells me that incrementing the min and max pointers
> > is not what we want to do.
> > 
> 
> Thats fine by me, thanks Eric.
> 
> Andrew, please remove previous patch from your tree and replace it by
> following one :
> 
> [PATCH v2] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
> 
> When proc_doulongvec_minmax() is used with an array of longs,
> and no min/max check requested (.extra1 or .extra2 being NULL), we
> dereference a NULL pointer for the second element of the array.
> 
> Noticed while doing some changes in network stack for the "16TB problem"
> 
> Fix is to not change min & max pointers in
> __do_proc_doulongvec_minmax(), so that all elements of the vector share
> an unique min/max limit, like proc_dointvec_minmax().
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  kernel/sysctl.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index f88552c..8e45451 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2485,7 +2485,7 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
>  		kbuf[left] = 0;
>  	}
>  
> -	for (; left && vleft--; i++, min++, max++, first=0) {
> +	for (; left && vleft--; i++, first=0) {
>  		unsigned long val;
>  
>  		if (write) {

Did we check to see whether any present callers are passing in pointers
to arrays of min/max values?

I wonder if there's any documentation for this interface which just
became wrong.

^ permalink raw reply

* pull request: wireless-2.6 2010-10-07
From: John W. Linville @ 2010-10-07 18:58 UTC (permalink / raw)
  To: davem; +Cc: linux-wireless, netdev, linux-kernel

Dave,

Here are a few more more-or-less-one-liner fixes intended for 2.6.36.
The one from Felix fixes a regression introduced in the 2.6.36 cycle.
The one from Johannes fixes a crash reported by Ben Greear (as
documented in the changelog).  The one from me reverts and earlier patch
from me that can result in stuff in dmesg from not calling
netif_receive_skb in the proper context.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit fb3dbece264a50ab4373f3af0bbbd9175d3ad4d7:

  Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/padovan/bluetooth-2.6 (2010-10-07 00:59:39 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git master

Felix Fietkau (1):
      ath9k_hw: fix regression in ANI listen time calculation

Johannes Berg (1):
      mac80211: delete AddBA response timer

John W. Linville (1):
      Revert "mac80211: use netif_receive_skb in ieee80211_tx_status callpath"

 drivers/net/wireless/ath/ath9k/ani.c |    2 +-
 net/mac80211/agg-tx.c                |    2 ++
 net/mac80211/status.c                |    4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/ani.c b/drivers/net/wireless/ath/ath9k/ani.c
index cc648b6..a3d95cc 100644
--- a/drivers/net/wireless/ath/ath9k/ani.c
+++ b/drivers/net/wireless/ath/ath9k/ani.c
@@ -543,7 +543,7 @@ static u8 ath9k_hw_chan_2_clockrate_mhz(struct ath_hw *ah)
 	if (conf_is_ht40(conf))
 		return clockrate * 2;
 
-	return clockrate * 2;
+	return clockrate;
 }
 
 static int32_t ath9k_hw_ani_get_listen_time(struct ath_hw *ah)
diff --git a/net/mac80211/agg-tx.c b/net/mac80211/agg-tx.c
index c893f23..8f23401 100644
--- a/net/mac80211/agg-tx.c
+++ b/net/mac80211/agg-tx.c
@@ -175,6 +175,8 @@ int ___ieee80211_stop_tx_ba_session(struct sta_info *sta, u16 tid,
 
 	set_bit(HT_AGG_STATE_STOPPING, &tid_tx->state);
 
+	del_timer_sync(&tid_tx->addba_resp_timer);
+
 	/*
 	 * After this packets are no longer handed right through
 	 * to the driver but are put onto tid_tx->pending instead,
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index 10caec5..34da679 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -377,7 +377,7 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
 				skb2 = skb_clone(skb, GFP_ATOMIC);
 				if (skb2) {
 					skb2->dev = prev_dev;
-					netif_receive_skb(skb2);
+					netif_rx(skb2);
 				}
 			}
 
@@ -386,7 +386,7 @@ void ieee80211_tx_status(struct ieee80211_hw *hw, struct sk_buff *skb)
 	}
 	if (prev_dev) {
 		skb->dev = prev_dev;
-		netif_receive_skb(skb);
+		netif_rx(skb);
 		skb = NULL;
 	}
 	rcu_read_unlock();
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply related

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Narendra_K @ 2010-10-07 18:44 UTC (permalink / raw)
  To: kay.sievers
  Cc: greg, Matt_Domsch, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <AANLkTinP5WPDPvk+kq8vsyP=xC9qcoe+c=1EBp0XJNPk@mail.gmail.com>

On Thu, Oct 07, 2010 at 10:35:14PM +0530, Kay Sievers wrote:
>    On Thu, Oct 7, 2010 at 18:48, Greg KH <greg@kroah.com> wrote:
>    > On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
>    >> 1) SMBIOS type 41 method.  Windows does not use this today, and I
>    >>    can't speak to their future plans.  Narendra's kernel patch does,
>    >>    as has biosdevname, the udev helper we first wrote for this
>    >>    purpose, for several years.
>    >
>    > Then stick with that udev helper please :)
> 
>    What about just exporting this information in sysfs, and not touch the
>    naming?
> 
>    Anyway, I'm pretty sure all of this naming of onboard devices should
>    happen only at install time, or from a system management tool and not
>    at hotplug time.
> 
>    We should not get confused by the way the (very simple)
>    automatic-rule-creater for persistent netdev naming in udev works.
>    This is really just a tool for the common case, and works fine for the
>    majority of people.

Right. It works as the automatic rule creator saves the snapshot of the
registered network interfaces at run time. 

> 
>    I'm not sure, if we should put all these special use cases in the
>    hotplug path. I mean it's not that people add and remove 4 port
>    network cards with special BIOS all the time, and expect proper naming
>    on the first bootup, right? The installer, or the system management
>    tool could just create/edit udev rules to provide proper device naming
>    on whatever property is available at a specific hardware, be it the
>    MAC address or some other persistent match?
> 

The proposal made is not expecting deterministic naming when an add-in
card with 'N' ports is plugged in/out. It is specific to onboard devices
only. Expectation is onboard devices have deterministic naming at first
bootup. And no special BIOS required as SMBIOS tables are in use for
sometime now.

I did explore using rules based on the exported attribute ATTRS{index}
on a system with 4 Onboard devices and two add-in devices.(where add-in
device becomes eth0 and eth1)

# PCI device 0x14e4:0x164c (bnx2) (custom name provided by external
# tool) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*",
# ATTRS{index}=="1", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

(and similary for eth1..eth3)

And for add-in devices.

# PCI device 0x8086:0x10c9 (igb) (custom name provided by external tool)
# SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*",
# ATTR{address}=="00:1b:21:54:33:3c", ATTR{type}=="1", KERNEL=="eth*",
# NAME="eth4" (and similar for eth5 as they do not have an index)

This works as i edited the file manually.

If this has to be done on a large number of systems where an image
based deployment is preferred, getting onboard device names as expected
is an issue and is important.

-- 
With regards,
Narendra K

^ permalink raw reply

* Re: [PATCH net-next] net: percpu net_device refcount
From: Eric Dumazet @ 2010-10-07 18:06 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20101007103051.63b5177c@nehalam>

Le jeudi 07 octobre 2010 à 10:30 -0700, Stephen Hemminger a écrit :

> It makes sense, but what about 256 cores and 1024 Vlans?
> That adds up to 4M of memory which is might be noticeable.
> 
> 

Well, 256 cores and 1024 Vlan -> 1 Mbyte of memory, not 4 ;)

This seems reasonable to me.

Eventually we could use a fallback, if percpu allocation failed -> use a
static field in net_device.




^ permalink raw reply

* [PATCH net-next-2.6] sfc: Don't try to set filters with search depths we know won't work
From: Ben Hutchings @ 2010-10-07 17:50 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

The filter engine will time-out and ignore filters beyond
200-something hops.  We also need to avoid infinite loops in
efx_filter_search() when the table is full.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/filter.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sfc/filter.c b/drivers/net/sfc/filter.c
index abc884d..52cb608 100644
--- a/drivers/net/sfc/filter.c
+++ b/drivers/net/sfc/filter.c
@@ -20,6 +20,12 @@
 #define FILTER_CTL_SRCH_FUDGE_WILD 3
 #define FILTER_CTL_SRCH_FUDGE_FULL 1
 
+/* Hard maximum hop limit.  Hardware will time-out beyond 200-something.
+ * We also need to avoid infinite loops in efx_filter_search() when the
+ * table is full.
+ */
+#define FILTER_CTL_SRCH_MAX 200
+
 struct efx_filter_table {
 	u32		offset;		/* address of table relative to BAR */
 	unsigned	size;		/* number of entries */
@@ -183,7 +189,8 @@ static int efx_filter_search(struct efx_filter_table *table,
 	incr = efx_filter_increment(key);
 
 	for (depth = 1, filter_idx = hash & (table->size - 1);
-	     test_bit(filter_idx, table->used_bitmap);
+	     depth <= FILTER_CTL_SRCH_MAX &&
+		     test_bit(filter_idx, table->used_bitmap);
 	     ++depth) {
 		cmp = &table->spec[filter_idx];
 		if (efx_filter_equal(spec, cmp))
@@ -192,6 +199,8 @@ static int efx_filter_search(struct efx_filter_table *table,
 	}
 	if (!for_insert)
 		return -ENOENT;
+	if (depth > FILTER_CTL_SRCH_MAX)
+		return -EBUSY;
 found:
 	*depth_required = depth;
 	return filter_idx;
-- 
1.7.2.1


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Narendra_K @ 2010-10-07 17:44 UTC (permalink / raw)
  To: greg
  Cc: kay.sievers, Matt_Domsch, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007171532.GA29857@kroah.com>

On Thu, Oct 07, 2010 at 10:45:32PM +0530, Greg KH wrote:
>    On Thu, Oct 07, 2010 at 07:05:14PM +0200, Kay Sievers wrote:
>    > On Thu, Oct 7, 2010 at 18:48, Greg KH <greg@kroah.com> wrote:
>    > > On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
>    > >> 1) SMBIOS type 41 method.  Windows does not use this today, and I
>    > >>    can't speak to their future plans.  Narendra's kernel patch does,
>    > >>    as has biosdevname, the udev helper we first wrote for this
>    > >>    purpose, for several years.
>    > >
>    > > Then stick with that udev helper please :)
>    >
>    > What about just exporting this information in sysfs, and not touch the
>    naming?
> 
>    I think this is now exported in userspace, right Narendra?
> 

Right. It is in the mainline kernel.
( commit 911e1c9b05a8e3559a7aa89083930700a0b9e7ee)

-- 
With regards,
Narendra K

^ permalink raw reply

* Re: [PATCH net-next] net: percpu net_device refcount
From: Stephen Hemminger @ 2010-10-07 17:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1286471555.2912.291.camel@edumazet-laptop>

On Thu, 07 Oct 2010 19:12:35 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> We tried very hard to remove all possible dev_hold()/dev_put() pairs in
> network stack, using RCU conversions.
> 
> There is still an unavoidable device refcount change for every dst we
> create/destroy, and this can slow down some workloads (routers or some
> app servers)
> 
> We can switch to a percpu refcount implementation, now dynamic per_cpu
> infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
> per device.
> 

It makes sense, but what about 256 cores and 1024 Vlans?
That adds up to 4M of memory which is might be noticeable.


-- 

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Greg KH @ 2010-10-07 17:15 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Matt Domsch, Narendra_K, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <AANLkTinP5WPDPvk+kq8vsyP=xC9qcoe+c=1EBp0XJNPk@mail.gmail.com>

On Thu, Oct 07, 2010 at 07:05:14PM +0200, Kay Sievers wrote:
> On Thu, Oct 7, 2010 at 18:48, Greg KH <greg@kroah.com> wrote:
> > On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
> >> 1) SMBIOS type 41 method.  Windows does not use this today, and I
> >>    can't speak to their future plans.  Narendra's kernel patch does,
> >>    as has biosdevname, the udev helper we first wrote for this
> >>    purpose, for several years.
> >
> > Then stick with that udev helper please :)
> 
> What about just exporting this information in sysfs, and not touch the naming?

I think this is now exported in userspace, right Narendra?

> Anyway, I'm pretty sure all of this naming of onboard devices should
> happen only at install time, or from a system management tool and not
> at hotplug time.
> 
> We should not get confused by the way the (very simple)
> automatic-rule-creater for persistent netdev naming in udev works.
> This is really just a tool for the common case, and works fine for the
> majority of people.
> 
> I'm not sure, if we should put all these special use cases in the
> hotplug path. I mean it's not that people add and remove 4 port
> network cards with special BIOS all the time, and expect proper naming
> on the first bootup, right? The installer, or the system management
> tool could just create/edit udev rules to provide proper device naming
> on whatever property is available at a specific hardware, be it the
> MAC address or some other persistent match?

I totally agree.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Kay Sievers @ 2010-10-07 17:13 UTC (permalink / raw)
  To: David Lamparter
  Cc: Matt Domsch, Greg KH, Narendra_K, netdev, linux-hotplug,
	linux-pci, Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007164906.GB122295@jupiter.n2.diac24.net>

On Thu, Oct 7, 2010 at 18:49, David Lamparter <equinox@diac24.net> wrote:
> On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
>> At most, they'll have to deal with the same "eth0_rename"
>
> Getting that "eth0_rename" stuff is not something unchangeably imposed
> by the kernel. I think this should be considered an udev bug most of all
> things, since if the udev scripts did it right, this wouldn't happen. A
> "right" way to do this would for example be to first rename all devices
> to "tmpwhatever0", then rename them back to their proper ethX names.

Whats the difference between eth0_rename and tmpeth0? I don't see any. :)

> Now I don't know even a single bit about udev scripting, so maybe this
> isn't possible with the current udev infrastructure, but there's
> certainly no kernel part stopping a proper implementation.
>
> Btw, what happened to nameif(8)? It's totally outdated (and deprecated I
> guess?), but it got the job done...

It can't swap already taken device names on hotplug, and that job
might get complicated very easily. Also stuff needs to be renamed
before the device is announced to userspace, otherwise the interface
gets configured and can't be renamed anymore.

Kay

^ permalink raw reply

* [PATCH net-next] net: percpu net_device refcount
From: Eric Dumazet @ 2010-10-07 17:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.

There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers)

We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.

On x86, dev_hold(dev) code :

before
	lock	incl 0x280(%ebx)
after:
	movl	0x260(%ebx),%eax
	incl	fs:(%eax)

Stress bench :

(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)

Before:

real	1m1.662s
user	0m14.373s
sys	12m55.960s

After:

real	0m51.179s
user	0m15.329s
sys	10m15.942s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    6 ++--
 net/core/dev.c            |   47 ++++++++++++++++++++++++++++++------
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6abcef6..fa1d88d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1020,7 +1020,7 @@ struct net_device {
 	struct timer_list	watchdog_timer;
 
 	/* Number of references to this device */
-	atomic_t		refcnt ____cacheline_aligned_in_smp;
+	int __percpu		*pcpu_refcnt;
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
@@ -1792,7 +1792,7 @@ extern void netdev_run_todo(void);
  */
 static inline void dev_put(struct net_device *dev)
 {
-	atomic_dec(&dev->refcnt);
+	irqsafe_cpu_dec(*dev->pcpu_refcnt);
 }
 
 /**
@@ -1803,7 +1803,7 @@ static inline void dev_put(struct net_device *dev)
  */
 static inline void dev_hold(struct net_device *dev)
 {
-	atomic_inc(&dev->refcnt);
+	irqsafe_cpu_inc(*dev->pcpu_refcnt);
 }
 
 /* Carrier loss detection, dial on demand. The functions netif_carrier_on
diff --git a/net/core/dev.c b/net/core/dev.c
index 7d14955..2186e1e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5195,9 +5195,6 @@ int init_dummy_netdev(struct net_device *dev)
 	 */
 	dev->reg_state = NETREG_DUMMY;
 
-	/* initialize the ref count */
-	atomic_set(&dev->refcnt, 1);
-
 	/* NAPI wants this */
 	INIT_LIST_HEAD(&dev->napi_list);
 
@@ -5205,6 +5202,19 @@ int init_dummy_netdev(struct net_device *dev)
 	set_bit(__LINK_STATE_PRESENT, &dev->state);
 	set_bit(__LINK_STATE_START, &dev->state);
 
+#if 0
+	/* We dont allocate pcpu_refcnt for dummy devices,
+	 * because users of this 'device' dont need to change
+	 * its refcount
+	 */
+	dev->pcpu_refcnt = alloc_percpu(int);
+	if (!dev->pcpu_refcnt)
+		return -ENOMEM;
+
+	/* initialize the ref count */
+	dev_hold(dev);
+#endif
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(init_dummy_netdev);
@@ -5246,6 +5256,15 @@ out:
 }
 EXPORT_SYMBOL(register_netdev);
 
+static int netdev_refcnt_read(const struct net_device *dev)
+{
+	int i, refcnt = 0;
+
+	for_each_possible_cpu(i)
+		refcnt += *per_cpu_ptr(dev->pcpu_refcnt, i);
+	return refcnt;
+}
+
 /*
  * netdev_wait_allrefs - wait until all references are gone.
  *
@@ -5260,11 +5279,14 @@ EXPORT_SYMBOL(register_netdev);
 static void netdev_wait_allrefs(struct net_device *dev)
 {
 	unsigned long rebroadcast_time, warning_time;
+	int refcnt;
 
 	linkwatch_forget_dev(dev);
 
 	rebroadcast_time = warning_time = jiffies;
-	while (atomic_read(&dev->refcnt) != 0) {
+	refcnt = netdev_refcnt_read(dev);
+
+	while (refcnt != 0) {
 		if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
 			rtnl_lock();
 
@@ -5291,11 +5313,13 @@ static void netdev_wait_allrefs(struct net_device *dev)
 
 		msleep(250);
 
+		refcnt = netdev_refcnt_read(dev);
+
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "
 			       "waiting for %s to become free. Usage "
 			       "count = %d\n",
-			       dev->name, atomic_read(&dev->refcnt));
+			       dev->name, refcnt);
 			warning_time = jiffies;
 		}
 	}
@@ -5353,7 +5377,7 @@ void netdev_run_todo(void)
 		netdev_wait_allrefs(dev);
 
 		/* paranoia */
-		BUG_ON(atomic_read(&dev->refcnt));
+		BUG_ON(netdev_refcnt_read(dev));
 		WARN_ON(rcu_dereference_raw(dev->ip_ptr));
 		WARN_ON(dev->ip6_ptr);
 		WARN_ON(dev->dn_ptr);
@@ -5523,9 +5547,13 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	dev = PTR_ALIGN(p, NETDEV_ALIGN);
 	dev->padded = (char *)dev - (char *)p;
 
-	if (dev_addr_init(dev))
+	dev->pcpu_refcnt = alloc_percpu(int);
+	if (!dev->pcpu_refcnt)
 		goto free_tx;
 
+	if (dev_addr_init(dev))
+		goto free_pcpu;
+
 	dev_mc_init(dev);
 	dev_uc_init(dev);
 
@@ -5556,6 +5584,8 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 
 free_tx:
 	kfree(tx);
+free_pcpu:
+	free_percpu(dev->pcpu_refcnt);
 free_p:
 	kfree(p);
 	return NULL;
@@ -5589,6 +5619,9 @@ void free_netdev(struct net_device *dev)
 	list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
 		netif_napi_del(p);
 
+	free_percpu(dev->pcpu_refcnt);
+	dev->pcpu_refcnt = NULL;
+
 	/*  Compatibility with error handling in drivers */
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
 		kfree((char *)dev - dev->padded);



^ permalink raw reply related

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Kay Sievers @ 2010-10-07 17:05 UTC (permalink / raw)
  To: Greg KH
  Cc: Matt Domsch, Narendra_K, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007164835.GA27339@kroah.com>

On Thu, Oct 7, 2010 at 18:48, Greg KH <greg@kroah.com> wrote:
> On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
>> 1) SMBIOS type 41 method.  Windows does not use this today, and I
>>    can't speak to their future plans.  Narendra's kernel patch does,
>>    as has biosdevname, the udev helper we first wrote for this
>>    purpose, for several years.
>
> Then stick with that udev helper please :)

What about just exporting this information in sysfs, and not touch the naming?

Anyway, I'm pretty sure all of this naming of onboard devices should
happen only at install time, or from a system management tool and not
at hotplug time.

We should not get confused by the way the (very simple)
automatic-rule-creater for persistent netdev naming in udev works.
This is really just a tool for the common case, and works fine for the
majority of people.

I'm not sure, if we should put all these special use cases in the
hotplug path. I mean it's not that people add and remove 4 port
network cards with special BIOS all the time, and expect proper naming
on the first bootup, right? The installer, or the system management
tool could just create/edit udev rules to provide proper device naming
on whatever property is available at a specific hardware, be it the
MAC address or some other persistent match?

Kay

^ permalink raw reply

* Re: [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Eric Dumazet @ 2010-10-07 16:59 UTC (permalink / raw)
  To: Eric W. Biederman, Andrew Morton
  Cc: Américo Wang, Robin Holt, linux-kernel, Willy Tarreau,
	David S. Miller, netdev, James Morris, Pekka Savola (ipv6),
	Patrick McHardy, Alexey Kuznetsov
In-Reply-To: <m1iq1e3qnn.fsf@fess.ebiederm.org>

Le jeudi 07 octobre 2010 à 09:37 -0700, Eric W. Biederman a écrit :

> The difference between long handling and int handling is a
> usability issue.  I don't expect we will be exporting new
> vectors via sysctl, so the conversion of a handful of vectors
> from int to long is where this is most likely to be used.
> 
> I skimmed through all of what I presume are the current users
> aka linux-2.6.36-rcX and there don't appear to be any users
> of proc_dounlongvec_minmax that use it's vector properties there.
> 
> Which doubly tells me that incrementing the min and max pointers
> is not what we want to do.
> 

Thats fine by me, thanks Eric.

Andrew, please remove previous patch from your tree and replace it by
following one :

[PATCH v2] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()

When proc_doulongvec_minmax() is used with an array of longs,
and no min/max check requested (.extra1 or .extra2 being NULL), we
dereference a NULL pointer for the second element of the array.

Noticed while doing some changes in network stack for the "16TB problem"

Fix is to not change min & max pointers in
__do_proc_doulongvec_minmax(), so that all elements of the vector share
an unique min/max limit, like proc_dointvec_minmax().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 kernel/sysctl.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f88552c..8e45451 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2485,7 +2485,7 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
 		kbuf[left] = 0;
 	}
 
-	for (; left && vleft--; i++, min++, max++, first=0) {
+	for (; left && vleft--; i++, first=0) {
 		unsigned long val;
 
 		if (write) {

^ permalink raw reply related

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: David Lamparter @ 2010-10-07 16:49 UTC (permalink / raw)
  To: Matt Domsch
  Cc: Greg KH, Narendra_K, netdev, linux-hotplug, linux-pci,
	Jordan_Hargrave, Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007163113.GA14260@auslistsprd01.us.dell.com>

Hi Matt,

On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
> At most, they'll have to deal with the same "eth0_rename"

Getting that "eth0_rename" stuff is not something unchangeably imposed
by the kernel. I think this should be considered an udev bug most of all
things, since if the udev scripts did it right, this wouldn't happen. A
"right" way to do this would for example be to first rename all devices
to "tmpwhatever0", then rename them back to their proper ethX names.

Now I don't know even a single bit about udev scripting, so maybe this
isn't possible with the current udev infrastructure, but there's
certainly no kernel part stopping a proper implementation.

Btw, what happened to nameif(8)? It's totally outdated (and deprecated I
guess?), but it got the job done...

-David

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Greg KH @ 2010-10-07 16:48 UTC (permalink / raw)
  To: Matt Domsch
  Cc: Narendra_K, netdev, linux-hotplug, linux-pci, Jordan_Hargrave,
	Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007163113.GA14260@auslistsprd01.us.dell.com>

On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
> On Thu, Oct 07, 2010 at 08:11:34AM -0700, Greg KH wrote:
> > On Thu, Oct 07, 2010 at 07:23:35AM -0700, Narendra_K@Dell.com wrote:
> > > Hello,
> > > 
> > > V1 -> V2:
> > > 
> > > This patch addresses the scenario of buggy firmware/BIOS tables. The
> > > patch introduces a command line parameter 'no_netfwindex', passing which
> > > firmware provided index will not be used to derive 'eth' names. By
> > > default, firmware index will be used and the parameter can be used to
> > > work around buggy firmware/BIOS tables.
> > > 
> > > Please find the patch below.
> > > 
> > > From: Narendra K <narendra_k@dell.com>
> > > Subject: [PATCH] Use firmware provided index to register a network device
> > > 
> > > This patch uses the firmware provided index to derive the ethN name.
> > > If the firmware provides an index for the corresponding pdev, the N
> > > is derived from the index.
> > 
> > No, this has the very big chance to reorder existing network names and
> > should all be done in userspace with the firmware information exported
> > in sysfs, if the user so desires.
> 
> Existing names that use 70-persistent-net.rules won't change.

But users that don't use it would cause their names to change, and you
can't break that, no matter how naturally fragile those types of systems
are.  Sorry.

> The kernel patch has the advantage of not requiring users to change
> their scripts, by reserving the first N values for onboard devices as
> BIOS describes them.

So you feel that updating a kernel is easier than getting a user to
update their userspace scripts?  While I might also feel it is that way,
in reality, it isn't, sorry.

> Userspace udev rules require us to change user behavior, and likely
> their scripts, to use a new namespace such as ethlom1, in order to get
> deterministic naming.

Then use that, don't put this in the kernel please.

> > > It took some time to find out the details asked above. Right,
> > > windows does not use SMBIOS type 41 record to derive names. But as
> > > a datapoint, windows also has the same problem.
> 
> > If windows does not use this field, then you can guarantee it will
> > show up incorrectly in BIOSes and never be tested by manufacturers
> > before they ship their machines.
> 
> There are 2 methods of exporting the information that have gotten
> confusing here.
> 
> 1) SMBIOS type 41 method.  Windows does not use this today, and I
>    can't speak to their future plans.  Narendra's kernel patch does,
>    as has biosdevname, the udev helper we first wrote for this
>    purpose, for several years.

Then stick with that udev helper please :)

> 2) soon-to-be-released PCIe Firmware Spec, exporting label and index
>    as an ACPI _DSM().  It is anticipated that Windows will use this
>    information in some fashion at some point, per our conversations
>    with Microsoft as part of the PCI SIG proposal process.  We've held
>    off submitting a kernel patch until the spec is public.

Let's worry about that _when_ it is public, and when there is a Windows
version that supports it please.  Until then, there will not be support
for this in BIOSes in any usable way.

> Cases:
> 1) BIOS doesn't provide this information (the common case today for
>    all but Dell 10G and newer servers), nothing is reserved, and
>    therefore the existing 70-persistent-net.rules take effect to name
>    devices.  No change in behavior for existing systems.  New installs
>    will have NICs named non-deterministically, and recorded in
>    70-persistent-net.rules.  Users desiring consistent naming must
>    adjust 70-persistent-net.rules after install, and to avoid
>    eth0_rename collisions, must rename into another namespace.

That's fine and is how things work today, right?

> 2) BIOS provides this information correctly (Dell 10G and newer servers, or
>    anyone else implementing the spec).  The first N values for onboard
>    devices are reserved and assigned by the kernel to onboard
>    devices.

And you just broke people's machines that don't use udev for network
naming.  Sorry, that's not going to work.

>    70-persistent-net.rules may still try to rename the
>    ports, and may fail with eth0_rename collisions.

So you just broke their machines as well, when it was working, again,
not acceptable.

> 3) BIOS provides this information incorrectly (none known today).

Only because we have no way of testing for this :)

> What I can't find here is a solution where the user gets determistic
> naming, without being forced to change the namespace and adjust their
> scripts.  Can anyone?

No, and they shoudn't.

You should only have "deterministic" naming if you want it, and you set
it up to do so, from userspace.  The kernel is not responsible for this,
sorry.

If you want to do it in userspace, you have all the tools, and the data
exported from the kernel to do so.

> I'm not opposed to "do it all in userspace" - really, I'm not.  But
> the 'where' is only part of the problem.  No userspace solution has
> yet been proposed that doesn't require a namespace change, and I'm
> still hoping to avoid that.

Just use a new script for people who want this, they will have to know
they want it anyway, right?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Eric W. Biederman @ 2010-10-07 16:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Américo Wang, Robin Holt, Andrew Morton, linux-kernel,
	Willy Tarreau, David S. Miller, netdev, James Morris,
	Pekka Savola (ipv6), Patrick McHardy, Alexey Kuznetsov
In-Reply-To: <1286445081.2912.15.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> Le jeudi 07 octobre 2010 à 17:25 +0800, Américo Wang a écrit :
>> >>
>> >
>> >Here is the final one.
>> 
>> Oops, that one is not correct. Hopefully this one
>> is correct.
>> 
>> --------------->
>> 
>> Eric D. noticed that we may trigger an OOPS if we leave ->extra{1,2}
>> to NULL when we use proc_doulongvec_minmax().
>> 
>> Actually, we don't need to store min/max values in a vector,
>> because all the elements in the vector should share the same min/max
>> value, like what proc_dointvec_minmax() does.
>> 
>
> If we assert same min/max limits are to be applied to all elements,
> a much simpler fix than yours would be :
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index f88552c..8e45451 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2485,7 +2485,7 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
>  		kbuf[left] = 0;
>  	}
>  
> -	for (; left && vleft--; i++, min++, max++, first=0) {
> +	for (; left && vleft--; i++, first=0) {
>  		unsigned long val;
>  
>  		if (write) {
>
>
> Please dont send huge patches like this to 'fix' a bug,
> especially on slow path.
>
> First we fix the bug, _then_ we can try to make code more 
> efficient or more pretty or shorter.
>
> So the _real_ question is :
>
> Should the min/max limits should be a single pair,
> shared by all elements, or a vector of limits.

The difference between long handling and int handling is a
usability issue.  I don't expect we will be exporting new
vectors via sysctl, so the conversion of a handful of vectors
from int to long is where this is most likely to be used.

I skimmed through all of what I presume are the current users
aka linux-2.6.36-rcX and there don't appear to be any users
of proc_dounlongvec_minmax that use it's vector properties there.

Which doubly tells me that incrementing the min and max pointers
is not what we want to do.

Eric

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Matt Domsch @ 2010-10-07 16:31 UTC (permalink / raw)
  To: Greg KH
  Cc: Narendra_K, netdev, linux-hotplug, linux-pci, Jordan_Hargrave,
	Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007151134.GA25713@kroah.com>

On Thu, Oct 07, 2010 at 08:11:34AM -0700, Greg KH wrote:
> On Thu, Oct 07, 2010 at 07:23:35AM -0700, Narendra_K@Dell.com wrote:
> > Hello,
> > 
> > V1 -> V2:
> > 
> > This patch addresses the scenario of buggy firmware/BIOS tables. The
> > patch introduces a command line parameter 'no_netfwindex', passing which
> > firmware provided index will not be used to derive 'eth' names. By
> > default, firmware index will be used and the parameter can be used to
> > work around buggy firmware/BIOS tables.
> > 
> > Please find the patch below.
> > 
> > From: Narendra K <narendra_k@dell.com>
> > Subject: [PATCH] Use firmware provided index to register a network device
> > 
> > This patch uses the firmware provided index to derive the ethN name.
> > If the firmware provides an index for the corresponding pdev, the N
> > is derived from the index.
> 
> No, this has the very big chance to reorder existing network names and
> should all be done in userspace with the firmware information exported
> in sysfs, if the user so desires.

Existing names that use 70-persistent-net.rules won't change.  They
were non-deterministically assigned and recorded in that file, so
they'll persist.  At most, they'll have to deal with the same
"eth0_rename" garbage that anyone wanting to get a consistent name
from other than MAC address would get.  The only reason the
auto-created 70-persistent-net.rules works is that most people don't
change their NIC configuration after install, so it has no renaming to do.

The kernel patch has the advantage of not requiring users to change
their scripts, by reserving the first N values for onboard devices as
BIOS describes them.

Userspace udev rules require us to change user behavior, and likely
their scripts, to use a new namespace such as ethlom1, in order to get
deterministic naming.

> > It took some time to find out the details asked above. Right,
> > windows does not use SMBIOS type 41 record to derive names. But as
> > a datapoint, windows also has the same problem.

> If windows does not use this field, then you can guarantee it will
> show up incorrectly in BIOSes and never be tested by manufacturers
> before they ship their machines.

There are 2 methods of exporting the information that have gotten
confusing here.

1) SMBIOS type 41 method.  Windows does not use this today, and I
   can't speak to their future plans.  Narendra's kernel patch does,
   as has biosdevname, the udev helper we first wrote for this purpose, for several years.

2) soon-to-be-released PCIe Firmware Spec, exporting label and index
   as an ACPI _DSM().  It is anticipated that Windows will use this
   information in some fashion at some point, per our conversations
   with Microsoft as part of the PCI SIG proposal process.  We've held
   off submitting a kernel patch until the spec is public.

Cases:
1) BIOS doesn't provide this information (the common case today for
   all but Dell 10G and newer servers), nothing is reserved, and
   therefore the existing 70-persistent-net.rules take effect to name
   devices.  No change in behavior for existing systems.  New installs
   will have NICs named non-deterministically, and recorded in
   70-persistent-net.rules.  Users desiring consistent naming must
   adjust 70-persistent-net.rules after install, and to avoid
   eth0_rename collisions, must rename into another namespace.

2) BIOS provides this information correctly (Dell 10G and newer servers, or
   anyone else implementing the spec).  The first N values for onboard
   devices are reserved and assigned by the kernel to onboard
   devices.  70-persistent-net.rules may still try to rename the
   ports, and may fail with eth0_rename collisions.  Users would have
   to adjust 70-persistent-net.rules to accommodate, but they do not
   have to change the namespace from ethX to something else. New
   installs will have onboard NICs named deterministically, and recorded in
   70-persistent-net.rules.

3) BIOS provides this information incorrectly (none known today).
   Some number of N values may be reserved, more or fewer than
   actually present.  In either case, it's possible that
   70-persistent-net.rules would have to be adjusted to accommodate,
   but they do not have to change the namespace from ethX to something
   else. New installs will have onboard NICs named deterministically,
   if not ideally due to buggy BIOS, and recorded in
   70-persistent-net.rules.

What I can't find here is a solution where the user gets determistic
naming, without being forced to change the namespace and adjust their
scripts.  Can anyone?  I'm not opposed to "do it all in userspace" -
really, I'm not.  But the 'where' is only part of the problem.  No
userspace solution has yet been proposed that doesn't require a
namespace change, and I'm still hoping to avoid that.

Thanks,
Matt

-- 
Matt Domsch
Technology Strategist
Dell | Office of the CTO

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Stephen Hemminger @ 2010-10-07 16:14 UTC (permalink / raw)
  To: Narendra_K
  Cc: netdev, linux-hotplug, linux-pci, Matt_Domsch, Jordan_Hargrave,
	Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007142319.GB2641@libnet-test.oslab.blr.amer.dell.com>

On Thu, 7 Oct 2010 07:23:35 -0700
<Narendra_K@Dell.com> wrote:

> Hello,
> 
> V1 -> V2:
> 
> This patch addresses the scenario of buggy firmware/BIOS tables. The
> patch introduces a command line parameter 'no_netfwindex', passing which
> firmware provided index will not be used to derive 'eth' names. By
> default, firmware index will be used and the parameter can be used to
> work around buggy firmware/BIOS tables.
> 
> Please find the patch below.
> 
> From: Narendra K <narendra_k@dell.com>
> Subject: [PATCH] Use firmware provided index to register a network device
> 
> This patch uses the firmware provided index to derive the ethN name.
> If the firmware provides an index for the corresponding pdev, the N
> is derived from the index.
> 
> As an example, consider a PowerEdge R710 which has 4 BCM5709
> Lan-On-Motherboard ports,1 Intel 82572EI port and 4 82575GB ports.
> The system firmware communicates the order of the 4 Lan-On-Motherboard
> ports by assigning indexes to each one of them. This is available to
> the OS as the SMBIOS type 41 record(for onboard devices), in the field
> 'device type index'. It looks like below -

I agree with Greg. Provide the firmware index in sysfs for use
in udev userspace and let udev do the policy on the naming.
The firmware index could be missing, wrong, or interesting on only
some devices.

^ permalink raw reply

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Greg KH @ 2010-10-07 15:11 UTC (permalink / raw)
  To: Narendra_K
  Cc: netdev, linux-hotplug, linux-pci, Matt_Domsch, Jordan_Hargrave,
	Vijay_Nijhawan, Charles_Rose
In-Reply-To: <20101007142319.GB2641@libnet-test.oslab.blr.amer.dell.com>

On Thu, Oct 07, 2010 at 07:23:35AM -0700, Narendra_K@Dell.com wrote:
> Hello,
> 
> V1 -> V2:
> 
> This patch addresses the scenario of buggy firmware/BIOS tables. The
> patch introduces a command line parameter 'no_netfwindex', passing which
> firmware provided index will not be used to derive 'eth' names. By
> default, firmware index will be used and the parameter can be used to
> work around buggy firmware/BIOS tables.
> 
> Please find the patch below.
> 
> From: Narendra K <narendra_k@dell.com>
> Subject: [PATCH] Use firmware provided index to register a network device
> 
> This patch uses the firmware provided index to derive the ethN name.
> If the firmware provides an index for the corresponding pdev, the N
> is derived from the index.

No, this has the very big chance to reorder existing network names and
should all be done in userspace with the firmware information exported
in sysfs, if the user so desires.

So consider this patch rejected from me.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] SIW: iWARP Protocol headers
From: Bernard Metzler @ 2010-10-07 14:55 UTC (permalink / raw)
  To: David Dillow
  Cc: Bart Van Assche, Jason Gunthorpe,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1286393504.26136.31.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>



David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote on 10/06/2010 09:31:44 PM:

> On Wed, 2010-10-06 at 12:37 -0600, Jason Gunthorpe wrote:
> > On Wed, Oct 06, 2010 at 02:22:55PM -0400, David Dillow wrote:
> > > On Wed, 2010-10-06 at 11:25 -0600, Jason Gunthorpe wrote:
> > > > On Wed, Oct 06, 2010 at 02:52:46PM +0200, Bernard Metzler wrote:
> > > > It is actually a little more complicated than just this. I assume
you
> > > > are casting the structures over packet payloads? In this case you
> > > > have to guarentee alignment (or used packed everywhere). Does iwarp
> > > > have provisions for alignment? If so you can construct your
bitfields
> > > > using the alignment type, ie if iWarp guarantees 4 bytes then the
> > > > biggest type you can use is u32 - then you can avoid using packed.
> > > >
> > > > Mind you, I'm not sure how to guarentee alignment when you consider
> > > > all the possible sources of unalignment in the stack: TCP, IP,L2
stack ?
> > >
> > > You don't have to. The TCP stack, IIRC, requires the architecture to
fix
> > > up misaligned accesses, or at least it used to. It will try to keep
> > > things aligned if possible -- see the comment above NET_IP_ALIGN in
> > > include/linux/skbuff.h.
> >
> > I was under the impression that this was just to align the IP
> > header.
>
> It is, and it assumes an ethernet header of 14 bytes. As you note, VLANs
> and everything else can mess with that, which is why the TCP stack
> requires the architecture to handle unaligned accesses. It's a bonus if
> iWarp can guarantee an alignment WRT the IP or TCP header, but not
> required.
>
> > > The structures in Bernard's header have all fields naturally
> > > aligned, so no need for the packed attribute and its baggage.
> >
> > No, they arent:
> >
> > > +struct iwarp_rdma_write {
> > > +   struct iwarp_ctrl   ctrl;
> > > +   __u32         sink_stag;
> > > +   __u64         sink_to;
> > > +} __attribute__((__packed__));
> >
> > For instance has sink_to unaligned if the maximum aligment of the
> > payload is only guarenteed to be 4. The proper thing to do in this
> > case is probably __attribute__((__packed__, __aligned__(4))) if that
> > works the way I think.. :)
>
> struct iwarp_ctrl consists of 2 u16s, so sink_stag is aligned to 4 bytes
> from the start of the struct iwarp_rdma_write, and sink_to is at 8 bytes
> -- so they are naturally aligned with respect to the structure. The
> compiler will not add padding, so no need for the packed attribute.
>
> Note I'm not talking about alignment in memory -- that'll depend on
> where the data starts, and as I mentioned, the stack doesn't make any
> guarantees about that, relying on the architecture to fixup any
> misaligned accesses.

agreed. all n-bytes are starting on a multipe of n-byte offset
relative to the start of the structure.

many thanks,
bernard.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox