Netdev List
 help / color / mirror / Atom feed
* Re: [RFC] SPD basic actions per netdev
From: Timo Teräs @ 2010-04-01  4:52 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jamal, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <20100401025247.GA19994@gondor.apana.org.au>

Herbert Xu wrote:
> On Wed, Mar 31, 2010 at 10:35:23PM -0400, jamal wrote:
>> On Thu, 2010-04-01 at 08:33 +0800, Herbert Xu wrote:
>>
>>> If we're going to change this then we should just add a second
>>> interface field to the selector, rather than trying to overload
>>> the existing one.
>> Do you mean to have a selector->iif/oif? Sure that makes sense - but is
>> a much larger surgery and user space will need to be taught.
>>
>> Did you look at the patch i sent? i tried to retain current behavior
>> except for the input check path. output path was working in classifying
>> with specific netdevs. 
> 
> OK, I guess the chances of an existing app breaking is slim.
> 
> BTW, you should treat FLOW_DIR_FWD as FLOW_DIR_IN.

I think we need iif and oif. The separation is clear in in/fwd policies,
as each is matched properly. But 'out' policies are used for both:
locally generated, and forwarded traffic.

Basically it goes like:
 in - for policy_check for traffic that is received locally
 fwd - for policy_check for traffic that is forwarded
 out - all (local and forwarded) traffic that goes out of box

IMHO, it's slightly confusing that in/fwd is split, but out is not.
But that's the way it works. If you now override the how interface
is checked for 'out' policy, it'll break current behaviour.

- Timo


^ permalink raw reply

* linux-next: build failure after merge of the slabh tree
From: Stephen Rothwell @ 2010-04-01  5:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-next, linux-kernel, Sjur Braendeland, David Miller, netdev

Hi Tejun,

After merging the slabh tree, today's linux-next build (x86_64
allmodconfig) failed like this:

net/caif/cfcnfg.c: In function 'cfcnfg_create':
net/caif/cfcnfg.c:68: error: implicit declaration of function 'kmalloc'
net/caif/cfcnfg.c:68: warning: assignment makes pointer from integer without a cast
net/caif/cfcnfg.c:100: error: implicit declaration of function 'kfree'

Caused by commit 15c9ac0c80e390df09ce5730a7b08b13e07a8dd5 ("net-caif: add
CAIF generic caif support functions") from the net interacting with
commit de380b55f92986c1a84198149cb71b7228d15fbd ("percpu: don't
implicitly include slab.h from percpu.h") from the slabh tree.

I have applied the following patch for today.  Dave, could you apply this
to the net tree please?

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 1 Apr 2010 16:31:50 +1100
Subject: [PATCH] net-caif: using kmalloc/kfree requires the include of slab.h

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 net/caif/cfcnfg.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/caif/cfcnfg.c b/net/caif/cfcnfg.c
index 70a733d..c873e3d 100644
--- a/net/caif/cfcnfg.c
+++ b/net/caif/cfcnfg.c
@@ -5,6 +5,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/stddef.h>
+#include <linux/slab.h>
 #include <net/caif/caif_layer.h>
 #include <net/caif/cfpkt.h>
 #include <net/caif/cfcnfg.h>
-- 
1.7.0.3

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* Re: Undefined behaviour of connect(fd, NULL, 0);
From: Changli Gao @ 2010-04-01  5:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Miller, shemminger, netdev
In-Reply-To: <x2j412e6f7f1003312116rd3b3ba96t31267545efe7660f@mail.gmail.com>

On Thu, Apr 1, 2010 at 12:16 PM, Changli Gao <xiaosuo@gmail.com> wrote:
>
> I found this from man page for connect(2)
>
>       Generally, connection-based protocol sockets may successfully connect()
>       only once; connectionless protocol sockets may use  connect()  multiple
>       times to change their association.  Connectionless sockets may dissolve
>       the association by connecting to an address with the  sa_family  member
>       of sockaddr set to AF_UNSPEC (supported on Linux since kernel 2.2).
>

dissolving the association by connecting to an address with the
sa_family member of sockaddr set to AF_UNSEPC is broken too.

int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
        struct inet_sock *inet = inet_sk(sk);
        struct sockaddr_in *usin = (struct sockaddr_in *) uaddr;
        struct rtable *rt;
        __be32 saddr;
        int oif;
        int err;


        if (addr_len < sizeof(*usin))
                return -EINVAL;

        if (usin->sin_family != AF_INET)
                return -EAFNOSUPPORT;

according to the man page, sin_family == AF_UNSPEC should be allowed.
And netlink's connect doesn't check the addr_len, so it behavior is
also undeterminedl

static int netlink_connect(struct socket *sock, struct sockaddr *addr,
                           int alen, int flags)
{
        int err = 0;
        struct sock *sk = sock->sk;
        struct netlink_sock *nlk = nlk_sk(sk);
        struct sockaddr_nl *nladdr = (struct sockaddr_nl *)addr;

        if (addr->sa_family == AF_UNSPEC) {
                sk->sk_state    = NETLINK_UNCONNECTED;
                nlk->dst_pid    = 0;
                nlk->dst_group  = 0;
                return 0;
        }

If this issues need to be fixed, I'll check all the protocols if their
connect() checkes the sizeof of socket address or not, and post a
patch.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [RFC] SPD basic actions per netdev
From: Herbert Xu @ 2010-04-01  6:01 UTC (permalink / raw)
  To: Timo Teräs; +Cc: jamal, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <4BB42692.9010105@iki.fi>

On Thu, Apr 01, 2010 at 07:52:34AM +0300, Timo Teräs wrote:
>
> IMHO, it's slightly confusing that in/fwd is split, but out is not.
> But that's the way it works. If you now override the how interface
> is checked for 'out' policy, it'll break current behaviour.

Unless I've misunderstood what his patch is trying to do, it would
seem that out policies would be completely unchanged.

Forward policies are not used on output.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [Patch V2] bonding: fix potential deadlock in bond_uninit()
From: Amerigo Wang @ 2010-04-01  6:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jiri Pirko, Stephen Hemminger, netdev, David S. Miller,
	Eric W. Biederman, Amerigo Wang, bonding-devel, Jay Vosburgh


bond_uninit() is invoked with rtnl_lock held, when it does destroy_workqueue()
which will potentially flush all works in this workqueue, if we hold rtnl_lock
again in the work function, it will deadlock.

So move destroy_workqueue() to destructor where rtnl_lock is not held any more,
suggested by Eric.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>

---

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5b92fbf..9f0aaa2 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4450,6 +4450,14 @@ static const struct net_device_ops bond_netdev_ops = {
 	.ndo_vlan_rx_kill_vid	= bond_vlan_rx_kill_vid,
 };
 
+static void bond_destructor(struct net_device *bond_dev)
+{
+	struct bonding *bond = netdev_priv(bond_dev);
+	if (bond->wq)
+		destroy_workqueue(bond->wq);
+	free_netdev(bond_dev);
+}
+
 static void bond_setup(struct net_device *bond_dev)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
@@ -4470,7 +4478,7 @@ static void bond_setup(struct net_device *bond_dev)
 	bond_dev->ethtool_ops = &bond_ethtool_ops;
 	bond_set_mode_ops(bond, bond->params.mode);
 
-	bond_dev->destructor = free_netdev;
+	bond_dev->destructor = bond_destructor;
 
 	/* Initialize the device options */
 	bond_dev->tx_queue_len = 0;
@@ -4542,9 +4550,6 @@ static void bond_uninit(struct net_device *bond_dev)
 
 	bond_remove_proc_entry(bond);
 
-	if (bond->wq)
-		destroy_workqueue(bond->wq);
-
 	netif_addr_lock_bh(bond_dev);
 	bond_mc_list_destroy(bond);
 	netif_addr_unlock_bh(bond_dev);

^ permalink raw reply related

* Re: [Patch V2] bonding: fix potential deadlock in bond_uninit()
From: Eric W. Biederman @ 2010-04-01  6:12 UTC (permalink / raw)
  To: Amerigo Wang
  Cc: linux-kernel, Jiri Pirko, Stephen Hemminger, netdev,
	David S. Miller, bonding-devel, Jay Vosburgh
In-Reply-To: <20100401061014.4815.7341.sendpatchset@localhost.localdomain>

Amerigo Wang <amwang@redhat.com> writes:

> bond_uninit() is invoked with rtnl_lock held, when it does destroy_workqueue()
> which will potentially flush all works in this workqueue, if we hold rtnl_lock
> again in the work function, it will deadlock.
>
> So move destroy_workqueue() to destructor where rtnl_lock is not held any more,
> suggested by Eric.

The error handling on creating a bond device needs to be updated as well.

Eric


> Signed-off-by: WANG Cong <amwang@redhat.com>
> Cc: Jay Vosburgh <fubar@us.ibm.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Jiri Pirko <jpirko@redhat.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>
> ---
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 5b92fbf..9f0aaa2 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -4450,6 +4450,14 @@ static const struct net_device_ops bond_netdev_ops = {
>  	.ndo_vlan_rx_kill_vid	= bond_vlan_rx_kill_vid,
>  };
>  
> +static void bond_destructor(struct net_device *bond_dev)
> +{
> +	struct bonding *bond = netdev_priv(bond_dev);
> +	if (bond->wq)
> +		destroy_workqueue(bond->wq);
> +	free_netdev(bond_dev);
> +}
> +
>  static void bond_setup(struct net_device *bond_dev)
>  {
>  	struct bonding *bond = netdev_priv(bond_dev);
> @@ -4470,7 +4478,7 @@ static void bond_setup(struct net_device *bond_dev)
>  	bond_dev->ethtool_ops = &bond_ethtool_ops;
>  	bond_set_mode_ops(bond, bond->params.mode);
>  
> -	bond_dev->destructor = free_netdev;
> +	bond_dev->destructor = bond_destructor;
>  
>  	/* Initialize the device options */
>  	bond_dev->tx_queue_len = 0;
> @@ -4542,9 +4550,6 @@ static void bond_uninit(struct net_device *bond_dev)
>  
>  	bond_remove_proc_entry(bond);
>  
> -	if (bond->wq)
> -		destroy_workqueue(bond->wq);
> -
>  	netif_addr_lock_bh(bond_dev);
>  	bond_mc_list_destroy(bond);
>  	netif_addr_unlock_bh(bond_dev);

^ permalink raw reply

* Re: [RFC] SPD basic actions per netdev
From: Timo Teräs @ 2010-04-01  6:20 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jamal, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <20100401060145.GB20865@gondor.apana.org.au>

Herbert Xu wrote:
> On Thu, Apr 01, 2010 at 07:52:34AM +0300, Timo Teräs wrote:
>> IMHO, it's slightly confusing that in/fwd is split, but out is not.
>> But that's the way it works. If you now override the how interface
>> is checked for 'out' policy, it'll break current behaviour.
> 
> Unless I've misunderstood what his patch is trying to do, it would
> seem that out policies would be completely unchanged.
> 
> Forward policies are not used on output.

Oh, that right.

But my statement still holds. If iif/oif is swapped, it's changing
current semantics and can end up breaking setups. Both are still
valid for 'in' and 'fwd' policies too, right? What if I'm using
'in' policy to make sure that all stuff arriving via 'eth0' is
encrypted, but 'eth1' is trusted and does not need xfrm. This
would break.

I do like the idea very much. In fact I remember asking this exact
feature long time ago. But I think it should be done by explicitly
allowing user to specify both iif and oif; even if it's more
intrusive.

^ permalink raw reply

* Re: [RFC] SPD basic actions per netdev
From: Herbert Xu @ 2010-04-01  6:28 UTC (permalink / raw)
  To: Timo Teräs; +Cc: jamal, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <4BB43B38.1060004@iki.fi>

On Thu, Apr 01, 2010 at 09:20:40AM +0300, Timo Teräs wrote:
>
> But my statement still holds. If iif/oif is swapped, it's changing
> current semantics and can end up breaking setups. Both are still
> valid for 'in' and 'fwd' policies too, right? What if I'm using
> 'in' policy to make sure that all stuff arriving via 'eth0' is
> encrypted, but 'eth1' is trusted and does not need xfrm. This
> would break.

The thing is if you're currently specifying an ifindex in the
selector for inbound/forward, it probably just won't work as
it'll be matched against oif which is meaningless on inbound
and forward.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [RFC] SPD basic actions per netdev
From: Timo Teräs @ 2010-04-01  6:32 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jamal, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <20100401062840.GA21284@gondor.apana.org.au>

Herbert Xu wrote:
> On Thu, Apr 01, 2010 at 09:20:40AM +0300, Timo Teräs wrote:
>> But my statement still holds. If iif/oif is swapped, it's changing
>> current semantics and can end up breaking setups. Both are still
>> valid for 'in' and 'fwd' policies too, right? What if I'm using
>> 'in' policy to make sure that all stuff arriving via 'eth0' is
>> encrypted, but 'eth1' is trusted and does not need xfrm. This
>> would break.
> 
> The thing is if you're currently specifying an ifindex in the
> selector for inbound/forward, it probably just won't work as
> it'll be matched against oif which is meaningless on inbound
> and forward.

On inbound it's always loopback interface. Does the same hold
true on forward? I was under the impression that it would
reflect the actual destination interface.

^ permalink raw reply

* Re: [RFC] SPD basic actions per netdev
From: Herbert Xu @ 2010-04-01  6:39 UTC (permalink / raw)
  To: Timo Teräs; +Cc: jamal, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <4BB43DE6.9060300@iki.fi>

On Thu, Apr 01, 2010 at 09:32:06AM +0300, Timo Teräs wrote:
>
> On inbound it's always loopback interface. Does the same hold
> true on forward? I was under the impression that it would
> reflect the actual destination interface.

That's a good point.  I suppose there might just be some crazy
people out there using it this way.

So Jamal, I think this is a good reason why we do need a new
field instead of just overloading the current one.  Otherwise
inbound selectors and forward selectors will have different
semantics which is needlessly confusing.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [Patch V2] bonding: fix potential deadlock in bond_uninit()
From: Cong Wang @ 2010-04-01  6:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Jiri Pirko, Stephen Hemminger, netdev,
	David S. Miller, bonding-devel, Jay Vosburgh
In-Reply-To: <m1r5mzsnuh.fsf@fess.ebiederm.org>

Eric W. Biederman wrote:
> Amerigo Wang <amwang@redhat.com> writes:
> 
>> bond_uninit() is invoked with rtnl_lock held, when it does destroy_workqueue()
>> which will potentially flush all works in this workqueue, if we hold rtnl_lock
>> again in the work function, it will deadlock.
>>
>> So move destroy_workqueue() to destructor where rtnl_lock is not held any more,
>> suggested by Eric.
> 
> The error handling on creating a bond device needs to be updated as well.
> 

You're right, I missed that part. Will update it soon.

Thanks.

^ permalink raw reply

* Re: [PATCH v3 04/12] l2tp: Add ppp device name to L2TP ppp session data
From: James Chapman @ 2010-04-01  7:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20100331160811.16351eb7@s6510>

Stephen Hemminger wrote:
> On Wed, 31 Mar 2010 08:43:09 +0100
> James Chapman <jchapman@katalix.com> wrote:
> 
>> Stephen Hemminger wrote:
>>> On Tue, 30 Mar 2010 17:17:46 +0100
>>> James Chapman <jchapman@katalix.com> wrote:
>>>
>>>> When dumping L2TP PPP sessions using /proc/net/l2tp, get
>>>> the assigned PPP device name from PPP using ppp_dev_name().
>>>>
>>>> Signed-off-by: James Chapman <jchapman@katalix.com>
>>>> Reviewed-by: Randy Dunlap <randy.dunlap@oracle.com>
>>>>
>>> Why is this a necessary API?
>>> Why not put it in debugfs if just a debugging tool?
>> With the original driver (merged in 2.6.23), some people use horrible
>> hacks in scripts to derive info about their L2TP connections from /proc.
>> So I was reluctant to move it to debugfs in the new driver. If it is ok
>> to move an existing /proc file to debugfs, I'm happy to do so. People
>> should obtain such info from their L2TP userspace daemon, or through
>> netlink anyway.
>>
>>
> Sounds like a good use of sysfs either with attribute or symlink
> back to underlying device
> 
There might be thousands of L2TP sessions in some setups. Populating
sysfs with a link for each of those sessions isn't practical. The
existing /proc file dumps its info as a single text file for this
reason. I'd also like to provide the device name in the session netlink
message, which is the interface used by l2tp userspace, so I need a
kernel API to retrieve the device name from ppp.

I like the suggestion of using debugfs for access to driver debug info
though. I propose leaving the /proc file for L2TPv2 only, removing the
L2TPv3 data that I added to the proc file in this patch series, to
retain compatibility with the existing driver. This would show only
L2TPv2 sessions and tunnels. For new driver functionality (L2TPv3 etc),
use debugfs. The debugfs files would dump lists in a similar form to the
current code, listing all tunnels (L2TPv2 and L2TPv3) in a single file.
Using debugfs gives more flexibility for adding additional info later,
as required. How does that sound?

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply

* Re: Undefined behaviour of connect(fd, NULL, 0);
From: David Miller @ 2010-04-01  7:23 UTC (permalink / raw)
  To: xiaosuo; +Cc: neilb, shemminger, netdev
In-Reply-To: <x2j412e6f7f1003312116rd3b3ba96t31267545efe7660f@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Thu, 1 Apr 2010 12:16:43 +0800

> Someone may use connect() to check if the connection is established
> or not. But there is no spec about the addr and addr_len value when
> connect(2) is used this way. Since there is no limit of addr and
> addr_len, and we supports addr is NULL to check the status of socket
> (Although it is buggy). I think we should treat it like a feature,
> and the problem Neil reported is a bug.

This seems logical, but I believe it is wrong.

We already know for a fact that it is guarenteed to not work
reliably for every single kernel in existence in the world
right now.

Every system.  Ones that have been deployed for 10 years as
well as those built from GIT 10 seconds ago.

So you tell me, if you put this into an application that you
wish to deploy anywhere, are you not being completely stupid?

Therefore, if it's illogical to use this in an application, what value
is there in starting to support it now in the kernel?

I'll tell you, the value is absolutely zero.

Yes we need to add the length check, but the behavior we give to this
case as a result, is completely arbitrary.  And I would in fact argue
for a hard error in these cases.

Simply mark it as invalid to call connect() this way.

^ permalink raw reply

* [Patch V3] bonding: fix potential deadlock in bond_uninit()
From: Cong Wang @ 2010-04-01  7:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Jiri Pirko, Stephen Hemminger, netdev,
	David S. Miller, bonding-devel, Jay Vosburgh
In-Reply-To: <m1r5mzsnuh.fsf@fess.ebiederm.org>

[-- Attachment #1: Type: text/plain, Size: 482 bytes --]

Eric W. Biederman wrote:
> Amerigo Wang <amwang@redhat.com> writes:
> 
>> bond_uninit() is invoked with rtnl_lock held, when it does destroy_workqueue()
>> which will potentially flush all works in this workqueue, if we hold rtnl_lock
>> again in the work function, it will deadlock.
>>
>> So move destroy_workqueue() to destructor where rtnl_lock is not held any more,
>> suggested by Eric.
> 
> The error handling on creating a bond device needs to be updated as well.
> 

Done.


[-- Attachment #2: drivers-net-bonding-bond_main_c-fix-destroy_workqueue-deadlock.diff --]
[-- Type: text/x-patch, Size: 2517 bytes --]

V3: fix error handling path of bond_create()

bond_uninit() is invoked with rtnl_lock held, when it does destroy_workqueue()
which will potentially flush all works in this workqueue, if we hold rtnl_lock
again in the work function, it will deadlock.

So move destroy_workqueue() to destructor where rtnl_lock is not held any more,
suggested by Eric.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>

---

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5b92fbf..61f8c63 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4450,6 +4450,14 @@ static const struct net_device_ops bond_netdev_ops = {
 	.ndo_vlan_rx_kill_vid	= bond_vlan_rx_kill_vid,
 };
 
+static void bond_destructor(struct net_device *bond_dev)
+{
+	struct bonding *bond = netdev_priv(bond_dev);
+	if (bond->wq)
+		destroy_workqueue(bond->wq);
+	free_netdev(bond_dev);
+}
+
 static void bond_setup(struct net_device *bond_dev)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
@@ -4470,7 +4478,7 @@ static void bond_setup(struct net_device *bond_dev)
 	bond_dev->ethtool_ops = &bond_ethtool_ops;
 	bond_set_mode_ops(bond, bond->params.mode);
 
-	bond_dev->destructor = free_netdev;
+	bond_dev->destructor = bond_destructor;
 
 	/* Initialize the device options */
 	bond_dev->tx_queue_len = 0;
@@ -4542,9 +4550,6 @@ static void bond_uninit(struct net_device *bond_dev)
 
 	bond_remove_proc_entry(bond);
 
-	if (bond->wq)
-		destroy_workqueue(bond->wq);
-
 	netif_addr_lock_bh(bond_dev);
 	bond_mc_list_destroy(bond);
 	netif_addr_unlock_bh(bond_dev);
@@ -4956,8 +4961,8 @@ int bond_create(struct net *net, const char *name)
 				bond_setup);
 	if (!bond_dev) {
 		pr_err("%s: eek! can't alloc netdev!\n", name);
-		res = -ENOMEM;
-		goto out;
+		rtnl_unlock();
+		return -ENOMEM;
 	}
 
 	dev_net_set(bond_dev, net);
@@ -4966,19 +4971,16 @@ int bond_create(struct net *net, const char *name)
 	if (!name) {
 		res = dev_alloc_name(bond_dev, "bond%d");
 		if (res < 0)
-			goto out_netdev;
+			goto out;
 	}
 
 	res = register_netdevice(bond_dev);
-	if (res < 0)
-		goto out_netdev;
 
 out:
 	rtnl_unlock();
+	if (res < 0)
+		bond_destructor(bond_dev);
 	return res;
-out_netdev:
-	free_netdev(bond_dev);
-	goto out;
 }
 
 static int __net_init bond_net_init(struct net *net)

^ permalink raw reply related

* Re: linux-next: build failure after merge of the slabh tree
From: David Miller @ 2010-04-01  7:29 UTC (permalink / raw)
  To: sfr; +Cc: tj, linux-next, linux-kernel, sjur.brandeland, netdev
In-Reply-To: <20100401164110.80e3b18f.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 1 Apr 2010 16:41:10 +1100

> I have applied the following patch for today.  Dave, could you apply this
> to the net tree please?

Done.

^ permalink raw reply

* Re: [PATCH v3 04/12] l2tp: Add ppp device name to L2TP ppp session data
From: Eric Dumazet @ 2010-04-01  7:30 UTC (permalink / raw)
  To: James Chapman; +Cc: Stephen Hemminger, netdev
In-Reply-To: <4BB4490F.8090406@katalix.com>

Le jeudi 01 avril 2010 à 08:19 +0100, James Chapman a écrit :

> There might be thousands of L2TP sessions in some setups. Populating
> sysfs with a link for each of those sessions isn't practical. The
> existing /proc file dumps its info as a single text file for this
> reason. I'd also like to provide the device name in the session netlink
> message, which is the interface used by l2tp userspace, so I need a
> kernel API to retrieve the device name from ppp.
> 
> I like the suggestion of using debugfs for access to driver debug info
> though. I propose leaving the /proc file for L2TPv2 only, removing the
> L2TPv3 data that I added to the proc file in this patch series, to
> retain compatibility with the existing driver. This would show only
> L2TPv2 sessions and tunnels. For new driver functionality (L2TPv3 etc),
> use debugfs. The debugfs files would dump lists in a similar form to the
> current code, listing all tunnels (L2TPv2 and L2TPv3) in a single file.
> Using debugfs gives more flexibility for adding additional info later,
> as required. How does that sound?
> 

debugfs ? I dont get it, sorry.

Why not using netlink, as most iproute2 utilities do ?



^ permalink raw reply

* Re: linux-next: build failure after merge of the slabh tree
From: Tejun Heo @ 2010-04-01  7:31 UTC (permalink / raw)
  To: David Miller; +Cc: sfr, linux-next, linux-kernel, sjur.brandeland, netdev
In-Reply-To: <20100401.002908.267643090.davem@davemloft.net>

On 04/01/2010 04:29 PM, David Miller wrote:
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Thu, 1 Apr 2010 16:41:10 +1100
> 
>> I have applied the following patch for today.  Dave, could you apply this
>> to the net tree please?
> 
> Done.

Thanks!

-- 
tejun

^ permalink raw reply

* Re: [PATCH v3 04/12] l2tp: Add ppp device name to L2TP ppp session data
From: David Miller @ 2010-04-01  7:34 UTC (permalink / raw)
  To: jchapman; +Cc: shemminger, netdev
In-Reply-To: <4BB4490F.8090406@katalix.com>

From: James Chapman <jchapman@katalix.com>
Date: Thu, 01 Apr 2010 08:19:43 +0100

> There might be thousands of L2TP sessions in some setups. Populating
> sysfs with a link for each of those sessions isn't practical. The
> existing /proc file dumps its info as a single text file for this
> reason. I'd also like to provide the device name in the session netlink
> message, which is the interface used by l2tp userspace, so I need a
> kernel API to retrieve the device name from ppp.

Scalability concerns are also another reason _not_ to use
procfs.

Use netlink or similar, which can dump with filtering and
proper queueing.

^ permalink raw reply

* Re: linux-next: build failure after merge of the slabh tree
From: Stephen Rothwell @ 2010-04-01  7:40 UTC (permalink / raw)
  To: David Miller; +Cc: tj, linux-next, linux-kernel, sjur.brandeland, netdev
In-Reply-To: <20100401.002908.267643090.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 433 bytes --]

Hi Dave,

On Thu, 01 Apr 2010 00:29:08 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Thu, 1 Apr 2010 16:41:10 +1100
> 
> > I have applied the following patch for today.  Dave, could you apply this
> > to the net tree please?
> 
> Done.

Thanks.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* [PATCH] stmmac: fix kconfig for crc32 build error
From: Giuseppe CAVALLARO @ 2010-04-01  7:44 UTC (permalink / raw)
  To: netdev; +Cc: Carmelo AMOROSO, Giuseppe Cavallaro

From: Carmelo AMOROSO <carmelo.amoroso@st.com>

stmmac uses crc32 functions so it needs to select CRC32.

Fixes build error:
drivers/built-in.o: In function `dwmac1000_set_filter':
dwmac1000_core.c:(.text+0x3c380): undefined reference to `crc32_le'
dwmac1000_core.c:(.text+0x3c384): undefined reference to `bitrev32'

Signed-off-by: Carmelo Amoroso <carmelo.amoroso@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 drivers/net/stmmac/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/stmmac/Kconfig b/drivers/net/stmmac/Kconfig
index fb28764..eb63d44 100644
--- a/drivers/net/stmmac/Kconfig
+++ b/drivers/net/stmmac/Kconfig
@@ -2,6 +2,7 @@ config STMMAC_ETH
 	tristate "STMicroelectronics 10/100/1000 Ethernet driver"
 	select MII
 	select PHYLIB
+	select CRC32
 	depends on NETDEVICES && CPU_SUBTYPE_ST40
 	help
 	  This is the driver for the Ethernet IPs are built around a
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] stmmac: add documentation for the driver.
From: Giuseppe CAVALLARO @ 2010-04-01  7:44 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1270107844-23457-1-git-send-email-peppe.cavallaro@st.com>

Add Documentation/networking/stmmac.txt for the
stmmac network driver.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 Documentation/networking/stmmac.txt |  143 +++++++++++++++++++++++++++++++++++
 1 files changed, 143 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/stmmac.txt

diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
new file mode 100644
index 0000000..7ee770b
--- /dev/null
+++ b/Documentation/networking/stmmac.txt
@@ -0,0 +1,143 @@
+       STMicroelectronics 10/100/1000 Synopsys Ethernet driver
+
+Copyright (C) 2007-2010  STMicroelectronics Ltd
+Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
+
+This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers
+(Synopsys IP blocks); it has been fully tested on STLinux platforms.
+
+Currently this network device driver is for all STM embedded MAC/GMAC
+(7xxx SoCs).
+
+DWC Ether MAC 10/100/1000 Universal version 3.41a and DWC Ether MAC 10/100
+Universal version 4.0 have been used for developing the first code
+implementation.
+
+Please, for more information also visit: www.stlinux.com
+
+1) Kernel Configuration
+The kernel configuration option is STMMAC_ETH:
+ Device Drivers ---> Network device support ---> Ethernet (1000 Mbit) --->
+ STMicroelectronics 10/100/1000 Ethernet driver (STMMAC_ETH)
+
+2) Driver parameters list:
+	debug: message level (0: no output, 16: all);
+	phyaddr: to manually provide the physical address to the PHY device;
+	dma_rxsize: DMA rx ring size;
+	dma_txsize: DMA tx ring size;
+	buf_sz: DMA buffer size;
+	tc: control the HW FIFO threshold;
+	tx_coe: Enable/Disable Tx Checksum Offload engine;
+	watchdog: transmit timeout (in milliseconds);
+	flow_ctrl: Flow control ability [on/off];
+	pause: Flow Control Pause Time;
+	tmrate: timer period (only if timer optimisation is configured).
+
+3) Command line options
+Driver parameters can be also passed in command line by using:
+	stmmaceth=dma_rxsize:128,dma_txsize:512
+
+4) Driver information and notes
+
+4.1) Transmit process
+The xmit method is invoked when the kernel needs to transmit a packet; it sets
+the descriptors in the ring and informs the DMA engine that there is a packet
+ready to be transmitted.
+Once the controller has finished transmitting the packet, an interrupt is
+triggered; So the driver will be able to release the socket buffers.
+By default, the driver sets the NETIF_F_SG bit in the features field of the
+net_device structure enabling the scatter/gather feature.
+
+4.2) Receive process
+When one or more packets are received, an interrupt happens. The interrupts
+are not queued so the driver has to scan all the descriptors in the ring during
+the receive process.
+This is based on NAPI so the interrupt handler signals only if there is work to be
+done, and it exits.
+Then the poll method will be scheduled at some future point.
+The incoming packets are stored, by the DMA, in a list of pre-allocated socket
+buffers in order to avoid the memcpy (Zero-copy).
+
+4.3) Timer-Driver Interrupt
+Instead of having the device that asynchronously notifies the frame receptions, the
+driver configures a timer to generate an interrupt at regular intervals.
+Based on the granularity of the timer, the frames that are received by the device
+will experience different levels of latency. Some NICs have dedicated timer
+device to perform this task. STMMAC can use either the RTC device or the TMU
+channel 2  on STLinux platforms.
+The timers frequency can be passed to the driver as parameter; when change it,
+take care of both hardware capability and network stability/performance impact.
+Several performance tests on STM platforms showed this optimisation allows to spare
+the CPU while having the maximum throughput.
+
+4.4) WOL
+Wake up on Lan feature through Magic Frame is only supported for the GMAC
+core.
+
+4.5) DMA descriptors
+Driver handles both normal and enhanced descriptors. The latter has been only
+tested on DWC Ether MAC 10/100/1000 Universal version 3.41a.
+
+4.6) Ethtool support
+Ethtool is supported. Driver statistics and internal errors can be taken using:
+ethtool -S ethX command. It is possible to dump registers etc.
+
+4.7) Jumbo and Segmentation Offloading
+Jumbo frames are supported and tested for the GMAC.
+The GSO has been also added but it's performed in software.
+LRO is not supported.
+
+4.8) Physical
+The driver is compatible with PAL to work with PHY and GPHY devices.
+
+4.9) Platform information
+Several information came from the platform; please refer to the
+driver's Header file in include/linux directory.
+
+struct plat_stmmacenet_data {
+        int bus_id;
+        int pbl;
+        int has_gmac;
+        void (*fix_mac_speed)(void *priv, unsigned int speed);
+        void (*bus_setup)(unsigned long ioaddr);
+#ifdef CONFIG_STM_DRIVERS
+        struct stm_pad_config *pad_config;
+#endif
+        void *bsp_priv;
+};
+
+Where:
+- pbl (Programmable Burst Length) is maximum number of
+  beats to be transferred in one DMA transaction.
+  GMAC also enables the 4xPBL by default.
+- fix_mac_speed and bus_setup are used to configure internal target
+  registers (on STM platforms);
+- has_gmac: GMAC core is on board (get it at run-time in the next step);
+- bus_id: bus identifier.
+
+struct plat_stmmacphy_data {
+        int bus_id;
+        int phy_addr;
+        unsigned int phy_mask;
+        int interface;
+        int (*phy_reset)(void *priv);
+        void *priv;
+};
+
+Where:
+- bus_id: bus identifier;
+- phy_addr: physical address used for the attached phy device;
+            set it to -1 to get it at run-time;
+- interface: physical MII interface mode;
+- phy_reset: hook to reset HW function.
+
+TODO:
+- Continue to make the driver more generic and suitable for other Synopsys
+  Ethernet controllers used on other architectures (i.e. ARM).
+- 10G controllers are not supported.
+- MAC uses Normal descriptors and GMAC uses enhanced ones.
+  This is a limit that should be reviewed. MAC could want to
+  use the enhanced structure.
+- Checksumming: Rx/Tx csum is done in HW in case of GMAC only.
+- Review the timer optimisation code to use an embedded device that seems to be
+  available in new chip generations.
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH 2/2] acenic: use the dma state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-01  8:13 UTC (permalink / raw)
  To: linux-acenic; +Cc: netdev, fujita.tomonori
In-Reply-To: <1270109586-9193-1-git-send-email-fujita.tomonori@lab.ntt.co.jp>

The DMA API is preferred.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 drivers/net/acenic.c |   26 +++++++++++++-------------
 drivers/net/acenic.h |    6 +++---
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/net/acenic.c b/drivers/net/acenic.c
index 02a4746..1328eb9 100644
--- a/drivers/net/acenic.c
+++ b/drivers/net/acenic.c
@@ -661,7 +661,7 @@ static void __devexit acenic_remove_one(struct pci_dev *pdev)
 			dma_addr_t mapping;
 
 			ringp = &ap->skb->rx_std_skbuff[i];
-			mapping = pci_unmap_addr(ringp, mapping);
+			mapping = dma_unmap_addr(ringp, mapping);
 			pci_unmap_page(ap->pdev, mapping,
 				       ACE_STD_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
@@ -681,7 +681,7 @@ static void __devexit acenic_remove_one(struct pci_dev *pdev)
 				dma_addr_t mapping;
 
 				ringp = &ap->skb->rx_mini_skbuff[i];
-				mapping = pci_unmap_addr(ringp,mapping);
+				mapping = dma_unmap_addr(ringp,mapping);
 				pci_unmap_page(ap->pdev, mapping,
 					       ACE_MINI_BUFSIZE,
 					       PCI_DMA_FROMDEVICE);
@@ -700,7 +700,7 @@ static void __devexit acenic_remove_one(struct pci_dev *pdev)
 			dma_addr_t mapping;
 
 			ringp = &ap->skb->rx_jumbo_skbuff[i];
-			mapping = pci_unmap_addr(ringp, mapping);
+			mapping = dma_unmap_addr(ringp, mapping);
 			pci_unmap_page(ap->pdev, mapping,
 				       ACE_JUMBO_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
@@ -1683,7 +1683,7 @@ static void ace_load_std_rx_ring(struct ace_private *ap, int nr_bufs)
 				       ACE_STD_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
 		ap->skb->rx_std_skbuff[idx].skb = skb;
-		pci_unmap_addr_set(&ap->skb->rx_std_skbuff[idx],
+		dma_unmap_addr_set(&ap->skb->rx_std_skbuff[idx],
 				   mapping, mapping);
 
 		rd = &ap->rx_std_ring[idx];
@@ -1744,7 +1744,7 @@ static void ace_load_mini_rx_ring(struct ace_private *ap, int nr_bufs)
 				       ACE_MINI_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
 		ap->skb->rx_mini_skbuff[idx].skb = skb;
-		pci_unmap_addr_set(&ap->skb->rx_mini_skbuff[idx],
+		dma_unmap_addr_set(&ap->skb->rx_mini_skbuff[idx],
 				   mapping, mapping);
 
 		rd = &ap->rx_mini_ring[idx];
@@ -1800,7 +1800,7 @@ static void ace_load_jumbo_rx_ring(struct ace_private *ap, int nr_bufs)
 				       ACE_JUMBO_BUFSIZE,
 				       PCI_DMA_FROMDEVICE);
 		ap->skb->rx_jumbo_skbuff[idx].skb = skb;
-		pci_unmap_addr_set(&ap->skb->rx_jumbo_skbuff[idx],
+		dma_unmap_addr_set(&ap->skb->rx_jumbo_skbuff[idx],
 				   mapping, mapping);
 
 		rd = &ap->rx_jumbo_ring[idx];
@@ -2013,7 +2013,7 @@ static void ace_rx_int(struct net_device *dev, u32 rxretprd, u32 rxretcsm)
 		skb = rip->skb;
 		rip->skb = NULL;
 		pci_unmap_page(ap->pdev,
-			       pci_unmap_addr(rip, mapping),
+			       dma_unmap_addr(rip, mapping),
 			       mapsize,
 			       PCI_DMA_FROMDEVICE);
 		skb_put(skb, retdesc->size);
@@ -2085,7 +2085,7 @@ static inline void ace_tx_int(struct net_device *dev,
 
 		if (dma_unmap_len(info, maplen)) {
 			pci_unmap_page(ap->pdev, dma_unmap_addr(info, mapping),
-				       pci_unmap_len(info, maplen),
+				       dma_unmap_len(info, maplen),
 				       PCI_DMA_TODEVICE);
 			dma_unmap_len_set(info, maplen, 0);
 		}
@@ -2392,7 +2392,7 @@ static int ace_close(struct net_device *dev)
 				memset(ap->tx_ring + i, 0,
 				       sizeof(struct tx_desc));
 			pci_unmap_page(ap->pdev, dma_unmap_addr(info, mapping),
-				       pci_unmap_len(info, maplen),
+				       dma_unmap_len(info, maplen),
 				       PCI_DMA_TODEVICE);
 			dma_unmap_len_set(info, maplen, 0);
 		}
@@ -2429,8 +2429,8 @@ ace_map_tx_skb(struct ace_private *ap, struct sk_buff *skb,
 
 	info = ap->skb->tx_skbuff + idx;
 	info->skb = tail;
-	pci_unmap_addr_set(info, mapping, mapping);
-	pci_unmap_len_set(info, maplen, skb->len);
+	dma_unmap_addr_set(info, mapping, mapping);
+	dma_unmap_len_set(info, maplen, skb->len);
 	return mapping;
 }
 
@@ -2549,8 +2549,8 @@ restart:
 			} else {
 				info->skb = NULL;
 			}
-			pci_unmap_addr_set(info, mapping, mapping);
-			pci_unmap_len_set(info, maplen, frag->size);
+			dma_unmap_addr_set(info, mapping, mapping);
+			dma_unmap_len_set(info, maplen, frag->size);
 			ace_load_tx_bd(ap, desc, mapping, flagsize, vlan_tag);
 		}
 	}
diff --git a/drivers/net/acenic.h b/drivers/net/acenic.h
index 17079b9..0681da7 100644
--- a/drivers/net/acenic.h
+++ b/drivers/net/acenic.h
@@ -589,7 +589,7 @@ struct ace_info {
 
 struct ring_info {
 	struct sk_buff		*skb;
-	DECLARE_PCI_UNMAP_ADDR(mapping)
+	DEFINE_DMA_UNMAP_ADDR(mapping);
 };
 
 
@@ -600,8 +600,8 @@ struct ring_info {
  */
 struct tx_ring_info {
 	struct sk_buff		*skb;
-	DECLARE_PCI_UNMAP_ADDR(mapping)
-	DECLARE_PCI_UNMAP_LEN(maplen)
+	DEFINE_DMA_UNMAP_ADDR(mapping);
+	DEFINE_DMA_UNMAP_LEN(maplen);
 };
 
 
-- 
1.7.0


^ permalink raw reply related

* [PATCH 1/2] acenic: fix the misusage of zero dma address
From: FUJITA Tomonori @ 2010-04-01  8:13 UTC (permalink / raw)
  To: linux-acenic; +Cc: netdev, fujita.tomonori

acenic wrongly assumes that zero is an invalid dma address (calls
dma_unmap_page for only non zero dma addresses). Zero is a valid dma
address on some architectures. The dma length can be used here.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 drivers/net/acenic.c |   16 ++++++----------
 1 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/net/acenic.c b/drivers/net/acenic.c
index 97a3dfd..02a4746 100644
--- a/drivers/net/acenic.c
+++ b/drivers/net/acenic.c
@@ -2078,18 +2078,16 @@ static inline void ace_tx_int(struct net_device *dev,
 
 	do {
 		struct sk_buff *skb;
-		dma_addr_t mapping;
 		struct tx_ring_info *info;
 
 		info = ap->skb->tx_skbuff + idx;
 		skb = info->skb;
-		mapping = pci_unmap_addr(info, mapping);
 
-		if (mapping) {
-			pci_unmap_page(ap->pdev, mapping,
+		if (dma_unmap_len(info, maplen)) {
+			pci_unmap_page(ap->pdev, dma_unmap_addr(info, mapping),
 				       pci_unmap_len(info, maplen),
 				       PCI_DMA_TODEVICE);
-			pci_unmap_addr_set(info, mapping, 0);
+			dma_unmap_len_set(info, maplen, 0);
 		}
 
 		if (skb) {
@@ -2377,14 +2375,12 @@ static int ace_close(struct net_device *dev)
 
 	for (i = 0; i < ACE_TX_RING_ENTRIES(ap); i++) {
 		struct sk_buff *skb;
-		dma_addr_t mapping;
 		struct tx_ring_info *info;
 
 		info = ap->skb->tx_skbuff + i;
 		skb = info->skb;
-		mapping = pci_unmap_addr(info, mapping);
 
-		if (mapping) {
+		if (dma_unmap_len(info, maplen)) {
 			if (ACE_IS_TIGON_I(ap)) {
 				/* NB: TIGON_1 is special, tx_ring is in io space */
 				struct tx_desc __iomem *tx;
@@ -2395,10 +2391,10 @@ static int ace_close(struct net_device *dev)
 			} else
 				memset(ap->tx_ring + i, 0,
 				       sizeof(struct tx_desc));
-			pci_unmap_page(ap->pdev, mapping,
+			pci_unmap_page(ap->pdev, dma_unmap_addr(info, mapping),
 				       pci_unmap_len(info, maplen),
 				       PCI_DMA_TODEVICE);
-			pci_unmap_addr_set(info, mapping, 0);
+			dma_unmap_len_set(info, maplen, 0);
 		}
 		if (skb) {
 			dev_kfree_skb(skb);
-- 
1.7.0


^ permalink raw reply related

* [PATCH net-next-2.6] ipv6 fib: Make ip6_fib{} more cache-line aware.
From: YOSHIFUJI Hideaki @ 2010-04-01  8:18 UTC (permalink / raw)
  To: davem; +Cc: yoshfuji, netdev

Because elements at the end of dst_entry{} are frequently
updated, it is not good to put frequently-used static
elements, such as rt6i_idev, rt6i_dst or rt6i_flags in the
same cache line.

On the other hand, fib6_table, rt6i_node or rt6i_gateway are
rarely used, so it is okay to stay in the same cache line.

Let's rearrange ip6_fib{}.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
---
 include/net/ip6_fib.h |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 86f46c4..4b1dc11 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -88,34 +88,37 @@ struct rt6_info {
 		struct dst_entry	dst;
 	} u;
 
-	struct inet6_dev		*rt6i_idev;
-
 #define rt6i_dev			u.dst.dev
 #define rt6i_nexthop			u.dst.neighbour
 #define rt6i_expires			u.dst.expires
 
+	/*
+	 * Tail elements of dst_entry (__refcnt etc.)
+	 * and these elements (rarely used in hot path) are in
+	 * the same cache line.
+	 */
+	struct fib6_table		*rt6i_table;
 	struct fib6_node		*rt6i_node;
 
 	struct in6_addr			rt6i_gateway;
-	
-	u32				rt6i_flags;
-	u32				rt6i_metric;
-	atomic_t			rt6i_ref;
 
-	/* more non-fragment space at head required */
-	unsigned short			rt6i_nfheader_len;
-
-	u8				rt6i_protocol;
+	atomic_t			rt6i_ref;
 
-	struct fib6_table		*rt6i_table;
+	/* These are in a separate cache line. */
+	struct rt6key			rt6i_dst ____cacheline_aligned_in_smp;
+	u32				rt6i_flags;
+	struct rt6key			rt6i_src;
+	u32				rt6i_metric;
 
-	struct rt6key			rt6i_dst;
+	struct inet6_dev		*rt6i_idev;
 
 #ifdef CONFIG_XFRM
 	u32				rt6i_flow_cache_genid;
 #endif
+	/* more non-fragment space at head required */
+	unsigned short			rt6i_nfheader_len;
 
-	struct rt6key			rt6i_src;
+	u8				rt6i_protocol;
 };
 
 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
-- 
1.5.6.5


^ permalink raw reply related

* [PATCH net-next-2.6] ipv6 fib: Make rt6_info{} more cache-line aware.
From: YOSHIFUJI Hideaki @ 2010-04-01  8:24 UTC (permalink / raw)
  To: davem; +Cc: yoshfuji, netdev

The head element of rt6_info{} is dst_entry{}, and
IPv6 specific elements follow.

Because elements at the end of dst_entry{} are frequently
updated, it is not good to put frequently-used static
elements, such as rt6i_idev, rt6i_dst or rt6i_flags in the
same cache line.

On the other hand, fib6_table, rt6i_node or rt6i_gateway are
rarely used, so it is okay to stay in the same cache line.

Let's rearrange rt6_info{}.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
---
 include/net/ip6_fib.h |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 86f46c4..4b1dc11 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -88,34 +88,37 @@ struct rt6_info {
 		struct dst_entry	dst;
 	} u;
 
-	struct inet6_dev		*rt6i_idev;
-
 #define rt6i_dev			u.dst.dev
 #define rt6i_nexthop			u.dst.neighbour
 #define rt6i_expires			u.dst.expires
 
+	/*
+	 * Tail elements of dst_entry (__refcnt etc.)
+	 * and these elements (rarely used in hot path) are in
+	 * the same cache line.
+	 */
+	struct fib6_table		*rt6i_table;
 	struct fib6_node		*rt6i_node;
 
 	struct in6_addr			rt6i_gateway;
-	
-	u32				rt6i_flags;
-	u32				rt6i_metric;
-	atomic_t			rt6i_ref;
 
-	/* more non-fragment space at head required */
-	unsigned short			rt6i_nfheader_len;
-
-	u8				rt6i_protocol;
+	atomic_t			rt6i_ref;
 
-	struct fib6_table		*rt6i_table;
+	/* These are in a separate cache line. */
+	struct rt6key			rt6i_dst ____cacheline_aligned_in_smp;
+	u32				rt6i_flags;
+	struct rt6key			rt6i_src;
+	u32				rt6i_metric;
 
-	struct rt6key			rt6i_dst;
+	struct inet6_dev		*rt6i_idev;
 
 #ifdef CONFIG_XFRM
 	u32				rt6i_flow_cache_genid;
 #endif
+	/* more non-fragment space at head required */
+	unsigned short			rt6i_nfheader_len;
 
-	struct rt6key			rt6i_src;
+	u8				rt6i_protocol;
 };
 
 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
-- 
1.5.6.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox