Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next V2 1/3] iwmc3200top: Add Intel Wireless MultiCom 3200 top driver.
From: Tomas Winkler @ 2009-10-20 11:53 UTC (permalink / raw)
  To: David Miller, Marcel Holtmann
  Cc: linville, netdev, linux-wireless, linux-mmc, yi.zhu,
	inaky.perez-gonzalez, cindy.h.kao, guy.cohen, ron.rindjunsky
In-Reply-To: <20091019.215457.43252934.davem@davemloft.net>

On Tue, Oct 20, 2009 at 6:54 AM, David Miller <davem@davemloft.net> wrote:
> From: Tomas Winkler <tomas.winkler@intel.com>
> Date: Sat, 17 Oct 2009 21:09:34 +0200
>
>> This patch adds Intel Wireless MultiCom 3200 top driver.
>> IWMC3200 is 4Wireless Com CHIP (GPS/BT/WiFi/WiMAX).
>> Top driver is responsible for device initialization and firmware download.
>> Firmware handled by top is responsible for top itself and
>> as well as bluetooth and GPS coms. (Wifi and WiMax provide their own firmware)
>> In addition top driver is used to retrieve firmware logs
>> and supports other debugging features
>>
>> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
>
> Applied to net-next-2.6

Thanks Dave

Marcel
I want to send out now the BT driver, would like the patch against
bluetooth-next-2.6.git, then I wait till you sync or can you also pick
it from net-next if Dave is OK with that?

Thanks
Tomas

^ permalink raw reply

* [PATCH] ifb: should not use __dev_get_by_index() without locks
From: Eric Dumazet @ 2009-10-20 12:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091019.212855.179405364.davem@davemloft.net>

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 20 Oct 2009 06:23:54 +0200
> 
>> I wonder if the whole thing could use RCU somehow, since some
>> workloads hit this dev_base_lock rwlock pretty hard...
> 
> True, but for now we'll put your fix in :-)

Here is another vulnerable point, needing following patch.

Thanks

[PATCH] ifb: should not use __dev_get_by_index() without locks

At this point (ri_tasklet()), RTNL or dev_base_lock are not held,
we must use dev_get_by_index() instead of __dev_get_by_index()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 801f088..030913f 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -98,12 +98,13 @@ static void ri_tasklet(unsigned long dev)
 		stats->tx_packets++;
 		stats->tx_bytes +=skb->len;
 
-		skb->dev = __dev_get_by_index(&init_net, skb->iif);
+		skb->dev = dev_get_by_index(&init_net, skb->iif);
 		if (!skb->dev) {
 			dev_kfree_skb(skb);
 			stats->tx_dropped++;
 			break;
 		}
+		dev_put(skb->dev);
 		skb->iif = _dev->ifindex;
 
 		if (from & AT_EGRESS) {


^ permalink raw reply related

* Policy routing + route "via" gives a strange behavior
From: Guido Trotter @ 2009-10-20 13:28 UTC (permalink / raw)
  To: netdev

Hi,

I'm seeing what I think might be a strange kernel behavior when setting up a
route "via" a gateway, with policy routing. When adding a route with a gateway,
the kernel accepts it only if the gateway is reachable via that device. For
example:

ip route add default dev eth1 via 192.168.5.254

is only accepted if there is a route like:
192.168.5.0/24 dev eth1 scope link

in the main routing table. which, of course, is ok, otherwise the kernel
wouldn't be able to reach 192.168.5.254 in the first place.

Now, when adding policy routing to the mix, if I do:
ip route add table 100 default dev eth1 via 192.168.5.254

This is also refused unless a route like the one before appears in the default
table, even though it does appear in table 100. Is this the right behavior, and
if yes, why? It seems to me that it should be acceptable to have the network
route as well just in the separate routing table, since the "via" will only be
used by traffic hitting that table anyway.

Thanks a lot,

Guido

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20 13:42 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <200910201452.58789.denys@visp.net.lb>

[Denys Fedoryschenko - Tue, Oct 20, 2009 at 02:52:58PM +0300]
...
| > Thanks a lot! I'll back with new one in a couple of hours. Meanwhile i
| > suppose you may try Michal's patch as well.
| I did, it didn't help.
| Maybe i can run some debugging options in kernel?
| Also i can add debug(printk) lines in kernel if you want, to see where is bug 
| appearing.
| Note, i told to Michal, so will tell here, this pc is hyperthreading P4, as i 
| know it is very good to trigger various SMP race conditions.
| I can try also it with nosmp if u want.
| 

Thanks Denys, I'm preparing new patch (just back from office
and had no inet connection that is why reply is delayed, sorry).

	-- Cyrill

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-20 13:50 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <20091020134217.GA5181@lenovo>

On Tuesday 20 October 2009 16:42:17 Cyrill Gorcunov wrote:
> [Denys Fedoryschenko - Tue, Oct 20, 2009 at 02:52:58PM +0300]
> ...
>
> | > Thanks a lot! I'll back with new one in a couple of hours. Meanwhile i
> | > suppose you may try Michal's patch as well.
> |
> | I did, it didn't help.
> | Maybe i can run some debugging options in kernel?
> | Also i can add debug(printk) lines in kernel if you want, to see where is
> | bug appearing.
> | Note, i told to Michal, so will tell here, this pc is hyperthreading P4,
> | as i know it is very good to trigger various SMP race conditions.
> | I can try also it with nosmp if u want.
>
> Thanks Denys, I'm preparing new patch (just back from office
> and had no inet connection that is why reply is delayed, sorry).
There is no problem at all.
This rename operation is just future operation and host is redundant, so i can 
do tests on it anytime.


^ permalink raw reply

* Re: [PATCHv2 1/4] First Patch on TFRC-SP. Copy base files from TFRC
From: Ivo Calado @ 2009-10-20 13:51 UTC (permalink / raw)
  To: Gerrit Renker, Ivo Calado, dccp, netdev
In-Reply-To: <20091019052153.GB3366@gerrit.erg.abdn.ac.uk>

On Mon, Oct 19, 2009 at 02:21, Gerrit Renker <gerrit@erg.abdn.ac.uk> wrote:
> | First Patch on TFRC-SP.
> Please find attached one edit that I made.
>
> I added unwinding the initialisation of tfrc_lib in the case where the
> initialisation of tfrc_sp_lib fails.  Unwinding is now done in the reverse
> order of the steps done during initialisation.
>

Agree.



-- 
Ivo Augusto Andrade Rocha Calado
MSc. Candidate
Embedded Systems and Pervasive Computing Lab - http://embedded.ufcg.edu.br
Systems and Computing Department - http://www.dsc.ufcg.edu.br
Electrical Engineering and Informatics Center - http://www.ceei.ufcg.edu.br
Federal University of Campina Grande - http://www.ufcg.edu.br

PGP: 0x03422935
Quidquid latine dictum sit, altum viditur.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20 13:59 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <200910201650.10066.denys@visp.net.lb>

[Denys Fedoryschenko - Tue, Oct 20, 2009 at 04:50:09PM +0300]
|
...
| >
| > Thanks Denys, I'm preparing new patch (just back from office
| > and had no inet connection that is why reply is delayed, sorry).
| There is no problem at all.
| This rename operation is just future operation and host is redundant, so i can 
| do tests on it anytime.
| 

ok, here is it, please try (it's still a draft version though)

	-- Cyrill
---
 drivers/net/pppoe.c |  106 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 81 insertions(+), 25 deletions(-)

Index: linux-2.6.git/drivers/net/pppoe.c
=====================================================================
--- linux-2.6.git.orig/drivers/net/pppoe.c
+++ linux-2.6.git/drivers/net/pppoe.c
@@ -313,8 +313,8 @@ static void pppoe_flush_dev(struct net_d
 			sk = sk_pppox(po);
 			spin_lock(&flush_lock);
 			po->pppoe_dev = NULL;
-			spin_unlock(&flush_lock);
 			dev_put(dev);
+			spin_unlock(&flush_lock);
 
 			/* We always grab the socket lock, followed by the
 			 * hash_lock, in that order.  Since we should
@@ -386,13 +386,21 @@ static struct notifier_block pppoe_notif
 static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
 {
 	struct pppox_sock *po = pppox_sk(sk);
-	struct pppox_sock *relay_po;
+	struct pppox_sock *relay_po = NULL;
+	struct net_device *dev = NULL;
 
 	if (sk->sk_state & PPPOX_BOUND) {
 		ppp_input(&po->chan, skb);
 	} else if (sk->sk_state & PPPOX_RELAY) {
-		relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
-						&po->pppoe_relay);
+		struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
+		read_lock_bh(&pn->hash_lock);
+		dev = po->pppoe_dev;
+		if (dev) {
+			dev_hold(dev);
+			relay_po = get_item_by_addr(dev_net(dev),
+					&po->pppoe_relay);
+		}
+		read_unlock_bh(&pn->hash_lock);
 		if (relay_po == NULL)
 			goto abort_kfree;
 
@@ -401,6 +409,7 @@ static int pppoe_rcv_core(struct sock *s
 
 		if (!__pppoe_xmit(sk_pppox(relay_po), skb))
 			goto abort_put;
+		dev_put(dev);
 	} else {
 		if (sock_queue_rcv_skb(sk, skb))
 			goto abort_kfree;
@@ -412,6 +421,8 @@ abort_put:
 	sock_put(sk_pppox(relay_po));
 
 abort_kfree:
+	if (dev)
+		dev_put(dev);
 	kfree_skb(skb);
 	return NET_RX_DROP;
 }
@@ -625,8 +636,8 @@ static int pppoe_connect(struct socket *
 	struct sock *sk = sock->sk;
 	struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev;
-	struct pppoe_net *pn;
+	struct net_device *dev = NULL;
+	struct pppoe_net *pn = NULL;
 	int error;
 
 	lock_sock(sk);
@@ -652,12 +663,15 @@ static int pppoe_connect(struct socket *
 	/* Delete the old binding */
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
+		spin_lock(&flush_lock);
 		if (po->pppoe_dev) {
 			pn = pppoe_pernet(dev_net(po->pppoe_dev));
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			po->pppoe_dev = NULL;
 		}
+		spin_unlock(&flush_lock);
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));
 		sk->sk_state = PPPOX_NONE;
@@ -670,10 +684,11 @@ static int pppoe_connect(struct socket *
 		if (!dev)
 			goto end;
 
+		write_lock_bh(&pn->hash_lock);
+		dev_hold(dev);
 		po->pppoe_dev = dev;
 		po->pppoe_ifindex = dev->ifindex;
 		pn = pppoe_pernet(dev_net(dev));
-		write_lock_bh(&pn->hash_lock);
 		if (!(dev->flags & IFF_UP)) {
 			write_unlock_bh(&pn->hash_lock);
 			goto err_put;
@@ -700,6 +715,7 @@ static int pppoe_connect(struct socket *
 			goto err_put;
 
 		sk->sk_state = PPPOX_CONNECTED;
+		dev_put(dev);
 	}
 
 	po->num = sp->sa_addr.pppoe.sid;
@@ -708,10 +724,13 @@ end:
 	release_sock(sk);
 	return error;
 err_put:
+	dev_put(dev);
+	write_lock_bh(&pn->hash_lock);
 	if (po->pppoe_dev) {
 		dev_put(po->pppoe_dev);
 		po->pppoe_dev = NULL;
 	}
+	write_unlock_bh(&pn->hash_lock);
 	goto end;
 }
 
@@ -738,6 +757,8 @@ static int pppoe_ioctl(struct socket *so
 {
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po = pppox_sk(sk);
+	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
+	unsigned int mtu = 0;
 	int val;
 	int err;
 
@@ -746,11 +767,17 @@ static int pppoe_ioctl(struct socket *so
 		err = -ENXIO;
 		if (!(sk->sk_state & PPPOX_CONNECTED))
 			break;
-
+		read_lock_bh(&pn->hash_lock);
+		err = -ENODEV;
+		if (po->pppoe_dev) {
+			mtu = po->pppoe_dev->mtu;
+			err = 0;
+		}
+		read_unlock_bh(&pn->hash_lock);
+		if (err)
+			break;
 		err = -EFAULT;
-		if (put_user(po->pppoe_dev->mtu -
-			     sizeof(struct pppoe_hdr) -
-			     PPP_HDRLEN,
+		if (put_user(mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN,
 			     (int __user *)arg))
 			break;
 		err = 0;
@@ -761,13 +788,21 @@ static int pppoe_ioctl(struct socket *so
 		if (!(sk->sk_state & PPPOX_CONNECTED))
 			break;
 
+		read_lock_bh(&pn->hash_lock);
+		err = -ENODEV;
+		if (po->pppoe_dev) {
+			mtu = po->pppoe_dev->mtu;
+			err = 0;
+		}
+		read_unlock_bh(&pn->hash_lock);
+		if (err)
+			break;
+
 		err = -EFAULT;
 		if (get_user(val, (int __user *)arg))
 			break;
 
-		if (val < (po->pppoe_dev->mtu
-			   - sizeof(struct pppoe_hdr)
-			   - PPP_HDRLEN))
+		if (val < (mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN))
 			err = 0;
 		else
 			err = -EINVAL;
@@ -839,10 +874,11 @@ static int pppoe_sendmsg(struct kiocb *i
 	struct sk_buff *skb;
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po = pppox_sk(sk);
+	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
 	int error;
 	struct pppoe_hdr hdr;
 	struct pppoe_hdr *ph;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	char *start;
 
 	lock_sock(sk);
@@ -856,18 +892,27 @@ static int pppoe_sendmsg(struct kiocb *i
 	hdr.code = 0;
 	hdr.sid = po->num;
 
-	dev = po->pppoe_dev;
+	read_lock_bh(&pn->hash_lock);
+	error = -ENODEV;
+	if (po->pppoe_dev) {
+		dev = po->pppoe_dev;
+		dev_hold(dev);
+		error = 0;
+	}
+	read_unlock_bh(&pn->hash_lock);
+	if (error)
+		goto end;
 
 	error = -EMSGSIZE;
 	if (total_len > (dev->mtu + dev->hard_header_len))
-		goto end;
+		goto end_put;
 
 
 	skb = sock_wmalloc(sk, total_len + dev->hard_header_len + 32,
 			   0, GFP_KERNEL);
 	if (!skb) {
 		error = -ENOMEM;
-		goto end;
+		goto end_put;
 	}
 
 	/* Reserve space for headers. */
@@ -885,7 +930,7 @@ static int pppoe_sendmsg(struct kiocb *i
 	error = memcpy_fromiovec(start, m->msg_iov, total_len);
 	if (error < 0) {
 		kfree_skb(skb);
-		goto end;
+		goto end_put;
 	}
 
 	error = total_len;
@@ -898,6 +943,8 @@ static int pppoe_sendmsg(struct kiocb *i
 
 	dev_queue_xmit(skb);
 
+end_put:
+	dev_put(dev);
 end:
 	release_sock(sk);
 	return error;
@@ -911,21 +958,28 @@ end:
 static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
 {
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev = po->pppoe_dev;
+	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
+	struct net_device *dev;
 	struct pppoe_hdr *ph;
 	int data_len = skb->len;
 
-	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
+	read_lock_bh(&pn->hash_lock);
+	if (!po->pppoe_dev) {
+		read_unlock_bh(&pn->hash_lock);
 		goto abort;
+	}
+	dev = po->pppoe_dev;
+	dev_hold(dev);
+	read_unlock_bh(&pn->hash_lock);
 
-	if (!dev)
-		goto abort;
+	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
+		goto abort_put;
 
 	/* Copy the data if there is no space for the header or if it's
 	 * read-only.
 	 */
 	if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len))
-		goto abort;
+		goto abort_put;
 
 	__skb_push(skb, sizeof(*ph));
 	skb_reset_network_header(skb);
@@ -944,9 +998,11 @@ static int __pppoe_xmit(struct sock *sk,
 			po->pppoe_pa.remote, NULL, data_len);
 
 	dev_queue_xmit(skb);
-
+	dev_put(dev);
 	return 1;
 
+abort_put:
+	dev_put(dev);
 abort:
 	kfree_skb(skb);
 	return 1;

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-20 14:20 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <20091020135920.GB5181@lenovo>

It panics almost immediately on boot(even on old operations  that was stable, 
seems on first pppoe customer login attempt), i will rebuild kernel and if 
interesting will try to get panic message.

On Tuesday 20 October 2009 16:59:20 Cyrill Gorcunov wrote:
> [Denys Fedoryschenko - Tue, Oct 20, 2009 at 04:50:09PM +0300]
>
> ...
>
> | > Thanks Denys, I'm preparing new patch (just back from office
> | > and had no inet connection that is why reply is delayed, sorry).
> |
> | There is no problem at all.
> | This rename operation is just future operation and host is redundant, so
> | i can do tests on it anytime.
>
> ok, here is it, please try (it's still a draft version though)
>
> 	-- Cyrill
> ---
>  drivers/net/pppoe.c |  106
> +++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 81
> insertions(+), 25 deletions(-)
>
> Index: linux-2.6.git/drivers/net/pppoe.c
> =====================================================================
> --- linux-2.6.git.orig/drivers/net/pppoe.c
> +++ linux-2.6.git/drivers/net/pppoe.c
> @@ -313,8 +313,8 @@ static void pppoe_flush_dev(struct net_d
>  			sk = sk_pppox(po);
>  			spin_lock(&flush_lock);
>  			po->pppoe_dev = NULL;
> -			spin_unlock(&flush_lock);
>  			dev_put(dev);
> +			spin_unlock(&flush_lock);
>
>  			/* We always grab the socket lock, followed by the
>  			 * hash_lock, in that order.  Since we should
> @@ -386,13 +386,21 @@ static struct notifier_block pppoe_notif
>  static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
>  {
>  	struct pppox_sock *po = pppox_sk(sk);
> -	struct pppox_sock *relay_po;
> +	struct pppox_sock *relay_po = NULL;
> +	struct net_device *dev = NULL;
>
>  	if (sk->sk_state & PPPOX_BOUND) {
>  		ppp_input(&po->chan, skb);
>  	} else if (sk->sk_state & PPPOX_RELAY) {
> -		relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
> -						&po->pppoe_relay);
> +		struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
> +		read_lock_bh(&pn->hash_lock);
> +		dev = po->pppoe_dev;
> +		if (dev) {
> +			dev_hold(dev);
> +			relay_po = get_item_by_addr(dev_net(dev),
> +					&po->pppoe_relay);
> +		}
> +		read_unlock_bh(&pn->hash_lock);
>  		if (relay_po == NULL)
>  			goto abort_kfree;
>
> @@ -401,6 +409,7 @@ static int pppoe_rcv_core(struct sock *s
>
>  		if (!__pppoe_xmit(sk_pppox(relay_po), skb))
>  			goto abort_put;
> +		dev_put(dev);
>  	} else {
>  		if (sock_queue_rcv_skb(sk, skb))
>  			goto abort_kfree;
> @@ -412,6 +421,8 @@ abort_put:
>  	sock_put(sk_pppox(relay_po));
>
>  abort_kfree:
> +	if (dev)
> +		dev_put(dev);
>  	kfree_skb(skb);
>  	return NET_RX_DROP;
>  }
> @@ -625,8 +636,8 @@ static int pppoe_connect(struct socket *
>  	struct sock *sk = sock->sk;
>  	struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
>  	struct pppox_sock *po = pppox_sk(sk);
> -	struct net_device *dev;
> -	struct pppoe_net *pn;
> +	struct net_device *dev = NULL;
> +	struct pppoe_net *pn = NULL;
>  	int error;
>
>  	lock_sock(sk);
> @@ -652,12 +663,15 @@ static int pppoe_connect(struct socket *
>  	/* Delete the old binding */
>  	if (stage_session(po->pppoe_pa.sid)) {
>  		pppox_unbind_sock(sk);
> +		spin_lock(&flush_lock);
>  		if (po->pppoe_dev) {
>  			pn = pppoe_pernet(dev_net(po->pppoe_dev));
>  			delete_item(pn, po->pppoe_pa.sid,
>  				po->pppoe_pa.remote, po->pppoe_ifindex);
>  			dev_put(po->pppoe_dev);
> +			po->pppoe_dev = NULL;
>  		}
> +		spin_unlock(&flush_lock);
>  		memset(sk_pppox(po) + 1, 0,
>  		       sizeof(struct pppox_sock) - sizeof(struct sock));
>  		sk->sk_state = PPPOX_NONE;
> @@ -670,10 +684,11 @@ static int pppoe_connect(struct socket *
>  		if (!dev)
>  			goto end;
>
> +		write_lock_bh(&pn->hash_lock);
> +		dev_hold(dev);
>  		po->pppoe_dev = dev;
>  		po->pppoe_ifindex = dev->ifindex;
>  		pn = pppoe_pernet(dev_net(dev));
> -		write_lock_bh(&pn->hash_lock);
>  		if (!(dev->flags & IFF_UP)) {
>  			write_unlock_bh(&pn->hash_lock);
>  			goto err_put;
> @@ -700,6 +715,7 @@ static int pppoe_connect(struct socket *
>  			goto err_put;
>
>  		sk->sk_state = PPPOX_CONNECTED;
> +		dev_put(dev);
>  	}
>
>  	po->num = sp->sa_addr.pppoe.sid;
> @@ -708,10 +724,13 @@ end:
>  	release_sock(sk);
>  	return error;
>  err_put:
> +	dev_put(dev);
> +	write_lock_bh(&pn->hash_lock);
>  	if (po->pppoe_dev) {
>  		dev_put(po->pppoe_dev);
>  		po->pppoe_dev = NULL;
>  	}
> +	write_unlock_bh(&pn->hash_lock);
>  	goto end;
>  }
>
> @@ -738,6 +757,8 @@ static int pppoe_ioctl(struct socket *so
>  {
>  	struct sock *sk = sock->sk;
>  	struct pppox_sock *po = pppox_sk(sk);
> +	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
> +	unsigned int mtu = 0;
>  	int val;
>  	int err;
>
> @@ -746,11 +767,17 @@ static int pppoe_ioctl(struct socket *so
>  		err = -ENXIO;
>  		if (!(sk->sk_state & PPPOX_CONNECTED))
>  			break;
> -
> +		read_lock_bh(&pn->hash_lock);
> +		err = -ENODEV;
> +		if (po->pppoe_dev) {
> +			mtu = po->pppoe_dev->mtu;
> +			err = 0;
> +		}
> +		read_unlock_bh(&pn->hash_lock);
> +		if (err)
> +			break;
>  		err = -EFAULT;
> -		if (put_user(po->pppoe_dev->mtu -
> -			     sizeof(struct pppoe_hdr) -
> -			     PPP_HDRLEN,
> +		if (put_user(mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN,
>  			     (int __user *)arg))
>  			break;
>  		err = 0;
> @@ -761,13 +788,21 @@ static int pppoe_ioctl(struct socket *so
>  		if (!(sk->sk_state & PPPOX_CONNECTED))
>  			break;
>
> +		read_lock_bh(&pn->hash_lock);
> +		err = -ENODEV;
> +		if (po->pppoe_dev) {
> +			mtu = po->pppoe_dev->mtu;
> +			err = 0;
> +		}
> +		read_unlock_bh(&pn->hash_lock);
> +		if (err)
> +			break;
> +
>  		err = -EFAULT;
>  		if (get_user(val, (int __user *)arg))
>  			break;
>
> -		if (val < (po->pppoe_dev->mtu
> -			   - sizeof(struct pppoe_hdr)
> -			   - PPP_HDRLEN))
> +		if (val < (mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN))
>  			err = 0;
>  		else
>  			err = -EINVAL;
> @@ -839,10 +874,11 @@ static int pppoe_sendmsg(struct kiocb *i
>  	struct sk_buff *skb;
>  	struct sock *sk = sock->sk;
>  	struct pppox_sock *po = pppox_sk(sk);
> +	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
>  	int error;
>  	struct pppoe_hdr hdr;
>  	struct pppoe_hdr *ph;
> -	struct net_device *dev;
> +	struct net_device *dev = NULL;
>  	char *start;
>
>  	lock_sock(sk);
> @@ -856,18 +892,27 @@ static int pppoe_sendmsg(struct kiocb *i
>  	hdr.code = 0;
>  	hdr.sid = po->num;
>
> -	dev = po->pppoe_dev;
> +	read_lock_bh(&pn->hash_lock);
> +	error = -ENODEV;
> +	if (po->pppoe_dev) {
> +		dev = po->pppoe_dev;
> +		dev_hold(dev);
> +		error = 0;
> +	}
> +	read_unlock_bh(&pn->hash_lock);
> +	if (error)
> +		goto end;
>
>  	error = -EMSGSIZE;
>  	if (total_len > (dev->mtu + dev->hard_header_len))
> -		goto end;
> +		goto end_put;
>
>
>  	skb = sock_wmalloc(sk, total_len + dev->hard_header_len + 32,
>  			   0, GFP_KERNEL);
>  	if (!skb) {
>  		error = -ENOMEM;
> -		goto end;
> +		goto end_put;
>  	}
>
>  	/* Reserve space for headers. */
> @@ -885,7 +930,7 @@ static int pppoe_sendmsg(struct kiocb *i
>  	error = memcpy_fromiovec(start, m->msg_iov, total_len);
>  	if (error < 0) {
>  		kfree_skb(skb);
> -		goto end;
> +		goto end_put;
>  	}
>
>  	error = total_len;
> @@ -898,6 +943,8 @@ static int pppoe_sendmsg(struct kiocb *i
>
>  	dev_queue_xmit(skb);
>
> +end_put:
> +	dev_put(dev);
>  end:
>  	release_sock(sk);
>  	return error;
> @@ -911,21 +958,28 @@ end:
>  static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
>  {
>  	struct pppox_sock *po = pppox_sk(sk);
> -	struct net_device *dev = po->pppoe_dev;
> +	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
> +	struct net_device *dev;
>  	struct pppoe_hdr *ph;
>  	int data_len = skb->len;
>
> -	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
> +	read_lock_bh(&pn->hash_lock);
> +	if (!po->pppoe_dev) {
> +		read_unlock_bh(&pn->hash_lock);
>  		goto abort;
> +	}
> +	dev = po->pppoe_dev;
> +	dev_hold(dev);
> +	read_unlock_bh(&pn->hash_lock);
>
> -	if (!dev)
> -		goto abort;
> +	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
> +		goto abort_put;
>
>  	/* Copy the data if there is no space for the header or if it's
>  	 * read-only.
>  	 */
>  	if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len))
> -		goto abort;
> +		goto abort_put;
>
>  	__skb_push(skb, sizeof(*ph));
>  	skb_reset_network_header(skb);
> @@ -944,9 +998,11 @@ static int __pppoe_xmit(struct sock *sk,
>  			po->pppoe_pa.remote, NULL, data_len);
>
>  	dev_queue_xmit(skb);
> -
> +	dev_put(dev);
>  	return 1;
>
> +abort_put:
> +	dev_put(dev);
>  abort:
>  	kfree_skb(skb);
>  	return 1;



^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20 14:23 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <200910201720.00473.denys@visp.net.lb>

[Denys Fedoryschenko - Tue, Oct 20, 2009 at 05:20:00PM +0300]
|
| It panics almost immediately on boot(even on old operations  that was stable, 
| seems on first pppoe customer login attempt), i will rebuild kernel and if 
| interesting will try to get panic message.
| 
...

ok, thanks. I continue digging.

^ permalink raw reply

* [PATCH RFC] Per route TCP options
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef

Turn the global sysctls allowing disabling of TCP SACK, DSCAK,
time stamp and window scale into per route entry feature options,
laying the ground to future removal of the relevant global sysctls.

You really only want to disable SACK, DSACK, time stamp or window
scale if you've got a piece of broken networking equipment somewhere 
as a stop gap until you can bring a big enough hammer to deal with
the broken network equipment. It doesn't make sense to "punish" the
entire connections going through the machine to destinations not 
related to the broken equipment.

This is doubly true when you're dealing with network containers
used to isolate several virtual domains.

Per route options implemented in free bits in the features route
entry property, which in some cases were reserved by name for these
options, so this does not inflate any structure and I expect that
when the apropriate global sysctls will be removed the overall code
base will be smaller.

Tested on x86 using Qemu/KVM.  

Will send the matching patch to iproute2 if/when this is ACKed or
if someone wants to test this.

Patchset based on original work by Ori Finkelman and Yoni Amit 
from ComSleep Ltd.

Gilad Ben-Yossef (8):
  Only parse time stamp TCP option in time wait sock
  Allow tcp_parse_options to consult dst entry
  Infrastructure for querying route entry features
  Add the no SACK route option feature
  Allow disabling TCP timestamp options per route
  Allow to turn off TCP window scale opt per route
  Allow disabling of DSACK TCP option per route
  Document future removal of sysctl_tcp_* options

 Documentation/feature-removal-schedule.txt |   12 ++++++++++++
 include/linux/rtnetlink.h                  |    6 ++++--
 include/net/dst.h                          |    8 +++++++-
 include/net/tcp.h                          |    3 ++-
 net/ipv4/syncookies.c                      |   27 ++++++++++++++-------------
 net/ipv4/tcp_input.c                       |   26 ++++++++++++++++++--------
 net/ipv4/tcp_ipv4.c                        |   19 ++++++++++---------
 net/ipv4/tcp_minisocks.c                   |    8 +++++---
 net/ipv4/tcp_output.c                      |   18 +++++++++++++-----
 net/ipv6/syncookies.c                      |   28 +++++++++++++++-------------
 net/ipv6/tcp_ipv6.c                        |    3 ++-
 11 files changed, 102 insertions(+), 56 deletions(-)

^ permalink raw reply

* [PATCH RFC] Allow tcp_parse_options to consult dst entry
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256052161-14156-2-git-send-email-gilad@codefidence.com>

We need tcp_parse_options to be aware of dst_entry to 
take into account per dst_entry TCP options settings

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/net/tcp.h        |    3 ++-
 net/ipv4/syncookies.c    |   27 ++++++++++++++-------------
 net/ipv4/tcp_input.c     |    9 ++++++---
 net/ipv4/tcp_ipv4.c      |   19 ++++++++++---------
 net/ipv4/tcp_minisocks.c |    7 +++++--
 net/ipv6/syncookies.c    |   28 +++++++++++++++-------------
 net/ipv6/tcp_ipv6.c      |    3 ++-
 7 files changed, 54 insertions(+), 42 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..740d09b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -409,7 +409,8 @@ extern int			tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
 
 extern void			tcp_parse_options(struct sk_buff *skb,
 						  struct tcp_options_received *opt_rx,
-						  int estab);
+						  int estab,
+						  struct dst_entry *dst);
 
 extern u8			*tcp_parse_md5sig_option(struct tcphdr *th);
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index a6e0e07..4990dd4 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -276,13 +276,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
 
-	/* check for timestamp cookie support */
-	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
-
-	if (tcp_opt.saw_tstamp)
-		cookie_check_timestamp(&tcp_opt);
-
 	ret = NULL;
 	req = inet_reqsk_alloc(&tcp_request_sock_ops); /* for safety */
 	if (!req)
@@ -298,12 +291,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	ireq->loc_addr		= ip_hdr(skb)->daddr;
 	ireq->rmt_addr		= ip_hdr(skb)->saddr;
 	ireq->ecn_ok		= 0;
-	ireq->snd_wscale	= tcp_opt.snd_wscale;
-	ireq->rcv_wscale	= tcp_opt.rcv_wscale;
-	ireq->sack_ok		= tcp_opt.sack_ok;
-	ireq->wscale_ok		= tcp_opt.wscale_ok;
-	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
-	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 
 	/* We throwed the options of the initial SYN away, so we hope
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
@@ -351,6 +338,20 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	/* check for timestamp cookie support */
+	memset(&tcp_opt, 0, sizeof(tcp_opt));
+	tcp_parse_options(skb, &tcp_opt, 0, &rt->u.dst);
+
+	if (tcp_opt.saw_tstamp)
+		cookie_check_timestamp(&tcp_opt);
+
+	ireq->snd_wscale        = tcp_opt.snd_wscale;
+	ireq->rcv_wscale        = tcp_opt.rcv_wscale;
+	ireq->sack_ok           = tcp_opt.sack_ok;
+	ireq->wscale_ok         = tcp_opt.wscale_ok;
+	ireq->tstamp_ok         = tcp_opt.saw_tstamp;
+	req->ts_recent          = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
+
 	/* Try to redo what tcp_v4_send_synack did. */
 	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->u.dst, RTAX_WINDOW);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d86784b..d502f49 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3698,12 +3698,14 @@ old_ack:
  * the fast version below fails.
  */
 void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
-		       int estab)
+		       int estab,  struct dst_entry *dst)
 {
 	unsigned char *ptr;
 	struct tcphdr *th = tcp_hdr(skb);
 	int length = (th->doff * 4) - sizeof(struct tcphdr);
 
+	BUG_ON(!estab && !dst);
+
 	ptr = (unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
 
@@ -3820,7 +3822,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
 		if (tcp_parse_aligned_timestamp(tp, th))
 			return 1;
 	}
-	tcp_parse_options(skb, &tp->rx_opt, 1);
+	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
 	return 1;
 }
 
@@ -5364,8 +5366,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	int saved_clamp = tp->rx_opt.mss_clamp;
+	struct dst_entry *dst = __sk_dst_get(sk);
 
-	tcp_parse_options(skb, &tp->rx_opt, 0);
+	tcp_parse_options(skb, &tp->rx_opt, 0, dst);
 
 	if (th->ack) {
 		/* rfc793:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7cda24b..1cb0ec4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1256,11 +1256,18 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
 #endif
 
+	ireq = inet_rsk(req);
+	ireq->loc_addr = daddr;
+	ireq->rmt_addr = saddr;
+	ireq->no_srccheck = inet_sk(sk)->transparent;
+	ireq->opt = tcp_v4_save_options(sk, skb);
+
+	dst = inet_csk_route_req(sk, req);
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = 536;
 	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
@@ -1269,14 +1276,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	tcp_openreq_init(req, &tmp_opt, skb);
 
-	ireq = inet_rsk(req);
-	ireq->loc_addr = daddr;
-	ireq->rmt_addr = saddr;
-	ireq->no_srccheck = inet_sk(sk)->transparent;
-	ireq->opt = tcp_v4_save_options(sk, skb);
-
 	if (security_inet_conn_request(sk, skb, req))
-		goto drop_and_free;
+		goto drop_and_release;
 
 	if (!want_cookie)
 		TCP_ECN_create_request(req, tcp_hdr(skb));
@@ -1301,7 +1302,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		 */
 		if (tmp_opt.saw_tstamp &&
 		    tcp_death_row.sysctl_tw_recycle &&
-		    (dst = inet_csk_route_req(sk, req)) != NULL &&
+		    dst != NULL &&
 		    (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
 		    peer->v4daddr == saddr) {
 			if (get_seconds() < peer->tcp_ts_stamp + TCP_PAWS_MSL &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index c49a550..70ff955 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -101,7 +101,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	int paws_reject = 0;
 
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, 1);
+		tcp_parse_options(skb, &tmp_opt, 1, NULL);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
@@ -499,10 +499,11 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	int paws_reject = 0;
 	struct tcp_options_received tmp_opt;
 	struct sock *child;
+	struct dst_entry *dst = inet_csk_route_req(sk, req);
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent = req->ts_recent;
@@ -515,6 +516,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	dst_release(dst);
+
 	/* Check for pure retransmitted SYN. */
 	if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn &&
 	    flg == TCP_FLAG_SYN &&
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 6b6ae91..6ece408 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -184,13 +184,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
 
-	/* check for timestamp cookie support */
-	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
-
-	if (tcp_opt.saw_tstamp)
-		cookie_check_timestamp(&tcp_opt);
-
 	ret = NULL;
 	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
 	if (!req)
@@ -224,12 +217,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	req->expires = 0UL;
 	req->retrans = 0;
 	ireq->ecn_ok		= 0;
-	ireq->snd_wscale	= tcp_opt.snd_wscale;
-	ireq->rcv_wscale	= tcp_opt.rcv_wscale;
-	ireq->sack_ok		= tcp_opt.sack_ok;
-	ireq->wscale_ok		= tcp_opt.wscale_ok;
-	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
-	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 	treq->rcv_isn = ntohl(th->seq) - 1;
 	treq->snt_isn = cookie;
 
@@ -264,6 +251,21 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 			goto out_free;
 	}
 
+	/* check for timestamp cookie support */
+	memset(&tcp_opt, 0, sizeof(tcp_opt));
+	tcp_parse_options(skb, &tcp_opt, 0, dst);
+
+	if (tcp_opt.saw_tstamp)
+		cookie_check_timestamp(&tcp_opt);
+
+	req->ts_recent          = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
+
+	ireq->snd_wscale        = tcp_opt.snd_wscale;
+	ireq->rcv_wscale        = tcp_opt.rcv_wscale;
+	ireq->sack_ok           = tcp_opt.sack_ok;
+	ireq->wscale_ok         = tcp_opt.wscale_ok;
+	ireq->tstamp_ok         = tcp_opt.saw_tstamp;
+
 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
 	tcp_select_initial_window(tcp_full_space(sk), req->mss,
 				  &req->rcv_wnd, &req->window_clamp,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 21d100b..2eebab5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1165,6 +1165,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct request_sock *req = NULL;
 	__u32 isn = TCP_SKB_CB(skb)->when;
+	struct dst_entry *dst = __sk_dst_get(sk);
 #ifdef CONFIG_SYN_COOKIES
 	int want_cookie = 0;
 #else
@@ -1203,7 +1204,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH RFC] Allow disabling of DSACK TCP option per route
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256052161-14156-7-git-send-email-gilad@codefidence.com>

Add and use no DSCAK bit in the features field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/linux/rtnetlink.h |    1 +
 net/ipv4/tcp_input.c      |    8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 6784b34..e78b60c 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -381,6 +381,7 @@ enum
 #define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 #define RTAX_FEATURE_NO_WSCALE	0x00000010
+#define RTAX_FEATURE_NO_DSACK	0x00000020
 
 struct rta_session
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4f5e914..4262da5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4080,8 +4080,10 @@ static inline int tcp_sack_extend(struct tcp_sack_block *sp, u32 seq,
 static void tcp_dsack_set(struct sock *sk, u32 seq, u32 end_seq)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct dst_entry *dst = __sk_dst_get(sk);
 
-	if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
+	if (tcp_is_sack(tp) && sysctl_tcp_dsack &&
+	    !dst_feature(dst, RTAX_FEATURE_NO_DSACK)) {
 		int mib_idx;
 
 		if (before(seq, tp->rcv_nxt))
@@ -4110,13 +4112,15 @@ static void tcp_dsack_extend(struct sock *sk, u32 seq, u32 end_seq)
 static void tcp_send_dupack(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct dst_entry *dst = __sk_dst_get(sk);
 
 	if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
 	    before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
 		tcp_enter_quickack_mode(sk);
 
-		if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
+		if (tcp_is_sack(tp) && sysctl_tcp_dsack &&
+		    !dst_feature(dst, RTAX_FEATURE_NO_DSACK)) {
 			u32 end_seq = TCP_SKB_CB(skb)->end_seq;
 
 			if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH RFC] Document future removal of sysctl_tcp_* options
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256052161-14156-8-git-send-email-gilad@codefidence.com>

No need for global kill switches if we have per route entry controls.
Wait a year before removing in case someone is using this.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>

---
 Documentation/feature-removal-schedule.txt |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 89a47b5..60db855 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -6,6 +6,18 @@ be removed from this file.
 
 ---------------------------
 
+What:	sysctl_tcp_sack, sysctl_tcp_timestamps, sysctl_tcp_window_scaling,
+	sysctl_tcp_dsack
+When:	October 2010
+
+Why:	These options can now be set on a per route basis via the
+	RTAX_FEATURE_NO_SACK, RTAX_FEATURE_NO_TSTAMP, RTAX_FEATURE_NO_WSCALE,
+	and RTAX_FEATURE_NO_DSACK route feature options.
+
+Who:	Gilad Ben-Yossef <gilad@codefidence.com>
+
+---------------------------
+
 What:	PRISM54
 When:	2.6.34
 
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH RFC] Only parse time stamp TCP option in time wait sock
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef, Yony Amit
In-Reply-To: <1256052161-14156-1-git-send-email-gilad@codefidence.com>

A time wait socket is established - we already know if time stamp
option is called for or not.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Signed-off-by: Yony Amit <yony@comsleep.com>

---
 net/ipv4/tcp_minisocks.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 624c3c9..c49a550 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -100,9 +100,8 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	struct tcp_options_received tmp_opt;
 	int paws_reject = 0;
 
-	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tcp_parse_options(skb, &tmp_opt, 1);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH RFC] Add the no SACK route option feature
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256052161-14156-4-git-send-email-gilad@codefidence.com>

Implement querying and acting upon the no sack bit in the features
field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/linux/rtnetlink.h |    2 +-
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    4 +++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index adf2068..9c802a6 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -377,7 +377,7 @@ enum
 #define RTAX_MAX (__RTAX_MAX - 1)
 
 #define RTAX_FEATURE_ECN	0x00000001
-#define RTAX_FEATURE_SACK	0x00000002
+#define RTAX_FEATURE_NO_SACK	0x00000002
 #define RTAX_FEATURE_TIMESTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d502f49..b14f780 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3763,7 +3763,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				break;
 			case TCPOPT_SACK_PERM:
 				if (opsize == TCPOLEN_SACK_PERM && th->syn &&
-				    !estab && sysctl_tcp_sack) {
+				    !estab && sysctl_tcp_sack &&
+				    !dst_feature(dst, RTAX_FEATURE_NO_SACK)) {
 					opt_rx->sack_ok = 1;
 					tcp_sack_reset(opt_rx);
 				}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fcd278a..64db8dd 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -464,6 +464,7 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 				struct tcp_md5sig_key **md5) {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned size = 0;
+	struct dst_entry *dst = __sk_dst_get(sk);
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
@@ -498,7 +499,8 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		opts->options |= OPTION_WSCALE;
 		size += TCPOLEN_WSCALE_ALIGNED;
 	}
-	if (likely(sysctl_tcp_sack)) {
+	if (likely(sysctl_tcp_sack &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_SACK))) {
 		opts->options |= OPTION_SACK_ADVERTISE;
 		if (unlikely(!(OPTION_TS & opts->options)))
 			size += TCPOLEN_SACKPERM_ALIGNED;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH RFC] Allow disabling TCP timestamp options per route
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256052161-14156-5-git-send-email-gilad@codefidence.com>

Implement querying and acting upon the no timestamp bit in the feature 
field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/linux/rtnetlink.h |    2 +-
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    8 ++++++--
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 9c802a6..2ab8c75 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -378,7 +378,7 @@ enum
 
 #define RTAX_FEATURE_ECN	0x00000001
 #define RTAX_FEATURE_NO_SACK	0x00000002
-#define RTAX_FEATURE_TIMESTAMP	0x00000004
+#define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 
 struct rta_session
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b14f780..d2f9742 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3755,7 +3755,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 			case TCPOPT_TIMESTAMP:
 				if ((opsize == TCPOLEN_TIMESTAMP) &&
 				    ((estab && opt_rx->tstamp_ok) ||
-				     (!estab && sysctl_tcp_timestamps))) {
+				     (!estab && sysctl_tcp_timestamps &&
+				      !dst_feature(dst, RTAX_FEATURE_NO_TSTAMP)))) {
 					opt_rx->saw_tstamp = 1;
 					opt_rx->rcv_tsval = get_unaligned_be32(ptr);
 					opt_rx->rcv_tsecr = get_unaligned_be32(ptr + 4);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 64db8dd..8f30c18 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -488,7 +488,9 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	opts->mss = tcp_advertise_mss(sk);
 	size += TCPOLEN_MSS_ALIGNED;
 
-	if (likely(sysctl_tcp_timestamps && *md5 == NULL)) {
+	if (likely(sysctl_tcp_timestamps &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_TSTAMP) &&
+		   *md5 == NULL)) {
 		opts->options |= OPTION_TS;
 		opts->tsval = TCP_SKB_CB(skb)->when;
 		opts->tsecr = tp->rx_opt.ts_recent;
@@ -2317,7 +2319,9 @@ static void tcp_connect_init(struct sock *sk)
 	 * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
 	 */
 	tp->tcp_header_len = sizeof(struct tcphdr) +
-		(sysctl_tcp_timestamps ? TCPOLEN_TSTAMP_ALIGNED : 0);
+		(sysctl_tcp_timestamps &&
+		(!dst_feature(dst, RTAX_FEATURE_NO_TSTAMP) ?
+		  TCPOLEN_TSTAMP_ALIGNED : 0));
 
 #ifdef CONFIG_TCP_MD5SIG
 	if (tp->af_specific->md5_lookup(sk, sk) != NULL)
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH RFC] Add dst_feature to query route entry features
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256052161-14156-3-git-send-email-gilad@codefidence.com>

Adding an accessor to existing  dst_entry feautres field and
refactor the only supported feature (allfrag) to use it.


Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/net/dst.h |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 5a900dd..b562be3 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -111,6 +111,12 @@ dst_metric(const struct dst_entry *dst, int metric)
 	return dst->metrics[metric-1];
 }
 
+static inline u32
+dst_feature(const struct dst_entry *dst, u32 feature)
+{
+	return dst_metric(dst, RTAX_FEATURES) & feature;
+}
+
 static inline u32 dst_mtu(const struct dst_entry *dst)
 {
 	u32 mtu = dst_metric(dst, RTAX_MTU);
@@ -136,7 +142,7 @@ static inline void set_dst_metric_rtt(struct dst_entry *dst, int metric,
 static inline u32
 dst_allfrag(const struct dst_entry *dst)
 {
-	int ret = dst_metric(dst, RTAX_FEATURES) & RTAX_FEATURE_ALLFRAG;
+	int ret = dst_feature(dst,  RTAX_FEATURE_ALLFRAG);
 	/* Yes, _exactly_. This is paranoia. */
 	barrier();
 	return ret;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCH RFC] Allow to turn off TCP window scale opt per route
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256052161-14156-6-git-send-email-gilad@codefidence.com>

Add and use no window scale bit in the features field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
---
 include/linux/rtnetlink.h |    1 +
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    6 ++++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 2ab8c75..6784b34 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -380,6 +380,7 @@ enum
 #define RTAX_FEATURE_NO_SACK	0x00000002
 #define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
+#define RTAX_FEATURE_NO_WSCALE	0x00000010
 
 struct rta_session
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d2f9742..4f5e914 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3739,7 +3739,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				break;
 			case TCPOPT_WINDOW:
 				if (opsize == TCPOLEN_WINDOW && th->syn &&
-				    !estab && sysctl_tcp_window_scaling) {
+				    !estab && sysctl_tcp_window_scaling &&
+				    !dst_feature(dst, RTAX_FEATURE_NO_WSCALE)) {
 					__u8 snd_wscale = *(__u8 *)ptr;
 					opt_rx->wscale_ok = 1;
 					if (snd_wscale > 14) {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8f30c18..ff60a21 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -496,7 +496,8 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		opts->tsecr = tp->rx_opt.ts_recent;
 		size += TCPOLEN_TSTAMP_ALIGNED;
 	}
-	if (likely(sysctl_tcp_window_scaling)) {
+	if (likely(sysctl_tcp_window_scaling &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_WSCALE))) {
 		opts->ws = tp->rx_opt.rcv_wscale;
 		opts->options |= OPTION_WSCALE;
 		size += TCPOLEN_WSCALE_ALIGNED;
@@ -2347,7 +2348,8 @@ static void tcp_connect_init(struct sock *sk)
 				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,
-				  sysctl_tcp_window_scaling,
+				  (sysctl_tcp_window_scaling &&
+				   !dst_feature(dst, RTAX_FEATURE_NO_WSCALE)),
 				  &rcv_wscale);
 
 	tp->rx_opt.rcv_wscale = rcv_wscale;
-- 
1.5.6.3


^ permalink raw reply related

* Re: [PATCH RFC] Per route TCP options
From: Eric Dumazet @ 2009-10-20 15:44 UTC (permalink / raw)
  To: Gilad Ben-Yossef; +Cc: netdev, ori
In-Reply-To: <1256052161-14156-1-git-send-email-gilad@codefidence.com>

Gilad Ben-Yossef a écrit :
> Turn the global sysctls allowing disabling of TCP SACK, DSCAK,
> time stamp and window scale into per route entry feature options,
> laying the ground to future removal of the relevant global sysctls.
> 
> You really only want to disable SACK, DSACK, time stamp or window
> scale if you've got a piece of broken networking equipment somewhere 
> as a stop gap until you can bring a big enough hammer to deal with
> the broken network equipment. It doesn't make sense to "punish" the
> entire connections going through the machine to destinations not 
> related to the broken equipment.
> 
> This is doubly true when you're dealing with network containers
> used to isolate several virtual domains.
> 
> Per route options implemented in free bits in the features route
> entry property, which in some cases were reserved by name for these
> options, so this does not inflate any structure and I expect that
> when the apropriate global sysctls will be removed the overall code
> base will be smaller.
> 
> Tested on x86 using Qemu/KVM.  
> 
> Will send the matching patch to iproute2 if/when this is ACKed or
> if someone wants to test this.
> 
> Patchset based on original work by Ori Finkelman and Yoni Amit 
> from ComSleep Ltd.
> 
> Gilad Ben-Yossef (8):
>   Only parse time stamp TCP option in time wait sock
>   Allow tcp_parse_options to consult dst entry
>   Infrastructure for querying route entry features
>   Add the no SACK route option feature
>   Allow disabling TCP timestamp options per route
>   Allow to turn off TCP window scale opt per route
>   Allow disabling of DSACK TCP option per route
>   Document future removal of sysctl_tcp_* options
> 

Interesting... But you should give numbers to your patches so that we know their order

You could also do the ECN part for consistency (ie RTAX_FEATURE_ECN -> RTAX_FEATURE_NO_ECN)

And please post iproute2 patches as well :)

Thanks

^ permalink raw reply

* Re: [PATCH RFC] Per route TCP options
From: Gilad Ben-Yossef @ 2009-10-20 16:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, ori
In-Reply-To: <4ADDDAFB.5040600@gmail.com>

Eric Dumazet wrote:

> Gilad Ben-Yossef a écrit :
>   
>> Turn the global sysctls allowing disabling of TCP SACK, DSCAK,
>> time stamp and window scale into per route entry feature options,
>> laying the ground to future removal of the relevant global sysctls.
>>
>> ...
> Interesting... But you should give numbers to your patches so that we know their order
>   
Will do.
> You could also do the ECN part for consistency (ie RTAX_FEATURE_ECN -> RTAX_FEATURE_NO_ECN)
>   
That and  sysctl_tcp_mtu_probing are on my todo list, assuming the 
general direction is accepted, of course.

Specifically, I couldn't understand why sysctl_tcp_ecn is documented to 
be a boolean value, but is initialized to 2 and queried with if 
(sysctl_tcp_ecn == 1) so I decided to let it be until I figure it out... 
;-)
> And please post iproute2 patches as well :)
>   
Will do. Note that the patch I have still has an ugly bit that needs 
addressing - it sets the features by name, but right now only displays 
numeric values to the, for "show". This of course will be fixed, but it 
does work for testing.

Thanks,
Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker & CTO
Codefidence Ltd.

Web:   http://codefidence.com
Cell:  +972-52-8260388
Skype: gilad_codefidence
Tel:   +972-8-9316883 ext. 201
Fax:   +972-8-9316884
Email: gilad@codefidence.com

Check out our Open Source technology and training blog - http://tuxology.net

	"Sorry cannot parse this, its too long to be true  :)"
	  -- Eric Dumazet on netdev mailing list


^ permalink raw reply

* [PATCH TESTING] Add support for configuring route entry features
From: Gilad Ben-Yossef @ 2009-10-20 16:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, ilpo.jarvinen, Gilad Ben-Yossef

This is needed to test my previous per route entry TCP options patch.

Feature display is still numeric, so this is not the final version yet.

To use: add "features nows nots nosack nodsack to a route entry.

Based loosley on original patch by Ilpo JÃ¤rvinen.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>

CC: Ilpo JÃ¤rvinen <ilpo.jarvinen@helsinki.fi>

---
 include/linux/rtnetlink.h |   10 ++++++----
 ip/iproute.c              |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 63d1c69..35964d4 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -376,10 +376,12 @@ enum
 
 #define RTAX_MAX (__RTAX_MAX - 1)
 
-#define RTAX_FEATURE_ECN	0x00000001
-#define RTAX_FEATURE_SACK	0x00000002
-#define RTAX_FEATURE_TIMESTAMP	0x00000004
-#define RTAX_FEATURE_ALLFRAG	0x00000008
+#define RTAX_FEATURE_ECN        0x00000001
+#define RTAX_FEATURE_NO_SACK    0x00000002
+#define RTAX_FEATURE_NO_TSTAMP  0x00000004
+#define RTAX_FEATURE_ALLFRAG    0x00000008
+#define RTAX_FEATURE_NO_WSCALE  0x00000010
+#define RTAX_FEATURE_NO_DSACK   0x00000020
 
 struct rta_session
 {
diff --git a/ip/iproute.c b/ip/iproute.c
index aeea93d..9dcc0e0 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -54,6 +54,22 @@ static const char *mx_names[RTAX_MAX+1] = {
 	[RTAX_FEATURES] = "features",
 	[RTAX_RTO_MIN]	= "rto_min",
 };
+
+struct valname {
+	unsigned int	val;
+	const char	*name;
+};
+
+static const struct valname features[] = {
+	{ RTAX_FEATURE_NO_SACK, "nosack" },
+	{ RTAX_FEATURE_NO_TSTAMP, "notimestamps" },
+	{ RTAX_FEATURE_NO_TSTAMP, "nots" },
+	{ RTAX_FEATURE_NO_WSCALE, "nowindowscale" },
+	{ RTAX_FEATURE_NO_WSCALE, "nows" },
+	{ RTAX_FEATURE_NO_DSACK, "nodsack" },
+	
+};
+
 static void usage(void) __attribute__((noreturn));
 
 static void usage(void)
@@ -75,7 +91,8 @@ static void usage(void)
 	fprintf(stderr, "           [ rtt TIME ] [ rttvar TIME ] [reordering NUMBER ]\n");
 	fprintf(stderr, "           [ window NUMBER] [ cwnd NUMBER ] [ initcwnd NUMBER ]\n");
 	fprintf(stderr, "           [ ssthresh NUMBER ] [ realms REALM ] [ src ADDRESS ]\n");
-	fprintf(stderr, "           [ rto_min TIME ] [ hoplimit NUMBER ] \n");
+	fprintf(stderr, "           [ rto_min TIME ] [ hoplimit NUMBER ]\n"); 
+	fprintf(stderr, "           [ features DISABLED_FEATURES ]\n");
 	fprintf(stderr, "TYPE := [ unicast | local | broadcast | multicast | throw |\n");
 	fprintf(stderr, "          unreachable | prohibit | blackhole | nat ]\n");
 	fprintf(stderr, "TABLE_ID := [ local | main | default | all | NUMBER ]\n");
@@ -85,6 +102,8 @@ static void usage(void)
 	fprintf(stderr, "NHFLAGS := [ onlink | pervasive ]\n");
 	fprintf(stderr, "RTPROTO := [ kernel | boot | static | NUMBER ]\n");
 	fprintf(stderr, "TIME := NUMBER[s|ms|us|ns|j]\n");
+	fprintf(stderr, "DISABLED_FEATURES := sack | timestamps | ts | ecn | frto |\n");
+	fprintf(stderr, "                     [ DISABLED_FEATURES ]\n");
 	exit(-1);
 }
 
@@ -877,6 +896,30 @@ int iproute_modify(int cmd, unsigned flags, int argc, char **argv)
 			if (get_unsigned(&win, *argv, 0))
 				invarg("\"ssthresh\" value is invalid\n", *argv);
 			rta_addattr32(mxrta, sizeof(mxbuf), RTAX_SSTHRESH, win);
+		} else if (matches(*argv, "features") == 0) {
+			int j;
+			unsigned int f = 0;
+			NEXT_ARG();
+			while (1) {
+				for (j = 0; j < ARRAY_SIZE(features); j++) {
+					if (strcmp(*argv, features[j].name) == 0) {
+						f |= features[j].val;
+						if (!NEXT_ARG_OK())
+							goto feat_out;
+						NEXT_ARG();
+						break;
+					}
+				}
+				if (j == ARRAY_SIZE(features)) {
+					if (f)
+						PREV_ARG();
+					break;
+				}
+			}
+feat_out:
+			if (!f)
+				invarg("\"features\" list is invalid\n", *argv);
+			rta_addattr32(mxrta, sizeof(mxbuf), RTAX_FEATURES, f);
 		} else if (matches(*argv, "realms") == 0) {
 			__u32 realm;
 			NEXT_ARG();
-- 
1.5.6.3


^ permalink raw reply related

* Re: [PATCH RFC] Per route TCP options
From: Rick Jones @ 2009-10-20 16:26 UTC (permalink / raw)
  To: Gilad Ben-Yossef; +Cc: netdev, ori
In-Reply-To: <1256052161-14156-1-git-send-email-gilad@codefidence.com>

Gilad Ben-Yossef wrote:
> Turn the global sysctls allowing disabling of TCP SACK, DSCAK,
> time stamp and window scale into per route entry feature options,
> laying the ground to future removal of the relevant global sysctls.
> 
> You really only want to disable SACK, DSACK, time stamp or window
> scale if you've got a piece of broken networking equipment somewhere 
> as a stop gap until you can bring a big enough hammer to deal with
> the broken network equipment. It doesn't make sense to "punish" the
> entire connections going through the machine to destinations not 
> related to the broken equipment.

Is it really only the case that those options get disabled for broken networking 
equipment?  Does this presage making all TCP options per-route only?

rick jones

^ permalink raw reply

* Re: Policy routing + route "via" gives a strange behavior
From: Atis Elsts @ 2009-10-20 16:48 UTC (permalink / raw)
  To: Guido Trotter; +Cc: netdev
In-Reply-To: <20091020132820.GA3159@gg.studio.tixteam.net>

On Tuesday 20 October 2009 16:28:20 you wrote:
> This is also refused unless a route like the one before appears in the
> default table, even though it does appear in table 100. Is this the right
> behavior, and if yes, why? 

I guess what you describe is too infrequent use case for anyone to really 
care. Connected and link scoped routes are usually not added to policy 
routing tables :) Can you explain more for what kind of setup this is needed?

This "issue" could be solved by using routing table in the FIB lookup done in 
fib_check_nh(). However, doing that would break a lot more setups than it 
would "fix".
For example, if you had these rules
  from all to 1.2.3.4 fwmark 0x64 lookup 100
  from all fwmark 0x64 unreachable
then adding policy route to table 100 would fail unless nexthop 1.2.3.4 was 
used...

Anyway, you can achieve what you wish by using the "onlink" option, e.g.:
  ip route add table 100 default dev eth1 via 192.168.5.254 onlink

Atis

^ permalink raw reply

* Re: Policy routing + route "via" gives a strange behavior
From: Guido Trotter @ 2009-10-20 17:23 UTC (permalink / raw)
  To: Atis Elsts; +Cc: netdev
In-Reply-To: <200910201948.39778.atis@mikrotik.com>

On Tue, Oct 20, 2009 at 07:48:39PM +0300, Atis Elsts wrote:

Hi,

Thanks for your explanation/help!

> > This is also refused unless a route like the one before appears in the
> > default table, even though it does appear in table 100. Is this the right
> > behavior, and if yes, why? 
> 
> I guess what you describe is too infrequent use case for anyone to really 
> care. Connected and link scoped routes are usually not added to policy 
> routing tables :) Can you explain more for what kind of setup this is needed?
> 

What I'm using it to is to force virtual machine traffic from a host to be
routed to a different interface (a gre interface, in my case). I set up a rule
that traffic from the guests' network (or from their interfaces) looks up a
different routing table, and in that table I set a default gateway so that any
traffic the instance would do is sent via that gateway.

> This "issue" could be solved by using routing table in the FIB lookup done in 
> fib_check_nh(). However, doing that would break a lot more setups than it 
> would "fix".
> For example, if you had these rules
>   from all to 1.2.3.4 fwmark 0x64 lookup 100
>   from all fwmark 0x64 unreachable
> then adding policy route to table 100 would fail unless nexthop 1.2.3.4 was 
> used...
> 
Not sure I follow you here, sorry! :/ How would it behave differently depending
on the rules used? If in table 100 I had something like:

ip route add 1.2.3.4 dev eth1 via 2.3.4.5

Then the traffic looking up 100 (which according to the rules is to 1.2.3.4, but
could be to something else as well) must be routed via 2.3.4.5.
Of course then in table 100 I need a route to 2.3.4.5, or I need one in the
default table (which is the only one that gets checked now).

> Anyway, you can achieve what you wish by using the "onlink" option, e.g.:
>   ip route add table 100 default dev eth1 via 192.168.5.254 onlink

I will try this, thanks! What I was doing now was something like:
ip route add 192.168.5.254/32 dev eth1 # no table 100
ip route add table 100 default dev eth1 via 192.168.5.254
ip route del 192.168.5.254/32 dev eth1

But it was kind of a nasty workaround. Although it was working :)

Thanks,

Guido

^ permalink raw reply

* Re: pktgen and spin_lock_bh in xmit path
From: Ben Greear @ 2009-10-20 17:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: NetDev, robert
In-Reply-To: <4ADD41F5.5080707@candelatech.com>

On 10/19/2009 09:52 PM, Ben Greear wrote:
> Eric Dumazet wrote:
>> Ben Greear a écrit :
>>> I'm having strange issues when running pktgen on 10G interfaces while
>>> also running
>>> pktgen on mac-vlans on that interface, when the mac-vlan pktgen threads
>>> are on a different
>>> CPU.

I think I found the problem.  First, lockdep was not the issue, and mac-vlans
were properly setting up the lockdep keys.  I would have expected lockdep to
figure out I was trying to lock a non-valid lock, but maybe something else
kept that from happening.

Second:  I think the problem can only happen on my code tree because I
added code to allow mac-vlans to return NETDEV_TX_BUSY
when a hacked varient of dev_queue_xmit decided it could not immediately
transmit a packet.  Without my change, a packet would have to be created fresh
in this scenario, so it would not hit the bug.

However, I think pktgen might still need a similar fix because other drivers or
logic might also change the skb tx-queue map.

Here is the problem, or at least one of them:

pktgen tries to xmit, but gets NETDEV_TX_BUSY.  During the xmit attempt, the
skb queue map was changed to that of the underlying device, which was 4.  Note
that mac-vlans have only a single tx queue.
pktgen will retry this skb, but it never resets the skb queue back to 0.
This means that it will soon be accessing txq[4], which is corrupting
memory.  Things rapidly decline from here!

Here is a patch for comment, in case the pktgen folks would like to
apply something similar:

@@ -3991,11 +4001,26 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev, u64 now)
                 }
         }

-       if (!pkt_dev->skb) {
+       if ((!pkt_dev->skb) || (pkt_dev->clone_count <= 1)) {
+               /** If clone count is low, that might be because device is a layered
+                * virtual device, like mac-vlan.  In that case, the queue-map may be
+                * changed while transmitting out the lower levels, so we need to
+                * reset this here so we don't accidentally use a bogus queue.
+                */
+       reset_queue_map:
                 set_cur_queue_map(pkt_dev);
                 queue_map = pkt_dev->cur_queue_map;
         } else {
                 queue_map = skb_get_queue_mapping(pkt_dev->skb);
+               if (unlikely(queue_map >= odev->num_tx_queues)) {
+                       static int do_once = 1;
+                       if (do_once) {
+                               printk("pktgen ERROR:  queue_map range error, queue_map: %i  num_tx_queues: %i  iface: %s\n",
+                                      queue_map, odev->num_tx_queues, odev->name);
+                               WARN_ON(1);
+                       }
+                       goto reset_queue_map;
+               }
         }

         txq = netdev_get_tx_queue(odev, queue_map);

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox