Netdev List
 help / color / mirror / Atom feed
* [PATCHv4 0/7] Per route TCP options support kill switches
From: Gilad Ben-Yossef @ 2009-10-28 14:15 UTC (permalink / raw)
  To: netdev; +Cc: ori

Allow selectively turning off support for specific TCP options
on a per route basis.

One normally want to disable SACK, DSACK, time stamp or window
scale if one got a piece of broken networking equipment somewhere
as a stop gap until you can bring a big enough hammer to deal with
the broken network equipment. It doesn't make sense to "punish" the
entire connections going through the machine to destinations not
related to the broken equipment.

This is doubly true when one is dealing with network containers
used to isolate several virtual domains.

Per route options implemented in free bits in the features route
entry property, which in some cases were reserved by name for these
options, so this does not inflate any structure.

Global sysctls for these options are still preserved and retain 
the exact original meaning (e.g. you have to have both the global 
sysctl turned on and not turn off the TCP option parsing in the
specific route to have it proccessed).

It is not possible to turn off globally an option but turn it on
per route, so as to not subtly change the meaning of current
establish sysctls (and this is a rare need anyway).

Tested on x86 using Qemu/KVM.

Working but crude matching patch to iproute2 sent earlier to the list.

Patchset based on original work by Ori Finkelman and Yony Amit
from ComSleep Ltd.

The author wishes to thank Eric Dumazaet, William Allen Simpson, 
Bill Fink and Ilpo Jarvinen for their feedback.


Gilad Ben-Yossef (7):
  Only parse time stamp TCP option in time wait sock
  Allow tcp_parse_options to consult dst entry
  Add dst_feature to query route entry features
  Add the no SACK route option feature
  Allow disabling TCP timestamp options per route
  Allow to turn off TCP window scale opt per route
  Allow disabling of DSACK TCP option per route

 include/linux/rtnetlink.h |    6 ++++--
 include/net/dst.h         |    8 +++++++-
 include/net/tcp.h         |    3 ++-
 net/ipv4/syncookies.c     |   27 ++++++++++++++-------------
 net/ipv4/tcp_input.c      |   26 ++++++++++++++++++--------
 net/ipv4/tcp_ipv4.c       |   21 ++++++++++++---------
 net/ipv4/tcp_minisocks.c  |    9 ++++++---
 net/ipv4/tcp_output.c     |   18 +++++++++++++-----
 net/ipv6/syncookies.c     |   28 +++++++++++++++-------------
 net/ipv6/tcp_ipv6.c       |    3 ++-
 10 files changed, 93 insertions(+), 56 deletions(-)


^ permalink raw reply

* [PATCHv4 2/7] Allow tcp_parse_options to consult dst entry
From: Gilad Ben-Yossef @ 2009-10-28 14:15 UTC (permalink / raw)
  To: netdev; +Cc: ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

We need tcp_parse_options to be aware of dst_entry to
take into account per dst_entry TCP options settings

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
---
 include/net/tcp.h        |    3 ++-
 net/ipv4/syncookies.c    |   27 ++++++++++++++-------------
 net/ipv4/tcp_input.c     |    9 ++++++---
 net/ipv4/tcp_ipv4.c      |   21 ++++++++++++---------
 net/ipv4/tcp_minisocks.c |    7 +++++--
 net/ipv6/syncookies.c    |   28 +++++++++++++++-------------
 net/ipv6/tcp_ipv6.c      |    3 ++-
 7 files changed, 56 insertions(+), 42 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..740d09b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -409,7 +409,8 @@ extern int			tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
 
 extern void			tcp_parse_options(struct sk_buff *skb,
 						  struct tcp_options_received *opt_rx,
-						  int estab);
+						  int estab,
+						  struct dst_entry *dst);
 
 extern u8			*tcp_parse_md5sig_option(struct tcphdr *th);
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index a6e0e07..4990dd4 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -276,13 +276,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
 
-	/* check for timestamp cookie support */
-	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
-
-	if (tcp_opt.saw_tstamp)
-		cookie_check_timestamp(&tcp_opt);
-
 	ret = NULL;
 	req = inet_reqsk_alloc(&tcp_request_sock_ops); /* for safety */
 	if (!req)
@@ -298,12 +291,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	ireq->loc_addr		= ip_hdr(skb)->daddr;
 	ireq->rmt_addr		= ip_hdr(skb)->saddr;
 	ireq->ecn_ok		= 0;
-	ireq->snd_wscale	= tcp_opt.snd_wscale;
-	ireq->rcv_wscale	= tcp_opt.rcv_wscale;
-	ireq->sack_ok		= tcp_opt.sack_ok;
-	ireq->wscale_ok		= tcp_opt.wscale_ok;
-	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
-	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 
 	/* We throwed the options of the initial SYN away, so we hope
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
@@ -351,6 +338,20 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	/* check for timestamp cookie support */
+	memset(&tcp_opt, 0, sizeof(tcp_opt));
+	tcp_parse_options(skb, &tcp_opt, 0, &rt->u.dst);
+
+	if (tcp_opt.saw_tstamp)
+		cookie_check_timestamp(&tcp_opt);
+
+	ireq->snd_wscale        = tcp_opt.snd_wscale;
+	ireq->rcv_wscale        = tcp_opt.rcv_wscale;
+	ireq->sack_ok           = tcp_opt.sack_ok;
+	ireq->wscale_ok         = tcp_opt.wscale_ok;
+	ireq->tstamp_ok         = tcp_opt.saw_tstamp;
+	req->ts_recent          = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
+
 	/* Try to redo what tcp_v4_send_synack did. */
 	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->u.dst, RTAX_WINDOW);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d86784b..d502f49 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3698,12 +3698,14 @@ old_ack:
  * the fast version below fails.
  */
 void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
-		       int estab)
+		       int estab,  struct dst_entry *dst)
 {
 	unsigned char *ptr;
 	struct tcphdr *th = tcp_hdr(skb);
 	int length = (th->doff * 4) - sizeof(struct tcphdr);
 
+	BUG_ON(!estab && !dst);
+
 	ptr = (unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
 
@@ -3820,7 +3822,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
 		if (tcp_parse_aligned_timestamp(tp, th))
 			return 1;
 	}
-	tcp_parse_options(skb, &tp->rx_opt, 1);
+	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
 	return 1;
 }
 
@@ -5364,8 +5366,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	int saved_clamp = tp->rx_opt.mss_clamp;
+	struct dst_entry *dst = __sk_dst_get(sk);
 
-	tcp_parse_options(skb, &tp->rx_opt, 0);
+	tcp_parse_options(skb, &tp->rx_opt, 0, dst);
 
 	if (th->ack) {
 		/* rfc793:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7cda24b..1d611e3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1256,11 +1256,21 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
 #endif
 
+	ireq = inet_rsk(req);
+	ireq->loc_addr = daddr;
+	ireq->rmt_addr = saddr;
+	ireq->no_srccheck = inet_sk(sk)->transparent;
+	ireq->opt = tcp_v4_save_options(sk, skb);
+
+	dst = inet_csk_route_req(sk, req);
+	if(!dst)
+		goto drop_and_free;
+
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = 536;
 	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
@@ -1269,14 +1279,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	tcp_openreq_init(req, &tmp_opt, skb);
 
-	ireq = inet_rsk(req);
-	ireq->loc_addr = daddr;
-	ireq->rmt_addr = saddr;
-	ireq->no_srccheck = inet_sk(sk)->transparent;
-	ireq->opt = tcp_v4_save_options(sk, skb);
-
 	if (security_inet_conn_request(sk, skb, req))
-		goto drop_and_free;
+		goto drop_and_release;
 
 	if (!want_cookie)
 		TCP_ECN_create_request(req, tcp_hdr(skb));
@@ -1301,7 +1305,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		 */
 		if (tmp_opt.saw_tstamp &&
 		    tcp_death_row.sysctl_tw_recycle &&
-		    (dst = inet_csk_route_req(sk, req)) != NULL &&
 		    (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
 		    peer->v4daddr == saddr) {
 			if (get_seconds() < peer->tcp_ts_stamp + TCP_PAWS_MSL &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 8c8c6e6..8bb560d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -102,7 +102,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
 		tmp_opt.tstamp_ok = 1;
-		tcp_parse_options(skb, &tmp_opt, 1);
+		tcp_parse_options(skb, &tmp_opt, 1, NULL);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
@@ -500,10 +500,11 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	int paws_reject = 0;
 	struct tcp_options_received tmp_opt;
 	struct sock *child;
+	struct dst_entry *dst = inet_csk_route_req(sk, req);
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent = req->ts_recent;
@@ -516,6 +517,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	dst_release(dst);
+
 	/* Check for pure retransmitted SYN. */
 	if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn &&
 	    flg == TCP_FLAG_SYN &&
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 6b6ae91..6ece408 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -184,13 +184,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
 
-	/* check for timestamp cookie support */
-	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
-
-	if (tcp_opt.saw_tstamp)
-		cookie_check_timestamp(&tcp_opt);
-
 	ret = NULL;
 	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
 	if (!req)
@@ -224,12 +217,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	req->expires = 0UL;
 	req->retrans = 0;
 	ireq->ecn_ok		= 0;
-	ireq->snd_wscale	= tcp_opt.snd_wscale;
-	ireq->rcv_wscale	= tcp_opt.rcv_wscale;
-	ireq->sack_ok		= tcp_opt.sack_ok;
-	ireq->wscale_ok		= tcp_opt.wscale_ok;
-	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
-	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 	treq->rcv_isn = ntohl(th->seq) - 1;
 	treq->snt_isn = cookie;
 
@@ -264,6 +251,21 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 			goto out_free;
 	}
 
+	/* check for timestamp cookie support */
+	memset(&tcp_opt, 0, sizeof(tcp_opt));
+	tcp_parse_options(skb, &tcp_opt, 0, dst);
+
+	if (tcp_opt.saw_tstamp)
+		cookie_check_timestamp(&tcp_opt);
+
+	req->ts_recent          = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
+
+	ireq->snd_wscale        = tcp_opt.snd_wscale;
+	ireq->rcv_wscale        = tcp_opt.rcv_wscale;
+	ireq->sack_ok           = tcp_opt.sack_ok;
+	ireq->wscale_ok         = tcp_opt.wscale_ok;
+	ireq->tstamp_ok         = tcp_opt.saw_tstamp;
+
 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
 	tcp_select_initial_window(tcp_full_space(sk), req->mss,
 				  &req->rcv_wnd, &req->window_clamp,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 21d100b..2eebab5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1165,6 +1165,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct request_sock *req = NULL;
 	__u32 isn = TCP_SKB_CB(skb)->when;
+	struct dst_entry *dst = __sk_dst_get(sk);
 #ifdef CONFIG_SYN_COOKIES
 	int want_cookie = 0;
 #else
@@ -1203,7 +1204,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
-- 
1.5.6.3


^ permalink raw reply related

* [PATCHv4 6/7]  Allow to turn off TCP window scale opt per route
From: Gilad Ben-Yossef @ 2009-10-28 14:15 UTC (permalink / raw)
  To: netdev; +Cc: ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

Add and use no window scale bit in the features field.

Note that this is not the same as setting a window scale of 0
as would happen with window limit on route.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
---
 include/linux/rtnetlink.h |    1 +
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    6 ++++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 2ab8c75..6784b34 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -380,6 +380,7 @@ enum
 #define RTAX_FEATURE_NO_SACK	0x00000002
 #define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
+#define RTAX_FEATURE_NO_WSCALE	0x00000010
 
 struct rta_session
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d2f9742..4f5e914 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3739,7 +3739,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				break;
 			case TCPOPT_WINDOW:
 				if (opsize == TCPOLEN_WINDOW && th->syn &&
-				    !estab && sysctl_tcp_window_scaling) {
+				    !estab && sysctl_tcp_window_scaling &&
+				    !dst_feature(dst, RTAX_FEATURE_NO_WSCALE)) {
 					__u8 snd_wscale = *(__u8 *)ptr;
 					opt_rx->wscale_ok = 1;
 					if (snd_wscale > 14) {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8f30c18..ff60a21 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -496,7 +496,8 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		opts->tsecr = tp->rx_opt.ts_recent;
 		size += TCPOLEN_TSTAMP_ALIGNED;
 	}
-	if (likely(sysctl_tcp_window_scaling)) {
+	if (likely(sysctl_tcp_window_scaling &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_WSCALE))) {
 		opts->ws = tp->rx_opt.rcv_wscale;
 		opts->options |= OPTION_WSCALE;
 		size += TCPOLEN_WSCALE_ALIGNED;
@@ -2347,7 +2348,8 @@ static void tcp_connect_init(struct sock *sk)
 				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,
-				  sysctl_tcp_window_scaling,
+				  (sysctl_tcp_window_scaling &&
+				   !dst_feature(dst, RTAX_FEATURE_NO_WSCALE)),
 				  &rcv_wscale);
 
 	tp->rx_opt.rcv_wscale = rcv_wscale;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCHv4 4/7] Add the no SACK route option feature
From: Gilad Ben-Yossef @ 2009-10-28 14:15 UTC (permalink / raw)
  To: netdev; +Cc: ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

Implement querying and acting upon the no sack bit in the features
field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
---
 include/linux/rtnetlink.h |    2 +-
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    4 +++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index adf2068..9c802a6 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -377,7 +377,7 @@ enum
 #define RTAX_MAX (__RTAX_MAX - 1)
 
 #define RTAX_FEATURE_ECN	0x00000001
-#define RTAX_FEATURE_SACK	0x00000002
+#define RTAX_FEATURE_NO_SACK	0x00000002
 #define RTAX_FEATURE_TIMESTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d502f49..b14f780 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3763,7 +3763,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				break;
 			case TCPOPT_SACK_PERM:
 				if (opsize == TCPOLEN_SACK_PERM && th->syn &&
-				    !estab && sysctl_tcp_sack) {
+				    !estab && sysctl_tcp_sack &&
+				    !dst_feature(dst, RTAX_FEATURE_NO_SACK)) {
 					opt_rx->sack_ok = 1;
 					tcp_sack_reset(opt_rx);
 				}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fcd278a..64db8dd 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -464,6 +464,7 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 				struct tcp_md5sig_key **md5) {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned size = 0;
+	struct dst_entry *dst = __sk_dst_get(sk);
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
@@ -498,7 +499,8 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		opts->options |= OPTION_WSCALE;
 		size += TCPOLEN_WSCALE_ALIGNED;
 	}
-	if (likely(sysctl_tcp_sack)) {
+	if (likely(sysctl_tcp_sack &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_SACK))) {
 		opts->options |= OPTION_SACK_ADVERTISE;
 		if (unlikely(!(OPTION_TS & opts->options)))
 			size += TCPOLEN_SACKPERM_ALIGNED;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCHv4 3/7] Add dst_feature to query route entry features
From: Gilad Ben-Yossef @ 2009-10-28 14:15 UTC (permalink / raw)
  To: netdev; +Cc: ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

Adding an accessor to existing  dst_entry feautres field and
refactor the only supported feature (allfrag) to use it.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
---
 include/net/dst.h |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 5a900dd..b562be3 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -111,6 +111,12 @@ dst_metric(const struct dst_entry *dst, int metric)
 	return dst->metrics[metric-1];
 }
 
+static inline u32
+dst_feature(const struct dst_entry *dst, u32 feature)
+{
+	return dst_metric(dst, RTAX_FEATURES) & feature;
+}
+
 static inline u32 dst_mtu(const struct dst_entry *dst)
 {
 	u32 mtu = dst_metric(dst, RTAX_MTU);
@@ -136,7 +142,7 @@ static inline void set_dst_metric_rtt(struct dst_entry *dst, int metric,
 static inline u32
 dst_allfrag(const struct dst_entry *dst)
 {
-	int ret = dst_metric(dst, RTAX_FEATURES) & RTAX_FEATURE_ALLFRAG;
+	int ret = dst_feature(dst,  RTAX_FEATURE_ALLFRAG);
 	/* Yes, _exactly_. This is paranoia. */
 	barrier();
 	return ret;
-- 
1.5.6.3


^ permalink raw reply related

* [PATCHv4 5/7] Allow disabling TCP timestamp options per route
From: Gilad Ben-Yossef @ 2009-10-28 14:15 UTC (permalink / raw)
  To: netdev; +Cc: ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

Implement querying and acting upon the no timestamp bit in the feature
field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
---
 include/linux/rtnetlink.h |    2 +-
 net/ipv4/tcp_input.c      |    3 ++-
 net/ipv4/tcp_output.c     |    8 ++++++--
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 9c802a6..2ab8c75 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -378,7 +378,7 @@ enum
 
 #define RTAX_FEATURE_ECN	0x00000001
 #define RTAX_FEATURE_NO_SACK	0x00000002
-#define RTAX_FEATURE_TIMESTAMP	0x00000004
+#define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 
 struct rta_session
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b14f780..d2f9742 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3755,7 +3755,8 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 			case TCPOPT_TIMESTAMP:
 				if ((opsize == TCPOLEN_TIMESTAMP) &&
 				    ((estab && opt_rx->tstamp_ok) ||
-				     (!estab && sysctl_tcp_timestamps))) {
+				     (!estab && sysctl_tcp_timestamps &&
+				      !dst_feature(dst, RTAX_FEATURE_NO_TSTAMP)))) {
 					opt_rx->saw_tstamp = 1;
 					opt_rx->rcv_tsval = get_unaligned_be32(ptr);
 					opt_rx->rcv_tsecr = get_unaligned_be32(ptr + 4);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 64db8dd..8f30c18 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -488,7 +488,9 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	opts->mss = tcp_advertise_mss(sk);
 	size += TCPOLEN_MSS_ALIGNED;
 
-	if (likely(sysctl_tcp_timestamps && *md5 == NULL)) {
+	if (likely(sysctl_tcp_timestamps &&
+		   !dst_feature(dst, RTAX_FEATURE_NO_TSTAMP) &&
+		   *md5 == NULL)) {
 		opts->options |= OPTION_TS;
 		opts->tsval = TCP_SKB_CB(skb)->when;
 		opts->tsecr = tp->rx_opt.ts_recent;
@@ -2317,7 +2319,9 @@ static void tcp_connect_init(struct sock *sk)
 	 * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
 	 */
 	tp->tcp_header_len = sizeof(struct tcphdr) +
-		(sysctl_tcp_timestamps ? TCPOLEN_TSTAMP_ALIGNED : 0);
+		(sysctl_tcp_timestamps &&
+		(!dst_feature(dst, RTAX_FEATURE_NO_TSTAMP) ?
+		  TCPOLEN_TSTAMP_ALIGNED : 0));
 
 #ifdef CONFIG_TCP_MD5SIG
 	if (tp->af_specific->md5_lookup(sk, sk) != NULL)
-- 
1.5.6.3


^ permalink raw reply related

* [PATCHv4 7/7] Allow disabling of DSACK TCP option per route
From: Gilad Ben-Yossef @ 2009-10-28 14:15 UTC (permalink / raw)
  To: netdev; +Cc: ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

Add and use no DSCAK bit in the features field.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
---
 include/linux/rtnetlink.h |    1 +
 net/ipv4/tcp_input.c      |    8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 6784b34..e78b60c 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -381,6 +381,7 @@ enum
 #define RTAX_FEATURE_NO_TSTAMP	0x00000004
 #define RTAX_FEATURE_ALLFRAG	0x00000008
 #define RTAX_FEATURE_NO_WSCALE	0x00000010
+#define RTAX_FEATURE_NO_DSACK	0x00000020
 
 struct rta_session
 {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4f5e914..4262da5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4080,8 +4080,10 @@ static inline int tcp_sack_extend(struct tcp_sack_block *sp, u32 seq,
 static void tcp_dsack_set(struct sock *sk, u32 seq, u32 end_seq)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct dst_entry *dst = __sk_dst_get(sk);
 
-	if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
+	if (tcp_is_sack(tp) && sysctl_tcp_dsack &&
+	    !dst_feature(dst, RTAX_FEATURE_NO_DSACK)) {
 		int mib_idx;
 
 		if (before(seq, tp->rcv_nxt))
@@ -4110,13 +4112,15 @@ static void tcp_dsack_extend(struct sock *sk, u32 seq, u32 end_seq)
 static void tcp_send_dupack(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct dst_entry *dst = __sk_dst_get(sk);
 
 	if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
 	    before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
 		tcp_enter_quickack_mode(sk);
 
-		if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
+		if (tcp_is_sack(tp) && sysctl_tcp_dsack &&
+		    !dst_feature(dst, RTAX_FEATURE_NO_DSACK)) {
 			u32 end_seq = TCP_SKB_CB(skb)->end_seq;
 
 			if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
-- 
1.5.6.3


^ permalink raw reply related

* [PATCHv4 1/7] Only parse time stamp TCP option in time wait sock
From: Gilad Ben-Yossef @ 2009-10-28 14:15 UTC (permalink / raw)
  To: netdev; +Cc: ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

Since we only use tcp_parse_options here to check for the exietence
of TCP timestamp option in the header, it is better to call with
the "established" flag on.

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Signed-off-by: Yony Amit <yony@comsleep.com>
---
 net/ipv4/tcp_minisocks.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 624c3c9..8c8c6e6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -100,9 +100,9 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	struct tcp_options_received tmp_opt;
 	int paws_reject = 0;
 
-	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tmp_opt.tstamp_ok = 1;
+		tcp_parse_options(skb, &tmp_opt, 1);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
-- 
1.5.6.3


^ permalink raw reply related

* Re: [net-next-2.6 PATCH v4 3/3] TCPCT part 1c: initial SYN exchange with SYNACK data
From: Eric Dumazet @ 2009-10-28 14:17 UTC (permalink / raw)
  To: William Allen Simpson; +Cc: Linux Kernel Network Developers
In-Reply-To: <4AE6E7C0.2050408@gmail.com>

William Allen Simpson a écrit :
> This is a significantly revised implementation of an earlier (year-old)
> patch that no longer applies cleanly, with permission of the original
> author (Adam Langley).  That patch was previously reviewed:
> 
>    http://thread.gmane.org/gmane.linux.network/102586
> 
> The principle difference is using a TCP option to carry the cookie nonce,
> instead of a user configured offset in the data.  This is more flexible and
> less subject to user configuration error.  Such a cookie option has been
> suggested for many years, and is also useful without SYN data, allowing
> several related concepts to use the same extension option.
> 
>    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
>    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

Sorry this link might be interesting to you, but I found nothing that explains
your patches.

> 
>    "Re: what a new TCP header might look like", May 12, 1998.
>    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

Same here....

> 
> Data structures are carefully composed to require minimal additions.
> For example, the struct tcp_options_received cookie_plus variable fits
> between existing 16-bit and 8-bit variables, requiring no additional
> space (taking alignment into consideration).  There are no additions to
> tcp_request_sock, and only 1 pointer and 1 flag byte in tcp_sock.
> 
> Allocations have been rearranged to avoid requiring GFP_ATOMIC, with
> only one unavoidable exception in tcp_create_openreq_child(), where the
> tcp_sock itself is created GFP_ATOMIC.
> 
> These functions will also be used in subsequent patches that implement
> additional features.
> 
> Requires:
>   TCPCT part 1a: add request_values parameter for sending SYNACK
>   TCPCT part 1b: sysctl_tcp_cookie_size, socket option
> TCP_COOKIE_TRANSACTIONS, functions
> 
> Signed-off-by: William.Allen.Simpson@gmail.com

I tried to find an RFC or document about this stuff and failed.

Before reading implementation code, I like to have english text that describes
the new concept/design.

(BTW I found http://ttcplinux.sourceforge.net/theses/ETTCP.pdf and found it interesting,
I wonder what happened to this)


^ permalink raw reply

* Re: [PATCH 3/3] net: TCP thin dupack
From: Ilpo Järvinen @ 2009-10-28 14:17 UTC (permalink / raw)
  To: Andreas Petlund; +Cc: Netdev, LKML, shemminger, David Miller
In-Reply-To: <4AE7207D.8090402@simula.no>

On Tue, 27 Oct 2009, Andreas Petlund wrote:

> This patch enables fast retransmissions after one dupACK for TCP if the 
> stream is identified as thin. This will reduce latencies for thin 
> streams that are not able to trigger fast retransmissions due to high 
> packet interarrival time. This mechanism is only active if enabled by 
> iocontrol or syscontrol and the stream is identified as thin. 
> 
> 
> Signed-off-by: Andreas Petlund <apetlund@simula.no>
> ---
>  include/linux/tcp.h        |    4 +++-
>  include/net/tcp.h          |    1 +
>  net/ipv4/sysctl_net_ipv4.c |    8 ++++++++
>  net/ipv4/tcp.c             |    5 +++++
>  net/ipv4/tcp_input.c       |    8 ++++++++
>  5 files changed, 25 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index e64368d..f4a05ff 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -97,6 +97,7 @@ enum {
>  #define TCP_CONGESTION		13	/* Congestion control algorithm */
>  #define TCP_MD5SIG		14	/* TCP MD5 Signature (RFC2385) */
>  #define TCP_THIN_RM_EXPB        15      /* Remove exp. backoff for thin streams*/
> +#define TCP_THIN_DUPACK         16      /* Fast retrans. after 1 dupack */
>  
>  #define TCPI_OPT_TIMESTAMPS	1
>  #define TCPI_OPT_SACK		2
> @@ -301,7 +302,8 @@ struct tcp_sock {
>  	u8	frto_counter;	/* Number of new acks after RTO */
>  	u8	nonagle;	/* Disable Nagle algorithm?             */
>  	u8      thin_rm_expb:1, /* Remove exp. backoff for thin streams */
> -		thin_undef : 7;
> +		thin_dupack : 1,/* Fast retransmit on first dupack      */
> +		thin_undef : 6;
>  
>  /* RTT measurement */
>  	u32	srtt;		/* smoothed round trip time << 3	*/
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 412c1bd..41f3a5e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -238,6 +238,7 @@ extern int sysctl_tcp_workaround_signed_windows;
>  extern int sysctl_tcp_slow_start_after_idle;
>  extern int sysctl_tcp_max_ssthresh;
>  extern int sysctl_tcp_force_thin_rm_expb;
> +extern int sysctl_tcp_force_thin_dupack;
>  
>  extern atomic_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 7458f37..8653867 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -721,6 +721,14 @@ static struct ctl_table ipv4_table[] = {
>  		.proc_handler   = proc_dointvec
>  	},
>  	{
> +		.ctl_name       = CTL_UNNUMBERED,
> +		.procname       = "tcp_force_thin_dupack",
> +		.data           = &sysctl_tcp_force_thin_dupack,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec
> +	},
> +	{
>  		.ctl_name	= CTL_UNNUMBERED,
>  		.procname	= "udp_mem",
>  		.data		= &sysctl_udp_mem,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index b4b0931..de190db 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2139,6 +2139,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  			tp->thin_rm_expb = 1;
>  		break;
>  
> +	case TCP_THIN_DUPACK:
> +		if (val)
> +			tp->thin_dupack = 1;
> +		break;
> +
>  	case TCP_CORK:
>  		/* When set indicates to always queue non-full frames.
>  		 * Later the user clears this option and we transmit
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d86784b..b71eb89 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -89,6 +89,8 @@ int sysctl_tcp_frto __read_mostly = 2;
>  int sysctl_tcp_frto_response __read_mostly;
>  int sysctl_tcp_nometrics_save __read_mostly;
>  
> +int sysctl_tcp_force_thin_dupack __read_mostly;
> +
>  int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>  int sysctl_tcp_abc __read_mostly;
>  
> @@ -2447,6 +2449,12 @@ static int tcp_time_to_recover(struct sock *sk)
>  		return 1;
>  	}
>  
> +	/* If a thin stream is detected, retransmit after first
> +	 * received dupack */
> +	if ((tp->thin_dupack || sysctl_tcp_force_thin_dupack) &&
> +	    tcp_dupack_heurestics(tp) > 1 && tcp_stream_is_thin(tp))
> +		return 1;
> +
>  	return 0;
>  }

Have you tested it? ...I doubt this will work like you say and retransmit 
something when the window is small. ...Besides, you should have built this 
patch on top of the function rename you submitted earlier as after DaveM 
applied that this will no longer even compile...

-- 
 i.

^ permalink raw reply

* Re: [PATCH 2/3] net: TCP thin linear timeouts
From: Ilpo Järvinen @ 2009-10-28 14:18 UTC (permalink / raw)
  To: Andreas Petlund; +Cc: Netdev, LKML, shemminger, David Miller
In-Reply-To: <4AE72079.4030504@simula.no>

On Tue, 27 Oct 2009, Andreas Petlund wrote:

> This patch will make TCP use only linear timeouts if the stream is thin. This will help to avoid the very high latencies that thin stream suffer because of exponential backoff. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin.
> 
> 
> Signed-off-by: Andreas Petlund <apetlund@simula.no>
> ---
>  include/linux/tcp.h        |    3 +++
>  include/net/tcp.h          |    1 +
>  net/ipv4/sysctl_net_ipv4.c |    8 ++++++++
>  net/ipv4/tcp.c             |    5 +++++
>  net/ipv4/tcp_timer.c       |   17 ++++++++++++++++-
>  5 files changed, 33 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 61723a7..e64368d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -96,6 +96,7 @@ enum {
>  #define TCP_QUICKACK		12	/* Block/reenable quick acks */
>  #define TCP_CONGESTION		13	/* Congestion control algorithm */
>  #define TCP_MD5SIG		14	/* TCP MD5 Signature (RFC2385) */
> +#define TCP_THIN_RM_EXPB        15      /* Remove exp. backoff for thin streams*/
>  
>  #define TCPI_OPT_TIMESTAMPS	1
>  #define TCPI_OPT_SACK		2
> @@ -299,6 +300,8 @@ struct tcp_sock {
>  	u16	advmss;		/* Advertised MSS			*/
>  	u8	frto_counter;	/* Number of new acks after RTO */
>  	u8	nonagle;	/* Disable Nagle algorithm?             */
> +	u8      thin_rm_expb:1, /* Remove exp. backoff for thin streams */
> +		thin_undef : 7;
>  
>  /* RTT measurement */
>  	u32	srtt;		/* smoothed round trip time << 3	*/
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 7c4482f..412c1bd 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -237,6 +237,7 @@ extern int sysctl_tcp_base_mss;
>  extern int sysctl_tcp_workaround_signed_windows;
>  extern int sysctl_tcp_slow_start_after_idle;
>  extern int sysctl_tcp_max_ssthresh;
> +extern int sysctl_tcp_force_thin_rm_expb;
>  
>  extern atomic_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 2dcf04d..7458f37 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -713,6 +713,14 @@ static struct ctl_table ipv4_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  	{
> +		.ctl_name       = CTL_UNNUMBERED,
> +		.procname       = "tcp_force_thin_rm_expb",
> +		.data           = &sysctl_tcp_force_thin_rm_expb,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec
> +	},
> +	{
>  		.ctl_name	= CTL_UNNUMBERED,
>  		.procname	= "udp_mem",
>  		.data		= &sysctl_udp_mem,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 90b2e06..b4b0931 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2134,6 +2134,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  		}
>  		break;
>  
> +	case TCP_THIN_RM_EXPB:
> +		if (val)
> +			tp->thin_rm_expb = 1;
> +		break;
> +
>  	case TCP_CORK:
>  		/* When set indicates to always queue non-full frames.
>  		 * Later the user clears this option and we transmit
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index cdb2ca7..24d6dc3 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -29,6 +29,7 @@ int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
>  int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
>  int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
>  int sysctl_tcp_orphan_retries __read_mostly;
> +int sysctl_tcp_force_thin_rm_expb __read_mostly;
>  
>  static void tcp_write_timer(unsigned long);
>  static void tcp_delack_timer(unsigned long);
> @@ -386,7 +387,21 @@ void tcp_retransmit_timer(struct sock *sk)
>  	icsk->icsk_retransmits++;
>  
>  out_reset_timer:
> -	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> +	if ((tp->thin_rm_expb || sysctl_tcp_force_thin_rm_expb) &&
> +	    tcp_stream_is_thin(tp) && sk->sk_state == TCP_ESTABLISHED) {
> +		/* If stream is thin, remove exponential backoff.
> +		 * Since 'icsk_backoff' is used to reset timer, set to 0
> +		 * Recalculate 'icsk_rto' as this might be increased if
> +		 * stream oscillates between thin and thick, thus the old
> +		 * value might already be too high compared to the value
> +		 * set by 'tcp_set_rto' in tcp_input.c which resets the
> +		 * rto without backoff. */
> +		icsk->icsk_backoff = 0;
> +		icsk->icsk_rto = min(((tp->srtt >> 3) + tp->rttvar), TCP_RTO_MAX);

The first part is nowadays done with __tcp_set_rto(tp).

-- 
 i.

^ permalink raw reply

* Re: [PATCHv4 0/7] Per route TCP options support kill switches
From: Eric Dumazet @ 2009-10-28 14:22 UTC (permalink / raw)
  To: Gilad Ben-Yossef; +Cc: netdev, ori
In-Reply-To: <1256739327-11576-1-git-send-email-gilad@codefidence.com>

Gilad Ben-Yossef a écrit :
> Allow selectively turning off support for specific TCP options
> on a per route basis.
> 
> One normally want to disable SACK, DSACK, time stamp or window
> scale if one got a piece of broken networking equipment somewhere
> as a stop gap until you can bring a big enough hammer to deal with
> the broken network equipment. It doesn't make sense to "punish" the
> entire connections going through the machine to destinations not
> related to the broken equipment.
> 
> This is doubly true when one is dealing with network containers
> used to isolate several virtual domains.
> 
> Per route options implemented in free bits in the features route
> entry property, which in some cases were reserved by name for these
> options, so this does not inflate any structure.
> 
> Global sysctls for these options are still preserved and retain 
> the exact original meaning (e.g. you have to have both the global 
> sysctl turned on and not turn off the TCP option parsing in the
> specific route to have it proccessed).
> 
> It is not possible to turn off globally an option but turn it on
> per route, so as to not subtly change the meaning of current
> establish sysctls (and this is a rare need anyway).
> 
> Tested on x86 using Qemu/KVM.
> 
> Working but crude matching patch to iproute2 sent earlier to the list.
> 
> Patchset based on original work by Ori Finkelman and Yony Amit
> from ComSleep Ltd.
> 
> The author wishes to thank Eric Dumazaet, William Allen Simpson, 
> Bill Fink and Ilpo Jarvinen for their feedback.
> 
> 
> Gilad Ben-Yossef (7):
>   Only parse time stamp TCP option in time wait sock
>   Allow tcp_parse_options to consult dst entry
>   Add dst_feature to query route entry features
>   Add the no SACK route option feature
>   Allow disabling TCP timestamp options per route
>   Allow to turn off TCP window scale opt per route
>   Allow disabling of DSACK TCP option per route
> 
>  include/linux/rtnetlink.h |    6 ++++--
>  include/net/dst.h         |    8 +++++++-
>  include/net/tcp.h         |    3 ++-
>  net/ipv4/syncookies.c     |   27 ++++++++++++++-------------
>  net/ipv4/tcp_input.c      |   26 ++++++++++++++++++--------
>  net/ipv4/tcp_ipv4.c       |   21 ++++++++++++---------
>  net/ipv4/tcp_minisocks.c  |    9 ++++++---
>  net/ipv4/tcp_output.c     |   18 +++++++++++++-----
>  net/ipv6/syncookies.c     |   28 +++++++++++++++-------------
>  net/ipv6/tcp_ipv6.c       |    3 ++-
>  10 files changed, 93 insertions(+), 56 deletions(-)
> 

I am a bit lost. What exactly changed in this new version, versus v3 ?


^ permalink raw reply

* Re: [PATCHv4 0/7] Per route TCP options support kill switches
From: Gilad Ben-Yossef @ 2009-10-28 14:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, ori
In-Reply-To: <4AE853A5.3060804@gmail.com>

Eric Dumazet wrote:

> Gilad Ben-Yossef a écrit :
>   
>> Allow selectively turning off support for specific TCP options
>> on a per route basis.
>>
>> One normally want to disable SACK, DSACK, time stamp or window
>> scale if one got a piece of broken networking equipment somewhere
>> as a stop gap until you can bring a big enough hammer to deal with
>> the broken network equipment. It doesn't make sense to "punish" the
>> entire connections going through the machine to destinations not
>> related to the broken equipment.
>>
>> This is doubly true when one is dealing with network containers
>> used to isolate several virtual domains.
>>
>> Per route options implemented in free bits in the features route
>> entry property, which in some cases were reserved by name for these
>> options, so this does not inflate any structure.
>>
>> Global sysctls for these options are still preserved and retain 
>> the exact original meaning (e.g. you have to have both the global 
>> sysctl turned on and not turn off the TCP option parsing in the
>> specific route to have it proccessed).
>>
>> It is not possible to turn off globally an option but turn it on
>> per route, so as to not subtly change the meaning of current
>> establish sysctls (and this is a rare need anyway).
>>
>> Tested on x86 using Qemu/KVM.
>>
>> Working but crude matching patch to iproute2 sent earlier to the list.
>>
>> Patchset based on original work by Ori Finkelman and Yony Amit
>> from ComSleep Ltd.
>>
>> The author wishes to thank Eric Dumazaet, William Allen Simpson, 
>> Bill Fink and Ilpo Jarvinen for their feedback.
>>
>>
>> Gilad Ben-Yossef (7):
>>   Only parse time stamp TCP option in time wait sock
>>   Allow tcp_parse_options to consult dst entry
>>   Add dst_feature to query route entry features
>>   Add the no SACK route option feature
>>   Allow disabling TCP timestamp options per route
>>   Allow to turn off TCP window scale opt per route
>>   Allow disabling of DSACK TCP option per route
>>
>>  include/linux/rtnetlink.h |    6 ++++--
>>  include/net/dst.h         |    8 +++++++-
>>  include/net/tcp.h         |    3 ++-
>>  net/ipv4/syncookies.c     |   27 ++++++++++++++-------------
>>  net/ipv4/tcp_input.c      |   26 ++++++++++++++++++--------
>>  net/ipv4/tcp_ipv4.c       |   21 ++++++++++++---------
>>  net/ipv4/tcp_minisocks.c  |    9 ++++++---
>>  net/ipv4/tcp_output.c     |   18 +++++++++++++-----
>>  net/ipv6/syncookies.c     |   28 +++++++++++++++-------------
>>  net/ipv6/tcp_ipv6.c       |    3 ++-
>>  10 files changed, 93 insertions(+), 56 deletions(-)
>>
>>     
>
> I am a bit lost. What exactly changed in this new version, versus v3 ?
>
>   


Code wise only the first patch - in tcp_timewait_state_process I
was calling tcp_parse_options() without actually initializing
properly the tstamp_ok field of the truct tcp_options_received
parameter to the that function.

I failed to noticed that since being a non initalized structure
on the stack it can actually produce the required behavior
depending on whatever random data was there.

I spotted it thanks to Williams A.S. review of my changes in that area.

Since I had the opportunity, I also edited the patch set description to
better document some of the things that seemed to bug some
of the reviewers (namely Bill F. and Williams A.S ).

Sorry for the confusion.

Thanks,
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker & CTO
Codefidence Ltd.

Web:   http://codefidence.com
Cell:  +972-52-8260388
Skype: gilad_codefidence
Tel:   +972-8-9316883 ext. 201
Fax:   +972-8-9316884
Email: gilad@codefidence.com

Check out our Open Source technology and training blog - http://tuxology.net

	"The biggest risk you can take it is to take no risk."
		-- Mark Zuckerberg and probably others


^ permalink raw reply

* Re: [PATCH 2/3] net: TCP thin linear timeouts
From: Ilpo Järvinen @ 2009-10-28 14:31 UTC (permalink / raw)
  To: Arnd Hannemann
  Cc: Eric Dumazet, Andreas Petlund, Netdev, LKML, shemminger,
	David Miller
In-Reply-To: <4AE83FE4.1050309@nets.rwth-aachen.de>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1469 bytes --]

On Wed, 28 Oct 2009, Arnd Hannemann wrote:

> Eric Dumazet schrieb:
> > Andreas Petlund a écrit :
> >> This patch will make TCP use only linear timeouts if the stream is 
> >> thin. This will help to avoid the very high latencies that thin 
> >> stream suffer because of exponential backoff. This mechanism is only 
> >> active if enabled by iocontrol or syscontrol and the stream is 
> >> identified as thin. 

...I don't see how high latency is in any connection to stream being 
"thin" or not btw. If all ACKs are lost it usually requires silence for 
the full RTT, which affects a stream regardless of its size. ...If not all 
ACKs are lost, then the dupACK approach in the other patch should cover 
it already.

> However, addressing the proposal:
> I wonder how one can seriously suggest to just skip congestion response 
> during timeout-based loss recovery? I believe that in a heavily 
> congested scenarios, this would lead to a goodput disaster... Not to 
> mention that in a heavily congested scenario, suddenly every flow will 
> become "thin", so this will even amplify the problems. Or did I miss 
> something?

Good point. I suppose such an under-provisioned network can certainly be
there. I have heard that at least some people who remove exponential back 
off apply it later on nth retransmission as very often there really isn't 
such a super heavy congestion scenario but something completely unrelated 
to congestion which causes the RTO.

-- 
 i.

^ permalink raw reply

* [PATCHv2 net-2.6] sfc: Really allow RX checksum offload to be disabled
From: Ben Hutchings @ 2009-10-28 14:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers

We have never checked the efx_nic::rx_checksum_enabled flag everywhere
we should, and since the switch to GRO we don't check it anywhere.
It's simplest to check it in the one place where we initialise the
per-packet checksummed flag.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org
---
This version really is applicable to net-2.6 and 2.6.31.y, and to
2.6.27.y with fuzz 1.  I'm not sure whether this bug is serious enough
for a stable update but it is an obvious fix.

Ben.

 drivers/net/sfc/falcon.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/sfc/falcon.c b/drivers/net/sfc/falcon.c
index c049364..e75674e 100644
--- a/drivers/net/sfc/falcon.c
+++ b/drivers/net/sfc/falcon.c
@@ -884,7 +884,9 @@ static void falcon_handle_rx_event(struct efx_channel *channel,
 		/* If packet is marked as OK and packet type is TCP/IPv4 or
 		 * UDP/IPv4, then we can rely on the hardware checksum.
 		 */
-		checksummed = RX_EV_HDR_TYPE_HAS_CHECKSUMS(rx_ev_hdr_type);
+		checksummed =
+			efx->rx_checksum_enabled &&
+			RX_EV_HDR_TYPE_HAS_CHECKSUMS(rx_ev_hdr_type);
 	} else {
 		falcon_handle_rx_not_ok(rx_queue, event, &rx_ev_pkt_ok,
 					&discard);

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH net-next-2.6] ipv6 sit: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 14:37 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv6/sit.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)


diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index b6b1626..2362a33 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1145,16 +1145,19 @@ static struct xfrm_tunnel sit_handler = {
 	.priority	=	1,
 };
 
-static void sit_destroy_tunnels(struct sit_net *sitn)
+static void sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head)
 {
 	int prio;
 
 	for (prio = 1; prio < 4; prio++) {
 		int h;
 		for (h = 0; h < HASH_SIZE; h++) {
-			struct ip_tunnel *t;
-			while ((t = sitn->tunnels[prio][h]) != NULL)
-				unregister_netdevice(t->dev);
+			struct ip_tunnel *t = sitn->tunnels[prio][h];
+
+			while (t != NULL) {
+				unregister_netdevice_queue(t->dev, head);
+				t = t->next;
+			}
 		}
 	}
 }
@@ -1208,11 +1211,13 @@ err_alloc:
 static void sit_exit_net(struct net *net)
 {
 	struct sit_net *sitn;
+	LIST_HEAD(list);
 
 	sitn = net_generic(net, sit_net_id);
 	rtnl_lock();
-	sit_destroy_tunnels(sitn);
-	unregister_netdevice(sitn->fb_tunnel_dev);
+	sit_destroy_tunnels(sitn, &list);
+	unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
+	unregister_netdevice_many(&list);
 	rtnl_unlock();
 	kfree(sitn);
 }

^ permalink raw reply related

* TG3, kvm, ipv6 & tso data corruption bug?
From: Rik van Riel @ 2009-10-28 14:46 UTC (permalink / raw)
  To: netdev; +Cc: Linux kernel Mailing List, Matt Carlson, Michael Chan, KVM list

I have been tracking down what I thought was a KVM related network
issue for a while, however it appears it could be a hardware issue.

The symptom is that data in network packets gets corrupted, before
the checksum is calculated.  This means the remote host can get
corrupted data, with no way to calculate it (except application
level checksums).  Luckily ssh has such checksums, so my rsync over
ssh backup script discovered this issue.

On a very regular basis, I got this message from ssh:

	Corrupted MAC on input.

I have played around a bit and narrowed it down to the following:

ipv4          => no problem
ipv6 w/o tso  => no problem
ipv6 with tso => occasional data corruption

Disabling tso with ethtool -K eth0 tso off makes the problem stop.

I am running Fedora 12's 2.6.31.1-56.fc12.x86_64 kernel, with the
following hardware:

05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5761 
Gigabit Ethernet PCIe (rev 10)

I do not know enough about the network layer to know whether this is
fixable in software or whether TSO offloading for ipv6 should just
be disabled on this model.

-- 
All rights reversed.

^ permalink raw reply

* [PATCH net-next-2.6] ip6mr: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 14:48 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv6/ip6mr.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 85849b4..52e0f74 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -477,7 +477,7 @@ failure:
  *	Delete a VIF entry
  */
 
-static int mif6_delete(struct net *net, int vifi)
+static int mif6_delete(struct net *net, int vifi, struct list_head *head)
 {
 	struct mif_device *v;
 	struct net_device *dev;
@@ -519,7 +519,7 @@ static int mif6_delete(struct net *net, int vifi)
 		in6_dev->cnf.mc_forwarding--;
 
 	if (v->flags & MIFF_REGISTER)
-		unregister_netdevice(dev);
+		unregister_netdevice_queue(dev, head);
 
 	dev_put(dev);
 	return 0;
@@ -976,6 +976,7 @@ static int ip6mr_device_event(struct notifier_block *this,
 	struct net *net = dev_net(dev);
 	struct mif_device *v;
 	int ct;
+	LIST_HEAD(list);
 
 	if (event != NETDEV_UNREGISTER)
 		return NOTIFY_DONE;
@@ -983,8 +984,10 @@ static int ip6mr_device_event(struct notifier_block *this,
 	v = &net->ipv6.vif6_table[0];
 	for (ct = 0; ct < net->ipv6.maxvif; ct++, v++) {
 		if (v->dev == dev)
-			mif6_delete(net, ct);
+			mif6_delete(net, ct, &list);
 	}
+	unregister_netdevice_many(&list);
+
 	return NOTIFY_DONE;
 }
 
@@ -1188,14 +1191,16 @@ static int ip6mr_mfc_add(struct net *net, struct mf6cctl *mfc, int mrtsock)
 static void mroute_clean_tables(struct net *net)
 {
 	int i;
+	LIST_HEAD(list);
 
 	/*
 	 *	Shut down all active vif entries
 	 */
 	for (i = 0; i < net->ipv6.maxvif; i++) {
 		if (!(net->ipv6.vif6_table[i].flags & VIFF_STATIC))
-			mif6_delete(net, i);
+			mif6_delete(net, i, &list);
 	}
+	unregister_netdevice_many(&list);
 
 	/*
 	 *	Wipe the cache
@@ -1325,7 +1330,7 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 		if (copy_from_user(&mifi, optval, sizeof(mifi_t)))
 			return -EFAULT;
 		rtnl_lock();
-		ret = mif6_delete(net, mifi);
+		ret = mif6_delete(net, mifi, NULL);
 		rtnl_unlock();
 		return ret;
 

^ permalink raw reply related

* Re: [PATCH] udev: create empty regular files to represent net interfaces
From: dann frazier @ 2009-10-28 15:09 UTC (permalink / raw)
  To: Matt Domsch
  Cc: Kay Sievers, linux-hotplug, Narendra_K, netdev, Jordan_Hargrave,
	Charles_Rose, Ben Hutchings
In-Reply-To: <20091028130308.GA24611@auslistsprd01.us.dell.com>

On Wed, Oct 28, 2009 at 08:03:08AM -0500, Matt Domsch wrote:
> On Wed, Oct 28, 2009 at 09:23:57AM +0100, Kay Sievers wrote:
[...]
> > That all sounds very much like something which will hit us back some
> > day. I'm not sure, if udev should publish such dead text files in
> > /dev, it does not seem to fit the usual APIs/assumptions where /sys
> > and /dev match, and libudev provides access to both. It all sounds
> > more like a database for a possible netdevname library, which does not
> > need to be public in /dev, right?
> 
> Right, it doesn't need to be in /dev.  We could have udev rules that
> simply call yet another program to maintain that database, in yet
> another way.

Or have udev maintain them in a private directory (e.g.,
/var/lib/udev/netalias). Personally, I like the approach of having
udev manage them as files - its an abstraction our users already get,
and they don't have to learn two mechanisms when aliasing disks and
nics (SYMLINK ftw). Plus there's obviously a lot of code reuse to be
had (most of my patch was moving code into a common section).

If we want to hide the file implementation - we could invent another
udev construct that basically aliases SYMLINK (e.g. NETALIAS) that
works iff the device is a netdevice. That would let us switch out
implementations in the future, but would obviously be much more
invasive.

-- 
dann frazier


^ permalink raw reply

* [PATCH net-next-2.6] ip6tnl: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 15:16 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv6/ip6_tunnel.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 670c291..6c1b5c9 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1393,14 +1393,19 @@ static void ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n)
 {
 	int h;
 	struct ip6_tnl *t;
+	LIST_HEAD(list);
 
 	for (h = 0; h < HASH_SIZE; h++) {
-		while ((t = ip6n->tnls_r_l[h]) != NULL)
-			unregister_netdevice(t->dev);
+		t = ip6n->tnls_r_l[h];
+		while (t != NULL) {
+			unregister_netdevice_queue(t->dev, &list);
+			t = t->next;
+		}
 	}
 
 	t = ip6n->tnls_wc[0];
-	unregister_netdevice(t->dev);
+	unregister_netdevice_queue(t->dev, &list);
+	unregister_netdevice_many(&list);
 }
 
 static int ip6_tnl_init_net(struct net *net)

^ permalink raw reply related

* [PATCH net-next-2.6] ipmr: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 15:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ipmr.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 6949745..ef4ee45 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -275,7 +275,8 @@ failure:
  *	@notify: Set to 1, if the caller is a notifier_call
  */
 
-static int vif_delete(struct net *net, int vifi, int notify)
+static int vif_delete(struct net *net, int vifi, int notify,
+		      struct list_head *head)
 {
 	struct vif_device *v;
 	struct net_device *dev;
@@ -319,7 +320,7 @@ static int vif_delete(struct net *net, int vifi, int notify)
 	}
 
 	if (v->flags&(VIFF_TUNNEL|VIFF_REGISTER) && !notify)
-		unregister_netdevice(dev);
+		unregister_netdevice_queue(dev, head);
 
 	dev_put(dev);
 	return 0;
@@ -870,14 +871,16 @@ static int ipmr_mfc_add(struct net *net, struct mfcctl *mfc, int mrtsock)
 static void mroute_clean_tables(struct net *net)
 {
 	int i;
+	LIST_HEAD(list);
 
 	/*
 	 *	Shut down all active vif entries
 	 */
 	for (i = 0; i < net->ipv4.maxvif; i++) {
 		if (!(net->ipv4.vif_table[i].flags&VIFF_STATIC))
-			vif_delete(net, i, 0);
+			vif_delete(net, i, 0, &list);
 	}
+	unregister_netdevice_many(&list);
 
 	/*
 	 *	Wipe the cache
@@ -993,7 +996,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 		if (optname == MRT_ADD_VIF) {
 			ret = vif_add(net, &vif, sk == net->ipv4.mroute_sk);
 		} else {
-			ret = vif_delete(net, vif.vifc_vifi, 0);
+			ret = vif_delete(net, vif.vifc_vifi, 0, NULL);
 		}
 		rtnl_unlock();
 		return ret;
@@ -1156,6 +1159,7 @@ static int ipmr_device_event(struct notifier_block *this, unsigned long event, v
 	struct net *net = dev_net(dev);
 	struct vif_device *v;
 	int ct;
+	LIST_HEAD(list);
 
 	if (!net_eq(dev_net(dev), net))
 		return NOTIFY_DONE;
@@ -1165,8 +1169,9 @@ static int ipmr_device_event(struct notifier_block *this, unsigned long event, v
 	v = &net->ipv4.vif_table[0];
 	for (ct = 0; ct < net->ipv4.maxvif; ct++, v++) {
 		if (v->dev == dev)
-			vif_delete(net, ct, 1);
+			vif_delete(net, ct, 1, &list);
 	}
+	unregister_netdevice_many(&list);
 	return NOTIFY_DONE;
 }
 

^ permalink raw reply related

* [PATCH kernel 2.6.32-rc5] [RESEND] netdev: usb: dm9601.c can drive a device not supported yet, add support for it
From: Janusz Krzysztofik @ 2009-10-28 15:34 UTC (permalink / raw)
  To: Peter Korsgaard; +Cc: netdev, David S. Miller
In-Reply-To: <87my3jnts0.fsf@macbook.be.48ers.dk>

I found that the current version of drivers/net/usb/dm9601.c can be used to
successfully drive a low-power, low-cost network adapter with USB ID
0a46:9000, based on a DM9000E chipset. As no device with this ID is yet
present in the kernel, I have created a patch that adds support for the device
to the dm9601 driver.

Created and tested against linux-2.6.32-rc5.

Signed-off-by: Janusz Krzysztofik <jkrzyszt@tis.icnet.pl>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>

---
Thursday 22 October 2009 20:54:55 Peter Korsgaard wrote:

> Acked-by: Peter Korsgaard <jacmet@sunsite.dk>

I'm not sure if there was sometning wrong with my initial submition, but since
the patch, unlike others, has not been applied since acked last week, neither
to net-2.6 nor net-next-2.6, I decided to resend it, with a slightly modified
subject, and CC: David this time.

Thanks,
Janusz

--- linux-2.6.32-rc5/drivers/net/usb/dm9601.c.orig	2009-10-22 20:14:00.000000000 +0200
+++ linux-2.6.32-rc5/drivers/net/usb/dm9601.c	2009-10-22 20:14:04.000000000 +0200
@@ -649,6 +649,10 @@ static const struct usb_device_id produc
 	USB_DEVICE(0x0fe6, 0x8101),	/* DM9601 USB to Fast Ethernet Adapter */
 	.driver_info = (unsigned long)&dm9601_info,
 	 },
+	{
+	 USB_DEVICE(0x0a46, 0x9000),	/* DM9000E */
+	 .driver_info = (unsigned long)&dm9601_info,
+	 },
 	{},			// END
 };
 


^ permalink raw reply

* [PATCH net-next-2.6] bridge: Optimize multiple unregistration
From: Eric Dumazet @ 2009-10-28 15:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Speedup module unloading by factorizing synchronize_rcu() calls

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/bridge/br_if.c |   19 +++++++++----------
 1 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index b1b3b0f..2117e5b 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -154,7 +154,7 @@ static void del_nbp(struct net_bridge_port *p)
 }
 
 /* called with RTNL */
-static void del_br(struct net_bridge *br)
+static void del_br(struct net_bridge *br, struct list_head *head)
 {
 	struct net_bridge_port *p, *n;
 
@@ -165,7 +165,7 @@ static void del_br(struct net_bridge *br)
 	del_timer_sync(&br->gc_timer);
 
 	br_sysfs_delbr(br->dev);
-	unregister_netdevice(br->dev);
+	unregister_netdevice_queue(br->dev, head);
 }
 
 static struct net_device *new_bridge_dev(struct net *net, const char *name)
@@ -323,7 +323,7 @@ int br_del_bridge(struct net *net, const char *name)
 	}
 
 	else
-		del_br(netdev_priv(dev));
+		del_br(netdev_priv(dev), NULL);
 
 	rtnl_unlock();
 	return ret;
@@ -462,15 +462,14 @@ int br_del_if(struct net_bridge *br, struct net_device *dev)
 void br_net_exit(struct net *net)
 {
 	struct net_device *dev;
+	LIST_HEAD(list);
 
 	rtnl_lock();
-restart:
-	for_each_netdev(net, dev) {
-		if (dev->priv_flags & IFF_EBRIDGE) {
-			del_br(netdev_priv(dev));
-			goto restart;
-		}
-	}
+	for_each_netdev(net, dev)
+		if (dev->priv_flags & IFF_EBRIDGE)
+			del_br(netdev_priv(dev), &list);
+
+	unregister_netdevice_many(&list);
 	rtnl_unlock();
 
 }

^ permalink raw reply related

* Re: [net-next-2.6 PATCH 01/23] igb: add support for seperate tx-usecs setting in ethtool
From: Stephen Hemminger @ 2009-10-28 15:42 UTC (permalink / raw)
  To: David Miller; +Cc: jeffrey.t.kirsher, netdev, gospo, alexander.h.duyck
In-Reply-To: <20091028.033922.99548522.davem@davemloft.net>

On Wed, 28 Oct 2009 03:39:22 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> +	new_rx_count = max(new_rx_count, (u32)IGB_MIN_RXD);

new_rx_count = max_t(u32, new_rx_count, IGB_MIN_RXD)
  is slightly cleaner (hides cast)

-- 

^ permalink raw reply

* Re: [PATCH] net: fold network name hash (v2)
From: Stephen Hemminger @ 2009-10-28 15:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, linux-kernel, akpm, torvalds, opurdila,
	viro
In-Reply-To: <4AE7DF8E.3020607@gmail.com>

On Wed, 28 Oct 2009 07:07:10 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Stephen Hemminger a écrit :
> > The full_name_hash does not produce a value that is evenly distributed
> > over the lower 8 bits. This causes name hash to be unbalanced with large
> > number of names. There is a standard function to fold in upper bits
> > so use that.
> > 
> > This is independent of possible improvements to full_name_hash()
> > in future.
> 
> >  static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
> >  {
> >  	unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
> > -	return &net->dev_name_head[hash & ((1 << NETDEV_HASHBITS) - 1)];
> > +	return &net->dev_name_head[hash_long(hash, NETDEV_HASHBITS)];
> >  }
> >  
> >  static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
> 
> full_name_hash() returns an "unsigned int", which is guaranteed to be 32 bits
> 
> You should therefore use hash_32(hash, NETDEV_HASHBITS),
> not hash_long() that maps to hash_64() on 64 bit arches, which is
> slower and certainly not any better with a 32bits input.

OK, I was following precedent. Only a couple places use hash_32, most use
hash_long().

Using the upper bits does give better distribution.
With 100,000 network names:

               Time       Ratio       Max   StdDev
hash_32       0.002123     1.00       422  11.07
hash_64       0.002927     1.00       400   3.97

The time field is pretty meaningless for such a small sample

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox