Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
From: Joe Perches @ 2012-05-02 20:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Alexander Duyck, Alexander Duyck, netdev,
	Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski
In-Reply-To: <1335990187.22133.636.camel@edumazet-glaptop>

On Wed, 2012-05-02 at 22:23 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 13:11 -0700, Joe Perches wrote:
> > It might be useful to comment that this is/can be initialized to
> > false in tcp_try_coalesce via tcp_recv_queue.
> > Otherwise this looks like a possibly uninitialized test
> > of fragstolen.  Maybe there's a path where tail is null
> > in tcp_recv_queue and it's an uninitialized test anyway.
> 
> If tcp_queue_rcv() returns 1, fragstolen is initialized in
> tcp_try_coalesce().
> 
> If tcp_queue_rcv() returns 0, fragstolen content is undefined and we
> dont care.
> 
> If a compiler or static checker complains, its only their problem.

True, but it's code that's a bit fragile and
I think as such it could be improved for any
human reader by a descriptive comment.

^ permalink raw reply

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
From: Eric Dumazet @ 2012-05-02 20:23 UTC (permalink / raw)
  To: Joe Perches
  Cc: David Miller, Alexander Duyck, Alexander Duyck, netdev,
	Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski
In-Reply-To: <1335989484.9611.11.camel@joe2Laptop>

On Wed, 2012-05-02 at 13:11 -0700, Joe Perches wrote:

> It might be useful to comment that this is/can be initialized to
> false in tcp_try_coalesce via tcp_recv_queue.


> Otherwise this looks like a possibly uninitialized test
> of fragstolen.  Maybe there's a path where tail is null
> in tcp_recv_queue and it's an uninitialized test anyway.

If tcp_queue_rcv() returns 1, fragstolen is initialized in
tcp_try_coalesce().

If tcp_queue_rcv() returns 0, fragstolen content is undefined and we
dont care.

If a compiler or static checker complains, its only their problem.

^ permalink raw reply

* bonding: don't increase rx_dropped after processing LACPDUs
From: Jiri Bohac @ 2012-05-02 20:23 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, netdev

Since commit 3aba891d, bonding processes LACP frames (802.3ad
mode) with bond_handle_frame(). Currently a copy of the skb is
made and the original is left to be processed by other
rx_handlers and the rest of the network stack by returning
RX_HANDLER_ANOTHER.  As there is no protocol handler for
PKT_TYPE_LACPDU, the frame is dropped and dev->rx_dropped
increased.

Fix this by making bond_handle_frame() return RX_HANDLER_CONSUMED
if bonding has processed the LACP frame.  

Signed-off-by: Jiri Bohac <jbohac@suse.cz>


diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2173,9 +2173,10 @@ re_arm:
  * received frames (loopback). Since only the payload is given to this
  * function, it check for loopback.
  */
-static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u16 length)
+static int bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u16 length)
 {
 	struct port *port;
+	int ret = RX_HANDLER_ANOTHER;
 
 	if (length >= sizeof(struct lacpdu)) {
 
@@ -2184,11 +2185,12 @@ static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u
 		if (!port->slave) {
 			pr_warning("%s: Warning: port of slave %s is uninitialized\n",
 				   slave->dev->name, slave->dev->master->name);
-			return;
+			return ret;
 		}
 
 		switch (lacpdu->subtype) {
 		case AD_TYPE_LACPDU:
+			ret = RX_HANDLER_CONSUMED;
 			pr_debug("Received LACPDU on port %d\n",
 				 port->actor_port_number);
 			/* Protect against concurrent state machines */
@@ -2198,6 +2200,7 @@ static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u
 			break;
 
 		case AD_TYPE_MARKER:
+			ret = RX_HANDLER_CONSUMED;
 			// No need to convert fields to Little Endian since we don't use the marker's fields.
 
 			switch (((struct bond_marker *)lacpdu)->tlv_type) {
@@ -2219,6 +2222,7 @@ static void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u
 			}
 		}
 	}
+	return ret;
 }
 
 /**
@@ -2456,18 +2460,20 @@ out:
 	return NETDEV_TX_OK;
 }
 
-void bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
+int bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
 			  struct slave *slave)
 {
+	int ret = RX_HANDLER_ANOTHER;
 	if (skb->protocol != PKT_TYPE_LACPDU)
-		return;
+		return ret;
 
 	if (!pskb_may_pull(skb, sizeof(struct lacpdu)))
-		return;
+		return ret;
 
 	read_lock(&bond->lock);
-	bond_3ad_rx_indication((struct lacpdu *) skb->data, slave, skb->len);
+	ret = bond_3ad_rx_indication((struct lacpdu *) skb->data, slave, skb->len);
 	read_unlock(&bond->lock);
+	return ret;
 }
 
 /*
diff --git a/drivers/net/bonding/bond_3ad.h b/drivers/net/bonding/bond_3ad.h
index 235b2cc..5ee7e3c 100644
--- a/drivers/net/bonding/bond_3ad.h
+++ b/drivers/net/bonding/bond_3ad.h
@@ -274,7 +274,7 @@ void bond_3ad_adapter_duplex_changed(struct slave *slave);
 void bond_3ad_handle_link_change(struct slave *slave, char link);
 int  bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info);
 int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev);
-void bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
+int bond_3ad_lacpdu_recv(struct sk_buff *skb, struct bonding *bond,
 			  struct slave *slave);
 int bond_3ad_set_carrier(struct bonding *bond);
 void bond_3ad_update_lacp_rate(struct bonding *bond);
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 62d2409..30eace7 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1444,8 +1444,9 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 	struct sk_buff *skb = *pskb;
 	struct slave *slave;
 	struct bonding *bond;
-	void (*recv_probe)(struct sk_buff *, struct bonding *,
+	int (*recv_probe)(struct sk_buff *, struct bonding *,
 				struct slave *);
+	int ret = RX_HANDLER_ANOTHER;
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
 	if (unlikely(!skb))
@@ -1464,8 +1465,10 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 
 		if (likely(nskb)) {
-			recv_probe(nskb, bond, slave);
+			ret = recv_probe(nskb, bond, slave);
 			dev_kfree_skb(nskb);
+			if (ret == RX_HANDLER_CONSUMED)
+				kfree_skb(skb);
 		}
 	}
 
@@ -1487,7 +1490,7 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 		memcpy(eth_hdr(skb)->h_dest, bond->dev->dev_addr, ETH_ALEN);
 	}
 
-	return RX_HANDLER_ANOTHER;
+	return ret;
 }
 
 /* enslave device <slave> to bond device <master> */
@@ -2723,7 +2726,7 @@ static void bond_validate_arp(struct bonding *bond, struct slave *slave, __be32
 	}
 }
 
-static void bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
+static int bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
 			 struct slave *slave)
 {
 	struct arphdr *arp;
@@ -2731,7 +2734,7 @@ static void bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
 	__be32 sip, tip;
 
 	if (skb->protocol != __cpu_to_be16(ETH_P_ARP))
-		return;
+		return RX_HANDLER_ANOTHER;
 
 	read_lock(&bond->lock);
 
@@ -2776,6 +2779,7 @@ static void bond_arp_rcv(struct sk_buff *skb, struct bonding *bond,
 
 out_unlock:
 	read_unlock(&bond->lock);
+	return RX_HANDLER_ANOTHER;
 }
 
 /*


-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply related

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
From: Joe Perches @ 2012-05-02 20:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Alexander Duyck, Alexander Duyck, netdev,
	Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski
In-Reply-To: <1335988709.22133.632.camel@edumazet-glaptop>

On Wed, 2012-05-02 at 21:58 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
> receiver function when application is not blocked in recvmsg().
> 
> Function tcp_queue_rcv() is moved a bit to allow its call from
> tcp_data_queue()
> 
> This gives good results especially if GRO could not kick, and if skb
> head is a fragment.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexander Duyck <alexander.h.duyck@intel.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> To be applied after "[PATCH v2 net-next] net: take care of cloned skbs
> in tcp_try_coalesce()"
> 
>  include/net/tcp.h    |    3 ++-
>  net/ipv4/tcp.c       |   10 +++++-----
>  net/ipv4/tcp_input.c |   40 +++++++++++++++++++++-------------------
>  3 files changed, 28 insertions(+), 25 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 0fb84de..a9d2fb8 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -438,7 +438,8 @@ extern int tcp_disconnect(struct sock *sk, int flags);
>  
>  void tcp_connect_init(struct sock *sk);
>  void tcp_finish_connect(struct sock *sk, struct sk_buff *skb);
> -void tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen);
> +int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
> +			       int hdrlen, bool *fragstolen);
>  
>  /* From syncookies.c */
>  extern __u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 9670af3..bd5deff 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -980,8 +980,8 @@ static inline int select_size(const struct sock *sk, bool sg)
>  static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
>  {
>  	struct sk_buff *skb;
> -	struct tcp_skb_cb *cb;
>  	struct tcphdr *th;
> +	bool fragstolen;

It might be useful to comment that this is/can be initialized to
false in tcp_try_coalesce via tcp_recv_queue.

>  	skb = alloc_skb(size + sizeof(*th), sk->sk_allocation);
>  	if (!skb)
> @@ -994,14 +994,14 @@ static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
>  	if (memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size))
>  		goto err_free;
>  
> -	cb = TCP_SKB_CB(skb);
> -
>  	TCP_SKB_CB(skb)->seq = tcp_sk(sk)->rcv_nxt;
>  	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
>  	TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
>  
> -	tcp_queue_rcv(sk, skb, sizeof(*th));
> -
> +	if (tcp_queue_rcv(sk, skb, sizeof(*th), &fragstolen)) {
> +		WARN_ON_ONCE(fragstolen); /* should not happen */

Otherwise this looks like a possibly uninitialized test
of fragstolen.  Maybe there's a path where tail is null
in tcp_recv_queue and it's an uninitialized test anyway.

> +		__kfree_skb(skb);
> +	}
>  	return size;
>  
>  err_free:
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index f891a5e..2233468 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4662,6 +4662,22 @@ end:
>  		skb_set_owner_r(skb, sk);
>  }
>  
> +int tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
> +		  bool *fragstolen)
> +{
> +	int eaten;
> +	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
> +
> +	__skb_pull(skb, hdrlen);
> +	eaten = (tail &&
> +		 tcp_try_coalesce(sk, tail, skb, fragstolen)) ? 1 : 0;
> +	tcp_sk(sk)->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
> +	if (!eaten) {
> +		__skb_queue_tail(&sk->sk_receive_queue, skb);
> +		skb_set_owner_r(skb, sk);
> +	}
> +	return eaten;
> +}

^ permalink raw reply

* [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
From: Eric Dumazet @ 2012-05-02 19:58 UTC (permalink / raw)
  To: David Miller
  Cc: Alexander Duyck, Alexander Duyck, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson, Herbert Xu,
	Ben Hutchings, Ilpo Järvinen, Maciej Żenczykowski
In-Reply-To: <1335981358.22133.605.camel@edumazet-glaptop>

From: Eric Dumazet <edumazet@google.com>

Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
receiver function when application is not blocked in recvmsg().

Function tcp_queue_rcv() is moved a bit to allow its call from
tcp_data_queue()

This gives good results especially if GRO could not kick, and if skb
head is a fragment.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
---
To be applied after "[PATCH v2 net-next] net: take care of cloned skbs
in tcp_try_coalesce()"

 include/net/tcp.h    |    3 ++-
 net/ipv4/tcp.c       |   10 +++++-----
 net/ipv4/tcp_input.c |   40 +++++++++++++++++++++-------------------
 3 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0fb84de..a9d2fb8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -438,7 +438,8 @@ extern int tcp_disconnect(struct sock *sk, int flags);
 
 void tcp_connect_init(struct sock *sk);
 void tcp_finish_connect(struct sock *sk, struct sk_buff *skb);
-void tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen);
+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
+			       int hdrlen, bool *fragstolen);
 
 /* From syncookies.c */
 extern __u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9670af3..bd5deff 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -980,8 +980,8 @@ static inline int select_size(const struct sock *sk, bool sg)
 static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	struct sk_buff *skb;
-	struct tcp_skb_cb *cb;
 	struct tcphdr *th;
+	bool fragstolen;
 
 	skb = alloc_skb(size + sizeof(*th), sk->sk_allocation);
 	if (!skb)
@@ -994,14 +994,14 @@ static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 	if (memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size))
 		goto err_free;
 
-	cb = TCP_SKB_CB(skb);
-
 	TCP_SKB_CB(skb)->seq = tcp_sk(sk)->rcv_nxt;
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
 	TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
 
-	tcp_queue_rcv(sk, skb, sizeof(*th));
-
+	if (tcp_queue_rcv(sk, skb, sizeof(*th), &fragstolen)) {
+		WARN_ON_ONCE(fragstolen); /* should not happen */
+		__kfree_skb(skb);
+	}
 	return size;
 
 err_free:
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f891a5e..2233468 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4662,6 +4662,22 @@ end:
 		skb_set_owner_r(skb, sk);
 }
 
+int tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
+		  bool *fragstolen)
+{
+	int eaten;
+	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+
+	__skb_pull(skb, hdrlen);
+	eaten = (tail &&
+		 tcp_try_coalesce(sk, tail, skb, fragstolen)) ? 1 : 0;
+	tcp_sk(sk)->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+	if (!eaten) {
+		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		skb_set_owner_r(skb, sk);
+	}
+	return eaten;
+}
 
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
@@ -4708,20 +4724,12 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		}
 
 		if (eaten <= 0) {
-			struct sk_buff *tail;
 queue_and_out:
 			if (eaten < 0 &&
 			    tcp_try_rmem_schedule(sk, skb->truesize))
 				goto drop;
 
-			tail = skb_peek_tail(&sk->sk_receive_queue);
-			eaten = (tail &&
-				 tcp_try_coalesce(sk, tail, skb,
-						  &fragstolen)) ? 1 : 0;
-			if (eaten <= 0) {
-				skb_set_owner_r(skb, sk);
-				__skb_queue_tail(&sk->sk_receive_queue, skb);
-			}
+			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
 		}
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if (skb->len)
@@ -5416,14 +5424,6 @@ discard:
 	return 0;
 }
 
-void tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen)
-{
-	__skb_pull(skb, hdrlen);
-	__skb_queue_tail(&sk->sk_receive_queue, skb);
-	skb_set_owner_r(skb, sk);
-	tcp_sk(sk)->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-}
-
 /*
  *	TCP receive function for the ESTABLISHED state.
  *
@@ -5532,6 +5532,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 		} else {
 			int eaten = 0;
 			int copied_early = 0;
+			bool fragstolen = false;
 
 			if (tp->copied_seq == tp->rcv_nxt &&
 			    len - tcp_header_len <= tp->ucopy.len) {
@@ -5589,7 +5590,8 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
 
 				/* Bulk data transfer: receiver */
-				tcp_queue_rcv(sk, skb, tcp_header_len);
+				eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
+						      &fragstolen);
 			}
 
 			tcp_event_data_recv(sk, skb);
@@ -5611,7 +5613,7 @@ no_ack:
 			else
 #endif
 			if (eaten)
-				__kfree_skb(skb);
+				kfree_skb_partial(skb, fragstolen);
 			else
 				sk->sk_data_ready(sk, 0);
 			return 0;

^ permalink raw reply related

* Re: [PATCH] tcp: change tcp_adv_win_scale and tcp_rmem[2]
From: Neal Cardwell @ 2012-05-02 19:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Tom Herbert, Yuchung Cheng
In-Reply-To: <1335961721.22133.562.camel@edumazet-glaptop>

On Wed, May 2, 2012 at 8:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> tcp_adv_win_scale default value is 2, meaning we expect a good citizen
> skb to have skb->len / skb->truesize ratio of 75% (3/4)
>
> In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
> 1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
> So these skbs were considered as not bloated.
>
> With recent truesize fixes, a typical MSS=1460 frame truesize is now the
> more precise :
> 2048 + 256 = 2304. But 2304 * 3/4 = 1728.
> So these skb are not good citizen anymore, because 1460 < 1728
>
> (GRO can escape this problem because it build skbs with a too low
> truesize.)
>
> This also means tcp advertises a too optimistic window for a given
> allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
> sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
> especially when application is slow to drain its receive queue or in
> case of losses (netperf is fast, scp is slow). This is a major latency
> source.
>
> We should adjust the len/truesize ratio to 50% instead of 75%
>
> This patch :
>
> 1) changes tcp_adv_win_scale default to 1 instead of 2
>
> 2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
> better truesize tracking and to allow autotuning tcp receive window to
> reach same value than before. Note that same amount of kernel memory is
> consumed compared to 2.6 kernels.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes
From: Eric W. Biederman @ 2012-05-02 19:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, jasowang, eric.dumazet, netdev, linux-kernel
In-Reply-To: <20120502081141.GA30549@redhat.com>

Side question.  Are you aware that macvtap/vhost net is broken
in the presence of  vlan accelleration and that vlan accelleration
is not optional if you are using vlan headers?

Eric

^ permalink raw reply

* Re: [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit
From: Neal Cardwell @ 2012-05-02 19:34 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, nanditad, netdev
In-Reply-To: <1335984391-31340-3-git-send-email-ycheng@google.com>

On Wed, May 2, 2012 at 2:46 PM, Yuchung Cheng <ycheng@google.com> wrote:
> Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
> Delays the fast retransmit by an interval of RTT/4. We borrow the
> RTO timer to implement the delay. If we receive another ACK or send
> a new packet, the timer is cancelled and restored to original RTO
> value offset by time elapsed.  When the delayed-ER timer fires,
> we enter fast recovery and perform fast retransmit.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
> ChangeLog in v2:
>  - Set sysctl_tcp_early_retrans default to 2
> ChangeLog in v3:
>  - use separate u8 for early retrans stats in tcp_sock
>  - disable ER if detects any reordering

After reading patch 3 of the series, I see that patch 3 incorporates
most of those suggestions from Monday. I think it would be quite a bit
cleaner to just have patch 2 of the series put the
tcp_disable_early_retrans() call and tcp_sock fields in the ultimately
desired place, rather than having patch 2 put them somewhere and patch
3 move them, but maybe that's just me.

When all the patches in the series are applied, the one issue I still
see is that frto_counter is in a new place, a bit further away from
frto_highmark than it used to be before the patch series. I think it
would be good  to keep frto_counter in its original location, back up
nearer to frto_highmark.

neal

^ permalink raw reply

* Re: WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x277/0x280()
From: Alex Villacís Lasso @ 2012-05-02 19:20 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4F92EC8D.2010806@fiec.espol.edu.ec>

El 21/04/12 12:21, Alex Villacís Lasso escribió:
> I am getting the following WARNING while running 3.4-rc3 x86_64. Every time I get this, I am running my Bittorrent client (transmission-gtk).
>
> [ 1494.720011] ------------[ cut here ]------------
> [ 1494.720024] WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x277/0x280()
> [ 1494.720027] Hardware name: OEM
> [ 1494.720031] NETDEV WATCHDOG: p17p1 (r8169): transmit queue 0 timed out
> [ 1494.720034] Modules linked in: fuse vboxpci(O) vboxnetadp(O) vboxnetflt(O) lockd vboxdrv(O) nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek 
> snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device ppdev snd_pcm uinput coretemp microcode pcspkr iTCO_wdt iTCO_vendor_support i2c_i801 parport_pc parport snd_timer snd soundcore r8169 mii snd_page_alloc sunrpc binfmt_misc floppy i915 
> drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
> [ 1494.720100] Pid: 0, comm: swapper/1 Tainted: G           O 3.4.0-rc3 #35
> [ 1494.720103] Call Trace:
> [ 1494.720106] <IRQ>  [<ffffffff81059bbf>] warn_slowpath_common+0x7f/0xc0
> [ 1494.720120]  [<ffffffff81059cb6>] warn_slowpath_fmt+0x46/0x50
> [ 1494.720127]  [<ffffffff8151ab97>] dev_watchdog+0x277/0x280
> [ 1494.720134]  [<ffffffff81068f61>] run_timer_softirq+0x131/0x440
> [ 1494.720139]  [<ffffffff8151a920>] ? qdisc_reset+0x50/0x50
> [ 1494.720145]  [<ffffffff810611a8>] __do_softirq+0xb8/0x280
> [ 1494.720151]  [<ffffffff8101cd62>] ? native_sched_clock+0x22/0x80
> [ 1494.720157]  [<ffffffff81090895>] ? sched_clock_local+0x25/0x90
> [ 1494.720164]  [<ffffffff8161619c>] call_softirq+0x1c/0x30
> [ 1494.720170]  [<ffffffff810172e5>] do_softirq+0x65/0xa0
> [ 1494.720174]  [<ffffffff8106169e>] irq_exit+0x9e/0xc0
> [ 1494.720180]  [<ffffffff81616acb>] smp_apic_timer_interrupt+0x6b/0x98
> [ 1494.720186]  [<ffffffff8161584a>] apic_timer_interrupt+0x6a/0x70
> [ 1494.720190] <EOI>  [<ffffffff8101e253>] ? mwait_idle+0x93/0x320
> [ 1494.720199]  [<ffffffff8101e1fa>] ? mwait_idle+0x3a/0x320
> [ 1494.720204]  [<ffffffff8101ee19>] cpu_idle+0xd9/0x130
> [ 1494.720210]  [<ffffffff815fb7cf>] start_secondary+0x266/0x26d
> [ 1494.720216] ---[ end trace b0a1361739c69d32 ]---
> [ 1494.721181] r8169 0000:02:00.0: p17p1: link up
>
>
Still present in 3.4-rc5.

^ permalink raw reply

* Re: WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x277/0x280()
From: Alex Villacís Lasso @ 2012-05-02 19:26 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4F92EC8D.2010806@fiec.espol.edu.ec>

El 21/04/12 12:21, Alex Villacís Lasso escribió:
> I am getting the following WARNING while running 3.4-rc3 x86_64. Every time I get this, I am running my Bittorrent client (transmission-gtk).
>
> [ 1494.720011] ------------[ cut here ]------------
> [ 1494.720024] WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0x277/0x280()
> [ 1494.720027] Hardware name: OEM
> [ 1494.720031] NETDEV WATCHDOG: p17p1 (r8169): transmit queue 0 timed out
> [ 1494.720034] Modules linked in: fuse vboxpci(O) vboxnetadp(O) vboxnetflt(O) lockd vboxdrv(O) nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek 
> snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device ppdev snd_pcm uinput coretemp microcode pcspkr iTCO_wdt iTCO_vendor_support i2c_i801 parport_pc parport snd_timer snd soundcore r8169 mii snd_page_alloc sunrpc binfmt_misc floppy i915 
> drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
> [ 1494.720100] Pid: 0, comm: swapper/1 Tainted: G           O 3.4.0-rc3 #35
> [ 1494.720103] Call Trace:
> [ 1494.720106] <IRQ>  [<ffffffff81059bbf>] warn_slowpath_common+0x7f/0xc0
> [ 1494.720120]  [<ffffffff81059cb6>] warn_slowpath_fmt+0x46/0x50
> [ 1494.720127]  [<ffffffff8151ab97>] dev_watchdog+0x277/0x280
> [ 1494.720134]  [<ffffffff81068f61>] run_timer_softirq+0x131/0x440
> [ 1494.720139]  [<ffffffff8151a920>] ? qdisc_reset+0x50/0x50
> [ 1494.720145]  [<ffffffff810611a8>] __do_softirq+0xb8/0x280
> [ 1494.720151]  [<ffffffff8101cd62>] ? native_sched_clock+0x22/0x80
> [ 1494.720157]  [<ffffffff81090895>] ? sched_clock_local+0x25/0x90
> [ 1494.720164]  [<ffffffff8161619c>] call_softirq+0x1c/0x30
> [ 1494.720170]  [<ffffffff810172e5>] do_softirq+0x65/0xa0
> [ 1494.720174]  [<ffffffff8106169e>] irq_exit+0x9e/0xc0
> [ 1494.720180]  [<ffffffff81616acb>] smp_apic_timer_interrupt+0x6b/0x98
> [ 1494.720186]  [<ffffffff8161584a>] apic_timer_interrupt+0x6a/0x70
> [ 1494.720190] <EOI>  [<ffffffff8101e253>] ? mwait_idle+0x93/0x320
> [ 1494.720199]  [<ffffffff8101e1fa>] ? mwait_idle+0x3a/0x320
> [ 1494.720204]  [<ffffffff8101ee19>] cpu_idle+0xd9/0x130
> [ 1494.720210]  [<ffffffff815fb7cf>] start_secondary+0x266/0x26d
> [ 1494.720216] ---[ end trace b0a1361739c69d32 ]---
> [ 1494.721181] r8169 0000:02:00.0: p17p1: link up
>
>
Still present in 3.4-rc5 (2)

^ permalink raw reply

* Re: [PATCH v3 2/3] tcp: early retransmit
From: Neal Cardwell @ 2012-05-02 19:14 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, nanditad, netdev
In-Reply-To: <1335984391-31340-2-git-send-email-ycheng@google.com>

On Wed, May 2, 2012 at 2:46 PM, Yuchung Cheng <ycheng@google.com> wrote:

I see this version of the patch decided against going with these
suggestions below that I made on Monday (repasted below). Do you have
a sec to share why you wanted to stick with the existing behavior?

> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -365,12 +365,13 @@ struct tcp_sock {
>
>        u32     frto_highmark;  /* snd_nxt when RTO occurred */
>        u16     advmss;         /* Advertised MSS                       */
> -       u8      frto_counter;   /* Number of new acks after RTO */
> -       u8      nonagle     : 4,/* Disable Nagle algorithm?             */
> +       u16     nonagle     : 4,/* Disable Nagle algorithm?             */
>                thin_lto    : 1,/* Use linear timeouts for thin streams */
>                thin_dupack : 1,/* Fast retransmit on first dupack      */
>                repair      : 1,
> -               unused      : 1;
> +               do_early_retrans: 1;/* Enable RFC5827 early-retransmit  */
> +
> +       u8      frto_counter;   /* Number of new acks after RTO */
>        u8      repair_queue;

To keep the change minimal and reduce the risk of mysterious
performance regressions from cache effects, I'd suggest keeping the
frto_counter and nonagle u8 bytes as u8 bytes in their current
location, and add a new u8 for the two ER bits. Same amount of space
as the scheme in the patch, just less shuffling.

> @@ -987,6 +989,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
>                       tp->undo_marker ? tp->undo_retrans : 0);
>  #endif
>                tcp_disable_fack(tp);
> +               tcp_disable_early_retrans(tp);
>        }
>  }

I think we should stick with the behavior where we disable early
retransmit any time tcp_update_reordering() is called with a non-zero
reordering metric. This is what we've tested and measured, and my
sense is that we could risk a significant number of spurious ER
firings if instead we relax this so that only reordering >3 causes us
to disable ER. I know the delayed ER should help avoid spurious ER
firings when there is a small degree of reordering, but my guess would
be that the max(RTT/4, 2ms) is perhaps not big enough if we're
allowing delayed ER for connections that have already witnessed small
degrees of reordering. So until we have more experimental data, I'd
recommend sticking with:

       if (metric > 0)
               tcp_disable_early_retrans(tp);

neal

^ permalink raw reply

* Re: [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery()
From: Neal Cardwell @ 2012-05-02 19:03 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, nanditad, netdev
In-Reply-To: <1335984391-31340-1-git-send-email-ycheng@google.com>

On Wed, May 2, 2012 at 2:46 PM, Yuchung Cheng <ycheng@google.com> wrote:
> This a prepartion patch that refactors the code to enter recovery
> into a new function tcp_enter_recovery(). It's needed to implement
> the delayed fast retransmit in ER.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
> ChangeLog since v1:
>  - swaped with part 1 and part2
> ChangeLog since v2:
>  - removed RFC in commit message

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH 2/2] ss: implement -M option to get all memory information
From: Stephen Hemminger @ 2012-05-02 19:00 UTC (permalink / raw)
  To: Shan Wei; +Cc: xemul, NetDev
In-Reply-To: <4FA1021E.6030905@gmail.com>

On Wed, 02 May 2012 17:45:02 +0800
Shan Wei <shanwei88@gmail.com> wrote:

> Hi stephen:
> 
> Stephen Hemminger said, at 2012/4/28 1:21:
> 
> > Lots of options return more or different information based on kernel
> > version, probably the biggest example is how stats are processed.
> 
> 
> how about the following patch?
> 
> ----
> [PATCH] ss: use new INET_DIAG_SKMEMINFO option to get memory information for tcp socket
> 
> 
> Signed-off-by: Shan Wei <davidshan@tencent.com>
> ---
>  misc/ss.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/misc/ss.c b/misc/ss.c
> index 5f70a26..3cfc9e8 100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -1336,7 +1336,17 @@ static void tcp_show_info(const struct nlmsghdr *nlh, struct inet_diag_msg *r)
>  	parse_rtattr(tb, INET_DIAG_MAX, (struct rtattr*)(r+1),
>  		     nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
>  
> -	if (tb[INET_DIAG_MEMINFO]) {
> +	if (tb[INET_DIAG_SKMEMINFO]) {
> +		const unsigned int *skmeminfo =  RTA_DATA(tb[INET_DIAG_SKMEMINFO]);
> +		printf(" skmem:(r%u,rb%u,t%u,tb%u,f%u,w%u,o%u)",
> +			skmeminfo[SK_MEMINFO_RMEM_ALLOC],
> +			skmeminfo[SK_MEMINFO_RCVBUF],
> +			skmeminfo[SK_MEMINFO_WMEM_ALLOC],
> +			skmeminfo[SK_MEMINFO_SNDBUF],
> +			skmeminfo[SK_MEMINFO_FWD_ALLOC],
> +			skmeminfo[SK_MEMINFO_WMEM_QUEUED],
> +			skmeminfo[SK_MEMINFO_OPTMEM]);
> +	}else if (tb[INET_DIAG_MEMINFO]) {
>  		const struct inet_diag_meminfo *minfo
>  			= RTA_DATA(tb[INET_DIAG_MEMINFO]);
>  		printf(" mem:(r%u,w%u,f%u,t%u)",
> @@ -1505,8 +1515,10 @@ static int tcp_show_netlink(struct filter *f, FILE *dump_fp, int socktype)
>  	memset(&req.r, 0, sizeof(req.r));
>  	req.r.idiag_family = AF_INET;
>  	req.r.idiag_states = f->states;
> -	if (show_mem)
> +	if (show_mem) {
>  		req.r.idiag_ext |= (1<<(INET_DIAG_MEMINFO-1));
> +		req.r.idiag_ext |= (1<<(INET_DIAG_SKMEMINFO-1));
> +	}
>  
>  	if (show_tcpinfo) {
>  		req.r.idiag_ext |= (1<<(INET_DIAG_INFO-1));

This looks good, is the skmeminfo a superset of the old meminfo?
But your code is broken on 64 bit. skmeminfo in kernel is an array of __u32!

^ permalink raw reply

* Re: sky2 still badly broken
From: Stephen Hemminger @ 2012-05-02 18:56 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: netdev
In-Reply-To: <4FA14EC6.6050207@linuxsystems.it>

On Wed, 02 May 2012 17:12:06 +0200
Niccolò Belli <darkbasic@linuxsystems.it> wrote:

> Il 30/04/2012 21:25, Stephen Hemminger ha scritto:
> > You are getting CRC and FIFO overrun errors. What laptop is this?
> > Everything works fine on my old Fuijitsu with same chip (but rev 14).
> > You could try taking out the status bit checks and see if the
> > packets are really okay and the Marvell chip is complaining about
> > bogus status.
> 
> I compiled 3.4-rc5 + both sky2 patches you recently published and I did 
> some more tests:
> 
> Point to point: works flawlessly.
> Attached to the switch: rx errors, even downloading a very small 89 KB 
> file :(
> 
> This is the dump using IPv4:
> http://files.linuxsystems.it/temp/2012-05/sky2_ipv4.pcap
> 
> dmesg (I did an rmmod -f sky2 before doing the test):
> [ 1147.885026] sky2 0000:06:00.0: eth0: disabling interface
> [ 1148.909548] sky2: driver version 1.30
> [ 1148.909764] sky2 0000:06:00.0: Yukon-2 EC Ultra chip revision 3
> [ 1148.914367] sky2 0000:06:00.0: irq 45 for MSI/MSI-X
> [ 1148.916310] sky2 0000:06:00.0: eth0: addr 00:13:77:b4:1b:fa
> [ 1148.942297] sky2 0000:06:00.0: eth0: enabling interface
> [ 1148.944225] ADDRCONF(NETDEV_UP): eth0: link is not ready
> [ 1151.496295] sky2 0000:06:00.0: eth0: Link is up at 1000 Mbps, full 
> duplex, flow control rx
> [ 1151.497614] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [ 1178.479437] device eth0 entered promiscuous mode
> [ 1179.541707] *sky2 0000:06:00.0: eth0: rx error, status 0x7ffc0001 
> length 1468*
> [ 1179.544110] *sky2 0000:06:00.0: eth0: rx error, status 0x7ffc0001 
> length 1468*
> [ 1181.642598] device eth0 left promiscuous mode
> 
> 
> 
> 
> This is the dump using IPv6:
> http://files.linuxsystems.it/temp/2012-05/sky2_ipv6.pcap
> 
> dmesg (I did an rmmod -f sky2 before doing the test):
> [ 1314.225572] sky2 0000:06:00.0: eth0: disabling interface
> [ 1315.248523] sky2: driver version 1.30
> [ 1315.248731] sky2 0000:06:00.0: Yukon-2 EC Ultra chip revision 3
> [ 1315.249333] sky2 0000:06:00.0: irq 45 for MSI/MSI-X
> [ 1315.250307] sky2 0000:06:00.0: eth0: addr 00:13:77:b4:1b:fa
> [ 1315.271364] sky2 0000:06:00.0: eth0: enabling interface
> [ 1315.273015] ADDRCONF(NETDEV_UP): eth0: link is not ready
> [ 1317.875985] sky2 0000:06:00.0: eth0: Link is up at 1000 Mbps, full 
> duplex, flow control rx
> [ 1317.877311] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [ 1345.946062] device eth0 entered promiscuous mode
> [ 1349.119848] device eth0 left promiscuous mode
> [ 1376.231698] device eth0 entered promiscuous mode
> [ 1377.369095] *sky2 0000:06:00.0: eth0: rx error, status 0x7ffc0001 
> length 1468*
> [ 1379.618257] device eth0 left promiscuous mode
> 
> 
> 
> Switch is a Netgear GS724Tv3 firmware 5.0.3.5
> 
> 
> Cheers,
> Niccolò

It could be that your switch doesn't do autonegotiation or flow
control. You are getting receive fifo overflow errors.

^ permalink raw reply

* Re: [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery()
From: Eric Dumazet @ 2012-05-02 18:49 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, ncardwell, nanditad, netdev
In-Reply-To: <1335984391-31340-1-git-send-email-ycheng@google.com>

On Wed, 2012-05-02 at 11:46 -0700, Yuchung Cheng wrote:
> This a prepartion patch that refactors the code to enter recovery
> into a new function tcp_enter_recovery(). It's needed to implement
> the delayed fast retransmit in ER.
> 
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
> ChangeLog since v1:
>  - swaped with part 1 and part2
> ChangeLog since v2:
>  - removed RFC in commit message
>  

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [PATCH v3 3/3] tcp: early retransmit: delayed fast retransmit
From: Yuchung Cheng @ 2012-05-02 18:46 UTC (permalink / raw)
  To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng
In-Reply-To: <1335984391-31340-1-git-send-email-ycheng@google.com>

Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed.  When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog in v2:
 - Set sysctl_tcp_early_retrans default to 2
ChangeLog in v3:
 - use separate u8 for early retrans stats in tcp_sock
 - disable ER if detects any reordering

 include/linux/tcp.h   |    8 +++---
 include/net/tcp.h     |    3 ++
 net/ipv4/tcp_input.c  |   72 +++++++++++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_output.c |    5 +--
 net/ipv4/tcp_timer.c  |    5 +++
 5 files changed, 78 insertions(+), 15 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7d08a79..f8e15b2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -365,12 +365,12 @@ struct tcp_sock {
 
 	u32	frto_highmark;	/* snd_nxt when RTO occurred */
 	u16	advmss;		/* Advertised MSS			*/
-	u16	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
-		repair      : 1,
-		do_early_retrans: 1;/* Enable RFC5827 early-retransmit  */
-
+		repair      : 1;
+	u8	do_early_retrans:1,/* Enable RFC5827 early retransmit (ER) */
+		early_retrans_delayed:1; /* Delayed ER timer installed */
 	u8	frto_counter;	/* Number of new acks after RTO */
 	u8	repair_queue;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 685437a..5283aa4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -500,6 +500,8 @@ extern void tcp_send_delayed_ack(struct sock *sk);
 
 /* tcp_input.c */
 extern void tcp_cwnd_application_limited(struct sock *sk);
+extern void tcp_resume_early_retransmit(struct sock *sk);
+extern void tcp_rearm_rto(struct sock *sk);
 
 /* tcp_timer.c */
 extern void tcp_init_xmit_timers(struct sock *);
@@ -805,6 +807,7 @@ static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
 {
 	tp->do_early_retrans = sysctl_tcp_early_retrans &&
 		!sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
+	tp->early_retrans_delayed = 0;
 }
 
 static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 98c586d..9363a54 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -989,8 +989,9 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
 		       tp->undo_marker ? tp->undo_retrans : 0);
 #endif
 		tcp_disable_fack(tp);
-		tcp_disable_early_retrans(tp);
 	}
+	if (metric > 0)
+		tcp_disable_early_retrans(tp);
 }
 
 /* This must be called before lost_out is incremented */
@@ -2342,6 +2343,27 @@ static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)
 	return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;
 }
 
+static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	unsigned long delay;
+
+	/* Delay early retransmit and entering fast recovery for
+	 * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
+	 * available, or RTO is scheduled to fire first.
+	 */
+	if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt)
+		return false;
+
+	delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2));
+	if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
+		return false;
+
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX);
+	tp->early_retrans_delayed = 1;
+	return true;
+}
+
 static inline int tcp_skb_timedout(const struct sock *sk,
 				   const struct sk_buff *skb)
 {
@@ -2449,7 +2471,7 @@ static inline int tcp_head_timedout(const struct sock *sk)
  * Main question: may we further continue forward transmission
  * with the same cwnd?
  */
-static int tcp_time_to_recover(struct sock *sk)
+static int tcp_time_to_recover(struct sock *sk, int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 packets_out;
@@ -2503,7 +2525,7 @@ static int tcp_time_to_recover(struct sock *sk)
 	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
 	    (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
 	    !tcp_may_send_now(sk))
-		return 1;
+		return !tcp_pause_early_retransmit(sk, flag);
 
 	return 0;
 }
@@ -3170,7 +3192,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
 			tcp_try_undo_dsack(sk);
 
-		if (!tcp_time_to_recover(sk)) {
+		if (!tcp_time_to_recover(sk, flag)) {
 			tcp_try_to_open(sk, flag);
 			return;
 		}
@@ -3269,16 +3291,47 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
 /* Restart timer after forward progress on connection.
  * RFC2988 recommends to restart timer to now+rto.
  */
-static void tcp_rearm_rto(struct sock *sk)
+void tcp_rearm_rto(struct sock *sk)
 {
-	const struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
 
 	if (!tp->packets_out) {
 		inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
 	} else {
-		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-					  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+		u32 rto = inet_csk(sk)->icsk_rto;
+		/* Offset the time elapsed after installing regular RTO */
+		if (tp->early_retrans_delayed) {
+			struct sk_buff *skb = tcp_write_queue_head(sk);
+			const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
+			s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
+			/* delta may not be positive if the socket is locked
+			 * when the delayed ER timer fires and is rescheduled.
+			 */
+			if (delta > 0)
+				rto = delta;
+		}
+		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
+					  TCP_RTO_MAX);
 	}
+	tp->early_retrans_delayed = 0;
+}
+
+/* This function is called when the delayed ER timer fires. TCP enters
+ * fast recovery and performs fast-retransmit.
+ */
+void tcp_resume_early_retransmit(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	tcp_rearm_rto(sk);
+
+	/* Stop if ER is disabled after the delayed ER timer is scheduled */
+	if (!tp->do_early_retrans)
+		return;
+
+	tcp_enter_recovery(sk, false);
+	tcp_update_scoreboard(sk, 1);
+	tcp_xmit_retransmit_queue(sk);
 }
 
 /* If we get here, the whole TSO packet has not been acked. */
@@ -3727,6 +3780,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, tp->snd_nxt))
 		goto invalid_ack;
 
+	if (tp->early_retrans_delayed)
+		tcp_rearm_rto(sk);
+
 	if (after(ack, prior_snd_una))
 		flag |= FLAG_SND_UNA_ADVANCED;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 834e89f..d947330 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -78,9 +78,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
 		tp->frto_counter = 3;
 
 	tp->packets_out += tcp_skb_pcount(skb);
-	if (!prior_packets)
-		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-					  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+	if (!prior_packets || tp->early_retrans_delayed)
+		tcp_rearm_rto(sk);
 }
 
 /* SND.NXT, if window was not shrunk.
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 34d4a02..e911e6c 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -319,6 +319,11 @@ void tcp_retransmit_timer(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
+	if (tp->early_retrans_delayed) {
+		tcp_resume_early_retransmit(sk);
+		return;
+	}
+
 	if (!tp->packets_out)
 		goto out;
 
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH v3 2/3] tcp: early retransmit
From: Yuchung Cheng @ 2012-05-02 18:46 UTC (permalink / raw)
  To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng
In-Reply-To: <1335984391-31340-1-git-send-email-ycheng@google.com>

This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are
less than 4 to recover losses by fast recovery instead of timeout.

While the algorithm is simple, small but frequent network reordering
makes this feature dangerous: the connection repeatedly enter
false recovery and degrade performance. Therefore we implement
a mitigation suggested in the appendix of the RFC that delays
entering fast recovery by a small interval, i.e., RTT/4. But when
the network reordering degree is too large, i.e., 3 packets,
ER is disabled to avoid false fast recoveries for the rest of
the connection. The performance impact of ER is summarized in
section 6 of the paper "Proportional Rate Reduction for TCP”,
IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

Note that Linux has a similar feature called THIN_DUPACK. The
differences are THIN_DUPACK do not mitigate reorderings and is only
used after slow start. Currently ER is disabled if THIN_DUPACK is
enabled. I would be happy to merge THIN_DUPACK feature with ER if
people think it's a good idea.

ER is enabled by sysctl_tcp_early_retrans:
  0: Disables ER

  1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

  2: (Default) reduce dupthresh like mode 1. In addition, delay
     entering fast recovery by RTT/4.

Note: mode 2 is implemented in the third part of this patch series.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog in v2:
 - swapped part1 and part2
ChangeLog in v3:
 - trivial text fixes in documentation.

 Documentation/networking/ip-sysctl.txt |   14 ++++++++++++++
 include/linux/tcp.h                    |    7 ++++---
 include/net/tcp.h                      |   15 +++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |   10 ++++++++++
 net/ipv4/tcp.c                         |    3 +++
 net/ipv4/tcp_input.c                   |   13 +++++++++++++
 net/ipv4/tcp_minisocks.c               |    1 +
 7 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 9b569a2..34916e7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -190,6 +190,20 @@ tcp_cookie_size - INTEGER
 tcp_dsack - BOOLEAN
 	Allows TCP to send "duplicate" SACKs.
 
+tcp_early_retrans - INTEGER
+	Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
+	for triggering fast retransmit when the amount of outstanding data is
+	small and when no previously unsent data can be transmitted (such
+	that limited transmit could be used).
+	Possible values:
+		0 disables ER
+		1 enables ER
+		2 enables ER but delays fast recovery and fast retransmit
+		  by a fourth of RTT. This mitigates connection falsely
+		  recovers when network has a small degree of reordering
+		  (less than 3 packets).
+	Default: 2
+
 tcp_ecn - INTEGER
 	Enable Explicit Congestion Notification (ECN) in TCP. ECN is only
 	used when both ends of the TCP flow support it. It is useful to
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 278af9e..7d08a79 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -365,12 +365,13 @@ struct tcp_sock {
 
 	u32	frto_highmark;	/* snd_nxt when RTO occurred */
 	u16	advmss;		/* Advertised MSS			*/
-	u8	frto_counter;	/* Number of new acks after RTO */
-	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
+	u16	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
 		repair      : 1,
-		unused      : 1;
+		do_early_retrans: 1;/* Enable RFC5827 early-retransmit  */
+
+	u8	frto_counter;	/* Number of new acks after RTO */
 	u8	repair_queue;
 
 /* RTT measurement */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0fb84de..685437a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -252,6 +252,7 @@ extern int sysctl_tcp_max_ssthresh;
 extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_early_retrans;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -797,6 +798,20 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
 	tp->rx_opt.sack_ok |= TCP_FACK_ENABLED;
 }
 
+/* TCP early-retransmit (ER) is similar to but more conservative than
+ * the thin-dupack feature.  Enable ER only if thin-dupack is disabled.
+ */
+static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
+{
+	tp->do_early_retrans = sysctl_tcp_early_retrans &&
+		!sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
+}
+
+static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
+{
+	tp->do_early_retrans = 0;
+}
+
 static inline unsigned int tcp_left_out(const struct tcp_sock *tp)
 {
 	return tp->sacked_out + tp->lost_out;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 33417f8..ef32956 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -27,6 +27,7 @@
 #include <net/tcp_memcontrol.h>
 
 static int zero;
+static int two = 2;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -677,6 +678,15 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_early_retrans",
+		.data		= &sysctl_tcp_early_retrans,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &two,
+	},
+	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9670af3..6802c89 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -395,6 +395,7 @@ void tcp_init_sock(struct sock *sk)
 	tp->mss_cache = TCP_MSS_DEFAULT;
 
 	tp->reordering = sysctl_tcp_reordering;
+	tcp_enable_early_retrans(tp);
 	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
 
 	sk->sk_state = TCP_CLOSE;
@@ -2495,6 +2496,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			err = -EINVAL;
 		else
 			tp->thin_dupack = val;
+			if (tp->thin_dupack)
+				tcp_disable_early_retrans(tp);
 		break;
 
 	case TCP_REPAIR:
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 22df826..98c586d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -99,6 +99,7 @@ int sysctl_tcp_thin_dupack __read_mostly;
 
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
 int sysctl_tcp_abc __read_mostly;
+int sysctl_tcp_early_retrans __read_mostly = 2;
 
 #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
 #define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
@@ -906,6 +907,7 @@ static void tcp_init_metrics(struct sock *sk)
 	if (dst_metric(dst, RTAX_REORDERING) &&
 	    tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
 		tcp_disable_fack(tp);
+		tcp_disable_early_retrans(tp);
 		tp->reordering = dst_metric(dst, RTAX_REORDERING);
 	}
 
@@ -987,6 +989,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
 		       tp->undo_marker ? tp->undo_retrans : 0);
 #endif
 		tcp_disable_fack(tp);
+		tcp_disable_early_retrans(tp);
 	}
 }
 
@@ -2492,6 +2495,16 @@ static int tcp_time_to_recover(struct sock *sk)
 	    tcp_is_sack(tp) && !tcp_send_head(sk))
 		return 1;
 
+	/* Trick#6: TCP early retransmit, per RFC5827.  To avoid spurious
+	 * retransmissions due to small network reorderings, we implement
+	 * Mitigation A.3 in the RFC and delay the retransmission for a short
+	 * interval if appropriate.
+	 */
+	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
+	    (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
+	    !tcp_may_send_now(sk))
+		return 1;
+
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 3cabafb..6f6a918 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->sacked_out = 0;
 		newtp->fackets_out = 0;
 		newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
+		tcp_enable_early_retrans(newtp);
 
 		/* So many TCP implementations out there (incorrectly) count the
 		 * initial SYN frame in their delayed-ACK and congestion control
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH v3 1/3] tcp: early retransmit: tcp_enter_recovery()
From: Yuchung Cheng @ 2012-05-02 18:46 UTC (permalink / raw)
  To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng

This a prepartion patch that refactors the code to enter recovery
into a new function tcp_enter_recovery(). It's needed to implement
the delayed fast retransmit in ER.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog since v1:
 - swaped with part 1 and part2
ChangeLog since v2:
 - removed RFC in commit message
 
 net/ipv4/tcp_input.c |   61 +++++++++++++++++++++++++++----------------------
 1 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c93b0cb..22df826 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3022,6 +3022,38 @@ static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
 	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
 }
 
+static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int mib_idx;
+
+	if (tcp_is_reno(tp))
+		mib_idx = LINUX_MIB_TCPRENORECOVERY;
+	else
+		mib_idx = LINUX_MIB_TCPSACKRECOVERY;
+
+	NET_INC_STATS_BH(sock_net(sk), mib_idx);
+
+	tp->high_seq = tp->snd_nxt;
+	tp->prior_ssthresh = 0;
+	tp->undo_marker = tp->snd_una;
+	tp->undo_retrans = tp->retrans_out;
+
+	if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
+		if (!ece_ack)
+			tp->prior_ssthresh = tcp_current_ssthresh(sk);
+		tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk);
+		TCP_ECN_queue_cwr(tp);
+	}
+
+	tp->bytes_acked = 0;
+	tp->snd_cwnd_cnt = 0;
+	tp->prior_cwnd = tp->snd_cwnd;
+	tp->prr_delivered = 0;
+	tp->prr_out = 0;
+	tcp_set_ca_state(sk, TCP_CA_Recovery);
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
@@ -3041,7 +3073,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 	struct tcp_sock *tp = tcp_sk(sk);
 	int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
 				    (tcp_fackets_out(tp) > tp->reordering));
-	int fast_rexmit = 0, mib_idx;
+	int fast_rexmit = 0;
 
 	if (WARN_ON(!tp->packets_out && tp->sacked_out))
 		tp->sacked_out = 0;
@@ -3142,32 +3174,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		}
 
 		/* Otherwise enter Recovery state */
-
-		if (tcp_is_reno(tp))
-			mib_idx = LINUX_MIB_TCPRENORECOVERY;
-		else
-			mib_idx = LINUX_MIB_TCPSACKRECOVERY;
-
-		NET_INC_STATS_BH(sock_net(sk), mib_idx);
-
-		tp->high_seq = tp->snd_nxt;
-		tp->prior_ssthresh = 0;
-		tp->undo_marker = tp->snd_una;
-		tp->undo_retrans = tp->retrans_out;
-
-		if (icsk->icsk_ca_state < TCP_CA_CWR) {
-			if (!(flag & FLAG_ECE))
-				tp->prior_ssthresh = tcp_current_ssthresh(sk);
-			tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
-			TCP_ECN_queue_cwr(tp);
-		}
-
-		tp->bytes_acked = 0;
-		tp->snd_cwnd_cnt = 0;
-		tp->prior_cwnd = tp->snd_cwnd;
-		tp->prr_delivered = 0;
-		tp->prr_out = 0;
-		tcp_set_ca_state(sk, TCP_CA_Recovery);
+		tcp_enter_recovery(sk, (flag & FLAG_ECE));
 		fast_rexmit = 1;
 	}
 
-- 
1.7.7.3

^ permalink raw reply related

* Re: [PATCH] ehea: add alias entry for portN properties
From: Breno Leitao @ 2012-05-02 18:34 UTC (permalink / raw)
  To: Thadeu Lima De Souza Cascardo
  Cc: Jeff Mahoney, Breno Leitao, Network Development, Olaf Hering
In-Reply-To: <4FA17C4E.1030905@suse.com>

Thadeu,

Could you check this patch, please?

Thanks
On 05/02/2012 03:26 PM, Jeff Mahoney wrote:
> Use separate table for alias entries in the ehea module,
> otherwise the probe() function will operate on the separate ports
> instead of the lhea-"root" entry of the device-tree
> 
> Initially reported here:
> https://bugzilla.novell.com/show_bug.cgi?id=435215
> 
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
> Signed-off-by: Olaf Hering <ohering@suse.com>
> ---
>  drivers/net/ethernet/ibm/ehea/ehea_main.c |   14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> --- a/drivers/net/ethernet/ibm/ehea/ehea_main.c
> +++ b/drivers/net/ethernet/ibm/ehea/ehea_main.c
> @@ -103,6 +103,19 @@ static int __devinit ehea_probe_adapter(
> 
>  static int __devexit ehea_remove(struct platform_device *dev);
> 
> +static struct of_device_id ehea_module_device_table[] = {
> +	{
> +		.name = "lhea",
> +		.compatible = "IBM,lhea",
> +	},
> +	{
> +		.type = "network",
> +		.compatible = "IBM,lhea-ethernet",
> +	},
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, ehea_module_device_table);
> +
>  static struct of_device_id ehea_device_table[] = {
>  	{
>  		.name = "lhea",
> @@ -110,7 +123,6 @@ static struct of_device_id ehea_device_t
>  	},
>  	{},
>  };
> -MODULE_DEVICE_TABLE(of, ehea_device_table);
> 
>  static struct of_platform_driver ehea_driver = {
>  	.driver = {

^ permalink raw reply

* [PATCH] ehea: add alias entry for portN properties
From: Jeff Mahoney @ 2012-05-02 18:26 UTC (permalink / raw)
  To: Breno Leitao; +Cc: Network Development, Olaf Hering

Use separate table for alias entries in the ehea module,
otherwise the probe() function will operate on the separate ports
instead of the lhea-"root" entry of the device-tree

Initially reported here:
https://bugzilla.novell.com/show_bug.cgi?id=435215

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Olaf Hering <ohering@suse.com>
---
 drivers/net/ethernet/ibm/ehea/ehea_main.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

--- a/drivers/net/ethernet/ibm/ehea/ehea_main.c
+++ b/drivers/net/ethernet/ibm/ehea/ehea_main.c
@@ -103,6 +103,19 @@ static int __devinit ehea_probe_adapter(
 
 static int __devexit ehea_remove(struct platform_device *dev);
 
+static struct of_device_id ehea_module_device_table[] = {
+	{
+		.name = "lhea",
+		.compatible = "IBM,lhea",
+	},
+	{
+		.type = "network",
+		.compatible = "IBM,lhea-ethernet",
+	},
+	{},
+};
+MODULE_DEVICE_TABLE(of, ehea_module_device_table);
+
 static struct of_device_id ehea_device_table[] = {
 	{
 		.name = "lhea",
@@ -110,7 +123,6 @@ static struct of_device_id ehea_device_t
 	},
 	{},
 };
-MODULE_DEVICE_TABLE(of, ehea_device_table);
 
 static struct of_platform_driver ehea_driver = {
 	.driver = {
-- 
Jeff Mahoney
SUSE Labs

^ permalink raw reply

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
From: Eric Dumazet @ 2012-05-02 18:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell, Tom Herbert,
	Jeff Kirsher, Michael Chan, Matt Carlson, Herbert Xu,
	Ben Hutchings, Ilpo Järvinen, Maciej Żenczykowski
In-Reply-To: <4FA17781.6080306@intel.com>

On Wed, 2012-05-02 at 11:05 -0700, Alexander Duyck wrote:

> You're correct about the fragstolen case, I actually was thinking of the
> first patch you sent, not this second one.
> 
> However we still have a problem.  What we end up with now is a case of
> sharing in which the clone skb no longer knows that it is sharing the
> head with another skb.  The dataref will drop to 1 when we call
> __kfree_skb.  This means that any other function out there that tries to
> see if the skb is shared would return false.  This could lead to issues
> if there is anything out there that manipulates the data in head based
> on the false assumption that it is not cloned.  What we would probably
> need to do in this case is tweak the logic for skb_cloned.  If you are
> using a head_frag you should probably add a check that returns true if
> cloned is true and page_count is greater than 1.  We should be safe in
> the case of skb_header_cloned since we already dropped are dataref when
> we stole the page and freed the skb.

I really dont understand this concern.

When skb is cloned, we copy in head_frag __skb_clone()

So both skbs have the bit set, and dataref = 2.

first skb is freed, dataref becomes 1 and nothing special  happen

>From this point, skb->head is not 'shared' anymore (taken your own
words). And we are free to do whatever we want.

second skb is freed, dataref becomes 0 and we call the right destructor.

^ permalink raw reply

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
From: Alexander Duyck @ 2012-05-02 18:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell, Tom Herbert,
	Jeff Kirsher, Michael Chan, Matt Carlson, Herbert Xu,
	Ben Hutchings, Ilpo Järvinen, Maciej Żenczykowski
In-Reply-To: <1335977179.22133.599.camel@edumazet-glaptop>

On 05/02/2012 09:46 AM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 09:27 -0700, Alexander Duyck wrote:
[...]
>>>>> @@ -4515,7 +4521,12 @@ copyfrags:
>>>>>  		offset = from->data - (unsigned char *)page_address(page);
>>>>>  		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
>>>>>  				   page, offset, skb_headlen(from));
>>>>> -		*fragstolen = true;
>>>>> +
>>>>> +		if (skb_cloned(from))
>>>>> +			get_page(page);
>>>>> +		else
>>>>> +			*fragstolen = true;
>>>>> +
>>>>>  		delta = len; /* we dont know real truesize... */
>>>>>  		goto copyfrags;
>>>>>  	}
>>>>>
>>>>>
>>>> I don't see where we are now addressing the put_page call to release the
>>>> page afterwards.  By calling get_page you are incrementing the page
>>>> count by one, but where are you decrementing dataref in the shared
>>>> info?  Without that we are looking at a memory leak because __kfree_skb
>>>> will decrement the dataref but it will never reach 0 so it will never
>>>> call put_page on the head frag.
>>> really the dataref was already incremented at skb_clone() time
>>>
>>> It will be properly decremented since we call __kfree_skb()
>>>
>>> Only the last decrement will perform the put_page()
>>>
>>> Think about splice() is doing, its the same get_page() game.
>> I think you are missing the point.  So skb_clone will bump the dataref
>> to 2, calling get_page will bump the page count to 2.  After this
>> function you don't call __kfree_skb(skb) instead you call
>> kmem_cache_free(skbuff_head_cache, skb).  This will free the sk_buff,
>> but not decrement dataref leaving it at 2.  Eventually the raw socket
>> will call kfree_skb(skb) on the clone dropping the dataref to 1 and you
>> will call put_page dropping the page count to 1, but I don't see where
>> the last __kfree_skb call will come from that will drop dataref and the
>> page count to 0.
> No, you miss that _if_ we added one to page count, then we wont call
> kmem_cache_free(skbuff_head_cache, skb) but call __kfree_skb(skb)
> instead because fragstolen will be false.
>
> if (fragstolen)
> 	kmem_cache_free(...)
> else
> 	__kfree_skb(...)
>
> In future patch (addressing tcp coalescing in tcp_queue_rcv() as well),
> I'll add a helper to make this more clear.
You're correct about the fragstolen case, I actually was thinking of the
first patch you sent, not this second one.

However we still have a problem.  What we end up with now is a case of
sharing in which the clone skb no longer knows that it is sharing the
head with another skb.  The dataref will drop to 1 when we call
__kfree_skb.  This means that any other function out there that tries to
see if the skb is shared would return false.  This could lead to issues
if there is anything out there that manipulates the data in head based
on the false assumption that it is not cloned.  What we would probably
need to do in this case is tweak the logic for skb_cloned.  If you are
using a head_frag you should probably add a check that returns true if
cloned is true and page_count is greater than 1.  We should be safe in
the case of skb_header_cloned since we already dropped are dataref when
we stole the page and freed the skb.

Thanks,

Alex

^ permalink raw reply

* [PATCH v2 net-next] net: take care of cloned skbs in tcp_try_coalesce()
From: Eric Dumazet @ 2012-05-02 17:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell, Tom Herbert,
	Jeff Kirsher, Michael Chan, Matt Carlson, Herbert Xu,
	Ben Hutchings, Ilpo Järvinen, Maciej Żenczykowski
In-Reply-To: <1335977179.22133.599.camel@edumazet-glaptop>

From: Eric Dumazet <edumazet@google.com>

Before stealing fragments or skb head, we must make sure skbs are not
cloned.

Alexander was worried about destination skb being cloned : In bridge
setups, a driver could be fooled if skb->data_len would not match skb
nr_frags.

If source skb is cloned, we must take references on pages instead.

Bug happened using tcpdump (if not using mmap())

Introduce kfree_skb_partial() helper to cleanup code.

Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c |   42 +++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96a631d..f891a5e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4455,6 +4455,7 @@ static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
  * @sk: socket
  * @to: prior buffer
  * @from: buffer to add in queue
+ * @fragstolen: pointer to boolean
  *
  * Before queueing skb @from after @to, try to merge them
  * to reduce overall memory use and queue lengths, if cost is small.
@@ -4467,10 +4468,10 @@ static bool tcp_try_coalesce(struct sock *sk,
 			     struct sk_buff *from,
 			     bool *fragstolen)
 {
-	int delta, len = from->len;
+	int i, delta, len = from->len;
 
 	*fragstolen = false;
-	if (tcp_hdr(from)->fin)
+	if (tcp_hdr(from)->fin || skb_cloned(to))
 		return false;
 	if (len <= skb_tailroom(to)) {
 		BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
@@ -4497,7 +4498,13 @@ copyfrags:
 		       skb_shinfo(from)->frags,
 		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
 		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
-		skb_shinfo(from)->nr_frags = 0;
+
+		if (skb_cloned(from))
+			for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
+				skb_frag_ref(from, i);
+		else
+			skb_shinfo(from)->nr_frags = 0;
+
 		to->truesize += delta;
 		atomic_add(delta, &sk->sk_rmem_alloc);
 		sk_mem_charge(sk, delta);
@@ -4515,13 +4522,26 @@ copyfrags:
 		offset = from->data - (unsigned char *)page_address(page);
 		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
 				   page, offset, skb_headlen(from));
-		*fragstolen = true;
+
+		if (skb_cloned(from))
+			get_page(page);
+		else
+			*fragstolen = true;
+
 		delta = len; /* we dont know real truesize... */
 		goto copyfrags;
 	}
 	return false;
 }
 
+static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen)
+{
+	if (head_stolen)
+		kmem_cache_free(skbuff_head_cache, skb);
+	else
+		__kfree_skb(skb);
+}
+
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -4565,10 +4585,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		if (!tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
 			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
 		} else {
-			if (fragstolen)
-				kmem_cache_free(skbuff_head_cache, skb);
-			else
-				__kfree_skb(skb);
+			kfree_skb_partial(skb, fragstolen);
 			skb = NULL;
 		}
 
@@ -4727,12 +4744,9 @@ queue_and_out:
 
 		tcp_fast_path_check(sk);
 
-		if (eaten > 0) {
-			if (fragstolen)
-				kmem_cache_free(skbuff_head_cache, skb);
-			else
-				__kfree_skb(skb);
-		} else if (!sock_flag(sk, SOCK_DEAD))
+		if (eaten > 0)
+			kfree_skb_partial(skb, fragstolen);
+		else if (!sock_flag(sk, SOCK_DEAD))
 			sk->sk_data_ready(sk, 0);
 		return;
 	}

^ permalink raw reply related

* Re: [v12 PATCH 2/3] NETFILTER module xt_hmark, new target for HASH based fwmark
From: Hans Schillstrom @ 2012-05-02 17:49 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: kaber@trash.net, jengelh@medozas.de,
	netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	hans@schillstrom.com
In-Reply-To: <20120502080944.GA17393@1984>

[-- Attachment #1: Type: text/plain, Size: 2271 bytes --]

On Wednesday 02 May 2012 10:09:44 Pablo Neira Ayuso wrote:
> On Wed, May 02, 2012 at 09:55:00AM +0200, Hans Schillstrom wrote:
> > Hello Pablo
> > (Sorry for spamming some of you, kmail started to send HTML mail)
> > 
> > On Wednesday 02 May 2012 02:34:14 Pablo Neira Ayuso wrote:
> > > Hi Hans,
> > > 
> > > I have decided to take your patch and give it one spin today.
> > > 
> > > Please, find it attached. The main things I've done are:
> > > 
> > > * splitting the code into smaller functions, thus, it becomes more
> > >   maintainable.
> > > 
> > > * try to put common code into functions, eg. the layer 4 protocol
> > >   parsing to obtain the ports is the same for both IPv4 and IPv6.
> > > 
> > > * adding the hmark_tuple abstraction, cleaner than using several
> > >   variables to set the address, ports, and so on. Thus, we only pass
> > >   one single pointer to it.
> > > 
> > > * I have removed most of the comments, they bloat the file and most
> > >   information can be extracted by reading the code. I only left the
> > >   comments that clarify "strange" things.
> > > 
> > > Regarding ICMP traffic, I think we can use the ID field for the
> > > hashing as well. Thus, we handle ICMP like other protocols.
> > 
> > Yes why not, I can give it a try.
> > 

I think we wait with this one..

> > > 
> > > Please, I'd appreciate if you can test and spot issues after my
> > > rework. I have slightly tested here.
> > 
> > OK I found some minor things, I'll send an updated version back later today.
> > I will run all my tests it will take a couple of hours.
> 
> Please, go ahead.

Done, all my tests passed

[snip]
This is what I have done.

- I reduced the code size a little bit by combining the hmark_ct_set_htuple_ipvX into one func.
  by adding a hmark_addr6_mask() and hmark_addr_any_mask()
  Note that using "otuple->src.l3num" as param 1 in both src and dst is not a typo.
  (it's not set in the rtuple)
- Made the if (dst < src) swap() in the hmark_hash() since it should be used by every caller. 
- Moved the L3 check a little bit earlier.
- changed return values for fragments.
- Added nhoffs to: hmark_set_tuple_ports(skb, (ip->ihl * 4) + nhoff, t, info);
  to get icmp working


-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

[-- Attachment #2: 0001-netfilter-add-xt_hmark-target-for-hash-based-skb-mar.patch --]
[-- Type: text/x-patch, Size: 15382 bytes --]

From 55b47c7a3f7ab6a9d0430c6b753ccf3cc3cac7b8 Mon Sep 17 00:00:00 2001
From: Hans Schillstrom <hans.schillstrom@ericsson.com>
Date: Wed, 2 May 2012 18:59:28 +0200
Subject: [PATCH 1/1] netfilter: add xt_hmark target for hash-based skb marking

The target allows you to create rules in the "raw" and "mangle" tables
which set the skbuff mark by means of hash calculation within a given
range. The nfmark can influence the routing method (see "Use netfilter
MARK value as routing key") and can also be used by other subsystems to
change their behaviour.

Some examples:

* Default rule handles all TCP, UDP, SCTP, ESP & AH

 iptables -t mangle -A PREROUTING -m state --state NEW,ESTABLISHED,RELATED \
	-j HMARK --hmark-offset 10000 --hmark-mod 10

* Handle SCTP and hash dest port only and produce a nfmark between 100-119.

 iptables -t mangle -A PREROUTING -p SCTP -j HMARK --src-mask 0 --dst-mask 0 \
	--sp-mask 0 --offset 100 --mod 20

* Fragment safe Layer 3 only, that keep a class C network flow together

 iptables -t mangle -A PREROUTING -j HMARK --method L3 \
	--src-mask 24 --mod 20 --offset 100

[ A big part of this patch has been refactorized by Pablo Neira Ayuso ]

Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
---
 include/linux/netfilter/xt_HMARK.h |   62 ++++++
 net/netfilter/Kconfig              |   15 ++
 net/netfilter/Makefile             |    1 +
 net/netfilter/xt_HMARK.c           |  371 ++++++++++++++++++++++++++++++++++++
 4 files changed, 449 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/netfilter/xt_HMARK.h
 create mode 100644 net/netfilter/xt_HMARK.c

diff --git a/include/linux/netfilter/xt_HMARK.h b/include/linux/netfilter/xt_HMARK.h
new file mode 100644
index 0000000..cdf4a8f
--- /dev/null
+++ b/include/linux/netfilter/xt_HMARK.h
@@ -0,0 +1,62 @@
+#ifndef XT_HMARK_H_
+#define XT_HMARK_H_
+
+#include <linux/types.h>
+
+enum {
+	XT_HMARK_NONE,
+	XT_HMARK_SADR_AND,
+	XT_HMARK_DADR_AND,
+	XT_HMARK_SPI_AND,
+	XT_HMARK_SPI_OR,
+	XT_HMARK_SPORT_AND,
+	XT_HMARK_DPORT_AND,
+	XT_HMARK_SPORT_OR,
+	XT_HMARK_DPORT_OR,
+	XT_HMARK_PROTO_AND,
+	XT_HMARK_RND,
+	XT_HMARK_MODULUS,
+	XT_HMARK_OFFSET,
+	XT_HMARK_CT,
+	XT_HMARK_METHOD_L3,
+	XT_HMARK_METHOD_L3_4,
+	XT_F_HMARK_SADR_AND    = 1 << XT_HMARK_SADR_AND,
+	XT_F_HMARK_DADR_AND    = 1 << XT_HMARK_DADR_AND,
+	XT_F_HMARK_SPI_AND     = 1 << XT_HMARK_SPI_AND,
+	XT_F_HMARK_SPI_OR      = 1 << XT_HMARK_SPI_OR,
+	XT_F_HMARK_SPORT_AND   = 1 << XT_HMARK_SPORT_AND,
+	XT_F_HMARK_DPORT_AND   = 1 << XT_HMARK_DPORT_AND,
+	XT_F_HMARK_SPORT_OR    = 1 << XT_HMARK_SPORT_OR,
+	XT_F_HMARK_DPORT_OR    = 1 << XT_HMARK_DPORT_OR,
+	XT_F_HMARK_PROTO_AND   = 1 << XT_HMARK_PROTO_AND,
+	XT_F_HMARK_RND         = 1 << XT_HMARK_RND,
+	XT_F_HMARK_MODULUS     = 1 << XT_HMARK_MODULUS,
+	XT_F_HMARK_OFFSET      = 1 << XT_HMARK_OFFSET,
+	XT_F_HMARK_CT          = 1 << XT_HMARK_CT,
+	XT_F_HMARK_METHOD_L3   = 1 << XT_HMARK_METHOD_L3,
+	XT_F_HMARK_METHOD_L3_4 = 1 << XT_HMARK_METHOD_L3_4,
+};
+
+union hmark_ports {
+	struct {
+		__u16	src;
+		__u16	dst;
+	} p16;
+	__u32	v32;
+};
+
+struct xt_hmark_info {
+	union nf_inet_addr	src_mask;	/* Source address mask */
+	union nf_inet_addr	dst_mask;	/* Dest address mask */
+	union hmark_ports	port_mask;
+	union hmark_ports	port_set;
+	__u32			spi_mask;
+	__u32			spi_set;
+	__u32			flags;		/* Print out only */
+	__u16			proto_mask;	/* L4 Proto mask */
+	__u32			hashrnd;
+	__u32			hmodulus;	/* Modulus */
+	__u32			hoffset;	/* Offset */
+};
+
+#endif /* XT_HMARK_H_ */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 0c6f67e..209c1ed 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -509,6 +509,21 @@ config NETFILTER_XT_TARGET_HL
 	since you can easily create immortal packets that loop
 	forever on the network.
 
+config NETFILTER_XT_TARGET_HMARK
+	tristate '"HMARK" target support'
+	depends on (IP6_NF_IPTABLES || IP6_NF_IPTABLES=n)
+	depends on NETFILTER_ADVANCED
+	---help---
+	This option adds the "HMARK" target.
+
+	The target allows you to create rules in the "raw" and "mangle" tables
+	which set the skbuff mark by means of hash calculation within a given
+	range. The nfmark can influence the routing method (see "Use netfilter
+	MARK value as routing key") and can also be used by other subsystems to
+	change their behaviour.
+
+	To compile it as a module, choose M here. If unsure, say N.
+
 config NETFILTER_XT_TARGET_IDLETIMER
 	tristate  "IDLETIMER target support"
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index ca36765..4e7960c 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_CONNSECMARK) += xt_CONNSECMARK.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_CT) += xt_CT.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_DSCP) += xt_DSCP.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_HL) += xt_HL.o
+obj-$(CONFIG_NETFILTER_XT_TARGET_HMARK) += xt_HMARK.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_LED) += xt_LED.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_LOG) += xt_LOG.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_NFLOG) += xt_NFLOG.o
diff --git a/net/netfilter/xt_HMARK.c b/net/netfilter/xt_HMARK.c
new file mode 100644
index 0000000..76a3fa7
--- /dev/null
+++ b/net/netfilter/xt_HMARK.c
@@ -0,0 +1,371 @@
+/*
+ * xt_HMARK - Netfilter module to set mark as hash value
+ *
+ * (C) 2012 by Hans Schillstrom <hans.schillstrom@ericsson.com>
+ * (C) 2012 by Pablo Neira Ayuso <pablo@netfilter.org>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * Description:
+ *
+ * This module calculates a hash value that can be modified by modulus and an
+ * offset, i.e. it is possible to produce a skb->mark within a range The hash
+ * value is based on a direction independent five tuple: src & dst addr src &
+ * dst ports and protocol.
+ */
+
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/icmp.h>
+
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_HMARK.h>
+
+#include <net/ip.h>
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+#include <net/netfilter/nf_conntrack.h>
+#endif
+#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
+#include <net/ipv6.h>
+#include <linux/netfilter_ipv6/ip6_tables.h>
+#endif
+
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Hans Schillstrom <hans.schillstrom@ericsson.com>");
+MODULE_DESCRIPTION("Xtables: packet marking using hash calculation");
+MODULE_ALIAS("ipt_HMARK");
+MODULE_ALIAS("ip6t_HMARK");
+
+struct hmark_tuple {
+	u32			src;
+	u32			dst;
+	union hmark_ports	uports;
+	uint8_t			proto;
+};
+
+static int
+hmark_ct_set_htuple(const struct sk_buff *skb, struct hmark_tuple *t,
+			 const struct xt_hmark_info *info);
+static inline u32
+hmark_hash(struct hmark_tuple *t, const struct xt_hmark_info *info)
+{
+	u32 hash;
+
+	if (t->dst < t->src)
+		swap(t->src, t->dst);
+
+	hash = jhash_3words(t->src, t->dst, t->uports.v32, info->hashrnd);
+	hash = hash ^ (t->proto & info->proto_mask);
+
+	return (hash % info->hmodulus) + info->hoffset;
+}
+
+static void
+hmark_set_tuple_ports(const struct sk_buff *skb, unsigned int nhoff,
+		      struct hmark_tuple *t, const struct xt_hmark_info *info)
+{
+	int protoff;
+
+	protoff = proto_ports_offset(t->proto);
+	if (protoff < 0)
+		return;
+
+	nhoff += protoff;
+	if (skb_copy_bits(skb, nhoff, &t->uports, sizeof(t->uports)) < 0)
+		return;
+
+	if (t->proto == IPPROTO_ESP || t->proto == IPPROTO_AH)
+		t->uports.v32 = (t->uports.v32 & info->spi_mask) |
+				info->spi_set;
+	else {
+		t->uports.v32 = (t->uports.v32 & info->port_mask.v32) |
+				info->port_set.v32;
+
+		if (t->uports.p16.dst < t->uports.p16.src)
+			swap(t->uports.p16.dst, t->uports.p16.src);
+	}
+}
+
+#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
+static int get_inner6_hdr(const struct sk_buff *skb, int *offset)
+{
+	struct icmp6hdr *icmp6h, _ih6;
+
+	icmp6h = skb_header_pointer(skb, *offset, sizeof(_ih6), &_ih6);
+	if (icmp6h == NULL)
+		return 0;
+
+	if (icmp6h->icmp6_type && icmp6h->icmp6_type < 128) {
+		*offset += sizeof(struct icmp6hdr);
+		return 1;
+	}
+	return 0;
+}
+
+static inline u32 hmark_addr6_mask(const __u32 *addr32, const __u32 *mask)
+{
+	return  (addr32[0] & mask[0]) ^
+		(addr32[1] & mask[1]) ^
+		(addr32[2] & mask[2]) ^
+		(addr32[3] & mask[3]);
+}
+
+static int
+hmark_pkt_set_htuple_ipv6(const struct sk_buff *skb, struct hmark_tuple *t,
+			  const struct xt_hmark_info *info)
+{
+	struct ipv6hdr *ip6, _ip6;
+	int flag = IP6T_FH_F_AUTH; /* Ports offset, find_hdr flags */
+	unsigned int nhoff = 0;
+	u16 fragoff = 0;
+	int nexthdr;
+
+	ip6 = (struct ipv6hdr *) (skb->data + skb_network_offset(skb));
+	nexthdr = ipv6_find_hdr(skb, &nhoff, -1, &fragoff, &flag);
+	if (nexthdr < 0)
+		return 0;
+	/* No need to check for icmp errors on fragments */
+	if ((flag & IP6T_FH_F_FRAG) || (nexthdr != IPPROTO_ICMPV6))
+		goto noicmp;
+	/* if an icmp error, use the inner header */
+	if (get_inner6_hdr(skb, &nhoff)) {
+		ip6 = skb_header_pointer(skb, nhoff, sizeof(_ip6), &_ip6);
+		if (ip6 == NULL)
+			return -1;
+		/* Treat AH as ESP, use SPI nothing else. */
+		flag = IP6T_FH_F_AUTH;
+		nexthdr = ipv6_find_hdr(skb, &nhoff, -1, &fragoff, &flag);
+		if (nexthdr < 0)
+			return -1;
+	}
+noicmp:
+	t->src = hmark_addr6_mask(ip6->saddr.s6_addr32, info->src_mask.all);
+	t->dst = hmark_addr6_mask(ip6->daddr.s6_addr32, info->dst_mask.all);
+
+	if (info->flags & XT_F_HMARK_METHOD_L3)
+		return 0;
+
+	t->proto = nexthdr;
+
+	if (t->proto == IPPROTO_ICMPV6)
+		return 0;
+
+	if (flag & IP6T_FH_F_FRAG)
+		return -1;
+
+	hmark_set_tuple_ports(skb, nhoff, t, info);
+
+	return 0;
+}
+
+static unsigned int
+hmark_tg_v6(struct sk_buff *skb, const struct xt_action_param *par)
+{
+	const struct xt_hmark_info *info = par->targinfo;
+	struct hmark_tuple t;
+
+	memset(&t, 0, sizeof(struct hmark_tuple));
+
+	if (info->flags & XT_F_HMARK_CT) {
+		if (hmark_ct_set_htuple(skb, &t, info) < 0)
+			return XT_CONTINUE;
+	} else {
+		if (hmark_pkt_set_htuple_ipv6(skb, &t, info) < 0)
+			return XT_CONTINUE;
+	}
+
+	skb->mark = hmark_hash(&t, info);
+	return XT_CONTINUE;
+}
+
+static inline u32
+hmark_addr_any_mask(int l3num, const __u32 *addr32, const __u32 *mask)
+{
+	if (l3num == AF_INET)
+		return *addr32 & *mask;
+
+	return hmark_addr6_mask(addr32, mask);
+}
+#else
+static inline u32
+hmark_addr_any_mask(int l3num, const __u32 *addr32, const __u32 *mask)
+{
+	return *addr32 & *mask;
+}
+
+#endif
+
+static int
+hmark_ct_set_htuple(const struct sk_buff *skb, struct hmark_tuple *t,
+			 const struct xt_hmark_info *info)
+{
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+	struct nf_conntrack_tuple *otuple;
+	struct nf_conntrack_tuple *rtuple;
+
+	if (ct == NULL || nf_ct_is_untracked(ct))
+		return -1;
+
+	otuple = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+	rtuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+
+	t->src = hmark_addr_any_mask(otuple->src.l3num, otuple->src.u3.all,
+				     info->src_mask.all);
+	t->dst = hmark_addr_any_mask(otuple->src.l3num, rtuple->src.u3.all,
+				     info->dst_mask.all);
+
+	if (info->flags & XT_F_HMARK_METHOD_L3)
+		return 0;
+
+	t->proto = nf_ct_protonum(ct);
+	if (t->proto != IPPROTO_ICMP) {
+		t->uports.p16.src = otuple->src.u.all;
+		t->uports.p16.dst = rtuple->src.u.all;
+		t->uports.v32 = (t->uports.v32 & info->port_mask.v32) |
+				info->port_set.v32;
+		if (t->uports.p16.dst < t->uports.p16.src)
+			swap(t->uports.p16.dst, t->uports.p16.src);
+	}
+
+	return 0;
+#else
+	return -1;
+#endif
+}
+
+static int get_inner_hdr(const struct sk_buff *skb, int iphsz, int *nhoff)
+{
+	const struct icmphdr *icmph;
+	struct icmphdr _ih;
+
+	/* Not enough header? */
+	icmph = skb_header_pointer(skb, *nhoff + iphsz, sizeof(_ih), &_ih);
+	if (icmph == NULL && icmph->type > NR_ICMP_TYPES)
+		return 0;
+
+	/* Error message? */
+	if (icmph->type != ICMP_DEST_UNREACH &&
+	    icmph->type != ICMP_SOURCE_QUENCH &&
+	    icmph->type != ICMP_TIME_EXCEEDED &&
+	    icmph->type != ICMP_PARAMETERPROB &&
+	    icmph->type != ICMP_REDIRECT)
+		return 0;
+
+	*nhoff += iphsz + sizeof(_ih);
+	return 1;
+}
+
+static int
+hmark_pkt_set_htuple_ipv4(const struct sk_buff *skb, struct hmark_tuple *t,
+			  const struct xt_hmark_info *info)
+{
+	struct iphdr *ip, _ip;
+	int nhoff = skb_network_offset(skb);
+
+	ip = (struct iphdr *) (skb->data + nhoff);
+	if (ip->protocol == IPPROTO_ICMP) {
+		/* use inner header in case of ICMP errors */
+		if (get_inner_hdr(skb, ip->ihl * 4, &nhoff)) {
+			ip = skb_header_pointer(skb, nhoff, sizeof(_ip), &_ip);
+			if (ip == NULL)
+				return -1;
+		}
+	}
+
+	t->src = (__force u32) ip->saddr;
+	t->dst = (__force u32) ip->daddr;
+
+	t->src &= info->src_mask.ip;
+	t->dst &= info->dst_mask.ip;
+
+	if (info->flags & XT_F_HMARK_METHOD_L3)
+		return 0;
+
+	t->proto = ip->protocol;
+
+	/* ICMP has no ports, skip */
+	if (t->proto == IPPROTO_ICMP)
+		return 0;
+
+	/* follow-up fragments don't contain ports, skip */
+	if (ip->frag_off & htons(IP_MF | IP_OFFSET))
+		return -1;
+
+	hmark_set_tuple_ports(skb, (ip->ihl * 4) + nhoff, t, info);
+
+	return 0;
+}
+
+static unsigned int
+hmark_tg_v4(struct sk_buff *skb, const struct xt_action_param *par)
+{
+	const struct xt_hmark_info *info = par->targinfo;
+	struct hmark_tuple t;
+
+	memset(&t, 0, sizeof(struct hmark_tuple));
+
+	if (info->flags & XT_F_HMARK_CT) {
+		if (hmark_ct_set_htuple(skb, &t, info) < 0)
+			return XT_CONTINUE;
+	} else {
+		if (hmark_pkt_set_htuple_ipv4(skb, &t, info) < 0)
+			return XT_CONTINUE;
+	}
+
+	skb->mark = hmark_hash(&t, info);
+	return XT_CONTINUE;
+}
+
+static int hmark_tg_check(const struct xt_tgchk_param *par)
+{
+	const struct xt_hmark_info *info = par->targinfo;
+
+	if (!info->hmodulus) {
+		pr_info("xt_HMARK: hash modulus can't be zero\n");
+		return -EINVAL;
+	}
+	if (info->proto_mask && (info->flags & XT_F_HMARK_METHOD_L3)) {
+		pr_info("xt_HMARK: proto mask must be zero with L3 mode\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static struct xt_target hmark_tg_reg[] __read_mostly = {
+	{
+		.name		= "HMARK",
+		.family		= NFPROTO_IPV4,
+		.target		= hmark_tg_v4,
+		.targetsize	= sizeof(struct xt_hmark_info),
+		.checkentry	= hmark_tg_check,
+		.me		= THIS_MODULE,
+	},
+#if IS_ENABLED(CONFIG_IP6_NF_IPTABLES)
+	{
+		.name		= "HMARK",
+		.family		= NFPROTO_IPV6,
+		.target		= hmark_tg_v6,
+		.targetsize	= sizeof(struct xt_hmark_info),
+		.checkentry	= hmark_tg_check,
+		.me		= THIS_MODULE,
+	},
+#endif
+};
+
+static int __init hmark_tg_init(void)
+{
+	return xt_register_targets(hmark_tg_reg, ARRAY_SIZE(hmark_tg_reg));
+}
+
+static void __exit hmark_tg_exit(void)
+{
+	xt_unregister_targets(hmark_tg_reg, ARRAY_SIZE(hmark_tg_reg));
+}
+
+module_init(hmark_tg_init);
+module_exit(hmark_tg_exit);
-- 
1.7.2.3


^ permalink raw reply related

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
From: Rick Jones @ 2012-05-02 17:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Alexander Duyck, David Miller, netdev,
	Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski
In-Reply-To: <1335947084.22133.134.camel@edumazet-glaptop>

On 05/02/2012 01:24 AM, Eric Dumazet wrote:
> On Tue, 2012-05-01 at 12:45 -0700, Alexander Duyck wrote:
>
>> I have a hacked together ixgbe up and running now with the new build_skb
>> logic and RSC/LRO disabled.  It looks like it is giving me a 5%
>> performance boost for small packet routing, but I am using more CPU for
>> netperf TCP receive tests and I was wondering if you had seen anything
>> similar on the tg3 driver?
>
> Really hard to say, numbers are so small on Gb link :
>
> what do you use to make your numbers ?
>
> netperf -H 172.30.42.23 -t OMNI -C -c
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.23 (172.30.42.23) port 0 AF_INET
> Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1700840     1700840     16384  10.01   931.60     10^6bits/s  4.50  S      1.32   S      1.582   2.783   usec/KB

If there is so little CPU consumed, I'm a bit surprised the throughput 
wasn't 940 Gbit/s.

It might be a good idea to fix the local and remote socket buffer sizes 
for these sorts of A-B comparisons to take the variability of the 
autotuning out.

And then, to see if the small differences are "real" one can light-up 
the confidence intervals.  For example (using kernels unrelated to the 
patch discussion):

raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -t omni -c -C 
-I 99,1 -i 30,3 -- -s 256K -S 256K -m 16K -O 
throughput,local_cpu_util,local_sd,remote_cpu_util,remote_sd,throughput_confid,local_cpu_confid,remote_cpu_confid,confidence_iteration
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 () 
port 0 AF_INET : +/-0.500% @ 99% conf.  : interval : demo
Throughput Local Local   Remote Remote  Throughput Local      Remote 
  Confidence
            CPU   Service CPU    Service Confidence CPU        CPU 
   Iterations
            Util  Demand  Util   Demand  Width (%)  Confidence 
Confidence Run
            %             %                         Width (%)  Width (%) 

941.36     8.70  3.030   45.36  7.895   0.006      18.836     0.209 
  30

In this instance, I asked to be 99% confident the throughput and CPU 
util were within +/- 0.5% of the "real" mean.  The confidence intervals 
were hit for throughput and remote CPU util, but not for local CPU util 
- netperf was running on my personal workstation, which also receives 
email etc.  Presumably a more isolated and idle system would have hit 
the confidence intervals.

Other sources of variation to consider eliminating when looking for 
small differences in CPU utilization might be the multiqueue support in 
the NIC.  I'll often just terminate irqbalance and set all the IRQs to a 
single CPU (when doing single stream tests).  Or, one can fully specify 
the four-tuple for the netperf data connection.

rick jones
of course there is also the whole question of the effect of HW threading 
on the meaningfulness of OS-determined utilization...

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox