linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] tcp: bugs and cleanup updated to 2.6.33-rc8 (again)
@ 2010-02-15 12:23 William Allen Simpson
  2010-02-15 12:25 ` [PATCH v3 1/7] net: tcp_header_len_th and tcp_option_len_th William Allen Simpson
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 12:23 UTC (permalink / raw)
  To: Linux Kernel Developers
  Cc: Linux Kernel Network Developers, Andrew Morton, David Miller

Combination of patches reported in October, November, December, January
and February.  These patches fix perceived bugs, and other cleanup.

This code has had previous review and several months of limited testing.

Some portions were removed during the various TCPCT part 1 patch splits,
then were cut off by the sudden unexpected end of that merge window.
[03 Dec 2009]  I've restarted the sub-numbering (again).

Of particular interest are the TCPCT header extensions that already
appear in the next phase of testing with other platforms.  These patches
allow correct reception without data corruption.

The remainder of the original TCPCT part 2 will be merged with part 3.

These are patches against the current linux-2.6 tree.

[Parts 2f and 2g updated, fixing some more problems introduced by
David Miller in commit bb5b7c11263dbbe78253cd05945a6bf8f55add8e]


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/7] net: tcp_header_len_th and tcp_option_len_th
  2010-02-15 12:23 [PATCH 0/7] tcp: bugs and cleanup updated to 2.6.33-rc8 (again) William Allen Simpson
@ 2010-02-15 12:25 ` William Allen Simpson
  2010-02-15 12:27 ` [PATCH v3 2/7] net: remove old tcp_optlen function William Allen Simpson
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 12:25 UTC (permalink / raw)
  To: Linux Kernel Developers
  Cc: Linux Kernel Network Developers, Andrew Morton, David Miller

[-- Attachment #1: Type: text/plain, Size: 359 bytes --]

Redefine two TCP header functions to accept TCP header pointer.
When subtracting, return signed int to allow error checking.

These functions will be used in subsequent patches that implement
additional features.

Signed-off-by: William.Allen.Simpson@gmail.com
---
  include/linux/tcp.h |   12 ++++++++++++
  1 files changed, 12 insertions(+), 0 deletions(-)

[-- Attachment #2: len_th+2a3+2.6.33-rc8.patch --]
[-- Type: text/plain, Size: 719 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7fee8a4..d0133cf 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -223,6 +223,18 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
 	return (tcp_hdr(skb)->doff - 5) * 4;
 }
 
+/* Length of fixed header plus standard options. */
+static inline unsigned int tcp_header_len_th(const struct tcphdr *th)
+{
+	return th->doff * 4;
+}
+
+/* Length of standard options only.  This could be negative. */
+static inline int tcp_option_len_th(const struct tcphdr *th)
+{
+	return (int)(th->doff * 4) - sizeof(*th);
+}
+
 /* This defines a selective acknowledgement block. */
 struct tcp_sack_block_wire {
 	__be32	start_seq;
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/7] net: remove old tcp_optlen function
  2010-02-15 12:23 [PATCH 0/7] tcp: bugs and cleanup updated to 2.6.33-rc8 (again) William Allen Simpson
  2010-02-15 12:25 ` [PATCH v3 1/7] net: tcp_header_len_th and tcp_option_len_th William Allen Simpson
@ 2010-02-15 12:27 ` William Allen Simpson
  2010-02-15 12:28 ` [PATCH v5 3/7] tcp: harmonize tcp_vx_rcv header length assumptions William Allen Simpson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 12:27 UTC (permalink / raw)
  To: Linux Kernel Developers
  Cc: Linux Kernel Network Developers, Andrew Morton, David Miller,
	Michael Chan

[-- Attachment #1: Type: text/plain, Size: 1071 bytes --]

The tcp_optlen() function returns a potential *negative* unsigned.

In the only two existing files using the old tcp_optlen() function,
clean up confusing and inconsistent mixing of both byte and word
offsets, and other coding style issues.  Document assumptions.

Quoth David Miller:
    This is transmit, and the packets can only come from the Linux
    TCP stack, not some external entity.

    You're being way too anal here, and adding these checks to
    drivers would be just a lot of rediculious bloat. [sic]

Therefore, there are *no* checks for bad TCP and IP header sizes, nor
any semantic changes.  The drivers should function exactly as existing.

No response from testers in 17+ weeks.

Requires:
   net: tcp_header_len_th and tcp_option_len_th

Signed-off-by: William.Allen.Simpson@gmail.com
CC: Michael Chan <mchan@broadcom.com>
---
  drivers/net/bnx2.c  |   29 +++++++++++++-----------
  drivers/net/tg3.c   |   60 +++++++++++++++++++++++---------------------------
  include/linux/tcp.h |    5 ----
  3 files changed, 44 insertions(+), 50 deletions(-)


[-- Attachment #2: len_th+2b3+2.6.33-rc8.patch --]
[-- Type: text/plain, Size: 6936 bytes --]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 65df1de..45452c5 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -6352,6 +6352,8 @@ bnx2_vlan_rx_register(struct net_device *dev, struct vlan_group *vlgrp)
 /* Called with netif_tx_lock.
  * bnx2_tx_int() runs without netif_tx_lock unless it needs to call
  * netif_wake_queue().
+ *
+ * No TCP or IP length checking, per David Miller (see commit log).
  */
 static netdev_tx_t
 bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -6396,19 +6398,19 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			(TX_BD_FLAGS_VLAN_TAG | (vlan_tx_tag_get(skb) << 16));
 	}
 #endif
-	if ((mss = skb_shinfo(skb)->gso_size)) {
-		u32 tcp_opt_len;
-		struct iphdr *iph;
+	mss = skb_shinfo(skb)->gso_size;
+	if (mss != 0) {
+		struct tcphdr *th = tcp_hdr(skb);
+		int tcp_opt_words = th->doff - (sizeof(*th) >> 2);
+		/* assumes positive tcp_opt_words without checking */
 
 		vlan_tag_flags |= TX_BD_FLAGS_SW_LSO;
 
-		tcp_opt_len = tcp_optlen(skb);
-
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) {
 			u32 tcp_off = skb_transport_offset(skb) -
 				      sizeof(struct ipv6hdr) - ETH_HLEN;
 
-			vlan_tag_flags |= ((tcp_opt_len >> 2) << 8) |
+			vlan_tag_flags |= (tcp_opt_words << 8) |
 					  TX_BD_FLAGS_SW_FLAGS;
 			if (likely(tcp_off == 0))
 				vlan_tag_flags &= ~TX_BD_FLAGS_TCP6_OFF0_MSK;
@@ -6421,14 +6423,15 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 				mss |= (tcp_off & 0xc) << TX_BD_TCP6_OFF2_SHL;
 			}
 		} else {
-			iph = ip_hdr(skb);
-			if (tcp_opt_len || (iph->ihl > 5)) {
-				vlan_tag_flags |= ((iph->ihl - 5) +
-						   (tcp_opt_len >> 2)) << 8;
-			}
+			struct iphdr *iph = ip_hdr(skb);
+			int ip_opt_words = iph->ihl - (sizeof(*iph) >> 2);
+			/* assumes positive ip_opt_words without checking */
+			int opt_words = ip_opt_words + tcp_opt_words;
+
+			if (opt_words > 0)
+				vlan_tag_flags |= opt_words << 8;
 		}
-	} else
-		mss = 0;
+	}
 
 	mapping = pci_map_single(bp->pdev, skb->data, len, PCI_DMA_TODEVICE);
 	if (pci_dma_mapping_error(bp->pdev, mapping)) {
diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 7f82b02..c20c800 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -5426,6 +5426,8 @@ static void tg3_set_txd(struct tg3_napi *tnapi, int entry,
 
 /* hard_start_xmit for devices that don't have any bugs and
  * support TG3_FLG2_HW_TSO_2 and TG3_FLG2_HW_TSO_3 only.
+ *
+ * No TCP or IP length checking, per David Miller (see commit log).
  */
 static netdev_tx_t tg3_start_xmit(struct sk_buff *skb,
 				  struct net_device *dev)
@@ -5461,9 +5463,9 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb,
 
 	entry = tnapi->tx_prod;
 	base_flags = 0;
-	mss = 0;
-	if ((mss = skb_shinfo(skb)->gso_size) != 0) {
-		int tcp_opt_len, ip_tcp_len;
+	mss = skb_shinfo(skb)->gso_size;
+	if (mss != 0) {
+		struct tcphdr *th;
 		u32 hdrlen;
 
 		if (skb_header_cloned(skb) &&
@@ -5471,18 +5473,16 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb,
 			dev_kfree_skb(skb);
 			goto out_unlock;
 		}
+		th = tcp_hdr(skb);
 
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
 			hdrlen = skb_headlen(skb) - ETH_HLEN;
 		else {
 			struct iphdr *iph = ip_hdr(skb);
 
-			tcp_opt_len = tcp_optlen(skb);
-			ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr);
-
+			hdrlen = ip_hdrlen(skb) + tcp_header_len_th(th);
+			iph->tot_len = htons(mss + hdrlen);
 			iph->check = 0;
-			iph->tot_len = htons(mss + ip_tcp_len + tcp_opt_len);
-			hdrlen = ip_tcp_len + tcp_opt_len;
 		}
 
 		if (tp->tg3_flags2 & TG3_FLG2_HW_TSO_3) {
@@ -5496,7 +5496,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb,
 		base_flags |= (TXD_FLAG_CPU_PRE_DMA |
 			       TXD_FLAG_CPU_POST_DMA);
 
-		tcp_hdr(skb)->check = 0;
+		th->check = 0;
 
 	}
 	else if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -5629,6 +5629,8 @@ tg3_tso_bug_end:
 
 /* hard_start_xmit for devices that have the 4G bug and/or 40-bit bug and
  * support TG3_FLG2_HW_TSO_1 or firmware TSO only.
+ *
+ * No TCP or IP length checking, per David Miller (see commit log).
  */
 static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 					  struct net_device *dev)
@@ -5668,20 +5670,21 @@ static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
 		base_flags |= TXD_FLAG_TCPUDP_CSUM;
 
-	if ((mss = skb_shinfo(skb)->gso_size) != 0) {
+	mss = skb_shinfo(skb)->gso_size;
+	if (mss != 0) {
 		struct iphdr *iph;
-		u32 tcp_opt_len, ip_tcp_len, hdr_len;
+		struct tcphdr *th;
+		u32 hdr_len;
+		int opt_bytes;
 
 		if (skb_header_cloned(skb) &&
 		    pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
 			dev_kfree_skb(skb);
 			goto out_unlock;
 		}
+		th = tcp_hdr(skb);
+		hdr_len = ip_hdrlen(skb) + tcp_header_len_th(th);
 
-		tcp_opt_len = tcp_optlen(skb);
-		ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr);
-
-		hdr_len = ip_tcp_len + tcp_opt_len;
 		if (unlikely((ETH_HLEN + hdr_len) > 80) &&
 			     (tp->tg3_flags2 & TG3_FLG2_TSO_BUG))
 			return (tg3_tso_bug(tp, skb));
@@ -5693,13 +5696,14 @@ static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 		iph->check = 0;
 		iph->tot_len = htons(mss + hdr_len);
 		if (tp->tg3_flags2 & TG3_FLG2_HW_TSO) {
-			tcp_hdr(skb)->check = 0;
+			th->check = 0;
 			base_flags &= ~TXD_FLAG_TCPUDP_CSUM;
 		} else
-			tcp_hdr(skb)->check = ~csum_tcpudp_magic(iph->saddr,
-								 iph->daddr, 0,
-								 IPPROTO_TCP,
-								 0);
+			th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
+						       0, IPPROTO_TCP, 0);
+
+		opt_bytes = hdr_len - sizeof(*iph) - sizeof(*th);
+		/* assumes positive opt_bytes without checking */
 
 		if (tp->tg3_flags2 & TG3_FLG2_HW_TSO_3) {
 			mss |= (hdr_len & 0xc) << 12;
@@ -5710,19 +5714,11 @@ static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 			mss |= hdr_len << 9;
 		else if ((tp->tg3_flags2 & TG3_FLG2_HW_TSO_1) ||
 			 GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5705) {
-			if (tcp_opt_len || iph->ihl > 5) {
-				int tsflags;
-
-				tsflags = (iph->ihl - 5) + (tcp_opt_len >> 2);
-				mss |= (tsflags << 11);
-			}
+			if (opt_bytes > 0)
+				mss |= opt_bytes << (11 - 2);
 		} else {
-			if (tcp_opt_len || iph->ihl > 5) {
-				int tsflags;
-
-				tsflags = (iph->ihl - 5) + (tcp_opt_len >> 2);
-				base_flags |= tsflags << 12;
-			}
+			if (opt_bytes > 0)
+				base_flags |= opt_bytes << (12 - 2);
 		}
 	}
 #if TG3_VLAN_TAG_USED
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d0133cf..74728f7 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -218,11 +218,6 @@ static inline unsigned int tcp_hdrlen(const struct sk_buff *skb)
 	return tcp_hdr(skb)->doff * 4;
 }
 
-static inline unsigned int tcp_optlen(const struct sk_buff *skb)
-{
-	return (tcp_hdr(skb)->doff - 5) * 4;
-}
-
 /* Length of fixed header plus standard options. */
 static inline unsigned int tcp_header_len_th(const struct tcphdr *th)
 {
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 3/7] tcp: harmonize tcp_vx_rcv header length assumptions
  2010-02-15 12:23 [PATCH 0/7] tcp: bugs and cleanup updated to 2.6.33-rc8 (again) William Allen Simpson
  2010-02-15 12:25 ` [PATCH v3 1/7] net: tcp_header_len_th and tcp_option_len_th William Allen Simpson
  2010-02-15 12:27 ` [PATCH v3 2/7] net: remove old tcp_optlen function William Allen Simpson
@ 2010-02-15 12:28 ` William Allen Simpson
  2010-02-15 12:48   ` Andi Kleen
  2010-02-15 12:31 ` [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs William Allen Simpson
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 12:28 UTC (permalink / raw)
  To: Linux Kernel Developers
  Cc: Linux Kernel Network Developers, Andrew Morton, David Miller,
	Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 827 bytes --]

Harmonize tcp_v4_rcv() and tcp_v6_rcv() -- better document tcp doff
and header length assumptions, and carefully compare implementations.

Reduces multiply/shifts, marginally improving speed.

Removes redundant tcp header length checks before checksumming.

Instead, assumes (and documents) that any backlog processing and
transform policies will carefully preserve the header, and will
ensure the socket buffer length remains >= the header size.

Stand-alone patch, originally developed for TCPCT.

Signed-off-by: William.Allen.Simpson@gmail.com
CC: Andi Kleen <andi@firstfloor.org>
---
  include/net/xfrm.h  |    7 ++++++
  net/ipv4/tcp_ipv4.c |   45 +++++++++++++++++++++-----------------
  net/ipv6/tcp_ipv6.c |   59 ++++++++++++++++++++++++++++----------------------
  3 files changed, 65 insertions(+), 46 deletions(-)



[-- Attachment #2: len_th+2c5+2.6.33-rc8.patch --]
[-- Type: text/plain, Size: 8037 bytes --]

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 60c2770..81492a1 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -975,6 +975,13 @@ xfrm_state_addr_cmp(struct xfrm_tmpl *tmpl, struct xfrm_state *x, unsigned short
 }
 
 #ifdef CONFIG_XFRM
+/*
+ * For transport, the policy is checked before the presumed more expensive
+ * checksum. The transport header has already been checked for size, and is
+ * guaranteed to be contiguous. These policies must not alter the header or
+ * its position in the buffer, and should not shorten the buffer length
+ * without ensuring the length remains >= the header size.
+ */
 extern int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb, unsigned short family);
 
 static inline int __xfrm_policy_check2(struct sock *sk, int dir,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 65b8ebf..0a76e41 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1559,7 +1559,8 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 		return 0;
 	}
 
-	if (skb->len < tcp_hdrlen(skb) || tcp_checksum_complete(skb))
+	/* Assumes header and options unchanged since checksum_init() */
+	if (tcp_checksum_complete(skb))
 		goto csum_err;
 
 	if (sk->sk_state == TCP_LISTEN) {
@@ -1601,14 +1602,13 @@ csum_err:
 }
 
 /*
- *	From tcp_input.c
+ *	Called by ip_input.c: ip_local_deliver_finish()
  */
-
 int tcp_v4_rcv(struct sk_buff *skb)
 {
-	const struct iphdr *iph;
 	struct tcphdr *th;
 	struct sock *sk;
+	int tcp_header_len;
 	int ret;
 	struct net *net = dev_net(skb->dev);
 
@@ -1618,31 +1618,33 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	/* Count it even if it's bad */
 	TCP_INC_STATS_BH(net, TCP_MIB_INSEGS);
 
+	/* Check too short header */
 	if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
 		goto discard_it;
 
-	th = tcp_hdr(skb);
-
-	if (th->doff < sizeof(struct tcphdr) / 4)
+	/* Check bad doff, compare doff directly to constant value */
+	tcp_header_len = tcp_hdr(skb)->doff;
+	if (tcp_header_len < (sizeof(struct tcphdr) / 4))
 		goto bad_packet;
-	if (!pskb_may_pull(skb, th->doff * 4))
+
+	/* Check too short header and options */
+	tcp_header_len *= 4;
+	if (!pskb_may_pull(skb, tcp_header_len))
 		goto discard_it;
 
-	/* An explanation is required here, I think.
-	 * Packet length and doff are validated by header prediction,
-	 * provided case of th->doff==0 is eliminated.
-	 * So, we defer the checks. */
+	/* Packet length and doff are validated by header prediction,
+	 * provided case of th->doff == 0 is eliminated (above).
+	 */
 	if (!skb_csum_unnecessary(skb) && tcp_v4_checksum_init(skb))
 		goto bad_packet;
 
 	th = tcp_hdr(skb);
-	iph = ip_hdr(skb);
 	TCP_SKB_CB(skb)->seq = ntohl(th->seq);
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
-				    skb->len - th->doff * 4);
+				    skb->len - tcp_header_len);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
 	TCP_SKB_CB(skb)->when	 = 0;
-	TCP_SKB_CB(skb)->flags	 = iph->tos;
+	TCP_SKB_CB(skb)->flags	 = ip_hdr(skb)->tos;
 	TCP_SKB_CB(skb)->sacked	 = 0;
 
 	sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
@@ -1682,14 +1684,14 @@ process:
 	bh_unlock_sock(sk);
 
 	sock_put(sk);
-
 	return ret;
 
 no_tcp_socket:
 	if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
 		goto discard_it;
 
-	if (skb->len < (th->doff << 2) || tcp_checksum_complete(skb)) {
+	/* Assumes header and options unchanged since checksum_init() */
+	if (tcp_checksum_complete(skb)) {
 bad_packet:
 		TCP_INC_STATS_BH(net, TCP_MIB_INERRS);
 	} else {
@@ -1711,18 +1713,21 @@ do_time_wait:
 		goto discard_it;
 	}
 
-	if (skb->len < (th->doff << 2) || tcp_checksum_complete(skb)) {
+	/* Assumes header and options unchanged since checksum_init() */
+	if (tcp_checksum_complete(skb)) {
 		TCP_INC_STATS_BH(net, TCP_MIB_INERRS);
 		inet_twsk_put(inet_twsk(sk));
 		goto discard_it;
 	}
+
 	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
 	case TCP_TW_SYN: {
 		struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
 							&tcp_hashinfo,
-							iph->daddr, th->dest,
+							ip_hdr(skb)->daddr,
+							th->dest,
 							inet_iif(skb));
-		if (sk2) {
+		if (sk2 != NULL) {
 			inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);
 			inet_twsk_put(inet_twsk(sk));
 			sk = sk2;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index febfd59..b76939a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1594,7 +1594,8 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 		return 0;
 	}
 
-	if (skb->len < tcp_hdrlen(skb) || tcp_checksum_complete(skb))
+	/* Assumes header and options unchanged since checksum_init() */
+	if (tcp_checksum_complete(skb))
 		goto csum_err;
 
 	if (sk->sk_state == TCP_LISTEN) {
@@ -1664,38 +1665,47 @@ ipv6_pktoptions:
 	return 0;
 }
 
+/*
+ *	Called by ip6_input.c: ip6_input_finish()
+ */
 static int tcp_v6_rcv(struct sk_buff *skb)
 {
 	struct tcphdr *th;
 	struct sock *sk;
+	int tcp_header_len;
 	int ret;
 	struct net *net = dev_net(skb->dev);
 
 	if (skb->pkt_type != PACKET_HOST)
 		goto discard_it;
 
-	/*
-	 *	Count it even if it's bad.
-	 */
+	/* Count it even if it's bad */
 	TCP_INC_STATS_BH(net, TCP_MIB_INSEGS);
 
+	/* Check too short header */
 	if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
 		goto discard_it;
 
-	th = tcp_hdr(skb);
-
-	if (th->doff < sizeof(struct tcphdr)/4)
+	/* Check bad doff, compare doff directly to constant value */
+	tcp_header_len = tcp_hdr(skb)->doff;
+	if (tcp_header_len < (sizeof(struct tcphdr) / 4))
 		goto bad_packet;
-	if (!pskb_may_pull(skb, th->doff*4))
+
+	/* Check too short header and options */
+	tcp_header_len *= 4;
+	if (!pskb_may_pull(skb, tcp_header_len))
 		goto discard_it;
 
+	/* Packet length and doff are validated by header prediction,
+	 * provided case of th->doff == 0 is eliminated (above).
+	 */
 	if (!skb_csum_unnecessary(skb) && tcp_v6_checksum_init(skb))
 		goto bad_packet;
 
 	th = tcp_hdr(skb);
 	TCP_SKB_CB(skb)->seq = ntohl(th->seq);
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
-				    skb->len - th->doff*4);
+				    skb->len - tcp_header_len);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
 	TCP_SKB_CB(skb)->when = 0;
 	TCP_SKB_CB(skb)->flags = ipv6_get_dsfield(ipv6_hdr(skb));
@@ -1711,6 +1721,7 @@ process:
 
 	if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
 		goto discard_and_relse;
+	/* nf_reset(skb); in ip6_input.c ip6_input_finish() */
 
 	if (sk_filter(sk, skb))
 		goto discard_and_relse;
@@ -1743,7 +1754,8 @@ no_tcp_socket:
 	if (!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb))
 		goto discard_it;
 
-	if (skb->len < (th->doff<<2) || tcp_checksum_complete(skb)) {
+	/* Assumes header and options unchanged since checksum_init() */
+	if (tcp_checksum_complete(skb)) {
 bad_packet:
 		TCP_INC_STATS_BH(net, TCP_MIB_INERRS);
 	} else {
@@ -1751,11 +1763,7 @@ bad_packet:
 	}
 
 discard_it:
-
-	/*
-	 *	Discard frame
-	 */
-
+	/* Discard frame. */
 	kfree_skb(skb);
 	return 0;
 
@@ -1769,24 +1777,23 @@ do_time_wait:
 		goto discard_it;
 	}
 
-	if (skb->len < (th->doff<<2) || tcp_checksum_complete(skb)) {
+	/* Assumes header and options unchanged since checksum_init() */
+	if (tcp_checksum_complete(skb)) {
 		TCP_INC_STATS_BH(net, TCP_MIB_INERRS);
 		inet_twsk_put(inet_twsk(sk));
 		goto discard_it;
 	}
 
 	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
-	case TCP_TW_SYN:
-	{
-		struct sock *sk2;
-
-		sk2 = inet6_lookup_listener(dev_net(skb->dev), &tcp_hashinfo,
-					    &ipv6_hdr(skb)->daddr,
-					    ntohs(th->dest), inet6_iif(skb));
+	case TCP_TW_SYN: {
+		struct sock *sk2 = inet6_lookup_listener(dev_net(skb->dev),
+							 &tcp_hashinfo,
+							 &ipv6_hdr(skb)->daddr,
+							 ntohs(th->dest),
+							 inet6_iif(skb));
 		if (sk2 != NULL) {
-			struct inet_timewait_sock *tw = inet_twsk(sk);
-			inet_twsk_deschedule(tw, &tcp_death_row);
-			inet_twsk_put(tw);
+			inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);
+			inet_twsk_put(inet_twsk(sk));
 			sk = sk2;
 			goto process;
 		}
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs
  2010-02-15 12:23 [PATCH 0/7] tcp: bugs and cleanup updated to 2.6.33-rc8 (again) William Allen Simpson
                   ` (2 preceding siblings ...)
  2010-02-15 12:28 ` [PATCH v5 3/7] tcp: harmonize tcp_vx_rcv header length assumptions William Allen Simpson
@ 2010-02-15 12:31 ` William Allen Simpson
  2010-02-15 15:10   ` Andi Kleen
  2010-02-15 21:24   ` [PATCH v5 " William Allen Simpson
  2010-02-15 12:34 ` [PATCH v3 5/7] TCPCT part 2e: accept SYNACK data William Allen Simpson
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 12:31 UTC (permalink / raw)
  To: Linux Kernel Developers
  Cc: Linux Kernel Network Developers, Andrew Morton, David Miller,
	Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1505 bytes --]

Fix incorrect header prediction flags documentation.

Relieve register pressure in (the i386) fast path by accessing skb->len
directly, instead of carrying a rarely used len parameter.

Eliminate unused len parameters in two other functions.

Don't use output calculated tp->tcp_header_len for input decisions.
While the output header is usually the same as the input (same options
in both directions), that's a poor assumption. In particular, Sack will
be different. Newer options are not guaranteed.

Moreover, in the fast path, that only saved a shift or two. The other
efficiencies in this patch more than make up the difference.

Instead, use tp->rx_opt.tstamp_ok to accurately predict header length.

Likewise, use tp->rx_opt.tstamp_ok for received MSS calculations.

Don't use "sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED" to guess that
the timestamp is present. This may have been OK in the days with fewer
possible options, but various combinations of newer options may yield
the same header length. (This bug is in 3 places.)

Instead, use tp->rx_opt.saw_tstamp to determine a timestamp is present.

There's no need to test buffer length against header length, already
checked by tcp_v[4,6]_rcv(). Straighten code for minor efficiency gain.

Stand-alone patch, originally developed for TCPCT.

Requires:
   net: tcp_header_len_th and tcp_option_len_th
   tcp: harmonize tcp_vx_rcv header length assumptions

Signed-off-by: William.Allen.Simpson@gmail.com
CC: Andi Kleen <andi@firstfloor.org>

[-- Attachment #2: len_th+2d4+2.6.33-rc8.patch --]
[-- Type: text/plain, Size: 9932 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 74728f7..2987ee8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -301,7 +301,11 @@ struct tcp_sock {
 
 /*
  *	Header prediction flags
- *	0x5?10 << 16 + snd_wnd in net byte order
+ *	S << 28 + TCP_FLAG_ACK + snd_wnd, in net byte order
+ *		(PSH flag is ignored)
+ *		S is 5 (no options), or 8 (timestamp aligned)
+ *	otherwise, 0 to turn it off -- for instance, when there are
+ *	holes in receive space.
  */
 	__be32	pred_flags;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 34f5cc2..6b0d7e9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -310,13 +310,11 @@ extern int			tcp_ioctl(struct sock *sk,
 
 extern int			tcp_rcv_state_process(struct sock *sk, 
 						      struct sk_buff *skb,
-						      struct tcphdr *th,
-						      unsigned len);
+						      struct tcphdr *th);
 
 extern int			tcp_rcv_established(struct sock *sk, 
 						    struct sk_buff *skb,
-						    struct tcphdr *th, 
-						    unsigned len);
+						    struct tcphdr *th);
 
 extern void			tcp_rcv_space_adjust(struct sock *sk);
 
@@ -533,9 +531,16 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
 	return (tp->srtt >> 3) + tp->rttvar;
 }
 
+static inline u16 __tcp_fast_path_header_length(const struct tcp_sock *tp)
+{
+	return tp->rx_opt.tstamp_ok
+		? sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED
+		: sizeof(struct tcphdr);
+}
+
 static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
 {
-	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+	tp->pred_flags = htonl((__tcp_fast_path_header_length(tp) << (28 - 2)) |
 			       ntohl(TCP_FLAG_ACK) |
 			       snd_wnd);
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 28e0296..8e0f6ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -152,7 +152,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 			 * tcp header plus fixed timestamp option length.
 			 * Resulting "len" is MSS free of SACK jitter.
 			 */
-			len -= tcp_sk(sk)->tcp_header_len;
+			len -= __tcp_fast_path_header_length(tcp_sk(sk));
 			icsk->icsk_ack.last_seg_size = len;
 			if (len == lss) {
 				icsk->icsk_ack.rcv_mss = len;
@@ -5206,7 +5206,7 @@ discard:
  *	tcp_data_queue when everything is OK.
  */
 int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
-			struct tcphdr *th, unsigned len)
+			struct tcphdr *th)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int res;
@@ -5225,31 +5225,15 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 	 *	extra cost of the net_bh soft interrupt processing...
 	 *	We do checksum and copy also but from device to kernel.
 	 */
-
-	tp->rx_opt.saw_tstamp = 0;
-
-	/*	pred_flags is 0xS?10 << 16 + snd_wnd
-	 *	if header_prediction is to be made
-	 *	'S' will always be tp->tcp_header_len >> 2
-	 *	'?' will be 0 for the fast path, otherwise pred_flags is 0 to
-	 *  turn it off	(when there are holes in the receive
-	 *	 space for instance)
-	 *	PSH flag is ignored.
-	 */
-
 	if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
 	    TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
 	    !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
-		int tcp_header_len = tp->tcp_header_len;
-
-		/* Timestamp header prediction: tcp_header_len
-		 * is automatically equal to th->doff*4 due to pred_flags
-		 * match.
-		 */
+		int tcp_header_len = tcp_header_len_th(th);
 
-		/* Check timestamp */
-		if (tcp_header_len == sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) {
-			/* No? Slow path! */
+		/* Timestamp header prediction */
+		if (tcp_header_len != sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) {
+			tp->rx_opt.saw_tstamp = 0; /* false */
+		} else {
 			if (!tcp_parse_aligned_timestamp(tp, th))
 				goto slow_path;
 
@@ -5264,35 +5248,12 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 			 */
 		}
 
-		if (len <= tcp_header_len) {
-			/* Bulk data transfer: sender */
-			if (len == tcp_header_len) {
-				/* Predicted packet is in window by definition.
-				 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
-				 * Hence, check seq<=rcv_wup reduces to:
-				 */
-				if (tcp_header_len ==
-				    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
-				    tp->rcv_nxt == tp->rcv_wup)
-					tcp_store_ts_recent(tp);
-
-				/* We know that such packets are checksummed
-				 * on entry.
-				 */
-				tcp_ack(sk, skb, 0);
-				__kfree_skb(skb);
-				tcp_data_snd_check(sk);
-				return 0;
-			} else { /* Header too small */
-				TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
-				goto discard;
-			}
-		} else {
+		if (tcp_header_len < skb->len) {
 			int eaten = 0;
 			int copied_early = 0;
 
 			if (tp->copied_seq == tp->rcv_nxt &&
-			    len - tcp_header_len <= tp->ucopy.len) {
+			    skb->len - tcp_header_len <= tp->ucopy.len) {
 #ifdef CONFIG_NET_DMA
 				if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
 					copied_early = 1;
@@ -5311,9 +5272,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 					 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
 					 * Hence, check seq<=rcv_wup reduces to:
 					 */
-					if (tcp_header_len ==
-					    (sizeof(struct tcphdr) +
-					     TCPOLEN_TSTAMP_ALIGNED) &&
+					if (tp->rx_opt.saw_tstamp &&
 					    tp->rcv_nxt == tp->rcv_wup)
 						tcp_store_ts_recent(tp);
 
@@ -5334,8 +5293,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 				 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
 				 * Hence, check seq<=rcv_wup reduces to:
 				 */
-				if (tcp_header_len ==
-				    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
+				if (tp->rx_opt.saw_tstamp &&
 				    tp->rcv_nxt == tp->rcv_wup)
 					tcp_store_ts_recent(tp);
 
@@ -5376,11 +5334,33 @@ no_ack:
 			else
 				sk->sk_data_ready(sk, 0);
 			return 0;
+		} else {
+			/* Bulk data transfer: sender
+			 *
+			 * tcp_header_len > skb->len never happens,
+			 * already checked by tcp_v[4,6]_rcv()
+			 *
+			 * Predicted packet is in window by definition.
+			 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
+			 * Hence, check seq<=rcv_wup reduces to:
+			 */
+			if (tp->rx_opt.saw_tstamp &&
+			    tp->rcv_nxt == tp->rcv_wup)
+				tcp_store_ts_recent(tp);
+
+			/* We know that such packets are checksummed
+			 * on entry.
+			 */
+			tcp_ack(sk, skb, 0);
+			__kfree_skb(skb);
+			tcp_data_snd_check(sk);
+			return 0;
 		}
 	}
 
 slow_path:
-	if (len < (th->doff << 2) || tcp_checksum_complete_user(sk, skb))
+	/* Assumes header and options unchanged since checksum_init() */
+	if (tcp_checksum_complete_user(sk, skb))
 		goto csum_error;
 
 	/*
@@ -5416,7 +5396,7 @@ discard:
 }
 
 static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
-					 struct tcphdr *th, unsigned len)
+					 struct tcphdr *th)
 {
 	u8 *hash_location;
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5693,7 +5673,7 @@ reset_and_undo:
  */
 
 int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-			  struct tcphdr *th, unsigned len)
+			  struct tcphdr *th)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5740,7 +5720,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 		goto discard;
 
 	case TCP_SYN_SENT:
-		queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
+		queued = tcp_rcv_synsent_state_process(sk, skb, th);
 		if (queued >= 0)
 			return queued;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0a76e41..f999e06 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1551,7 +1551,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
 		TCP_CHECK_TIMER(sk);
-		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
+		if (tcp_rcv_established(sk, skb, tcp_hdr(skb))) {
 			rsk = sk;
 			goto reset;
 		}
@@ -1578,7 +1578,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 	}
 
 	TCP_CHECK_TIMER(sk);
-	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
+	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb))) {
 		rsk = sk;
 		goto reset;
 	}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f206ee5..37b7536 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -718,8 +718,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,
 	int state = child->sk_state;
 
 	if (!sock_owned_by_user(child)) {
-		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
-					    skb->len);
+		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb));
 		/* Wakeup parent, send SIGIO */
 		if (state == TCP_SYN_RECV && child->sk_state != state)
 			parent->sk_data_ready(parent, 0);
diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
index 9bc805d..de2a32e 100644
--- a/net/ipv4/tcp_probe.c
+++ b/net/ipv4/tcp_probe.c
@@ -88,7 +88,7 @@ static inline int tcp_probe_avail(void)
  * Note: arguments must match tcp_rcv_established()!
  */
 static int jtcp_rcv_established(struct sock *sk, struct sk_buff *skb,
-			       struct tcphdr *th, unsigned len)
+			       struct tcphdr *th)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_sock *inet = inet_sk(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index b76939a..3d08a4d 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1586,7 +1586,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
 		TCP_CHECK_TIMER(sk);
-		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len))
+		if (tcp_rcv_established(sk, skb, tcp_hdr(skb)))
 			goto reset;
 		TCP_CHECK_TIMER(sk);
 		if (opt_skb)
@@ -1618,7 +1618,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 	}
 
 	TCP_CHECK_TIMER(sk);
-	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len))
+	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb)))
 		goto reset;
 	TCP_CHECK_TIMER(sk);
 	if (opt_skb)
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 5/7] TCPCT part 2e: accept SYNACK data
  2010-02-15 12:23 [PATCH 0/7] tcp: bugs and cleanup updated to 2.6.33-rc8 (again) William Allen Simpson
                   ` (3 preceding siblings ...)
  2010-02-15 12:31 ` [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs William Allen Simpson
@ 2010-02-15 12:34 ` William Allen Simpson
  2010-02-15 12:41 ` [PATCH v5 6/7] TCPCT part 2f: cleanup tcp_parse_options William Allen Simpson
  2010-02-15 12:45 ` [PATCH v7 7/7] TCPCT part 2g: parse cookie pair and 64-bit timestamp William Allen Simpson
  6 siblings, 0 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 12:34 UTC (permalink / raw)
  To: Linux Kernel Developers
  Cc: Linux Kernel Network Developers, Andrew Morton, David Miller

[-- Attachment #1: Type: text/plain, Size: 710 bytes --]

When accompanied by cookie option, Initiator (client) queues incoming
SYNACK transaction data.

This is a straightforward re-implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley).  The patch was previously reviewed:

    http://thread.gmane.org/gmane.linux.network/102586

This function will also be used in subsequent patches that implement
additional features.

Requires:
   TCPCT part 1g: Responder Cookie => Initiator
   net: tcp_header_len_th and tcp_option_len_th

Signed-off-by: William.Allen.Simpson@gmail.com
---
  net/ipv4/tcp_input.c |   26 +++++++++++++++++++++++++-
  1 files changed, 25 insertions(+), 1 deletions(-)


[-- Attachment #2: TCPCT+2e3+2.6.33-rc8.patch --]
[-- Type: text/plain, Size: 2057 bytes --]

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8e0f6ae..165040e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5395,6 +5395,12 @@ discard:
 	return 0;
 }
 
+/*
+ * Returns:
+ *	+1 on reset,
+ *	0 success and/or SYNACK data,
+ *	-1 on discard.
+ */
 static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 					 struct tcphdr *th)
 {
@@ -5403,6 +5409,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_cookie_values *cvp = tp->cookie_values;
 	int saved_clamp = tp->rx_opt.mss_clamp;
+	int queued = 0;
 
 	tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0);
 
@@ -5509,6 +5516,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 					- TCPOLEN_COOKIE_BASE;
 			int cookie_pair_size = cookie_size
 					     + cvp->cookie_desired;
+			int tcp_header_len = tcp_header_len_th(th);
 
 			/* A cookie extension option was sent and returned.
 			 * Note that each incoming SYNACK replaces the
@@ -5524,6 +5532,19 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 				       hash_location, cookie_size);
 				cvp->cookie_pair_size = cookie_pair_size;
 			}
+
+			queued = skb->len - tcp_header_len;
+			if (queued > 0) {
+				/* Queue incoming transaction data. */
+				__skb_pull(skb, tcp_header_len);
+				__skb_queue_tail(&sk->sk_receive_queue, skb);
+				skb_set_owner_r(skb, sk);
+				sk->sk_data_ready(sk, 0);
+				cvp->s_data_in = 1; /* true */
+				tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+				tp->rcv_wup = TCP_SKB_CB(skb)->end_seq;
+				tp->copied_seq = TCP_SKB_CB(skb)->seq + 1;
+			}
 		}
 
 		smp_mb();
@@ -5577,11 +5598,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 						  TCP_DELACK_MAX, TCP_RTO_MAX);
 
 discard:
-			__kfree_skb(skb);
+			if (queued <= 0)
+				__kfree_skb(skb);
 			return 0;
 		} else {
 			tcp_send_ack(sk);
 		}
+		if (queued > 0)
+			return 0;
 		return -1;
 	}
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 6/7] TCPCT part 2f: cleanup tcp_parse_options
  2010-02-15 12:23 [PATCH 0/7] tcp: bugs and cleanup updated to 2.6.33-rc8 (again) William Allen Simpson
                   ` (4 preceding siblings ...)
  2010-02-15 12:34 ` [PATCH v3 5/7] TCPCT part 2e: accept SYNACK data William Allen Simpson
@ 2010-02-15 12:41 ` William Allen Simpson
  2010-02-15 12:45 ` [PATCH v7 7/7] TCPCT part 2g: parse cookie pair and 64-bit timestamp William Allen Simpson
  6 siblings, 0 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 12:41 UTC (permalink / raw)
  To: Linux Kernel Developers
  Cc: Linux Kernel Network Developers, Andrew Morton, David Miller

[-- Attachment #1: Type: text/plain, Size: 838 bytes --]

Split switch, shift cases to the left, fix most lines beyond column 80.

Prepare (future) error return.

Harmonize initialization in syncookies.
Fix initialization in tcp_minisocks.

Repair net/ipv4/tcp_ipv4.c errors with goto targets, overlooked by
David Miller in commit bb5b7c11263dbbe78253cd05945a6bf8f55add8e

Requires:
   TCPCT part 1g: Responder Cookie => Initiator
   net: tcp_header_len_th and tcp_option_len_th

Signed-off-by: William.Allen.Simpson@gmail.com
---
  include/net/tcp.h        |    3 +-
  net/ipv4/syncookies.c    |    9 ++-
  net/ipv4/tcp_input.c     |  223 ++++++++++++++++++++++++++--------------------
  net/ipv4/tcp_ipv4.c      |   10 ++-
  net/ipv4/tcp_minisocks.c |   26 ++++--
  net/ipv6/syncookies.c    |    9 ++-
  net/ipv6/tcp_ipv6.c      |    6 +-
  7 files changed, 172 insertions(+), 114 deletions(-)

[-- Attachment #2: TCPCT+2f5+2.6.33-rc8.patch --]
[-- Type: text/plain, Size: 15000 bytes --]

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6b0d7e9..420e872 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -403,7 +403,8 @@ extern int			tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
 					    size_t len, int nonblock, 
 					    int flags, int *addr_len);
 
-extern void			tcp_parse_options(struct sk_buff *skb,
+extern int			tcp_parse_options(struct sk_buff *skb,
+						  const struct tcphdr *th,
 						  struct tcp_options_received *opt_rx,
 						  u8 **hvpp,
 						  int estab);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 66fd80e..d1f45ad 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -255,15 +255,16 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_options_received tcp_opt;
 	u8 *hash_location;
+	struct rtable *rt;
+	struct request_sock *req;
 	struct inet_request_sock *ireq;
 	struct tcp_request_sock *treq;
 	struct tcp_sock *tp = tcp_sk(sk);
 	const struct tcphdr *th = tcp_hdr(skb);
 	__u32 cookie = ntohl(th->ack_seq) - 1;
 	struct sock *ret = sk;
-	struct request_sock *req;
 	int mss;
-	struct rtable *rt;
+	int parsed;
 	__u8 rcv_wscale;
 
 	if (!sysctl_tcp_syncookies || !th->ack)
@@ -279,7 +280,9 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, &hash_location, 0);
+	parsed = tcp_parse_options(skb, th, &tcp_opt, &hash_location, 0);
+	if (parsed < 0)
+		goto out;
 
 	if (tcp_opt.saw_tstamp)
 		cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 165040e..2a9ef6b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3726,122 +3726,149 @@ old_ack:
  * But, this can also be called on packets in the established flow when
  * the fast version below fails.
  */
-void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
-		       u8 **hvpp, int estab)
+int tcp_parse_options(struct sk_buff *skb, const struct tcphdr *th,
+		      struct tcp_options_received *opt_rx, u8 **hvpp, int estab)
 {
-	unsigned char *ptr;
-	struct tcphdr *th = tcp_hdr(skb);
-	int length = (th->doff * 4) - sizeof(struct tcphdr);
+	unsigned char *ptr = (unsigned char *)(th + 1);
+	int length = tcp_option_len_th(th);
+	bool syn = th->syn;
 
-	ptr = (unsigned char *)(th + 1);
-	opt_rx->saw_tstamp = 0;
+	opt_rx->cookie_plus = 0;
+	opt_rx->saw_tstamp = 0; /* false */
 
 	while (length > 0) {
-		int opcode = *ptr++;
 		int opsize;
+		int opcode = *ptr++;
 
 		switch (opcode) {
 		case TCPOPT_EOL:
-			return;
+			length = 0;
+			continue;
 		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
 			length--;
 			continue;
 		default:
-			opsize = *ptr++;
-			if (opsize < 2) /* "silly options" */
-				return;
-			if (opsize > length)
-				return;	/* don't parse partial options */
-			switch (opcode) {
-			case TCPOPT_MSS:
-				if (opsize == TCPOLEN_MSS && th->syn && !estab) {
-					u16 in_mss = get_unaligned_be16(ptr);
-					if (in_mss) {
-						if (opt_rx->user_mss &&
-						    opt_rx->user_mss < in_mss)
-							in_mss = opt_rx->user_mss;
-						opt_rx->mss_clamp = in_mss;
-					}
-				}
-				break;
-			case TCPOPT_WINDOW:
-				if (opsize == TCPOLEN_WINDOW && th->syn &&
-				    !estab && sysctl_tcp_window_scaling) {
-					__u8 snd_wscale = *(__u8 *)ptr;
-					opt_rx->wscale_ok = 1;
-					if (snd_wscale > 14) {
-						if (net_ratelimit())
-							printk(KERN_INFO "tcp_parse_options: Illegal window "
-							       "scaling value %d >14 received.\n",
-							       snd_wscale);
-						snd_wscale = 14;
-					}
-					opt_rx->snd_wscale = snd_wscale;
-				}
-				break;
-			case TCPOPT_TIMESTAMP:
-				if ((opsize == TCPOLEN_TIMESTAMP) &&
-				    ((estab && opt_rx->tstamp_ok) ||
-				     (!estab && sysctl_tcp_timestamps))) {
-					opt_rx->saw_tstamp = 1;
-					opt_rx->rcv_tsval = get_unaligned_be32(ptr);
-					opt_rx->rcv_tsecr = get_unaligned_be32(ptr + 4);
-				}
-				break;
-			case TCPOPT_SACK_PERM:
-				if (opsize == TCPOLEN_SACK_PERM && th->syn &&
-				    !estab && sysctl_tcp_sack) {
-					opt_rx->sack_ok = 1;
-					tcp_sack_reset(opt_rx);
+			/* fallthru */
+			break;
+		};
+
+		opsize = *ptr++;
+		if (opsize < 2 || opsize > length) {
+			/* don't parse partial options */
+			break;
+		}
+
+		switch (opcode) {
+		case TCPOPT_MSS:
+			if (opsize == TCPOLEN_MSS && syn && !estab) {
+				u16 in_mss = get_unaligned_be16(ptr);
+				if (in_mss) {
+					if (opt_rx->user_mss &&
+					    opt_rx->user_mss < in_mss)
+						in_mss = opt_rx->user_mss;
+					opt_rx->mss_clamp = in_mss;
 				}
-				break;
+			}
+			break;
 
-			case TCPOPT_SACK:
-				if ((opsize >= (TCPOLEN_SACK_BASE + TCPOLEN_SACK_PERBLOCK)) &&
-				   !((opsize - TCPOLEN_SACK_BASE) % TCPOLEN_SACK_PERBLOCK) &&
-				   opt_rx->sack_ok) {
-					TCP_SKB_CB(skb)->sacked = (ptr - 2) - (unsigned char *)th;
+		case TCPOPT_WINDOW:
+			if (opsize == TCPOLEN_WINDOW && syn &&
+			    !estab && sysctl_tcp_window_scaling) {
+				__u8 snd_wscale = *(__u8 *)ptr;
+				opt_rx->wscale_ok = 1;
+				if (snd_wscale > 14) {
+					if (net_ratelimit())
+						printk(KERN_INFO
+						       "tcp_parse_options: "
+						       "window scaling value "
+						       "%d > 14 received.\n",
+						       snd_wscale);
+					snd_wscale = 14;
 				}
-				break;
+				opt_rx->snd_wscale = snd_wscale;
+			}
+			break;
+
+		case TCPOPT_SACK_PERM:
+			if (opsize == TCPOLEN_SACK_PERM && syn &&
+			    !estab && sysctl_tcp_sack) {
+				opt_rx->sack_ok = 1;
+				tcp_sack_reset(opt_rx);
+			}
+			break;
+
+		case TCPOPT_SACK:
+			if ((opsize >= (TCPOLEN_SACK_BASE + TCPOLEN_SACK_PERBLOCK)) &&
+			    !((opsize - TCPOLEN_SACK_BASE) % TCPOLEN_SACK_PERBLOCK) &&
+			    opt_rx->sack_ok) {
+				TCP_SKB_CB(skb)->sacked = (ptr - 2)
+							- (unsigned char *)th;
+			}
+			break;
+
+		case TCPOPT_TIMESTAMP:
+			if ((opsize == TCPOLEN_TIMESTAMP) &&
+			    ((estab && opt_rx->tstamp_ok) ||
+			     (!estab && sysctl_tcp_timestamps))) {
+				opt_rx->saw_tstamp = 1;
+				opt_rx->rcv_tsval = get_unaligned_be32(ptr);
+				opt_rx->rcv_tsecr = get_unaligned_be32(ptr + 4);
+			}
+			break;
 #ifdef CONFIG_TCP_MD5SIG
-			case TCPOPT_MD5SIG:
-				/*
-				 * The MD5 Hash has already been
-				 * checked (see tcp_v{4,6}_do_rcv()).
-				 */
-				break;
+		case TCPOPT_MD5SIG:
+			/*
+			 * The MD5 Hash has already been
+			 * checked (see tcp_v{4,6}_do_rcv()).
+			 */
+			break;
 #endif
-			case TCPOPT_COOKIE:
-				/* This option is variable length.
-				 */
-				switch (opsize) {
-				case TCPOLEN_COOKIE_BASE:
-					/* not yet implemented */
-					break;
-				case TCPOLEN_COOKIE_PAIR:
-					/* not yet implemented */
-					break;
-				case TCPOLEN_COOKIE_MIN+0:
-				case TCPOLEN_COOKIE_MIN+2:
-				case TCPOLEN_COOKIE_MIN+4:
-				case TCPOLEN_COOKIE_MIN+6:
-				case TCPOLEN_COOKIE_MAX:
-					/* 16-bit multiple */
+		case TCPOPT_COOKIE:
+			if (opt_rx->cookie_plus != 0) {
+				/* discard duplicate */
+				break;
+			}
+			switch (opsize) {
+			case TCPOLEN_COOKIE_BASE:
+				/* not yet implemented */
+				break;
+			case TCPOLEN_COOKIE_PAIR: {
+				/* not yet implemented */
+				break;
+			}
+			case TCPOLEN_COOKIE_MIN+0:
+			case TCPOLEN_COOKIE_MIN+2:
+			case TCPOLEN_COOKIE_MIN+4:
+			case TCPOLEN_COOKIE_MIN+6:
+			case TCPOLEN_COOKIE_MAX:
+				/* 16-bit multiple */
+				if (syn) {
 					opt_rx->cookie_plus = opsize;
 					*hvpp = ptr;
-				default:
-					/* ignore option */
-					break;
-				};
+				}
+				break;
+			default:
+				/* ignore option */
 				break;
 			};
+			break;
 
-			ptr += opsize-2;
-			length -= opsize;
-		}
+		default:
+			/* skip unrecognized options */
+			break;
+		};
+
+		ptr += opsize - 2;
+		length -= opsize;
 	}
+	return 0;
 }
 
+/*
+ * Returns:
+ *	1 on success
+ *	0 on failure
+ */
 static int tcp_parse_aligned_timestamp(struct tcp_sock *tp, struct tcphdr *th)
 {
 	__be32 *ptr = (__be32 *)(th + 1);
@@ -3875,8 +3902,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
 		if (tcp_parse_aligned_timestamp(tp, th))
 			return 1;
 	}
-	tcp_parse_options(skb, &tp->rx_opt, hvpp, 1);
-	return 1;
+	return tcp_parse_options(skb, th, &tp->rx_opt, hvpp, 1);
 }
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -5127,10 +5153,13 @@ static int tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 {
 	u8 *hash_location;
 	struct tcp_sock *tp = tcp_sk(sk);
+	int parsed = tcp_fast_parse_options(skb, th, tp, &hash_location);
+
+	if (parsed < 0)
+		goto discard;
 
 	/* RFC1323: H1. Apply PAWS check first. */
-	if (tcp_fast_parse_options(skb, th, tp, &hash_location) &&
-	    tp->rx_opt.saw_tstamp &&
+	if (tp->rx_opt.saw_tstamp &&
 	    tcp_paws_discard(sk, skb)) {
 		if (!th->rst) {
 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
@@ -5410,8 +5439,10 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	struct tcp_cookie_values *cvp = tp->cookie_values;
 	int saved_clamp = tp->rx_opt.mss_clamp;
 	int queued = 0;
+	int parsed = tcp_parse_options(skb, th, &tp->rx_opt, &hash_location, 0);
 
-	tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0);
+	if (parsed < 0)
+		goto discard;
 
 	if (th->ack) {
 		/* rfc793:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f999e06..3f0813f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1215,6 +1215,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_extend_values tmp_ext;
 	struct tcp_options_received tmp_opt;
+	int parsed;
 	u8 *hash_location;
 	struct request_sock *req;
 	struct inet_request_sock *ireq;
@@ -1265,7 +1266,10 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
-	tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+	parsed = tcp_parse_options(skb, tcp_hdr(skb), &tmp_opt, &hash_location,
+				   0);
+	if (parsed < 0)
+		goto drop_and_free;
 
 	if (tmp_opt.cookie_plus > 0 &&
 	    tmp_opt.saw_tstamp &&
@@ -1278,7 +1282,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		int l = tmp_opt.cookie_plus - TCPOLEN_COOKIE_BASE;
 
 		if (tcp_cookie_generator(&tmp_ext.cookie_bakery[0]) != 0)
-			goto drop_and_release;
+			goto drop_and_free;
 
 		/* Secret recipe starts with IP addresses */
 		*mess++ ^= daddr;
@@ -1299,7 +1303,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		tmp_ext.cookie_out_never = 1; /* true */
 		tmp_ext.cookie_plus = 0;
 	} else {
-		goto drop_and_release;
+		goto drop_and_free;
 	}
 	tmp_ext.cookie_in_always = tp->rx_opt.cookie_in_always;
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 37b7536..0f1b409 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -95,15 +95,21 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
 	int paws_reject = 0;
 
-	tmp_opt.saw_tstamp = 0;
+	/* Fast check for options, compare doff directly to constant value. */
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+		int parsed = tcp_parse_options(skb, th, &tmp_opt,
+					       &hash_location, 0);
 
-		if (tmp_opt.saw_tstamp) {
+		if (parsed < 0) {
+			paws_reject = 1; /* true */
+		} else if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
 			tmp_opt.ts_recent_stamp	= tcptw->tw_ts_recent_stamp;
 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
 		}
+	} else {
+		/* otherwise initialized by tcp_parse_options() */
+		tmp_opt.saw_tstamp = 0; /* false */
 	}
 
 	if (tw->tw_substate == TCP_FIN_WAIT2) {
@@ -526,11 +532,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
 	int paws_reject = 0;
 
-	tmp_opt.saw_tstamp = 0;
-	if (th->doff > (sizeof(struct tcphdr)>>2)) {
-		tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+	/* Fast check for options, compare doff directly to constant value. */
+	if (th->doff > (sizeof(*th) >> 2)) {
+		int parsed = tcp_parse_options(skb, th, &tmp_opt,
+					       &hash_location, 0);
 
-		if (tmp_opt.saw_tstamp) {
+		if (parsed < 0) {
+			paws_reject = 1; /* true */
+		} else if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent = req->ts_recent;
 			/* We do not store true stamp, but it is not required,
 			 * it can be estimated (approximately)
@@ -539,6 +548,9 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 			tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->retrans);
 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
 		}
+	} else {
+		/* otherwise initialized by tcp_parse_options() */
+		tmp_opt.saw_tstamp = 0; /* false */
 	}
 
 	/* Check for pure retransmitted SYN. */
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 7208a06..a0905b4 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -161,6 +161,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_options_received tcp_opt;
 	u8 *hash_location;
+	struct dst_entry *dst;
+	struct request_sock *req;
 	struct inet_request_sock *ireq;
 	struct inet6_request_sock *ireq6;
 	struct tcp_request_sock *treq;
@@ -169,9 +171,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	const struct tcphdr *th = tcp_hdr(skb);
 	__u32 cookie = ntohl(th->ack_seq) - 1;
 	struct sock *ret = sk;
-	struct request_sock *req;
 	int mss;
-	struct dst_entry *dst;
+	int parsed;
 	__u8 rcv_wscale;
 
 	if (!sysctl_tcp_syncookies || !th->ack)
@@ -187,7 +188,9 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, &hash_location, 0);
+	parsed = tcp_parse_options(skb, th, &tcp_opt, &hash_location, 0);
+	if (parsed < 0)
+		goto out;
 
 	if (tcp_opt.saw_tstamp)
 		cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3d08a4d..e15e4f6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1164,6 +1164,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_extend_values tmp_ext;
 	struct tcp_options_received tmp_opt;
+	int parsed;
 	u8 *hash_location;
 	struct request_sock *req;
 	struct inet6_request_sock *treq;
@@ -1207,7 +1208,10 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
-	tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+	parsed = tcp_parse_options(skb, tcp_hdr(skb), &tmp_opt, &hash_location,
+				   0);
+	if (parsed < 0)
+		goto drop_and_free;
 
 	if (tmp_opt.cookie_plus > 0 &&
 	    tmp_opt.saw_tstamp &&
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v7 7/7] TCPCT part 2g: parse cookie pair and 64-bit timestamp
  2010-02-15 12:23 [PATCH 0/7] tcp: bugs and cleanup updated to 2.6.33-rc8 (again) William Allen Simpson
                   ` (5 preceding siblings ...)
  2010-02-15 12:41 ` [PATCH v5 6/7] TCPCT part 2f: cleanup tcp_parse_options William Allen Simpson
@ 2010-02-15 12:45 ` William Allen Simpson
  6 siblings, 0 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 12:45 UTC (permalink / raw)
  To: Linux Kernel Developers
  Cc: Linux Kernel Network Developers, Andrew Morton, David Miller

[-- Attachment #1: Type: text/plain, Size: 867 bytes --]

Every bit is sacred.  Use as few bits as possible in tcp_options_received.
Group related timestamp flag bits for cache line memory efficiency.

Fix #define spacing for TCP options.

Define and parse 64-bit timestamp extended option (and minor cleanup).
However, only 32-bits are used at this time (permitted by specification).

Parse cookie pair extended option (previously defined).

Handle header extension.

Fix initialization in tcp_minisocks.

Requires:
   net: tcp_header_len_th and tcp_option_len_th
   TCPCT part 2f: cleanup tcp_parse_options

Signed-off-by: William.Allen.Simpson@gmail.com
---
  include/linux/tcp.h      |   12 ++++-
  include/net/tcp.h        |   45 +++++++++--------
  net/ipv4/tcp_input.c     |  123 ++++++++++++++++++++++++++++++++++++++++++----
  net/ipv4/tcp_minisocks.c |    6 ++
  4 files changed, 152 insertions(+), 34 deletions(-)

[-- Attachment #2: TCPCT+2g7+2.6.33-rc8.patch --]
[-- Type: text/plain, Size: 10658 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 2987ee8..05fa9b2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -247,7 +247,7 @@ struct tcp_options_received {
 	u32	ts_recent;	/* Time stamp to echo next		*/
 	u32	rcv_tsval;	/* Time stamp value             	*/
 	u32	rcv_tsecr;	/* Time stamp echo reply        	*/
-	u16 	saw_tstamp : 1,	/* Saw TIMESTAMP on last packet		*/
+	u16 	tstamp64_ok:1,	/* Verified with cookie pair		*/
 		tstamp_ok : 1,	/* TIMESTAMP seen on SYN packet		*/
 		dsack : 1,	/* D-SACK is scheduled			*/
 		wscale_ok : 1,	/* Wscale seen on SYN packet		*/
@@ -260,13 +260,21 @@ struct tcp_options_received {
 	u8	num_sacks;	/* Number of SACK blocks		*/
 	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
+
+	/* When the options are extended beyond the standard 40 bytes,
+	 * this holds the additional data offset (up to 1,020 bytes).
+	 */
+	u8	extended;	/* in 32-bit words			*/
+	u8	saw_tstamp64:1,	/* 64-bit TIMESTAMP seen on last packet	*/
+		saw_tstamp:1,	/* TIMESTAMP seen on last packet	*/
+		__unused:6;
 };
 
 static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
 {
+	rx_opt->tstamp64_ok = 0;
 	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
 	rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
-	rx_opt->cookie_plus = 0;
 }
 
 /* This is the max number of SACKS that we'll generate and process. It's safe
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 420e872..4b560d2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -156,9 +156,8 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 /*
  *	TCP option
  */
- 
-#define TCPOPT_NOP		1	/* Padding */
 #define TCPOPT_EOL		0	/* End of options */
+#define TCPOPT_NOP		1	/* Padding */
 #define TCPOPT_MSS		2	/* Segment size negotiating */
 #define TCPOPT_WINDOW		3	/* Window scaling */
 #define TCPOPT_SACK_PERM        4       /* SACK Permitted */
@@ -166,30 +165,32 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
 #define TCPOPT_COOKIE		253	/* Cookie extension (experimental) */
-
-/*
- *     TCP option lengths
- */
-
-#define TCPOLEN_MSS            4
-#define TCPOLEN_WINDOW         3
-#define TCPOLEN_SACK_PERM      2
-#define TCPOLEN_TIMESTAMP      10
-#define TCPOLEN_MD5SIG         18
-#define TCPOLEN_COOKIE_BASE    2	/* Cookie-less header extension */
-#define TCPOLEN_COOKIE_PAIR    3	/* Cookie pair header extension */
-#define TCPOLEN_COOKIE_MIN     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN)
-#define TCPOLEN_COOKIE_MAX     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MAX)
-
-/* But this is what stacks really send out. */
-#define TCPOLEN_TSTAMP_ALIGNED		12
+#define TCPOPT_TSTAMP64		254	/* 64-bit extension (experimental) */
+
+/*	TCP option lengths (same order as above) */
+#define TCPOLEN_MSS		4
+#define TCPOLEN_WINDOW		3
+#define TCPOLEN_SACK_PERM	2
+#define TCPOLEN_SACK_BASE	2
+#define TCPOLEN_SACK_PERBLOCK	8
+#define TCPOLEN_TIMESTAMP	10
+#define TCPOLEN_MD5SIG		18
+#define TCPOLEN_COOKIE_BASE	2	/* Cookie-less header extension */
+#define TCPOLEN_COOKIE_PAIR	4	/* Cookie pair header extension */
+#define TCPOLEN_COOKIE_MIN	(TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN)
+#define TCPOLEN_COOKIE_MAX	(TCPOLEN_COOKIE_BASE+TCP_COOKIE_MAX)
+#define TCPOLEN_TSTAMP64	3
+
+/*	TCP options 32-bit aligned (same order as above) */
+#define TCPOLEN_MSS_ALIGNED		4
 #define TCPOLEN_WSCALE_ALIGNED		4
 #define TCPOLEN_SACKPERM_ALIGNED	4
-#define TCPOLEN_SACK_BASE		2
 #define TCPOLEN_SACK_BASE_ALIGNED	4
-#define TCPOLEN_SACK_PERBLOCK		8
+#define TCPOLEN_TSTAMP_ALIGNED		12
 #define TCPOLEN_MD5SIG_ALIGNED		20
-#define TCPOLEN_MSS_ALIGNED		4
+
+/*	TCP option extensions (same order as above) */
+#define TCPOEXT_TSTAMP64		16
 
 /* Flags in tp->nonagle */
 #define TCP_NAGLE_OFF		1	/* Nagle's algo is disabled */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2a9ef6b..dcad45b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3725,17 +3725,27 @@ old_ack:
 /* Look for tcp options. Normally only called on SYN and SYNACK packets.
  * But, this can also be called on packets in the established flow when
  * the fast version below fails.
+ *
+ * Returns:
+ *	0 on success
+ *	- on failure
  */
 int tcp_parse_options(struct sk_buff *skb, const struct tcphdr *th,
 		      struct tcp_options_received *opt_rx, u8 **hvpp, int estab)
 {
+	__be32 *xdp = (__be32 *)th + th->doff;
 	unsigned char *ptr = (unsigned char *)(th + 1);
+	int remaining = skb_headlen(skb) - tcp_header_len_th(th);
 	int length = tcp_option_len_th(th);
+	int extend = 0;
 	bool syn = th->syn;
 
 	opt_rx->cookie_plus = 0;
+	opt_rx->extended = 0;
+	opt_rx->saw_tstamp64 = 0; /* false */
 	opt_rx->saw_tstamp = 0; /* false */
 
+repeat:
 	while (length > 0) {
 		int opsize;
 		int opcode = *ptr++;
@@ -3833,26 +3843,104 @@ int tcp_parse_options(struct sk_buff *skb, const struct tcphdr *th,
 				/* not yet implemented */
 				break;
 			case TCPOLEN_COOKIE_PAIR: {
-				/* not yet implemented */
+				int words = ptr[1] & 0xf;
+
+				if (!syn &&
+				    *ptr >= words &&
+				    words >= (TCP_COOKIE_MIN / 4) &&
+				    words <= (TCP_COOKIE_MAX / 4) &&
+				    opt_rx->extended == 0) {
+					int bytes = words * 4;
+
+					opt_rx->extended = *ptr;
+					extend = *ptr * 4;
+
+					/* Adjust end_seq, set in
+					 * tcp_v[4,6]_rcv()
+					 */
+					TCP_SKB_CB(skb)->end_seq -= extend;
+					remaining -= extend;
+
+					if (unlikely(remaining < 0)) {
+						/* missing data!!! */
+						return remaining;
+					}
+					extend -= bytes;
+
+					opt_rx->cookie_plus = bytes
+						+ TCPOLEN_COOKIE_BASE;
+					*hvpp = (u8 *)xdp;
+					xdp += words;
+				}
 				break;
 			}
 			case TCPOLEN_COOKIE_MIN+0:
 			case TCPOLEN_COOKIE_MIN+2:
 			case TCPOLEN_COOKIE_MIN+4:
 			case TCPOLEN_COOKIE_MIN+6:
-			case TCPOLEN_COOKIE_MAX:
 				/* 16-bit multiple */
 				if (syn) {
 					opt_rx->cookie_plus = opsize;
 					*hvpp = ptr;
 				}
 				break;
+			case TCPOLEN_COOKIE_MAX+0:
+				/* either cookie or cookie pair */
+				if (syn || opt_rx->saw_tstamp64) {
+					opt_rx->cookie_plus = opsize;
+					*hvpp = ptr;
+				}
+				break;
+			case TCPOLEN_COOKIE_MAX+4:
+			case TCPOLEN_COOKIE_MAX+8:
+			case TCPOLEN_COOKIE_MAX+12:
+			case TCPOLEN_COOKIE_MAX+TCP_COOKIE_MAX:
+				/* 32-bit multiple (pair) */
+				if (opt_rx->saw_tstamp64) {
+					opt_rx->cookie_plus = opsize;
+					*hvpp = ptr;
+				}
+				break;
 			default:
 				/* ignore option */
 				break;
 			};
 			break;
 
+		case TCPOPT_TSTAMP64:
+			if (opsize == TCPOLEN_TSTAMP64) {
+				if (!syn &&
+				    *ptr >= (TCPOEXT_TSTAMP64 / 4) &&
+				    !opt_rx->saw_tstamp &&
+				    opt_rx->extended == 0) {
+					opt_rx->extended = *ptr;
+					extend = *ptr * 4;
+
+					/* Adjust end_seq, set in
+					 * tcp_v[4,6]_rcv()
+					 */
+					TCP_SKB_CB(skb)->end_seq -= extend;
+					remaining -= extend;
+
+					if (unlikely(remaining < 0)) {
+						/* missing data!!! */
+						return remaining;
+					}
+					extend -= TCPOEXT_TSTAMP64;
+
+					/* 64-bits not yet implemented */
+					xdp++;
+					opt_rx->rcv_tsval = ntohl(*xdp);
+					xdp += 2;
+					opt_rx->rcv_tsecr = ntohl(*xdp);
+					xdp++;
+
+					opt_rx->saw_tstamp64 = 1; /* true */
+					opt_rx->saw_tstamp = 1; /* true */
+				}
+			}
+			break;
+
 		default:
 			/* skip unrecognized options */
 			break;
@@ -3861,6 +3949,13 @@ int tcp_parse_options(struct sk_buff *skb, const struct tcphdr *th,
 		ptr += opsize - 2;
 		length -= opsize;
 	}
+
+	if (unlikely(extend > 0)) {
+		ptr = (unsigned char *)xdp;
+		length = extend;
+		extend = 0;
+		goto repeat;
+	}
 	return 0;
 }
 
@@ -3887,6 +3982,11 @@ static int tcp_parse_aligned_timestamp(struct tcp_sock *tp, struct tcphdr *th)
 
 /* Fast parse options. This hopes to only see timestamps.
  * If it is wrong it falls back on tcp_parse_options().
+ *
+ * Returns:
+ *	1 on success, fast
+ *	0 on success, slow
+ *	- on failure
  */
 static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
 				  struct tcp_sock *tp, u8 **hvpp)
@@ -3896,11 +3996,14 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
 	 */
 	if (th->doff == (sizeof(*th) / 4)) {
 		tp->rx_opt.saw_tstamp = 0;
+		tp->rx_opt.extended = 0;
 		return 0;
-	} else if (tp->rx_opt.tstamp_ok &&
-		   th->doff == ((sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) / 4)) {
-		if (tcp_parse_aligned_timestamp(tp, th))
-			return 1;
+	}
+	if (th->doff == ((sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) / 4) &&
+	    tp->rx_opt.tstamp_ok &&
+	    tcp_parse_aligned_timestamp(tp, th)) {
+		tp->rx_opt.extended = 0;
+		return 1;
 	}
 	return tcp_parse_options(skb, th, &tp->rx_opt, hvpp, 1);
 }
@@ -3911,8 +4014,8 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
  */
 u8 *tcp_parse_md5sig_option(struct tcphdr *th)
 {
-	int length = (th->doff << 2) - sizeof (*th);
 	u8 *ptr = (u8*)(th + 1);
+	int length = tcp_option_len_th(th);
 
 	/* If the TCP option is too short, we can short cut */
 	if (length < TCPOLEN_MD5SIG)
@@ -4377,7 +4480,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
 		goto drop;
 
-	__skb_pull(skb, th->doff * 4);
+	__skb_pull(skb, (th->doff + tp->rx_opt.extended) * 4);
 
 	TCP_ECN_accept_cwr(tp, skb);
 
@@ -5038,8 +5141,8 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
 
 	/* Do we wait for any urgent data? - normally not... */
 	if (tp->urg_data == TCP_URG_NOTYET) {
-		u32 ptr = tp->urg_seq - ntohl(th->seq) + (th->doff * 4) -
-			  th->syn;
+		u32 ptr = ((th->doff + tp->rx_opt.extended) * 4)
+			+ tp->urg_seq - ntohl(th->seq) - th->syn;
 
 		/* Is the urgent pointer pointing into this packet? */
 		if (ptr < skb->len) {
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 0f1b409..2240012 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -109,6 +109,9 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 		}
 	} else {
 		/* otherwise initialized by tcp_parse_options() */
+		tmp_opt.cookie_plus = 0;
+		tmp_opt.extended = 0;
+		tmp_opt.saw_tstamp64 = 0; /* false */
 		tmp_opt.saw_tstamp = 0; /* false */
 	}
 
@@ -550,6 +553,9 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		}
 	} else {
 		/* otherwise initialized by tcp_parse_options() */
+		tmp_opt.cookie_plus = 0;
+		tmp_opt.extended = 0;
+		tmp_opt.saw_tstamp64 = 0; /* false */
 		tmp_opt.saw_tstamp = 0; /* false */
 	}
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/7] tcp: harmonize tcp_vx_rcv header length assumptions
  2010-02-15 12:28 ` [PATCH v5 3/7] tcp: harmonize tcp_vx_rcv header length assumptions William Allen Simpson
@ 2010-02-15 12:48   ` Andi Kleen
  2010-02-15 12:59     ` William Allen Simpson
  0 siblings, 1 reply; 16+ messages in thread
From: Andi Kleen @ 2010-02-15 12:48 UTC (permalink / raw)
  To: William Allen Simpson
  Cc: Linux Kernel Developers, Linux Kernel Network Developers,
	Andrew Morton, David Miller, Andi Kleen

On Mon, Feb 15, 2010 at 07:28:44AM -0500, William Allen Simpson wrote:
> Harmonize tcp_v4_rcv() and tcp_v6_rcv() -- better document tcp doff
> and header length assumptions, and carefully compare implementations.
> 
> Reduces multiply/shifts, marginally improving speed.
> 
> Removes redundant tcp header length checks before checksumming.
> 
> Instead, assumes (and documents) that any backlog processing and
> transform policies will carefully preserve the header, and will
> ensure the socket buffer length remains >= the header size.

I reviewed the patch and it looks ok to me. As far as I can
see it's mostly manual CSE, no real behaviour change.

Normally it's customary to separate formatting changes from
real changes, but it was only in a few places and not too bad.

I didn't fully understand that new comment:

/* nf_reset(skb); in ip6_input.c ip6_input_finish() */

Overall you can add a 

Reviewed-by: Andi Kleen <andi@firstfloor.org>

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/7] tcp: harmonize tcp_vx_rcv header length assumptions
  2010-02-15 12:48   ` Andi Kleen
@ 2010-02-15 12:59     ` William Allen Simpson
  0 siblings, 0 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 12:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Developers, Linux Kernel Network Developers,
	Andrew Morton, David Miller

Andi Kleen wrote:
> On Mon, Feb 15, 2010 at 07:28:44AM -0500, William Allen Simpson wrote:
>> Harmonize tcp_v4_rcv() and tcp_v6_rcv() -- better document tcp doff
>> and header length assumptions, and carefully compare implementations.
> 
> I didn't fully understand that new comment:
> 
> /* nf_reset(skb); in ip6_input.c ip6_input_finish() */
> 
That's part of the harmonization.  IPv4 has a nf_reset() in this code
position.  I asked on the list where IPv6 did the same thing, so that
the difference could be documented.  The information was provided by
Patrick McHardy.  If someday somebody actually finishes merging the
two functions, that's the only actual difference.


> Overall you can add a 
> 
> Reviewed-by: Andi Kleen <andi@firstfloor.org>
> 
Thanks, hopefully as applied.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs
  2010-02-15 12:31 ` [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs William Allen Simpson
@ 2010-02-15 15:10   ` Andi Kleen
  2010-02-15 19:03     ` William Allen Simpson
  2010-02-15 21:24   ` [PATCH v5 " William Allen Simpson
  1 sibling, 1 reply; 16+ messages in thread
From: Andi Kleen @ 2010-02-15 15:10 UTC (permalink / raw)
  To: William Allen Simpson
  Cc: Linux Kernel Developers, Linux Kernel Network Developers,
	Andrew Morton, David Miller, Andi Kleen

On Mon, Feb 15, 2010 at 07:31:11AM -0500, William Allen Simpson wrote:
> Fix incorrect header prediction flags documentation.
> 
> Relieve register pressure in (the i386) fast path by accessing skb->len
> directly, instead of carrying a rarely used len parameter.
> 
> Eliminate unused len parameters in two other functions.
> 
> Don't use output calculated tp->tcp_header_len for input decisions.
> While the output header is usually the same as the input (same options
> in both directions), that's a poor assumption. In particular, Sack will
> be different. Newer options are not guaranteed.

Is this a bug fix? 

tcp_ack() is so bloated these days that I sometimes wonder
if the fast path still makes sense. It would be probably
an interesting study to actually count cycles for
the common cases.

On the other hand CPU cycles on modern systems are so cheap
that it might be simpler to simply ignore them and only
optimize for cache misses. I suppose a cache optimized
tcp_rcv_establish() would look quite different.

Anyways that's not directly related to the patch.

> There's no need to test buffer length against header length, already
> checked by tcp_v[4,6]_rcv(). Straighten code for minor efficiency gain.
> 
> Stand-alone patch, originally developed for TCPCT.

Normally it would be better to split this into smaller patches
that do one thing at a time (typically this requires getting
used to patch stack tools like "quilt")

But it's not too bad here.

>  static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
>  {
> -	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
> +	tp->pred_flags = htonl((__tcp_fast_path_header_length(tp) << (28 - 2)) |

It would be better to use defines or sizeof for the magic numbers.

> -	tp->rx_opt.saw_tstamp = 0;
> -
> -	/*	pred_flags is 0xS?10 << 16 + snd_wnd
> -	 *	if header_prediction is to be made
> -	 *	'S' will always be tp->tcp_header_len >> 2
> -	 *	'?' will be 0 for the fast path, otherwise pred_flags is 0 to
> -	 *  turn it off	(when there are holes in the receive
> -	 *	 space for instance)
> -	 *	PSH flag is ignored.
> -	 */

I liked the comment at this place.


I did a quick review of the rest and it seems ok to me.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs
  2010-02-15 15:10   ` Andi Kleen
@ 2010-02-15 19:03     ` William Allen Simpson
  2010-02-15 19:15       ` Andi Kleen
  0 siblings, 1 reply; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 19:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Developers, Linux Kernel Network Developers,
	Andrew Morton, David Miller

Andi Kleen wrote:
> On Mon, Feb 15, 2010 at 07:31:11AM -0500, William Allen Simpson wrote:
>> Don't use output calculated tp->tcp_header_len for input decisions.
>> While the output header is usually the same as the input (same options
>> in both directions), that's a poor assumption. In particular, Sack will
>> be different. Newer options are not guaranteed.
> 
> Is this a bug fix? 
> 
Yes.  One of many, all inter-related.

I don't know how much description folks want in the patch "summary", so
simply used declarative statements that are one-to-one with the order of
the patch, but it took me a bit to grok this problem!

1) unknown options can be stripped out of the header in middleware, see
RFC 1122 section 4.2.2.5.

2) new options Cookie Pair and 64-bit Timestamps (defined in patch 7).

3) stripping them leaves a Sack covering 1 segment, which has the exact
same word count as 32-bit Timestamps.  Boom!  All the silly checks
against the size of the options field (instead of the proper saw_tstamp)
start setting fields based on completely useless data!

4) and of course, using the size of the previous output to predict the
expected input header size is a poor assumption (to be generous).

There are 29+ options these days, not 4 or 5.  There are options that
are only sent one way.  There are options that have different data in
different directions.

Yes, it was originally for TCPCT, but fixes a broad spectrum of bugs.


>> Stand-alone patch, originally developed for TCPCT.
> 
> Normally it would be better to split this into smaller patches
> that do one thing at a time (typically this requires getting
> used to patch stack tools like "quilt")
> 
> But it's not too bad here.
> 
There are small efficiency patches included, but it would be likely
impossible to split them from the bug fixes without re-writing the same
code over and over again.  And I'm doing these patch splits by hand....

I did recently learn how to maintain branches that are branches on top of
each other, so I've got tcpct1, tcpct2, and tcpct3 for the 3 parts.  But
it's a pain to keep updated with git fetch, and checkout, and rebase,
for each branch.

At first, I was keeping a master patch set, and trying to maintain it
over .31, .32, and net-next, and now .33 -- but I gave up.


>>  static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
>>  {
>> -	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
>> +	tp->pred_flags = htonl((__tcp_fast_path_header_length(tp) << (28 - 2)) |
> 
> It would be better to use defines or sizeof for the magic numbers.
> 
I agree!  I was just following the existing coding style, trying to
improve understanding by splitting it into 28 (matches the shift
documented in the header file), and 2 (the doff field is actually the
number of 32-bit words).

These are field offsets in a 32-bit word, sizeof() wouldn't work.

It might be even better to have pred_flags be a union, but I didn't do
the original design for this code....

I'll add a nice block comment explaining the shift value here.


>> -	tp->rx_opt.saw_tstamp = 0;
>> -
>> -	/*	pred_flags is 0xS?10 << 16 + snd_wnd
>> -	 *	if header_prediction is to be made
>> -	 *	'S' will always be tp->tcp_header_len >> 2
>> -	 *	'?' will be 0 for the fast path, otherwise pred_flags is 0 to
>> -	 *  turn it off	(when there are holes in the receive
>> -	 *	 space for instance)
>> -	 *	PSH flag is ignored.
>> -	 */
> 
> I liked the comment at this place.
> 
The existing comment here didn't match the comment in the header file,
and both comments had errors.  Here, the 'S' is wrong.  Instead, it was
only '5' in the header file, among other errors.

It's easier to avoid bit-rot by defining in only one place, but I will
add a comment here saying "See linux/tcp.h for pred_flags details."


> 
> I did a quick review of the rest and it seems ok to me.
> 
> -Andi
> 
Thank you again.  As the fixes requested are merely adding comments,
I'll quickly re-spin this patch without reposting the entire patch set.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs
  2010-02-15 19:03     ` William Allen Simpson
@ 2010-02-15 19:15       ` Andi Kleen
  2010-02-15 21:24         ` William Allen Simpson
  0 siblings, 1 reply; 16+ messages in thread
From: Andi Kleen @ 2010-02-15 19:15 UTC (permalink / raw)
  To: William Allen Simpson
  Cc: Andi Kleen, Linux Kernel Developers,
	Linux Kernel Network Developers, Andrew Morton, David Miller

On Mon, Feb 15, 2010 at 02:03:34PM -0500, William Allen Simpson wrote:
> Andi Kleen wrote:
> >On Mon, Feb 15, 2010 at 07:31:11AM -0500, William Allen Simpson wrote:
> >>Don't use output calculated tp->tcp_header_len for input decisions.
> >>While the output header is usually the same as the input (same options
> >>in both directions), that's a poor assumption. In particular, Sack will
> >>be different. Newer options are not guaranteed.
> >
> >Is this a bug fix? 
> >
> Yes.  One of many, all inter-related.
> 
> I don't know how much description folks want in the patch "summary", so

A good way  to find out is to read some of the existing (newer) changelogs.

> simply used declarative statements that are one-to-one with the order of
> the patch, but it took me a bit to grok this problem!

It's useful to distingush bug fixes from cleanups from optimizations etc.

> 
> 1) unknown options can be stripped out of the header in middleware, see
> RFC 1122 section 4.2.2.5.
> 
> 2) new options Cookie Pair and 64-bit Timestamps (defined in patch 7).
> 
> 3) stripping them leaves a Sack covering 1 segment, which has the exact
> same word count as 32-bit Timestamps.  Boom!  All the silly checks

Boom == the code would use sack data as time stamp?

> against the size of the options field (instead of the proper saw_tstamp)
> start setting fields based on completely useless data!
> 
> 4) and of course, using the size of the previous output to predict the
> expected input header size is a poor assumption (to be generous).

Invalid fast path heuristics are probably not bugs in the strict sense,
as long as they do not lead to actually incorrect behaviour.

> 
> There are 29+ options these days, not 4 or 5.  There are options that
> are only sent one way.  There are options that have different data in
> different directions.
> 
> Yes, it was originally for TCPCT, but fixes a broad spectrum of bugs.

It would be good if you could extract the pure bug fixes from the rest.

Normally there's a "fast path" in patch submission for bug fixes
(e.g. might still go in for the current kernel). The other stuff
goes in slower.

> 
> 
> >>Stand-alone patch, originally developed for TCPCT.
> >
> >Normally it would be better to split this into smaller patches
> >that do one thing at a time (typically this requires getting
> >used to patch stack tools like "quilt")
> >
> >But it's not too bad here.
> >
> There are small efficiency patches included, but it would be likely
> impossible to split them from the bug fixes without re-writing the same
> code over and over again.  And I'm doing these patch splits by hand....

That's the problem. Get the right tools. Splitting patches by hand
is usually the wrong approach for anything non trivial.

> I did recently learn how to maintain branches that are branches on top of
> each other, so I've got tcpct1, tcpct2, and tcpct3 for the 3 parts.  But
> it's a pain to keep updated with git fetch, and checkout, and rebase,
> for each branch.

I personally use quilt for such patchkits, it's much easier.

For git I believe there are several extensions that implement quilt
like patch stacking.

But there are different ways to skin a cat and people do it differently.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs
  2010-02-15 19:15       ` Andi Kleen
@ 2010-02-15 21:24         ` William Allen Simpson
  0 siblings, 0 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 21:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Developers, Linux Kernel Network Developers,
	Andrew Morton, David Miller

Andi Kleen wrote:
> On Mon, Feb 15, 2010 at 02:03:34PM -0500, William Allen Simpson wrote:
>> Andi Kleen wrote:
>>> Is this a bug fix? 
>>>
>> Yes.  One of many, all inter-related.
>>
> It's useful to distingush bug fixes from cleanups from optimizations etc.
> ...

The bug fixes depend on the cleanups.
  * 1/7 has the new functions.
  * 2/7 has bug fixes that require the new functions.
  * 3/7 fixes (and optimizes) the incoming code path, so that
  * 4/7 can fix the bugs.
  * 5/7 is needed to test the code from part 1.
  * 6/7 is the cleanup of parsing (also fixes some bugs introduced by
    davem's patch in December), so that
  * 7/7 can fix the bugs.

I originally tried submitting these as separate individual pairs of
patches (in November, December, January), but nobody paid any attention.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v5 4/7] tcp: input header length, prediction, and timestamp bugs
  2010-02-15 12:31 ` [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs William Allen Simpson
  2010-02-15 15:10   ` Andi Kleen
@ 2010-02-15 21:24   ` William Allen Simpson
  1 sibling, 0 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-15 21:24 UTC (permalink / raw)
  To: Linux Kernel Developers
  Cc: Linux Kernel Network Developers, Andrew Morton, David Miller,
	Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1883 bytes --]

Fix incorrect header prediction flags documentation.

Relieve register pressure in (the i386) fast path by accessing skb->len
directly, instead of carrying a rarely used len parameter.

Eliminate unused len parameters in two other functions.

Don't use output calculated tp->tcp_header_len for input decisions.
While the output header is usually the same as the input (same options
in both directions), that's a poor assumption. In particular, Sack will
be different. Newer options are not guaranteed.

Moreover, in the fast path, that only saved a shift or two. The other
efficiencies in this patch more than make up the difference.

Instead, use tp->rx_opt.tstamp_ok to accurately predict header length.

Likewise, use tp->rx_opt.tstamp_ok for received MSS calculations.

Don't use "sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED" to guess that
the timestamp is present. This may have been OK in the days with fewer
possible options, but various combinations of newer options may yield
the same header length. (This bug is in 3 places.)

Instead, use tp->rx_opt.saw_tstamp to determine a timestamp is present.

There's no need to test buffer length against header length, already
checked by tcp_v[4,6]_rcv(). Straighten code for minor efficiency gain.

Stand-alone patch, originally developed for TCPCT.

Requires:
   net: tcp_header_len_th and tcp_option_len_th
   tcp: harmonize tcp_vx_rcv header length assumptions

Signed-off-by: William.Allen.Simpson@gmail.com
Reviewed-by: Andi Kleen <andi@firstfloor.org>
---
  include/linux/tcp.h      |    6 ++-
  include/net/tcp.h        |   18 ++++++--
  net/ipv4/tcp_input.c     |   96 +++++++++++++++++++---------------------------
  net/ipv4/tcp_ipv4.c      |    4 +-
  net/ipv4/tcp_minisocks.c |    3 +-
  net/ipv4/tcp_probe.c     |    2 +-
  net/ipv6/tcp_ipv6.c      |    4 +-
  7 files changed, 63 insertions(+), 70 deletions(-)



[-- Attachment #2: len_th+2d5+2.6.33-rc8.patch --]
[-- Type: text/plain, Size: 10166 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 74728f7..2987ee8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -301,7 +301,11 @@ struct tcp_sock {
 
 /*
  *	Header prediction flags
- *	0x5?10 << 16 + snd_wnd in net byte order
+ *	S << 28 + TCP_FLAG_ACK + snd_wnd, in net byte order
+ *		(PSH flag is ignored)
+ *		S is 5 (no options), or 8 (timestamp aligned)
+ *	otherwise, 0 to turn it off -- for instance, when there are
+ *	holes in receive space.
  */
 	__be32	pred_flags;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 34f5cc2..52dd185 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -310,13 +310,11 @@ extern int			tcp_ioctl(struct sock *sk,
 
 extern int			tcp_rcv_state_process(struct sock *sk, 
 						      struct sk_buff *skb,
-						      struct tcphdr *th,
-						      unsigned len);
+						      struct tcphdr *th);
 
 extern int			tcp_rcv_established(struct sock *sk, 
 						    struct sk_buff *skb,
-						    struct tcphdr *th, 
-						    unsigned len);
+						    struct tcphdr *th);
 
 extern void			tcp_rcv_space_adjust(struct sock *sk);
 
@@ -533,9 +531,19 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
 	return (tp->srtt >> 3) + tp->rttvar;
 }
 
+static inline u8 __tcp_fast_path_header_length(const struct tcp_sock *tp)
+{
+	return tp->rx_opt.tstamp_ok
+		? sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED
+		: sizeof(struct tcphdr);
+}
+
 static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
 {
-	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+	/* See linux/tcp.h for pred_flags details.
+	 * 'S' (doff) is 32-bit words, convert from bytes: 26 = 28 - 2.
+	 */
+	tp->pred_flags = htonl((__tcp_fast_path_header_length(tp) << 26) |
 			       ntohl(TCP_FLAG_ACK) |
 			       snd_wnd);
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 28e0296..d314170 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -152,7 +152,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 			 * tcp header plus fixed timestamp option length.
 			 * Resulting "len" is MSS free of SACK jitter.
 			 */
-			len -= tcp_sk(sk)->tcp_header_len;
+			len -= __tcp_fast_path_header_length(tcp_sk(sk));
 			icsk->icsk_ack.last_seg_size = len;
 			if (len == lss) {
 				icsk->icsk_ack.rcv_mss = len;
@@ -5206,7 +5206,7 @@ discard:
  *	tcp_data_queue when everything is OK.
  */
 int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
-			struct tcphdr *th, unsigned len)
+			struct tcphdr *th)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int res;
@@ -5224,32 +5224,18 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 	 *	Our current scheme is not silly either but we take the
 	 *	extra cost of the net_bh soft interrupt processing...
 	 *	We do checksum and copy also but from device to kernel.
+	 *
+	 *	See linux/tcp.h and net/tcp.h for pred_flags details.
 	 */
-
-	tp->rx_opt.saw_tstamp = 0;
-
-	/*	pred_flags is 0xS?10 << 16 + snd_wnd
-	 *	if header_prediction is to be made
-	 *	'S' will always be tp->tcp_header_len >> 2
-	 *	'?' will be 0 for the fast path, otherwise pred_flags is 0 to
-	 *  turn it off	(when there are holes in the receive
-	 *	 space for instance)
-	 *	PSH flag is ignored.
-	 */
-
 	if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
 	    TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
 	    !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
-		int tcp_header_len = tp->tcp_header_len;
-
-		/* Timestamp header prediction: tcp_header_len
-		 * is automatically equal to th->doff*4 due to pred_flags
-		 * match.
-		 */
+		int tcp_header_len = tcp_header_len_th(th);
 
-		/* Check timestamp */
-		if (tcp_header_len == sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) {
-			/* No? Slow path! */
+		/* Timestamp header prediction */
+		if (tcp_header_len != sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) {
+			tp->rx_opt.saw_tstamp = 0; /* false */
+		} else {
 			if (!tcp_parse_aligned_timestamp(tp, th))
 				goto slow_path;
 
@@ -5264,35 +5250,12 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 			 */
 		}
 
-		if (len <= tcp_header_len) {
-			/* Bulk data transfer: sender */
-			if (len == tcp_header_len) {
-				/* Predicted packet is in window by definition.
-				 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
-				 * Hence, check seq<=rcv_wup reduces to:
-				 */
-				if (tcp_header_len ==
-				    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
-				    tp->rcv_nxt == tp->rcv_wup)
-					tcp_store_ts_recent(tp);
-
-				/* We know that such packets are checksummed
-				 * on entry.
-				 */
-				tcp_ack(sk, skb, 0);
-				__kfree_skb(skb);
-				tcp_data_snd_check(sk);
-				return 0;
-			} else { /* Header too small */
-				TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
-				goto discard;
-			}
-		} else {
+		if (tcp_header_len < skb->len) {
 			int eaten = 0;
 			int copied_early = 0;
 
 			if (tp->copied_seq == tp->rcv_nxt &&
-			    len - tcp_header_len <= tp->ucopy.len) {
+			    skb->len - tcp_header_len <= tp->ucopy.len) {
 #ifdef CONFIG_NET_DMA
 				if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
 					copied_early = 1;
@@ -5311,9 +5274,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 					 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
 					 * Hence, check seq<=rcv_wup reduces to:
 					 */
-					if (tcp_header_len ==
-					    (sizeof(struct tcphdr) +
-					     TCPOLEN_TSTAMP_ALIGNED) &&
+					if (tp->rx_opt.saw_tstamp &&
 					    tp->rcv_nxt == tp->rcv_wup)
 						tcp_store_ts_recent(tp);
 
@@ -5334,8 +5295,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 				 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
 				 * Hence, check seq<=rcv_wup reduces to:
 				 */
-				if (tcp_header_len ==
-				    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
+				if (tp->rx_opt.saw_tstamp &&
 				    tp->rcv_nxt == tp->rcv_wup)
 					tcp_store_ts_recent(tp);
 
@@ -5376,11 +5336,33 @@ no_ack:
 			else
 				sk->sk_data_ready(sk, 0);
 			return 0;
+		} else {
+			/* Bulk data transfer: sender
+			 *
+			 * tcp_header_len > skb->len never happens,
+			 * already checked by tcp_v[4,6]_rcv()
+			 *
+			 * Predicted packet is in window by definition.
+			 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
+			 * Hence, check seq<=rcv_wup reduces to:
+			 */
+			if (tp->rx_opt.saw_tstamp &&
+			    tp->rcv_nxt == tp->rcv_wup)
+				tcp_store_ts_recent(tp);
+
+			/* We know that such packets are checksummed
+			 * on entry.
+			 */
+			tcp_ack(sk, skb, 0);
+			__kfree_skb(skb);
+			tcp_data_snd_check(sk);
+			return 0;
 		}
 	}
 
 slow_path:
-	if (len < (th->doff << 2) || tcp_checksum_complete_user(sk, skb))
+	/* Assumes header and options unchanged since checksum_init() */
+	if (tcp_checksum_complete_user(sk, skb))
 		goto csum_error;
 
 	/*
@@ -5416,7 +5398,7 @@ discard:
 }
 
 static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
-					 struct tcphdr *th, unsigned len)
+					 struct tcphdr *th)
 {
 	u8 *hash_location;
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5693,7 +5675,7 @@ reset_and_undo:
  */
 
 int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-			  struct tcphdr *th, unsigned len)
+			  struct tcphdr *th)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5740,7 +5722,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 		goto discard;
 
 	case TCP_SYN_SENT:
-		queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
+		queued = tcp_rcv_synsent_state_process(sk, skb, th);
 		if (queued >= 0)
 			return queued;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0a76e41..f999e06 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1551,7 +1551,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
 		TCP_CHECK_TIMER(sk);
-		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
+		if (tcp_rcv_established(sk, skb, tcp_hdr(skb))) {
 			rsk = sk;
 			goto reset;
 		}
@@ -1578,7 +1578,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 	}
 
 	TCP_CHECK_TIMER(sk);
-	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
+	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb))) {
 		rsk = sk;
 		goto reset;
 	}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f206ee5..37b7536 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -718,8 +718,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,
 	int state = child->sk_state;
 
 	if (!sock_owned_by_user(child)) {
-		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
-					    skb->len);
+		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb));
 		/* Wakeup parent, send SIGIO */
 		if (state == TCP_SYN_RECV && child->sk_state != state)
 			parent->sk_data_ready(parent, 0);
diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
index 9bc805d..de2a32e 100644
--- a/net/ipv4/tcp_probe.c
+++ b/net/ipv4/tcp_probe.c
@@ -88,7 +88,7 @@ static inline int tcp_probe_avail(void)
  * Note: arguments must match tcp_rcv_established()!
  */
 static int jtcp_rcv_established(struct sock *sk, struct sk_buff *skb,
-			       struct tcphdr *th, unsigned len)
+			       struct tcphdr *th)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_sock *inet = inet_sk(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index b76939a..3d08a4d 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1586,7 +1586,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
 		TCP_CHECK_TIMER(sk);
-		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len))
+		if (tcp_rcv_established(sk, skb, tcp_hdr(skb)))
 			goto reset;
 		TCP_CHECK_TIMER(sk);
 		if (opt_skb)
@@ -1618,7 +1618,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 	}
 
 	TCP_CHECK_TIMER(sk);
-	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len))
+	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb)))
 		goto reset;
 	TCP_CHECK_TIMER(sk);
 	if (opt_skb)
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 4/7] tcp: input header length, prediction, and timestamp bugs
  2010-02-25 20:30 [PATCH 0/7] tcp: bugs and cleanup for 2.6.33 William Allen Simpson
@ 2010-02-25 20:49 ` William Allen Simpson
  0 siblings, 0 replies; 16+ messages in thread
From: William Allen Simpson @ 2010-02-25 20:49 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: Linux Kernel Developers, Linux Kernel Network Developers,
	David Miller, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1883 bytes --]

Fix incorrect header prediction flags documentation.

Relieve register pressure in (the i386) fast path by accessing skb->len
directly, instead of carrying a rarely used len parameter.

Eliminate unused len parameters in two other functions.

Don't use output calculated tp->tcp_header_len for input decisions.
While the output header is usually the same as the input (same options
in both directions), that's a poor assumption. In particular, Sack will
be different. Newer options are not guaranteed.

Moreover, in the fast path, that only saved a shift or two. The other
efficiencies in this patch more than make up the difference.

Instead, use tp->rx_opt.tstamp_ok to accurately predict header length.

Likewise, use tp->rx_opt.tstamp_ok for received MSS calculations.

Don't use "sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED" to guess that
the timestamp is present. This may have been OK in the days with fewer
possible options, but various combinations of newer options may yield
the same header length. (This bug is in 3 places.)

Instead, use tp->rx_opt.saw_tstamp to determine a timestamp is present.

There's no need to test buffer length against header length, already
checked by tcp_v[4,6]_rcv(). Straighten code for minor efficiency gain.

Stand-alone patch, originally developed for TCPCT.

Requires:
   net: tcp_header_len_th and tcp_option_len_th
   tcp: harmonize tcp_vx_rcv header length assumptions

Signed-off-by: William.Allen.Simpson@gmail.com
Reviewed-by: Andi Kleen <andi@firstfloor.org>
---
  include/linux/tcp.h      |    6 ++-
  include/net/tcp.h        |   18 ++++++--
  net/ipv4/tcp_input.c     |   96 +++++++++++++++++++---------------------------
  net/ipv4/tcp_ipv4.c      |    4 +-
  net/ipv4/tcp_minisocks.c |    3 +-
  net/ipv4/tcp_probe.c     |    2 +-
  net/ipv6/tcp_ipv6.c      |    4 +-
  7 files changed, 63 insertions(+), 70 deletions(-)



[-- Attachment #2: len_th+2d5+2.6.33.patch --]
[-- Type: text/plain, Size: 10166 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 74728f7..2987ee8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -301,7 +301,11 @@ struct tcp_sock {
 
 /*
  *	Header prediction flags
- *	0x5?10 << 16 + snd_wnd in net byte order
+ *	S << 28 + TCP_FLAG_ACK + snd_wnd, in net byte order
+ *		(PSH flag is ignored)
+ *		S is 5 (no options), or 8 (timestamp aligned)
+ *	otherwise, 0 to turn it off -- for instance, when there are
+ *	holes in receive space.
  */
 	__be32	pred_flags;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 34f5cc2..52dd185 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -310,13 +310,11 @@ extern int			tcp_ioctl(struct sock *sk,
 
 extern int			tcp_rcv_state_process(struct sock *sk, 
 						      struct sk_buff *skb,
-						      struct tcphdr *th,
-						      unsigned len);
+						      struct tcphdr *th);
 
 extern int			tcp_rcv_established(struct sock *sk, 
 						    struct sk_buff *skb,
-						    struct tcphdr *th, 
-						    unsigned len);
+						    struct tcphdr *th);
 
 extern void			tcp_rcv_space_adjust(struct sock *sk);
 
@@ -533,9 +531,19 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
 	return (tp->srtt >> 3) + tp->rttvar;
 }
 
+static inline u8 __tcp_fast_path_header_length(const struct tcp_sock *tp)
+{
+	return tp->rx_opt.tstamp_ok
+		? sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED
+		: sizeof(struct tcphdr);
+}
+
 static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
 {
-	tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+	/* See linux/tcp.h for pred_flags details.
+	 * 'S' (doff) is 32-bit words, convert from bytes: 26 = 28 - 2.
+	 */
+	tp->pred_flags = htonl((__tcp_fast_path_header_length(tp) << 26) |
 			       ntohl(TCP_FLAG_ACK) |
 			       snd_wnd);
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3fddc69..e8f9e3b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -152,7 +152,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 			 * tcp header plus fixed timestamp option length.
 			 * Resulting "len" is MSS free of SACK jitter.
 			 */
-			len -= tcp_sk(sk)->tcp_header_len;
+			len -= __tcp_fast_path_header_length(tcp_sk(sk));
 			icsk->icsk_ack.last_seg_size = len;
 			if (len == lss) {
 				icsk->icsk_ack.rcv_mss = len;
@@ -5206,7 +5206,7 @@ discard:
  *	tcp_data_queue when everything is OK.
  */
 int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
-			struct tcphdr *th, unsigned len)
+			struct tcphdr *th)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int res;
@@ -5224,32 +5224,18 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 	 *	Our current scheme is not silly either but we take the
 	 *	extra cost of the net_bh soft interrupt processing...
 	 *	We do checksum and copy also but from device to kernel.
+	 *
+	 *	See linux/tcp.h and net/tcp.h for pred_flags details.
 	 */
-
-	tp->rx_opt.saw_tstamp = 0;
-
-	/*	pred_flags is 0xS?10 << 16 + snd_wnd
-	 *	if header_prediction is to be made
-	 *	'S' will always be tp->tcp_header_len >> 2
-	 *	'?' will be 0 for the fast path, otherwise pred_flags is 0 to
-	 *  turn it off	(when there are holes in the receive
-	 *	 space for instance)
-	 *	PSH flag is ignored.
-	 */
-
 	if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
 	    TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
 	    !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
-		int tcp_header_len = tp->tcp_header_len;
-
-		/* Timestamp header prediction: tcp_header_len
-		 * is automatically equal to th->doff*4 due to pred_flags
-		 * match.
-		 */
+		int tcp_header_len = tcp_header_len_th(th);
 
-		/* Check timestamp */
-		if (tcp_header_len == sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) {
-			/* No? Slow path! */
+		/* Timestamp header prediction */
+		if (tcp_header_len != sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) {
+			tp->rx_opt.saw_tstamp = 0; /* false */
+		} else {
 			if (!tcp_parse_aligned_timestamp(tp, th))
 				goto slow_path;
 
@@ -5264,35 +5250,12 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 			 */
 		}
 
-		if (len <= tcp_header_len) {
-			/* Bulk data transfer: sender */
-			if (len == tcp_header_len) {
-				/* Predicted packet is in window by definition.
-				 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
-				 * Hence, check seq<=rcv_wup reduces to:
-				 */
-				if (tcp_header_len ==
-				    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
-				    tp->rcv_nxt == tp->rcv_wup)
-					tcp_store_ts_recent(tp);
-
-				/* We know that such packets are checksummed
-				 * on entry.
-				 */
-				tcp_ack(sk, skb, 0);
-				__kfree_skb(skb);
-				tcp_data_snd_check(sk);
-				return 0;
-			} else { /* Header too small */
-				TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
-				goto discard;
-			}
-		} else {
+		if (tcp_header_len < skb->len) {
 			int eaten = 0;
 			int copied_early = 0;
 
 			if (tp->copied_seq == tp->rcv_nxt &&
-			    len - tcp_header_len <= tp->ucopy.len) {
+			    skb->len - tcp_header_len <= tp->ucopy.len) {
 #ifdef CONFIG_NET_DMA
 				if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
 					copied_early = 1;
@@ -5311,9 +5274,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 					 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
 					 * Hence, check seq<=rcv_wup reduces to:
 					 */
-					if (tcp_header_len ==
-					    (sizeof(struct tcphdr) +
-					     TCPOLEN_TSTAMP_ALIGNED) &&
+					if (tp->rx_opt.saw_tstamp &&
 					    tp->rcv_nxt == tp->rcv_wup)
 						tcp_store_ts_recent(tp);
 
@@ -5334,8 +5295,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 				 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
 				 * Hence, check seq<=rcv_wup reduces to:
 				 */
-				if (tcp_header_len ==
-				    (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
+				if (tp->rx_opt.saw_tstamp &&
 				    tp->rcv_nxt == tp->rcv_wup)
 					tcp_store_ts_recent(tp);
 
@@ -5376,11 +5336,33 @@ no_ack:
 			else
 				sk->sk_data_ready(sk, 0);
 			return 0;
+		} else {
+			/* Bulk data transfer: sender
+			 *
+			 * tcp_header_len > skb->len never happens,
+			 * already checked by tcp_v[4,6]_rcv()
+			 *
+			 * Predicted packet is in window by definition.
+			 * seq == rcv_nxt and rcv_wup <= rcv_nxt.
+			 * Hence, check seq<=rcv_wup reduces to:
+			 */
+			if (tp->rx_opt.saw_tstamp &&
+			    tp->rcv_nxt == tp->rcv_wup)
+				tcp_store_ts_recent(tp);
+
+			/* We know that such packets are checksummed
+			 * on entry.
+			 */
+			tcp_ack(sk, skb, 0);
+			__kfree_skb(skb);
+			tcp_data_snd_check(sk);
+			return 0;
 		}
 	}
 
 slow_path:
-	if (len < (th->doff << 2) || tcp_checksum_complete_user(sk, skb))
+	/* Assumes header and options unchanged since checksum_init() */
+	if (tcp_checksum_complete_user(sk, skb))
 		goto csum_error;
 
 	/*
@@ -5416,7 +5398,7 @@ discard:
 }
 
 static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
-					 struct tcphdr *th, unsigned len)
+					 struct tcphdr *th)
 {
 	u8 *hash_location;
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5693,7 +5675,7 @@ reset_and_undo:
  */
 
 int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-			  struct tcphdr *th, unsigned len)
+			  struct tcphdr *th)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5740,7 +5722,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 		goto discard;
 
 	case TCP_SYN_SENT:
-		queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
+		queued = tcp_rcv_synsent_state_process(sk, skb, th);
 		if (queued >= 0)
 			return queued;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0a76e41..f999e06 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1551,7 +1551,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
 		TCP_CHECK_TIMER(sk);
-		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
+		if (tcp_rcv_established(sk, skb, tcp_hdr(skb))) {
 			rsk = sk;
 			goto reset;
 		}
@@ -1578,7 +1578,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 	}
 
 	TCP_CHECK_TIMER(sk);
-	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
+	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb))) {
 		rsk = sk;
 		goto reset;
 	}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f206ee5..37b7536 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -718,8 +718,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,
 	int state = child->sk_state;
 
 	if (!sock_owned_by_user(child)) {
-		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
-					    skb->len);
+		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb));
 		/* Wakeup parent, send SIGIO */
 		if (state == TCP_SYN_RECV && child->sk_state != state)
 			parent->sk_data_ready(parent, 0);
diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
index 9bc805d..de2a32e 100644
--- a/net/ipv4/tcp_probe.c
+++ b/net/ipv4/tcp_probe.c
@@ -88,7 +88,7 @@ static inline int tcp_probe_avail(void)
  * Note: arguments must match tcp_rcv_established()!
  */
 static int jtcp_rcv_established(struct sock *sk, struct sk_buff *skb,
-			       struct tcphdr *th, unsigned len)
+			       struct tcphdr *th)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_sock *inet = inet_sk(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index b76939a..3d08a4d 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1586,7 +1586,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
 		TCP_CHECK_TIMER(sk);
-		if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len))
+		if (tcp_rcv_established(sk, skb, tcp_hdr(skb)))
 			goto reset;
 		TCP_CHECK_TIMER(sk);
 		if (opt_skb)
@@ -1618,7 +1618,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
 	}
 
 	TCP_CHECK_TIMER(sk);
-	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len))
+	if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb)))
 		goto reset;
 	TCP_CHECK_TIMER(sk);
 	if (opt_skb)
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-02-25 20:49 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-15 12:23 [PATCH 0/7] tcp: bugs and cleanup updated to 2.6.33-rc8 (again) William Allen Simpson
2010-02-15 12:25 ` [PATCH v3 1/7] net: tcp_header_len_th and tcp_option_len_th William Allen Simpson
2010-02-15 12:27 ` [PATCH v3 2/7] net: remove old tcp_optlen function William Allen Simpson
2010-02-15 12:28 ` [PATCH v5 3/7] tcp: harmonize tcp_vx_rcv header length assumptions William Allen Simpson
2010-02-15 12:48   ` Andi Kleen
2010-02-15 12:59     ` William Allen Simpson
2010-02-15 12:31 ` [PATCH v4 4/7] tcp: input header length, prediction, and timestamp bugs William Allen Simpson
2010-02-15 15:10   ` Andi Kleen
2010-02-15 19:03     ` William Allen Simpson
2010-02-15 19:15       ` Andi Kleen
2010-02-15 21:24         ` William Allen Simpson
2010-02-15 21:24   ` [PATCH v5 " William Allen Simpson
2010-02-15 12:34 ` [PATCH v3 5/7] TCPCT part 2e: accept SYNACK data William Allen Simpson
2010-02-15 12:41 ` [PATCH v5 6/7] TCPCT part 2f: cleanup tcp_parse_options William Allen Simpson
2010-02-15 12:45 ` [PATCH v7 7/7] TCPCT part 2g: parse cookie pair and 64-bit timestamp William Allen Simpson
  -- strict thread matches above, loose matches on Subject: below --
2010-02-25 20:30 [PATCH 0/7] tcp: bugs and cleanup for 2.6.33 William Allen Simpson
2010-02-25 20:49 ` [PATCH v5 4/7] tcp: input header length, prediction, and timestamp bugs William Allen Simpson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).