Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: query: bnx2 and tg3 don't check tcp and/or ip header length validity?
From: David Miller @ 2009-10-15  0:49 UTC (permalink / raw)
  To: william.allen.simpson; +Cc: netdev
In-Reply-To: <4AD66F84.2010703@gmail.com>

This is transmit, and the packets can only come from the Linux
TCP stack, not some external entity.

You're being way too anal here, and adding these checks to
drivers would be just a lot of rediculious bloat.

^ permalink raw reply

* [PATCH -next] vmxnet: fix 2 build problems
From: Randy Dunlap @ 2009-10-15  1:17 UTC (permalink / raw)
  To: Stephen Rothwell, netdev, Shreyas Bhatewara
  Cc: linux-next, LKML, davem, pv-drivers
In-Reply-To: <20091014163445.f0441473.sfr@canb.auug.org.au>

From: Randy Dunlap <randy.dunlap@oracle.com>

vmxnet3 uses in_dev* interfaces so it should depend on INET.
Also fix so that the driver builds when CONFIG_PCI_MSI is disabled.

vmxnet3_drv.c:(.text+0x2a88cb): undefined reference to `in_dev_finish_destroy'

drivers/net/vmxnet3/vmxnet3_drv.c:1335: error: 'struct vmxnet3_intr' has no member named 'msix_entries'
drivers/net/vmxnet3/vmxnet3_drv.c:1384: error: 'struct vmxnet3_intr' has no member named 'msix_entries'
drivers/net/vmxnet3/vmxnet3_drv.c:2137: error: 'struct vmxnet3_intr' has no member named 'msix_entries'
drivers/net/vmxnet3/vmxnet3_drv.c:2138: error: 'struct vmxnet3_intr' has no member named 'msix_entries'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
 drivers/net/Kconfig               |    2 +-
 drivers/net/vmxnet3/vmxnet3_drv.c |   11 ++++++++++-
 2 files changed, 11 insertions(+), 2 deletions(-)

--- linux-next-20091014.orig/drivers/net/vmxnet3/vmxnet3_drv.c
+++ linux-next-20091014/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -1314,9 +1314,11 @@ vmxnet3_netpoll(struct net_device *netde
 	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
 	int irq;
 
+#ifdef CONFIG_PCI_MSI
 	if (adapter->intr.type == VMXNET3_IT_MSIX)
 		irq = adapter->intr.msix_entries[0].vector;
 	else
+#endif
 		irq = adapter->pdev->irq;
 
 	disable_irq(irq);
@@ -1330,12 +1332,15 @@ vmxnet3_request_irqs(struct vmxnet3_adap
 {
 	int err;
 
+#ifdef CONFIG_PCI_MSI
 	if (adapter->intr.type == VMXNET3_IT_MSIX) {
 		/* we only use 1 MSI-X vector */
 		err = request_irq(adapter->intr.msix_entries[0].vector,
 				  vmxnet3_intr, 0, adapter->netdev->name,
 				  adapter->netdev);
-	} else if (adapter->intr.type == VMXNET3_IT_MSI) {
+	} else
+#endif
+	if (adapter->intr.type == VMXNET3_IT_MSI) {
 		err = request_irq(adapter->pdev->irq, vmxnet3_intr, 0,
 				  adapter->netdev->name, adapter->netdev);
 	} else {
@@ -1376,6 +1381,7 @@ vmxnet3_free_irqs(struct vmxnet3_adapter
 	       adapter->intr.num_intrs <= 0);
 
 	switch (adapter->intr.type) {
+#ifdef CONFIG_PCI_MSI
 	case VMXNET3_IT_MSIX:
 	{
 		int i;
@@ -1385,6 +1391,7 @@ vmxnet3_free_irqs(struct vmxnet3_adapter
 				 adapter->netdev);
 		break;
 	}
+#endif
 	case VMXNET3_IT_MSI:
 		free_irq(adapter->pdev->irq, adapter->netdev);
 		break;
@@ -2134,6 +2141,7 @@ vmxnet3_alloc_intr_resources(struct vmxn
 	if (adapter->intr.type == VMXNET3_IT_AUTO) {
 		int err;
 
+#ifdef CONFIG_PCI_MSI
 		adapter->intr.msix_entries[0].entry = 0;
 		err = pci_enable_msix(adapter->pdev, adapter->intr.msix_entries,
 				      VMXNET3_LINUX_MAX_MSIX_VECT);
@@ -2142,6 +2150,7 @@ vmxnet3_alloc_intr_resources(struct vmxn
 			adapter->intr.type = VMXNET3_IT_MSIX;
 			return;
 		}
+#endif
 
 		err = pci_enable_msi(adapter->pdev);
 		if (!err) {
--- linux-next-20091014.orig/drivers/net/Kconfig
+++ linux-next-20091014/drivers/net/Kconfig
@@ -3232,7 +3232,7 @@ config VIRTIO_NET
 
 config VMXNET3
        tristate "VMware VMXNET3 ethernet driver"
-       depends on PCI && X86
+       depends on PCI && X86 && INET
        help
          This driver supports VMware's vmxnet3 virtual ethernet NIC.
          To compile this driver as a module, choose M here: the

^ permalink raw reply

* Re: [Pv-drivers] [PATCH -next] vmxnet: fix 2 build problems
From: Bhavesh Davda @ 2009-10-15  2:00 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Stephen Rothwell, netdev, Shreyas Bhatewara,
	pv-drivers@vmware.com, linux-next@vger.kernel.org, LKML,
	davem@davemloft.net
In-Reply-To: <20091014181715.09a82630.randy.dunlap@oracle.com>

Looks great! Thanks for making this change!

Signed-off-by: Bhavesh davda <bhavesh@vmware.com<mailto:bhavesh@vmware.com>>

- Bhavesh

I'm usually not as bad with my spelling as my iPhone makes it appear.


On Oct 14, 2009, at 6:18 PM, "Randy Dunlap" <randy.dunlap@oracle.com<mailto:randy.dunlap@oracle.com>> wrote:

Signed-off-by: Randy Dunlap <<mailto:randy.dunlap@oracle.com>randy.dunlap@oracle.com<mailto:randy.dunlap@oracle.com>>

^ permalink raw reply

* [PATCH net-next-2.6] net: sk_drops consolidation
From: Eric Dumazet @ 2009-10-15  2:47 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

sock_queue_rcv_skb() can update sk_drops itself, removing need for
callers to take care of it. This is more consistent since
sock_queue_rcv_skb() also reads sk_drops when queueing a skb.

This adds sk_drops managment to many protocols that not cared yet.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/sock.c        |   15 +++++++--------
 net/ieee802154/dgram.c |    1 -
 net/ieee802154/raw.c   |    1 -
 net/ipv4/raw.c         |    1 -
 net/ipv4/udp.c         |   19 ++++++++-----------
 net/ipv6/raw.c         |    3 +--
 net/ipv6/udp.c         |    6 ++----
 net/phonet/datagram.c  |    6 ++----
 net/phonet/pep.c       |    2 --
 9 files changed, 20 insertions(+), 34 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 43ca2c9..38713aa 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -274,7 +274,7 @@ static void sock_disable_timestamp(struct sock *sk, int flag)
 
 int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
-	int err = 0;
+	int err;
 	int skb_len;
 	unsigned long flags;
 	struct sk_buff_head *list = &sk->sk_receive_queue;
@@ -284,17 +284,17 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	 */
 	if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
 	    (unsigned)sk->sk_rcvbuf) {
-		err = -ENOMEM;
-		goto out;
+		atomic_inc(&sk->sk_drops);
+		return -ENOMEM;
 	}
 
 	err = sk_filter(sk, skb);
 	if (err)
-		goto out;
+		return err;
 
 	if (!sk_rmem_schedule(sk, skb->truesize)) {
-		err = -ENOBUFS;
-		goto out;
+		atomic_inc(&sk->sk_drops);
+		return -ENOBUFS;
 	}
 
 	skb->dev = NULL;
@@ -314,8 +314,7 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_data_ready(sk, skb_len);
-out:
-	return err;
+	return 0;
 }
 EXPORT_SYMBOL(sock_queue_rcv_skb);
 
diff --git a/net/ieee802154/dgram.c b/net/ieee802154/dgram.c
index 25ad956..9aac5ae 100644
--- a/net/ieee802154/dgram.c
+++ b/net/ieee802154/dgram.c
@@ -318,7 +318,6 @@ out:
 static int dgram_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	if (sock_queue_rcv_skb(sk, skb) < 0) {
-		atomic_inc(&sk->sk_drops);
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
diff --git a/net/ieee802154/raw.c b/net/ieee802154/raw.c
index 769c8d1..9c9b85c 100644
--- a/net/ieee802154/raw.c
+++ b/net/ieee802154/raw.c
@@ -206,7 +206,6 @@ out:
 static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	if (sock_queue_rcv_skb(sk, skb) < 0) {
-		atomic_inc(&sk->sk_drops);
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index f18172b..39e2a6b 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -292,7 +292,6 @@ static int raw_rcv_skb(struct sock * sk, struct sk_buff * skb)
 	/* Charge it to the socket. */
 
 	if (sock_queue_rcv_skb(sk, skb) < 0) {
-		atomic_inc(&sk->sk_drops);
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ee61b3f..45a8a7e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1063,25 +1063,22 @@ EXPORT_SYMBOL(udp_lib_unhash);
 
 static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
-	int is_udplite = IS_UDPLITE(sk);
-	int rc;
+	int rc = sock_queue_rcv_skb(sk, skb);
+
+	if (rc < 0) {
+		int is_udplite = IS_UDPLITE(sk);
 
-	if ((rc = sock_queue_rcv_skb(sk, skb)) < 0) {
 		/* Note that an ENOMEM error is charged twice */
-		if (rc == -ENOMEM) {
+		if (rc == -ENOMEM)
 			UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS,
 					 is_udplite);
-			atomic_inc(&sk->sk_drops);
-		}
-		goto drop;
+		UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
+		kfree_skb(skb);
+		return -1;
 	}
 
 	return 0;
 
-drop:
-	UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	kfree_skb(skb);
-	return -1;
 }
 
 /* returns:
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index d8375bc..fd737ef 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -381,8 +381,7 @@ static inline int rawv6_rcv_skb(struct sock * sk, struct sk_buff * skb)
 	}
 
 	/* Charge it to the socket. */
-	if (sock_queue_rcv_skb(sk,skb)<0) {
-		atomic_inc(&sk->sk_drops);
+	if (sock_queue_rcv_skb(sk, skb) < 0) {
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 1f8e2af..b86425b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -385,13 +385,11 @@ int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	if ((rc = sock_queue_rcv_skb(sk,skb)) < 0) {
+	if ((rc = sock_queue_rcv_skb(sk, skb)) < 0) {
 		/* Note that an ENOMEM error is charged twice */
-		if (rc == -ENOMEM) {
+		if (rc == -ENOMEM)
 			UDP6_INC_STATS_BH(sock_net(sk),
 					UDP_MIB_RCVBUFERRORS, is_udplite);
-			atomic_inc(&sk->sk_drops);
-		}
 		goto drop;
 	}
 
diff --git a/net/phonet/datagram.c b/net/phonet/datagram.c
index ef5c75c..67f072e 100644
--- a/net/phonet/datagram.c
+++ b/net/phonet/datagram.c
@@ -159,11 +159,9 @@ out_nofree:
 static int pn_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
 	int err = sock_queue_rcv_skb(sk, skb);
-	if (err < 0) {
+
+	if (err < 0)
 		kfree_skb(skb);
-		if (err == -ENOMEM)
-			atomic_inc(&sk->sk_drops);
-	}
 	return err ? NET_RX_DROP : NET_RX_SUCCESS;
 }
 
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index 5f32d21..cbaa1d6 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -360,8 +360,6 @@ static int pipe_do_rcv(struct sock *sk, struct sk_buff *skb)
 			err = sock_queue_rcv_skb(sk, skb);
 			if (!err)
 				return 0;
-			if (err == -ENOMEM)
-				atomic_inc(&sk->sk_drops);
 			break;
 		}
 

^ permalink raw reply related

* Re: [Pv-drivers] [PATCH -next] vmxnet: fix 2 build problems
From: David Miller @ 2009-10-15  3:39 UTC (permalink / raw)
  To: bhavesh
  Cc: randy.dunlap, sfr, netdev, sbhatewara, pv-drivers, linux-next,
	linux-kernel
In-Reply-To: <06257446-AA46-489A-80D7-D7C70096A744@vmware.com>

From: Bhavesh Davda <bhavesh@vmware.com>
Date: Wed, 14 Oct 2009 19:00:45 -0700

> Looks great! Thanks for making this change!
> 
> Signed-off-by: Bhavesh davda <bhavesh@vmware.com<mailto:bhavesh@vmware.com>>

Applied, thanks Randy.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sk_drops consolidation
From: David Miller @ 2009-10-15  3:40 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <4AD68D4C.2080108@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 15 Oct 2009 04:47:40 +0200

> sock_queue_rcv_skb() can update sk_drops itself, removing need for
> callers to take care of it. This is more consistent since
> sock_queue_rcv_skb() also reads sk_drops when queueing a skb.
> 
> This adds sk_drops managment to many protocols that not cared yet.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks great, applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/2] be2net: Add support for next generation of BladeEngine device
From: Ajit Khaparde @ 2009-10-15  3:49 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091014.163345.28372235.davem@davemloft.net>

On 14/10/09 16:33 -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Wed, 14 Oct 2009 15:24:02 -0700 (PDT)
> 
> > From: Ajit Khaparde <ajitk@serverengines.com>
> > Date: Wed, 14 Oct 2009 17:54:09 +0530
> > 
> >> Add new PCI ids to support next generation of BladeEnigne device.
> >> 
> >> Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
> > 
> > Applied.
> 
> I have to revert your changes, how did this even build for you?
> 
> In file included from drivers/net/benet/be_cmds.c:18:
> drivers/net/benet/be.h:45:1: warning: "BE_DEVICE_ID1" redefined
> drivers/net/benet/be.h:44:1: warning: this is the location of the previous definition
> drivers/net/benet/be.h:48:1: warning: "OC_DEVICE_ID2" redefined
> drivers/net/benet/be.h:47:1: warning: this is the location of the previous definition
> 
> Well, obviously, since in your patch:
> 
>  #define BE_VENDOR_ID 		0x19a2
>  #define BE_DEVICE_ID1		0x211
> +#define BE_DEVICE_ID1		0x221
>  #define OC_DEVICE_ID1		0x700
>  #define OC_DEVICE_ID2		0x701
> +#define OC_DEVICE_ID2		0x710
> 
> drivers/net/benet/be.h:56: error: 'OC_DEVICE_ID3' undeclared (first use in this function)
> drivers/net/benet/be.h:58: error: 'BE_DEVICE_ID2' undeclared (first use in this function)
> 
> The mistake is obvious, and you provably didn't even compile check
> these changes.
> 
> How can I possibly take your patches seriously when it is clear how
> incredibly careless you are being here?

Sorry Dave. My bad. Now I realize what happened.
After I compiled and tried my changes, the setup got rebooted and I picked
the files from a wrong sandbox. And these wrong files got committed to git.
I will be careful next time.

^ permalink raw reply

* [PATCH net-next-2.6] be2net: Add support for next generation of BladeEngine device.
From: Ajit Khaparde @ 2009-10-15  4:57 UTC (permalink / raw)
  To: David Miller, netdev

Add new PCI ids to support next generation of BladeEngine device.

Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
---
 drivers/net/benet/be.h |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/net/benet/be.h b/drivers/net/benet/be.h
index 4b61a91..ce75631 100644
--- a/drivers/net/benet/be.h
+++ b/drivers/net/benet/be.h
@@ -35,20 +35,31 @@
 #define DRV_VER			"2.101.205"
 #define DRV_NAME		"be2net"
 #define BE_NAME			"ServerEngines BladeEngine2 10Gbps NIC"
+#define BE3_NAME		"ServerEngines BladeEngine3 10Gbps NIC"
 #define OC_NAME			"Emulex OneConnect 10Gbps NIC"
+#define OC_NAME1		"Emulex OneConnect 10Gbps NIC (be3)"
 #define DRV_DESC		BE_NAME "Driver"
 
 #define BE_VENDOR_ID 		0x19a2
 #define BE_DEVICE_ID1		0x211
+#define BE_DEVICE_ID2		0x221
 #define OC_DEVICE_ID1		0x700
 #define OC_DEVICE_ID2		0x701
+#define OC_DEVICE_ID3		0x710
 
 static inline char *nic_name(struct pci_dev *pdev)
 {
-	if (pdev->device == OC_DEVICE_ID1 || pdev->device == OC_DEVICE_ID2)
+	switch (pdev->device) {
+	case OC_DEVICE_ID1:
+	case OC_DEVICE_ID2:
 		return OC_NAME;
-	else
+	case OC_DEVICE_ID3:
+		return OC_NAME1;
+	case BE_DEVICE_ID2:
+		return BE3_NAME;
+	default:
 		return BE_NAME;
+	}
 }
 
 /* Number of bytes of an RX frame that are copied to skb->data */
-- 
1.6.0.4


^ permalink raw reply related

* [net-next-2.6 PATCH 1/4] TCPCT part 1: initial SYN exchange with SYNACK data
From: William Allen Simpson @ 2009-10-15  5:28 UTC (permalink / raw)
  To: Linux Kernel Network Developers

[-- Attachment #1: Type: text/plain, Size: 766 bytes --]

Pass optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission.  Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.
---
  include/net/request_sock.h      |    3 ++-
  include/net/tcp.h               |    3 ++-
  net/dccp/ipv4.c                 |    5 +++--
  net/dccp/ipv6.c                 |    5 +++--
  net/dccp/minisocks.c            |    2 +-
  net/ipv4/inet_connection_sock.c |    2 +-
  net/ipv4/tcp_ipv4.c             |   12 +++++++-----
  net/ipv4/tcp_minisocks.c        |    2 +-
  net/ipv4/tcp_output.c           |    2 +-
  net/ipv6/tcp_ipv6.c             |   14 +++++++-------
  10 files changed, 28 insertions(+), 22 deletions(-)


[-- Attachment #2: TCPCT+1-1.patch --]
[-- Type: text/plain, Size: 7275 bytes --]

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index c719084..cdd9e8b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -33,7 +33,8 @@ struct request_sock_ops {
 	struct kmem_cache	*slab;
 	char		*slab_name;
 	int		(*rtx_syn_ack)(struct sock *sk,
-				       struct request_sock *req);
+				       struct request_sock *req,
+				       void *extend_values);
 	void		(*send_ack)(struct sock *sk, struct sk_buff *skb,
 				    struct request_sock *req);
 	void		(*send_reset)(struct sock *sk,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..28bcaf7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -443,7 +443,8 @@ extern int			tcp_connect(struct sock *sk);
 
 extern struct sk_buff *		tcp_make_synack(struct sock *sk,
 						struct dst_entry *dst,
-						struct request_sock *req);
+						struct request_sock *req,
+						void *extend_values);
 
 extern int			tcp_disconnect(struct sock *sk, int flags);
 
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 7302e14..6fc9ea3 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -473,7 +473,8 @@ static struct dst_entry* dccp_v4_route_skb(struct net *net, struct sock *sk,
 	return &rt->u.dst;
 }
 
-static int dccp_v4_send_response(struct sock *sk, struct request_sock *req)
+static int dccp_v4_send_response(struct sock *sk, struct request_sock *req,
+				 void *extend_unused)
 {
 	int err = -1;
 	struct sk_buff *skb;
@@ -622,7 +623,7 @@ int dccp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	dreq->dreq_iss	   = dccp_v4_init_sequence(skb);
 	dreq->dreq_service = service;
 
-	if (dccp_v4_send_response(sk, req))
+	if (dccp_v4_send_response(sk, req, NULL))
 		goto drop_and_free;
 
 	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index a2afb55..63fb189 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -241,7 +241,8 @@ out:
 }
 
 
-static int dccp_v6_send_response(struct sock *sk, struct request_sock *req)
+static int dccp_v6_send_response(struct sock *sk, struct request_sock *req,
+				 void *extend_unused)
 {
 	struct inet6_request_sock *ireq6 = inet6_rsk(req);
 	struct ipv6_pinfo *np = inet6_sk(sk);
@@ -468,7 +469,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	dreq->dreq_iss	   = dccp_v6_init_sequence(skb);
 	dreq->dreq_service = service;
 
-	if (dccp_v6_send_response(sk, req))
+	if (dccp_v6_send_response(sk, req, NULL))
 		goto drop_and_free;
 
 	inet6_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 5ca49ce..af226a0 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -184,7 +184,7 @@ struct sock *dccp_check_req(struct sock *sk, struct sk_buff *skb,
 			 * counter (backoff, monitored by dccp_response_timer).
 			 */
 			req->retrans++;
-			req->rsk_ops->rtx_syn_ack(sk, req);
+			req->rsk_ops->rtx_syn_ack(sk, req, NULL);
 		}
 		/* Network Duplicate, discard packet */
 		return NULL;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 9139e8f..b7314f2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -504,7 +504,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
 			if (time_after_eq(now, req->expires)) {
 				if ((req->retrans < thresh ||
 				     (inet_rsk(req)->acked && req->retrans < max_retries))
-				    && !req->rsk_ops->rtx_syn_ack(parent, req)) {
+				    && !req->rsk_ops->rtx_syn_ack(parent, req, NULL)) {
 					unsigned long timeo;
 
 					if (req->retrans++ == 0)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9971870..2d25bd4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -742,7 +742,7 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
  *	socket.
  */
 static int __tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
-				struct dst_entry *dst)
+				struct dst_entry *dst, void *extend_values)
 {
 	const struct inet_request_sock *ireq = inet_rsk(req);
 	int err = -1;
@@ -752,7 +752,7 @@ static int __tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
 	if (!dst && (dst = inet_csk_route_req(sk, req)) == NULL)
 		return -1;
 
-	skb = tcp_make_synack(sk, dst, req);
+	skb = tcp_make_synack(sk, dst, req, extend_values);
 
 	if (skb) {
 		struct tcphdr *th = tcp_hdr(skb);
@@ -773,9 +773,10 @@ static int __tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
 	return err;
 }
 
-static int tcp_v4_send_synack(struct sock *sk, struct request_sock *req)
+static int tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
+			      void *extend_values)
 {
-	return __tcp_v4_send_synack(sk, req, NULL);
+	return __tcp_v4_send_synack(sk, req, NULL, extend_values);
 }
 
 /*
@@ -1333,7 +1334,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	}
 	tcp_rsk(req)->snt_isn = isn;
 
-	if (__tcp_v4_send_synack(sk, req, dst) || want_cookie)
+	if (__tcp_v4_send_synack(sk, req, dst, NULL) ||
+	    want_cookie)
 		goto drop_and_free;
 
 	inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index e320afe..8819882 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -537,7 +537,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		 * Enforce "SYN-ACK" according to figure 8, figure 6
 		 * of RFC793, fixed by RFC1122.
 		 */
-		req->rsk_ops->rtx_syn_ack(sk, req);
+		req->rsk_ops->rtx_syn_ack(sk, req, NULL);
 		return NULL;
 	}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fcd278a..765d80f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2219,7 +2219,7 @@ int tcp_send_synack(struct sock *sk)
 
 /* Prepare a SYN-ACK. */
 struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
-				struct request_sock *req)
+				struct request_sock *req, void *extend_values)
 {
 	struct inet_request_sock *ireq = inet_rsk(req);
 	struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4517630..3b3d7b3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -460,7 +460,8 @@ out:
 }
 
 
-static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req)
+static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req,
+			      void *extend_values)
 {
 	struct inet6_request_sock *treq = inet6_rsk(req);
 	struct ipv6_pinfo *np = inet6_sk(sk);
@@ -498,7 +499,7 @@ static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req)
 	if ((err = xfrm_lookup(sock_net(sk), &dst, &fl, sk, 0)) < 0)
 		goto done;
 
-	skb = tcp_make_synack(sk, dst, req);
+	skb = tcp_make_synack(sk, dst, req, extend_values);
 	if (skb) {
 		struct tcphdr *th = tcp_hdr(skb);
 
@@ -1242,13 +1243,12 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	security_inet_conn_request(sk, skb, req);
 
-	if (tcp_v6_send_synack(sk, req))
+	if (tcp_v6_send_synack(sk, req, NULL) ||
+	    want_cookie)
 		goto drop;
 
-	if (!want_cookie) {
-		inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-		return 0;
-	}
+	inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+	return 0;
 
 drop:
 	if (req)
-- 
1.6.0.4


^ permalink raw reply related

* [net-next-2.6 PATCH 3/4] TCPCT part 1: initial SYN exchange with SYNACK data
From: William Allen Simpson @ 2009-10-15  5:34 UTC (permalink / raw)
  To: Linux Kernel Network Developers
In-Reply-To: <4AD6B3E8.2050904@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

Redefine two TCP header functions to accept TCP header pointer.
When subtracting, return signed int to allow error checking.

In the only two existing files using the latter function, clean up
confusing and inconsistent mixing of both byte and word offsets.
However, remove proposed header length checking, and document the
assumptions instead.  In the immortal words of the reviewer:

   This is transmit, and the packets can only come from the Linux
   TCP stack, not some external entity.

   You're being way too anal here, and adding these checks to
   drivers would be just a lot of rediculious bloat. [sic]

These functions will also be used in subsequent patches that implement
additional features.
---
  drivers/net/bnx2.c  |   22 +++++++++++++---------
  drivers/net/tg3.c   |   32 +++++++++++++++++---------------
  include/linux/tcp.h |   10 ++++++++--
  3 files changed, 38 insertions(+), 26 deletions(-)


[-- Attachment #2: TCPCT+1-3.patch --]
[-- Type: text/plain, Size: 4444 bytes --]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 08cddb6..2cb342c 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -6330,18 +6330,17 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 #endif
 	if ((mss = skb_shinfo(skb)->gso_size)) {
-		u32 tcp_opt_len;
-		struct iphdr *iph;
+		struct tcphdr *th = tcp_hdr(skb);
+		int tcp_opt_words = th->doff - (sizeof(*th) >> 2);
 
+		/* assumes positive tcp_opt_words without checking */
 		vlan_tag_flags |= TX_BD_FLAGS_SW_LSO;
 
-		tcp_opt_len = tcp_optlen(skb);
-
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) {
 			u32 tcp_off = skb_transport_offset(skb) -
 				      sizeof(struct ipv6hdr) - ETH_HLEN;
 
-			vlan_tag_flags |= ((tcp_opt_len >> 2) << 8) |
+			vlan_tag_flags |= (tcp_opt_words << 8) |
 					  TX_BD_FLAGS_SW_FLAGS;
 			if (likely(tcp_off == 0))
 				vlan_tag_flags &= ~TX_BD_FLAGS_TCP6_OFF0_MSK;
@@ -6354,10 +6353,15 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 				mss |= (tcp_off & 0xc) << TX_BD_TCP6_OFF2_SHL;
 			}
 		} else {
-			iph = ip_hdr(skb);
-			if (tcp_opt_len || (iph->ihl > 5)) {
-				vlan_tag_flags |= ((iph->ihl - 5) +
-						   (tcp_opt_len >> 2)) << 8;
+			struct iphdr *iph = ip_hdr(skb);
+			int ip_opt_words = iph->ihl - (sizeof(*iph) >> 2);
+			int opt_words;
+
+			/* assumes positive ip_opt_words without checking */
+			opt_words = ip_opt_words + tcp_opt_words;
+
+			if (opt_words > 0) {
+				vlan_tag_flags |= opt_words << 8;
 			}
 		}
 	} else
diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index ba5d3fe..7e89e2a 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -5230,7 +5230,8 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb,
 		else {
 			struct iphdr *iph = ip_hdr(skb);
 
-			tcp_opt_len = tcp_optlen(skb);
+			tcp_opt_len = tcp_option_len_th(tcp_hdr(skb));
+			/* assumes positive tcp_opt_len without checking */
 			ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr);
 
 			iph->check = 0;
@@ -5392,7 +5393,8 @@ static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 	mss = 0;
 	if ((mss = skb_shinfo(skb)->gso_size) != 0) {
 		struct iphdr *iph;
-		int tcp_opt_len, ip_tcp_len, hdr_len;
+		int tcp_opt_len, ip_hdr_len, ip_opt_len, ip_tcp_len, hdr_len;
+		int opt_bytes;
 
 		if (skb_header_cloned(skb) &&
 		    pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
@@ -5400,10 +5402,12 @@ static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 			goto out_unlock;
 		}
 
-		tcp_opt_len = tcp_optlen(skb);
-		ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr);
-
+		tcp_opt_len = tcp_option_len_th(tcp_hdr(skb));
+		/* assumes positive tcp_opt_len without checking */
+		ip_hdr_len = ip_hdrlen(skb);
+		ip_tcp_len = ip_hdr_len + sizeof(struct tcphdr);
 		hdr_len = ip_tcp_len + tcp_opt_len;
+
 		if (unlikely((ETH_HLEN + hdr_len) > 80) &&
 			     (tp->tg3_flags2 & TG3_FLG2_TSO_BUG))
 			return (tg3_tso_bug(tp, skb));
@@ -5423,20 +5427,18 @@ static netdev_tx_t tg3_start_xmit_dma_bug(struct sk_buff *skb,
 								 IPPROTO_TCP,
 								 0);
 
+		ip_opt_len = ip_hdr_len - sizeof(struct iphdr);
+		/* assumes positive ip_opt_len without checking */
+		opt_bytes = ip_opt_len + tcp_opt_len;
+
 		if ((tp->tg3_flags2 & TG3_FLG2_HW_TSO) ||
 		    (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5705)) {
-			if (tcp_opt_len || iph->ihl > 5) {
-				int tsflags;
-
-				tsflags = (iph->ihl - 5) + (tcp_opt_len >> 2);
-				mss |= (tsflags << 11);
+			if (opt_bytes > 0) {
+				mss |= (opt_bytes >> 2) << 11;
 			}
 		} else {
-			if (tcp_opt_len || iph->ihl > 5) {
-				int tsflags;
-
-				tsflags = (iph->ihl - 5) + (tcp_opt_len >> 2);
-				base_flags |= tsflags << 12;
+			if (opt_bytes > 0) {
+				base_flags |= (opt_bytes >> 2) << 12;
 			}
 		}
 	}
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 63ab660..d304ba5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -217,9 +217,15 @@ static inline unsigned int tcp_hdrlen(const struct sk_buff *skb)
 	return tcp_hdr(skb)->doff * 4;
 }
 
-static inline unsigned int tcp_optlen(const struct sk_buff *skb)
+static inline unsigned int tcp_header_len_th(const struct tcphdr *th)
 {
-	return (tcp_hdr(skb)->doff - 5) * 4;
+	return th->doff * 4;
+}
+
+/* When doff is bad, this could be negative. */
+static inline int tcp_option_len_th(const struct tcphdr *th)
+{
+	return (int)tcp_header_len_th(th) - sizeof(*th);
 }
 
 /* This defines a selective acknowledgement block. */
-- 
1.6.0.4


^ permalink raw reply related

* [net-next-2.6 PATCH 4/4] TCPCT part 1: initial SYN exchange with SYNACK data
From: William Allen Simpson @ 2009-10-15  5:36 UTC (permalink / raw)
  To: Linux Kernel Network Developers
In-Reply-To: <4AD6B467.2080701@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2057 bytes --]

This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley).  That patch was previously reviewed:

   http://thread.gmane.org/gmane.linux.network/102586

The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data.  This is more flexible and
less subject to user configuration error.  Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.

   "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
   http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

   "Re: what a new TCP header might look like", May 12, 1998.
   ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

Data structures are carefully composed to require minimal additions.
For example, the struct tcp_options_received cookie_plus variable fits
between existing 16-bit and 8-bit variables, requiring no additional
space (taking alignment into consideration).  There are no additions to
tcp_request_sock, and only 1 pointer and 1 flag byte in tcp_sock.

Allocations have been rearranged to avoid requiring GFP_ATOMIC, with
only one unavoidable exception in tcp_create_openreq_child(), where the
tcp_sock itself is created GFP_ATOMIC.

These functions will also be used in subsequent patches that implement
additional features.
---
  include/linux/tcp.h      |   35 +++++++-
  include/net/tcp.h        |   72 ++++++++++++++--
  net/ipv4/syncookies.c    |    5 +-
  net/ipv4/tcp.c           |  133 +++++++++++++++++++++++++++-
  net/ipv4/tcp_input.c     |   82 +++++++++++++++---
  net/ipv4/tcp_ipv4.c      |   62 +++++++++++--
  net/ipv4/tcp_minisocks.c |   43 +++++++---
  net/ipv4/tcp_output.c    |  223 ++++++++++++++++++++++++++++++++++++++++++---
  net/ipv6/syncookies.c    |    5 +-
  net/ipv6/tcp_ipv6.c      |   47 +++++++++-
  10 files changed, 641 insertions(+), 66 deletions(-)


[-- Attachment #2: TCPCT+1-4.patch --]
[-- Type: text/plain, Size: 39340 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d304ba5..1c9a1d1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -252,26 +252,36 @@ struct tcp_options_received {
 		sack_ok : 4,	/* SACK seen on SYN packet		*/
 		snd_wscale : 4,	/* Window scaling received from sender	*/
 		rcv_wscale : 4;	/* Window scaling to send to receiver	*/
-/*	SACKs data	*/
+	u8	cookie_plus: 6;	/* bytes in authenticator/cookie option	*/
 	u8	num_sacks;	/* Number of SACK blocks		*/
-	u16	user_mss;  	/* mss requested by user in ioctl */
+	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
 };
 
+static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+{
+	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+	rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
+	rx_opt->cookie_plus = 0;
+}
+
 /* This is the max number of SACKS that we'll generate and process. It's safe
  * to increse this, although since:
  *   size = TCPOLEN_SACK_BASE_ALIGNED (4) + n * TCPOLEN_SACK_PERBLOCK (8)
  * only four options will fit in a standard TCP header */
 #define TCP_NUM_SACKS 4
 
+struct tcp_cookie_values;
+struct tcp_request_sock_ops;
+
 struct tcp_request_sock {
 	struct inet_request_sock 	req;
 #ifdef CONFIG_TCP_MD5SIG
 	/* Only used by TCP MD5 Signature so far. */
 	const struct tcp_request_sock_ops *af_specific;
 #endif
-	u32			 	rcv_isn;
-	u32			 	snt_isn;
+	u32				rcv_isn;
+	u32				snt_isn;
 };
 
 static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
@@ -441,6 +451,19 @@ struct tcp_sock {
 /* TCP MD5 Signature Option information */
 	struct tcp_md5sig_info	*md5sig_info;
 #endif
+
+	/* When the cookie options are generated and exchanged, then this
+	 * object holds a reference to them (cookie_values->kref).  Also
+	 * contains related tcp_cookie_transactions fields.
+	 */
+	struct tcp_cookie_values  	*cookie_values;
+
+	u8				cookie_in_always:1,
+					cookie_out_never:1,
+					extend_timestamp:1,
+					s_data_constant:1,
+					s_data_in:1,
+					s_data_out:1;
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
@@ -459,6 +482,10 @@ struct tcp_timewait_sock {
 	u16			  tw_md5_keylen;
 	u8			  tw_md5_key[TCP_MD5SIG_MAXKEYLEN];
 #endif
+	/* Few sockets in timewait have cookies; in that case, then this
+	 * object holds a reference to it (tw_cookie_values->kref)
+	 */
+	struct tcp_cookie_values  *tw_cookie_values;
 };
 
 static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 63d17fd..a2d2c0f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -30,6 +30,7 @@
 #include <linux/dmaengine.h>
 #include <linux/crypto.h>
 #include <linux/cryptohash.h>
+#include <linux/kref.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -167,6 +168,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_SACK             5       /* SACK Block */
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
+#define TCPOPT_COOKIE		253	/* Cookie extension (experimental) */
 
 /*
  *     TCP option lengths
@@ -177,6 +179,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_SACK_PERM      2
 #define TCPOLEN_TIMESTAMP      10
 #define TCPOLEN_MD5SIG         18
+#define TCPOLEN_COOKIE_BASE    2	/* Cookie-less header extension */
+#define TCPOLEN_COOKIE_PAIR    3	/* Cookie pair header extension */
+#define TCPOLEN_COOKIE_MAX     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MAX)
+#define TCPOLEN_COOKIE_MIN     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN)
 
 /* But this is what stacks really send out. */
 #define TCPOLEN_TSTAMP_ALIGNED		12
@@ -344,11 +350,6 @@ static inline void tcp_dec_quickack_mode(struct sock *sk,
 
 extern void tcp_enter_quickack_mode(struct sock *sk);
 
-static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
-{
- 	rx_opt->tstamp_ok = rx_opt->sack_ok = rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
-}
-
 #define	TCP_ECN_OK		1
 #define	TCP_ECN_QUEUE_CWR	2
 #define	TCP_ECN_DEMAND_CWR	4
@@ -410,7 +411,7 @@ extern int			tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
 
 extern void			tcp_parse_options(struct sk_buff *skb,
 						  struct tcp_options_received *opt_rx,
-						  int estab);
+						  u8 **cryptic, int estab);
 
 extern u8			*tcp_parse_md5sig_option(struct tcphdr *th);
 
@@ -1482,6 +1483,65 @@ struct tcp_request_sock_ops {
 #endif
 };
 
+/**
+ * A tcp_sock contains a pointer to the current value, and this is cloned to
+ * the tcp_timewait_sock.
+ *
+ * @cookie_pair:	variable data from the option exchange.
+ *
+ * @cookie_desired:	user specified tcpct_cookie_desired.  Zero
+ *			indicates default (sysctl_tcp_cookie_size).
+ *			After cookie sent, remembers size of cookie.
+ *
+ * @s_data_desired:	user specified tcpct_s_data_desired.  When the
+ *			constant payload is specified (s_data_constant),
+ *			holds its length instead.
+ *
+ * @s_data_payload:	constant data that is to be included in the
+ *			payload of SYN or SYNACK segments when the
+ *			cookie option is present.
+ */
+struct tcp_cookie_values {
+	struct kref	kref;
+	u8		cookie_pair[TCP_COOKIE_PAIR_SIZE];
+	u8		cookie_pair_size;
+	u8		cookie_desired;
+	u16		s_data_desired;
+	u8		s_data_payload[0];
+};
+
+static inline void tcp_cookie_values_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct tcp_cookie_values, kref));
+}
+
+/* The length of constant payload data.  Note that s_data_desired is
+ * overloaded, depending on s_data_constant: either the length of constant
+ * data (returned here) or the limit on variable data.
+ */
+static inline int tcp_s_data_size(const struct tcp_sock *tp)
+{
+	return (NULL != tp->cookie_values && tp->s_data_constant)
+		? tp->cookie_values->s_data_desired
+		: 0;
+}
+
+/* As tcp_request_sock has already been extended in other places, the
+ * only remaining method is to pass stack values along as function
+ * parameters.  These parameters are not needed after sending SYNACK.
+ */
+struct tcp_extend_values {
+	u8				cookie_bakery[TCP_COOKIE_MAX];
+	u8				cookie_plus;
+	u8				cookie_in_always:1,
+					cookie_out_never:1;
+};
+
+static inline struct tcp_extend_values *tcp_xv(const void *extend_values)
+{
+	return (struct tcp_extend_values *)extend_values;
+}
+
 extern void tcp_v4_init(void);
 extern void tcp_init(void);
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 5ec678a..cdab491 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -253,6 +253,8 @@ EXPORT_SYMBOL(cookie_check_timestamp);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 			     struct ip_options *opt)
 {
+	struct tcp_options_received tcp_opt;
+	u8 *cryptic_value;
 	struct inet_request_sock *ireq;
 	struct tcp_request_sock *treq;
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -263,7 +265,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	int mss;
 	struct rtable *rt;
 	__u8 rcv_wscale;
-	struct tcp_options_received tcp_opt;
 
 	if (!sysctl_tcp_syncookies || !th->ack)
 		goto out;
@@ -278,7 +279,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
+	tcp_parse_options(skb, &tcp_opt, &cryptic_value, 0);
 
 	if (tcp_opt.saw_tstamp)
 		cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cf13726..0b47ffe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2039,8 +2039,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 	int val;
 	int err = 0;
 
-	/* This is a string value all the others are int's */
-	if (optname == TCP_CONGESTION) {
+	/* These are data/string values, all the others are ints */
+	if (TCP_CONGESTION == optname) {
 		char name[TCP_CA_NAME_MAX];
 
 		if (optlen < 1)
@@ -2056,6 +2056,95 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		err = tcp_set_congestion_control(sk, name);
 		release_sock(sk);
 		return err;
+	} else if (TCP_COOKIE_TRANSACTIONS == optname) {
+		struct tcp_cookie_transactions ctd;
+		struct tcp_cookie_values *cvp = NULL;
+
+		if (sizeof(ctd) > optlen) {
+			return -EINVAL;
+		}
+		if (copy_from_user(&ctd, optval, sizeof(ctd))) {
+			return -EFAULT;
+		}
+		if (sizeof(ctd.tcpct_value) < ctd.tcpct_used) {
+			return -EINVAL;
+		}
+		if (0 == ctd.tcpct_cookie_desired) {
+			/* default to global value */
+		} else if ((0x1 & ctd.tcpct_cookie_desired)
+			|| TCP_COOKIE_MAX < ctd.tcpct_cookie_desired
+			|| TCP_COOKIE_MIN > ctd.tcpct_cookie_desired) {
+			return -EINVAL;
+		}
+
+		if (TCP_COOKIE_OUT_NEVER & ctd.tcpct_flags) {
+			/* Supercedes all other values */
+			lock_sock(sk);
+			if (NULL != tp->cookie_values) {
+				kref_put(&tp->cookie_values->kref,
+					 tcp_cookie_values_release);
+				tp->cookie_values = NULL;
+			}
+			tp->cookie_in_always = 0; /* false */
+			tp->cookie_out_never = 1; /* true */
+			tp->extend_timestamp = 0; /* false */
+			tp->s_data_constant = 0; /* false */
+			tp->s_data_in = 0; /* false */
+			tp->s_data_out = 0; /* false */
+			release_sock(sk);
+			return err;
+		}
+
+		/* Allocate ancillary memory before locking.
+		 */
+		if (0 < ctd.tcpct_used
+		 || (NULL == tp->cookie_values
+		  && (0 < sysctl_tcp_cookie_size
+		   || 0 < ctd.tcpct_cookie_desired
+		   || 0 < ctd.tcpct_s_data_desired))) {
+			cvp = kmalloc(sizeof(*cvp) + ctd.tcpct_used,
+				      GFP_KERNEL);
+			if (NULL == cvp) {
+				return -ENOMEM;
+			}
+		}
+
+		lock_sock(sk);
+		tp->cookie_in_always = (TCP_COOKIE_IN_ALWAYS & ctd.tcpct_flags);
+		tp->cookie_out_never = 0; /* false */
+		tp->extend_timestamp = (TCP_EXTEND_TIMESTAMP & ctd.tcpct_flags);
+		tp->s_data_in = 0; /* false */
+		tp->s_data_out = 0; /* false */
+
+		if (NULL == cvp) {
+			/* No cookies by default. */
+			tp->s_data_constant = 0; /* false */
+		} else if (0 == ctd.tcpct_used) {
+			/* No constant payload data. */
+			cvp->cookie_desired = ctd.tcpct_cookie_desired;
+			cvp->s_data_desired = ctd.tcpct_s_data_desired;
+			tp->cookie_values = cvp;
+			tp->s_data_constant = 0; /* false */
+		} else {
+			/* Changes in values are recorded by a change in
+			 * pointer, ensuring that the cookie will differ,
+			 * without separately hashing each value later.
+			 */
+			if (unlikely(NULL != tp->cookie_values)) {
+				kref_put(&tp->cookie_values->kref,
+					 tcp_cookie_values_release);
+			}
+			kref_init(&cvp->kref);
+			memcpy(cvp->s_data_payload, ctd.tcpct_value,
+			       ctd.tcpct_used);
+			cvp->cookie_desired = ctd.tcpct_cookie_desired;
+			cvp->s_data_desired = ctd.tcpct_used;
+			tp->cookie_values = cvp;
+			tp->s_data_constant = 1; /* true */
+		}
+
+		release_sock(sk);
+		return err;
 	}
 
 	if (optlen < sizeof(int))
@@ -2387,6 +2476,46 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
 			return -EFAULT;
 		return 0;
+
+	case TCP_COOKIE_TRANSACTIONS: {
+		struct tcp_cookie_transactions ctd;
+		struct tcp_cookie_values *cvp = tp->cookie_values;
+
+		if (get_user(len, optlen))
+			return -EFAULT;
+		if (len < sizeof(ctd))
+			return -EINVAL;
+
+		memset(&ctd, 0, sizeof(ctd));
+		ctd.tcpct_flags =
+			  (tp->cookie_in_always ? TCP_COOKIE_IN_ALWAYS : 0)
+			| (tp->cookie_out_never ? TCP_COOKIE_OUT_NEVER : 0)
+			| (tp->extend_timestamp ? TCP_EXTEND_TIMESTAMP : 0)
+			| (tp->s_data_in ? TCP_S_DATA_IN : 0)
+			| (tp->s_data_out ? TCP_S_DATA_OUT : 0);
+
+		if (NULL != cvp) {
+			/* Cookie(s) saved, return as nonce */
+			if (sizeof(ctd.tcpct_value) < cvp->cookie_pair_size) {
+				/* impossible? */
+				return -EINVAL;
+			}
+			memcpy(&ctd.tcpct_value[0], &cvp->cookie_pair[0],
+			       cvp->cookie_pair_size);
+			ctd.tcpct_used = cvp->cookie_pair_size;
+
+			ctd.tcpct_cookie_desired = cvp->cookie_desired;
+			ctd.tcpct_s_data_desired = cvp->s_data_desired;
+		}
+
+		if (copy_to_user(optval, &ctd, sizeof(ctd))) {
+			return -EFAULT;
+		}
+		if (put_user(sizeof(ctd), optlen)) {
+			return -EFAULT;
+		}
+		return 0;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d86784b..200afa8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3698,11 +3698,11 @@ old_ack:
  * the fast version below fails.
  */
 void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
-		       int estab)
+		       u8 **cryptic, int estab)
 {
 	unsigned char *ptr;
 	struct tcphdr *th = tcp_hdr(skb);
-	int length = (th->doff * 4) - sizeof(struct tcphdr);
+	int length = tcp_option_len_th(th);
 
 	ptr = (unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
@@ -3782,6 +3782,19 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				 */
 				break;
 #endif
+			case TCPOPT_COOKIE:
+				/* This option carries 3 different lengths.
+				 */
+				if (TCPOLEN_COOKIE_MAX >= opsize
+				 && TCPOLEN_COOKIE_MIN <= opsize) {
+					opt_rx->cookie_plus = opsize;
+					*cryptic = ptr;
+				} else if (TCPOLEN_COOKIE_PAIR == opsize) {
+					/* not yet implemented */
+				} else if (TCPOLEN_COOKIE_BASE == opsize) {
+					/* not yet implemented */
+				}
+				break;
 			}
 
 			ptr += opsize-2;
@@ -3810,17 +3823,21 @@ static int tcp_parse_aligned_timestamp(struct tcp_sock *tp, struct tcphdr *th)
  * If it is wrong it falls back on tcp_parse_options().
  */
 static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
-				  struct tcp_sock *tp)
+				  struct tcp_sock *tp, u8 **cryptic)
 {
-	if (th->doff == sizeof(struct tcphdr) >> 2) {
+	/* In the spirit of fast parsing, compare doff directly to shifted
+	 * constant values.  Because equality is used, short doff can be
+	 * ignored here, and checked later.
+	 */
+	if ((sizeof(*th) >> 2) == th->doff) {
 		tp->rx_opt.saw_tstamp = 0;
 		return 0;
 	} else if (tp->rx_opt.tstamp_ok &&
-		   th->doff == (sizeof(struct tcphdr)>>2)+(TCPOLEN_TSTAMP_ALIGNED>>2)) {
+		   ((sizeof(*th)+TCPOLEN_TSTAMP_ALIGNED)>>2) == th->doff) {
 		if (tcp_parse_aligned_timestamp(tp, th))
 			return 1;
 	}
-	tcp_parse_options(skb, &tp->rx_opt, 1);
+	tcp_parse_options(skb, &tp->rx_opt, cryptic, 1);
 	return 1;
 }
 
@@ -3830,7 +3847,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
  */
 u8 *tcp_parse_md5sig_option(struct tcphdr *th)
 {
-	int length = (th->doff << 2) - sizeof (*th);
+	int length = tcp_option_len_th(th);
 	u8 *ptr = (u8*)(th + 1);
 
 	/* If the TCP option is too short, we can short cut */
@@ -5070,10 +5087,11 @@ out:
 static int tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 			      struct tcphdr *th, int syn_inerr)
 {
+	u8 *cv;
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	/* RFC1323: H1. Apply PAWS check first. */
-	if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp &&
+	if (tcp_fast_parse_options(skb, th, tp, &cv) && tp->rx_opt.saw_tstamp &&
 	    tcp_paws_discard(sk, skb)) {
 		if (!th->rst) {
 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
@@ -5361,11 +5379,14 @@ discard:
 static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 					 struct tcphdr *th, unsigned len)
 {
-	struct tcp_sock *tp = tcp_sk(sk);
+	u8 *cryptic_value;
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_cookie_values *cvp = tp->cookie_values;
 	int saved_clamp = tp->rx_opt.mss_clamp;
+	int queued = 0;
 
-	tcp_parse_options(skb, &tp->rx_opt, 0);
+	tcp_parse_options(skb, &tp->rx_opt, &cryptic_value, 0);
 
 	if (th->ack) {
 		/* rfc793:
@@ -5462,6 +5483,42 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 * Change state from SYN-SENT only after copied_seq
 		 * is initialized. */
 		tp->copied_seq = tp->rcv_nxt;
+
+		if (NULL != cvp
+		 && 0 < cvp->cookie_pair_size
+		 && 0 < tp->rx_opt.cookie_plus) {
+			int cookie_size = tp->rx_opt.cookie_plus - TCPOLEN_COOKIE_BASE;
+			int cookie_pair_size = cvp->cookie_desired + cookie_size;
+
+			/* A cookie extension option was sent and returned.
+			 * Note that each incoming SYNACK replaces the
+			 * Responder cookie.  The initial exchange is most
+			 * fragile, as protection against spoofing relies
+			 * entirely upon the sequence and timestamp (above).
+			 * This replacement strategy allows the correct pair to
+			 * pass through, while any others will be filtered via
+			 * Responder verification later.
+			 */
+			if (sizeof(cvp->cookie_pair) >= cookie_pair_size) {
+				memcpy(&cvp->cookie_pair[cvp->cookie_desired],
+				       cryptic_value, cookie_size);
+				cvp->cookie_pair_size = cookie_pair_size;
+			}
+
+			if (tcp_header_len_th(th) < skb->len) {
+				/* Queue incoming transaction data. */
+				__skb_pull(skb, tcp_header_len_th(th));
+				__skb_queue_tail(&sk->sk_receive_queue, skb);
+				skb_set_owner_r(skb, sk);
+				sk->sk_data_ready(sk, 0);
+				tp->s_data_in = 1; /* true */
+				queued = 1; /* should be amount? */
+				tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+				tp->rcv_wup = TCP_SKB_CB(skb)->end_seq;
+				tp->copied_seq = TCP_SKB_CB(skb)->seq + 1;
+			}
+		}
+
 		smp_mb();
 		tcp_set_state(sk, TCP_ESTABLISHED);
 
@@ -5513,11 +5570,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 						  TCP_DELACK_MAX, TCP_RTO_MAX);
 
 discard:
-			__kfree_skb(skb);
+			if (0 == queued)
+				__kfree_skb(skb);
 			return 0;
 		} else {
 			tcp_send_ack(sk);
 		}
+		if (0 < queued)
+			return 0; /* amount queued? */
 		return -1;
 	}
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2d25bd4..7d5fd4d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -217,7 +217,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (inet->opt)
 		inet_csk(sk)->icsk_ext_hdr_len = inet->opt->optlen;
 
-	tp->rx_opt.mss_clamp = 536;
+	tp->rx_opt.mss_clamp = TCP_MIN_RCVMSS;
 
 	/* Socket identity is still unknown (sport may be zero).
 	 * However we set state to SYN-SENT and not releasing socket
@@ -1211,9 +1211,12 @@ static struct timewait_sock_ops tcp_timewait_sock_ops = {
 
 int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 {
-	struct inet_request_sock *ireq;
+	struct tcp_extend_values tmp_ext;
 	struct tcp_options_received tmp_opt;
+	u8 *cryptic_value;
+	struct inet_request_sock *ireq;
 	struct request_sock *req;
+	struct tcp_sock *tp = tcp_sk(sk);
 	__be32 saddr = ip_hdr(skb)->saddr;
 	__be32 daddr = ip_hdr(skb)->daddr;
 	__u32 isn = TCP_SKB_CB(skb)->when;
@@ -1258,16 +1261,37 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 #endif
 
 	tcp_clear_options(&tmp_opt);
-	tmp_opt.mss_clamp = 536;
-	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
+	tmp_opt.mss_clamp = TCP_MIN_RCVMSS;
+	tmp_opt.user_mss  = tp->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
+
+	if (0 < tmp_opt.cookie_plus
+	 && tmp_opt.saw_tstamp
+	 && !tp->cookie_out_never
+	 && (0 < sysctl_tcp_cookie_size
+	  || (NULL != tp->cookie_values
+	   && 0 < tp->cookie_values->cookie_desired))) {
+#ifdef CONFIG_SYN_COOKIES
+		want_cookie = 0;	/* not our kind of cookie */
+#endif
+		tmp_ext.cookie_out_never = 0; /* false */
+		tmp_ext.cookie_plus = tmp_opt.cookie_plus;
+
+		/* secret recipe not yet implemented */
+	} else if (!tp->cookie_in_always) {
+		/* redundant indications, but ensure initialization. */
+		tmp_ext.cookie_out_never = 1; /* true */
+		tmp_ext.cookie_plus = 0;
+	} else {
+		goto drop_and_free;
+	}
+	tmp_ext.cookie_in_always = tp->cookie_in_always;
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
 
 	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-
 	tcp_openreq_init(req, &tmp_opt, skb);
 
 	ireq = inet_rsk(req);
@@ -1334,7 +1358,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	}
 	tcp_rsk(req)->snt_isn = isn;
 
-	if (__tcp_v4_send_synack(sk, req, dst, NULL) ||
+	if (__tcp_v4_send_synack(sk, req, dst, (void *)&tmp_ext) ||
 	    want_cookie)
 		goto drop_and_free;
 
@@ -1812,7 +1836,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 	 */
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd_clamp = ~0;
-	tp->mss_cache = 536;
+	tp->mss_cache = TCP_MIN_RCVMSS;
 
 	tp->reordering = sysctl_tcp_reordering;
 	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
@@ -1828,6 +1852,19 @@ static int tcp_v4_init_sock(struct sock *sk)
 	tp->af_specific = &tcp_sock_ipv4_specific;
 #endif
 
+	/* TCP Cookie Transactions */
+	if (0 < sysctl_tcp_cookie_size) {
+		/* Default, cookies without s_data. */
+		tp->cookie_values =
+			kzalloc(sizeof(*tp->cookie_values), sk->sk_allocation);
+		if (NULL != tp->cookie_values) {
+			kref_init(&tp->cookie_values->kref);
+		}
+	}
+	/* Presumed zeroed, in order of appearance:
+	 *	cookie_in_always, cookie_out_never, extend_timestamp,
+	 *	s_data_constant, s_data_in, s_data_out
+	 */
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
@@ -1881,6 +1918,15 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		sk->sk_sndmsg_page = NULL;
 	}
 
+	/*
+	 * If cookie or s_data exists, remove it.
+	 */
+	if (NULL != tp->cookie_values) {
+		kref_put(&tp->cookie_values->kref,
+			 tcp_cookie_values_release);
+		tp->cookie_values = NULL;
+	}
+
 	percpu_counter_dec(&tcp_sockets_allocated);
 }
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 8819882..0d33f5c 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -96,13 +96,14 @@ enum tcp_tw_status
 tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 			   const struct tcphdr *th)
 {
-	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
 	struct tcp_options_received tmp_opt;
+	u8 *cryptic_value;
+	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
 	int paws_reject = 0;
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
@@ -394,9 +395,12 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		/* Now setup tcp_sock */
 		newtp = tcp_sk(newsk);
 		newtp->pred_flags = 0;
-		newtp->rcv_wup = newtp->copied_seq = newtp->rcv_nxt = treq->rcv_isn + 1;
-		newtp->snd_sml = newtp->snd_una = newtp->snd_nxt = treq->snt_isn + 1;
-		newtp->snd_up = treq->snt_isn + 1;
+
+		newtp->rcv_wup = newtp->copied_seq =
+		newtp->rcv_nxt = treq->rcv_isn + 1;
+
+		newtp->snd_sml = newtp->snd_una = newtp->snd_nxt =
+		newtp->snd_up = treq->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk));
 
 		tcp_prequeue_init(newtp);
 
@@ -429,9 +433,24 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
 		skb_queue_head_init(&newtp->out_of_order_queue);
-		newtp->write_seq = treq->snt_isn + 1;
-		newtp->pushed_seq = newtp->write_seq;
+		newtp->write_seq = newtp->pushed_seq =
+			treq->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk));
 
+		/* TCP Cookie Transactions */
+		if (NULL != tcp_sk(sk)->cookie_values) {
+			/* Instead of reusing the original, replace with
+			 * default, cookies without s_data.
+			 */
+			newtp->cookie_values =
+				kzalloc(sizeof(*newtp->cookie_values), GFP_ATOMIC);
+			if (NULL != newtp->cookie_values) {
+				kref_init(&newtp->cookie_values->kref);
+			}
+		}
+		/* Presumed copied, in order of appearance:
+		 *	cookie_in_always, cookie_out_never, extend_timestamp,
+		 *	s_data_constant, s_data_in, s_data_out
+		 */
 		newtp->rx_opt.saw_tstamp = 0;
 
 		newtp->rx_opt.dsack = 0;
@@ -495,15 +514,16 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 			   struct request_sock *req,
 			   struct request_sock **prev)
 {
+	struct tcp_options_received tmp_opt;
+	u8 *cryptic_value;
 	const struct tcphdr *th = tcp_hdr(skb);
 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
 	int paws_reject = 0;
-	struct tcp_options_received tmp_opt;
 	struct sock *child;
 
 	tmp_opt.saw_tstamp = 0;
-	if (th->doff > (sizeof(struct tcphdr)>>2)) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+	if (th->doff > (sizeof(*th) >> 2)) {
+		tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent = req->ts_recent;
@@ -596,7 +616,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	 * Invalid ACK: reset will be sent by listening socket
 	 */
 	if ((flg & TCP_FLAG_ACK) &&
-	    (TCP_SKB_CB(skb)->ack_seq != tcp_rsk(req)->snt_isn + 1))
+	    (TCP_SKB_CB(skb)->ack_seq != tcp_rsk(req)->snt_isn + 1 +
+					 tcp_s_data_size(tcp_sk(sk))))
 		return sk;
 
 	/* Also, it would be not so bad idea to check rcv_tsecr, which
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c235196..0a04684 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -370,6 +370,7 @@ static inline int tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_TS		(1 << 1)
 #define OPTION_MD5		(1 << 2)
 #define OPTION_WSCALE		(1 << 3)
+#define OPTION_COOKIE_EXTENSION	(1 << 4)
 
 struct tcp_out_options {
 	u8 options;		/* bit field of OPTION_* */
@@ -377,8 +378,35 @@ struct tcp_out_options {
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u16 mss;		/* 0 to disable */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
+	u8	*cookie_copy;	/* temporary pointer */
+	u8	cookie_size;	/* bytes in copy */
 };
 
+/* The sysctl int routines are generic, so check consistency here.
+ */
+static u8 tcp_cookie_size_check(u8 desired)
+{
+	if (0 < desired) {
+		/* previously specified */
+		return desired;
+	}
+	if (0 >= sysctl_tcp_cookie_size) {
+		/* no default specified */
+		return 0;
+	}
+	if (TCP_COOKIE_MIN > sysctl_tcp_cookie_size) {
+		return TCP_COOKIE_MIN;
+	}
+	if (TCP_COOKIE_MAX < sysctl_tcp_cookie_size) {
+		return TCP_COOKIE_MAX;
+	}
+	if (0x1 & sysctl_tcp_cookie_size) {
+		/* 8-bit multiple, illegal, fix it */
+		return (u8)(sysctl_tcp_cookie_size + 0x1);
+	}
+	return (u8)sysctl_tcp_cookie_size;
+}
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -395,11 +423,22 @@ struct tcp_out_options {
 static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 			      const struct tcp_out_options *opts,
 			      __u8 **md5_hash) {
-	if (unlikely(OPTION_MD5 & opts->options)) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_NOP << 16) |
-			       (TCPOPT_MD5SIG << 8) |
-			       TCPOLEN_MD5SIG);
+	u8 options = opts->options;	/* mungable copy */
+
+	if (unlikely(OPTION_MD5 & options)) {
+		if (unlikely(OPTION_COOKIE_EXTENSION & options)) {
+			*ptr++ = htonl((TCPOPT_COOKIE << 24) |
+				       (TCPOLEN_COOKIE_BASE << 16) |
+				       (TCPOPT_MD5SIG << 8) |
+				       TCPOLEN_MD5SIG);
+		} else {
+			*ptr++ = htonl((TCPOPT_NOP << 24) |
+				       (TCPOPT_NOP << 16) |
+				       (TCPOPT_MD5SIG << 8) |
+				       TCPOLEN_MD5SIG);
+		}
+		/* larger cookies are incompatible */
+		options &= ~OPTION_COOKIE_EXTENSION;
 		*md5_hash = (__u8 *)ptr;
 		ptr += 4;
 	} else {
@@ -412,12 +451,13 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 			       opts->mss);
 	}
 
-	if (likely(OPTION_TS & opts->options)) {
-		if (unlikely(OPTION_SACK_ADVERTISE & opts->options)) {
+	if (likely(OPTION_TS & options)) {
+		if (unlikely(OPTION_SACK_ADVERTISE & options)) {
 			*ptr++ = htonl((TCPOPT_SACK_PERM << 24) |
 				       (TCPOLEN_SACK_PERM << 16) |
 				       (TCPOPT_TIMESTAMP << 8) |
 				       TCPOLEN_TIMESTAMP);
+			options &= ~OPTION_SACK_ADVERTISE;
 		} else {
 			*ptr++ = htonl((TCPOPT_NOP << 24) |
 				       (TCPOPT_NOP << 16) |
@@ -428,15 +468,48 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 		*ptr++ = htonl(opts->tsecr);
 	}
 
-	if (unlikely(OPTION_SACK_ADVERTISE & opts->options &&
-		     !(OPTION_TS & opts->options))) {
+	/* Specification requires after timestamp, so do it now.
+	 */
+	if (unlikely(OPTION_COOKIE_EXTENSION & options)) {
+		u8 *cookie_copy = opts->cookie_copy;
+		u8 cookie_size = opts->cookie_size;
+
+		if (unlikely(0x1 & cookie_size)) {
+			/* 8-bit multiple, illegal, ignore */
+			cookie_size = 0;
+		} else if (likely(0x2 & cookie_size)) {
+			__u8 *p = (__u8 *)ptr;
+
+			/* 16-bit multiple */
+			*p++ = TCPOPT_COOKIE;
+			*p++ = TCPOLEN_COOKIE_BASE + cookie_size;
+			*p++ = *cookie_copy++;
+			*p++ = *cookie_copy++;
+			ptr++;
+			cookie_size -= 2;
+		} else {
+			/* 32-bit multiple */
+			*ptr++ = htonl(((TCPOPT_NOP << 24) |
+				        (TCPOPT_NOP << 16) |
+				        (TCPOPT_COOKIE << 8) |
+				        TCPOLEN_COOKIE_BASE) +
+				       cookie_size);
+		}
+
+		if (0 < cookie_size) {
+			memcpy(ptr, cookie_copy, cookie_size);
+			ptr += (cookie_size >> 2);
+		}
+	}
+
+	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
 		*ptr++ = htonl((TCPOPT_NOP << 24) |
 			       (TCPOPT_NOP << 16) |
 			       (TCPOPT_SACK_PERM << 8) |
 			       TCPOLEN_SACK_PERM);
 	}
 
-	if (unlikely(OPTION_WSCALE & opts->options)) {
+	if (unlikely(OPTION_WSCALE & options)) {
 		*ptr++ = htonl((TCPOPT_NOP << 24) |
 			       (TCPOPT_WINDOW << 16) |
 			       (TCPOLEN_WINDOW << 8) |
@@ -471,11 +544,18 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 				struct tcp_out_options *opts,
 				struct tcp_md5sig_key **md5) {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_cookie_values *cvp = tp->cookie_values;
 	unsigned size = 0;
+	u8 cookie_size = (!tp->cookie_out_never && NULL != cvp)
+			 ? tcp_cookie_size_check(cvp->cookie_desired)
+			 : 0;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
 	if (*md5) {
+		if (0 < cookie_size) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+		}
 		opts->options |= OPTION_MD5;
 		size += TCPOLEN_MD5SIG_ALIGNED;
 	}
@@ -512,6 +592,63 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 			size += TCPOLEN_SACKPERM_ALIGNED;
 	}
 
+	/* Having both authentication and cookies for security is redundant,
+	 * and there's certainly not enough room.  Instead, the cookie-less
+	 * variant is proposed above.
+	 *
+	 * Consider the pessimal case with authentication.  The options
+	 * could look like:
+	 *   COOKIE|MD5(20) + MSS(4) + WSCALE(4) + SACK|TS(12) == 40
+	 *
+	 * (Currently, the timestamps && *MD5 test above prevents this.)
+	 *
+	 * Note that timestamps are required by the specification.
+	 *
+	 * Odd numbers of bytes are prohibited by the specification, ensuring
+	 * that the cookie is 16-bit aligned, and the resulting cookie pair is
+	 * 32-bit aligned.
+	 */
+	if (NULL == *md5
+	 && (OPTION_TS & opts->options)
+	 && 0 < cookie_size) {
+		int need = TCPOLEN_COOKIE_BASE + cookie_size;
+		int remaining = MAX_TCP_OPTION_SPACE - size;
+
+		if (0x2 & need) {
+			/* 32-bit multiple */
+			need += 2; /* NOPs */
+
+			if (need > remaining) {
+				/* try shrinking cookie to fit */
+				cookie_size -= 2;
+				need -= 4;
+			}
+		}
+		while (need > remaining && TCP_COOKIE_MIN <= cookie_size) {
+			cookie_size -= 4;
+			need -= 4;
+		}
+		if (TCP_COOKIE_MIN <= cookie_size) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+			opts->cookie_copy = &cvp->cookie_pair[0];
+			opts->cookie_size = cookie_size;
+
+			/* Remember for future incarnations. */
+			cvp->cookie_desired = cookie_size;
+
+			if (cvp->cookie_desired != cvp->cookie_pair_size) {
+				/* Currently use random bytes as a nonce,
+				 * assuming these are completely unpredictable
+				 * by hostile users of the same system.
+				 */
+				get_random_bytes(opts->cookie_copy,
+						 cookie_size);
+				cvp->cookie_pair_size = cookie_size;
+			}
+
+			size += need;
+		}
+	}
 	return size;
 }
 
@@ -520,14 +657,22 @@ static unsigned tcp_synack_options(struct sock *sk,
 				   struct request_sock *req,
 				   unsigned mss, struct sk_buff *skb,
 				   struct tcp_out_options *opts,
-				   struct tcp_md5sig_key **md5) {
-	unsigned size = 0;
+				   struct tcp_md5sig_key **md5,
+				   struct tcp_extend_values *xvp)
+{
 	struct inet_request_sock *ireq = inet_rsk(req);
+	unsigned size = 0;
+	u8 cookie_plus = (NULL != xvp && !xvp->cookie_out_never)
+			 ? xvp->cookie_plus
+			 : 0;
 	char doing_ts;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
 	if (*md5) {
+		if (0 < cookie_plus) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+		}
 		opts->options |= OPTION_MD5;
 		size += TCPOLEN_MD5SIG_ALIGNED;
 	}
@@ -561,6 +706,34 @@ static unsigned tcp_synack_options(struct sock *sk,
 			size += TCPOLEN_SACKPERM_ALIGNED;
 	}
 
+	/* Similar rationale to tcp_syn_options() applies here, too.
+	 * If the <SYN> options fit, the same options should fit now!
+	 */
+	if (NULL == *md5
+	 && doing_ts
+	 && 0 < cookie_plus) {
+		int need = cookie_plus; /* has TCPOLEN_COOKIE_BASE */
+		int remaining = MAX_TCP_OPTION_SPACE - size;
+
+		if (0x2 & need) {
+			/* 32-bit multiple */
+			need += 2; /* NOPs */
+		}
+		if (need <= remaining) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+			opts->cookie_copy = &xvp->cookie_bakery[0];
+			opts->cookie_size = cookie_plus - TCPOLEN_COOKIE_BASE;
+
+			/* secret recipe not yet implemented */
+			get_random_bytes(opts->cookie_copy,
+					 opts->cookie_size);
+
+			size += need;
+		} else {
+			/* There's no error return, so flag it. */
+			xvp->cookie_out_never = 1; /* true */
+		}
+	}
 	return size;
 }
 
@@ -2229,14 +2402,15 @@ int tcp_send_synack(struct sock *sk)
 struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 				struct request_sock *req, void *extend_values)
 {
+	struct tcp_out_options opts;
+	struct tcp_extend_values *xvp = tcp_xv(extend_values);
 	struct inet_request_sock *ireq = inet_rsk(req);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcphdr *th;
-	int tcp_header_size;
-	struct tcp_out_options opts;
 	struct sk_buff *skb;
 	struct tcp_md5sig_key *md5;
 	__u8 *md5_hash_location;
+	int tcp_header_size;
 	int mss;
 
 	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
@@ -2274,7 +2448,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 #endif
 	TCP_SKB_CB(skb)->when = tcp_time_stamp;
 	tcp_header_size = tcp_synack_options(sk, req, mss,
-					     skb, &opts, &md5) +
+					     skb, &opts, &md5, xvp) +
 			  sizeof(struct tcphdr);
 
 	skb_push(skb, tcp_header_size);
@@ -2292,6 +2466,25 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	 */
 	tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
 			     TCPCB_FLAG_SYN | TCPCB_FLAG_ACK);
+
+	/* If cookies are active, and constant data is available, copy it
+	 * directly from the listening socket.
+	 */
+	if (NULL != xvp
+	 && !xvp->cookie_out_never
+	 && 0 < xvp->cookie_plus
+	 && tp->s_data_constant) {
+		const struct tcp_cookie_values *cvp = tp->cookie_values;
+
+		if (NULL != cvp
+		 && 0 < cvp->s_data_desired) {
+			u8 *buf = skb_put(skb, cvp->s_data_desired);
+
+			memcpy(buf, cvp->s_data_payload, cvp->s_data_desired);
+			TCP_SKB_CB(skb)->end_seq += cvp->s_data_desired;
+		}
+	}
+
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
 	th->ack_seq = htonl(tcp_rsk(req)->rcv_isn + 1);
 
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index cbe55e5..2839349 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -159,6 +159,8 @@ static inline int cookie_check(struct sk_buff *skb, __u32 cookie)
 
 struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 {
+	struct tcp_options_received tcp_opt;
+	u8 *cryptic_value;
 	struct inet_request_sock *ireq;
 	struct inet6_request_sock *ireq6;
 	struct tcp_request_sock *treq;
@@ -171,7 +173,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	int mss;
 	struct dst_entry *dst;
 	__u8 rcv_wscale;
-	struct tcp_options_received tcp_opt;
 
 	if (!sysctl_tcp_syncookies || !th->ack)
 		goto out;
@@ -186,7 +187,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
+	tcp_parse_options(skb, &tcp_opt, &cryptic_value, 0);
 
 	if (tcp_opt.saw_tstamp)
 		cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3b3d7b3..1320825 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1161,11 +1161,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb)
  */
 static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 {
+	struct tcp_extend_values tmp_ext;
+	struct tcp_options_received tmp_opt;
+	u8 *cryptic_value;
 	struct inet6_request_sock *treq;
 	struct ipv6_pinfo *np = inet6_sk(sk);
-	struct tcp_options_received tmp_opt;
-	struct tcp_sock *tp = tcp_sk(sk);
 	struct request_sock *req = NULL;
+	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 isn = TCP_SKB_CB(skb)->when;
 #ifdef CONFIG_SYN_COOKIES
 	int want_cookie = 0;
@@ -1205,7 +1207,29 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
+
+	if (0 < tmp_opt.cookie_plus
+	 && tmp_opt.saw_tstamp
+	 && !tp->cookie_out_never
+	 && (0 < sysctl_tcp_cookie_size
+	  || (NULL != tp->cookie_values
+	   && 0 < tp->cookie_values->cookie_desired))) {
+#ifdef CONFIG_SYN_COOKIES
+		want_cookie = 0;	/* not our kind of cookie */
+#endif
+		tmp_ext.cookie_out_never = 0; /* false */
+		tmp_ext.cookie_plus = tmp_opt.cookie_plus;
+
+		/* secret recipe not yet implemented */
+	} else if (!tp->cookie_in_always) {
+		/* redundant indications, but ensure initialization. */
+		tmp_ext.cookie_out_never = 1; /* true */
+		tmp_ext.cookie_plus = 0;
+	} else {
+		goto drop;
+	}
+	tmp_ext.cookie_in_always = tp->cookie_in_always;
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
@@ -1243,7 +1267,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	security_inet_conn_request(sk, skb, req);
 
-	if (tcp_v6_send_synack(sk, req, NULL) ||
+	if (tcp_v6_send_synack(sk, req, (void *)&tmp_ext) ||
 	    want_cookie)
 		goto drop;
 
@@ -1848,7 +1872,7 @@ static int tcp_v6_init_sock(struct sock *sk)
 	 */
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd_clamp = ~0;
-	tp->mss_cache = 536;
+	tp->mss_cache = TCP_MIN_RCVMSS;
 
 	tp->reordering = sysctl_tcp_reordering;
 
@@ -1864,6 +1888,19 @@ static int tcp_v6_init_sock(struct sock *sk)
 	tp->af_specific = &tcp_sock_ipv6_specific;
 #endif
 
+	/* TCP Cookie Transactions */
+	if (0 < sysctl_tcp_cookie_size) {
+		/* Default, cookies without s_data. */
+		tp->cookie_values =
+			kzalloc(sizeof(*tp->cookie_values), sk->sk_allocation);
+		if (NULL != tp->cookie_values) {
+			kref_init(&tp->cookie_values->kref);
+		}
+	}
+	/* Presumed zeroed, in order of appearance:
+	 *	cookie_in_always, cookie_out_never, extend_timestamp,
+	 *	s_data_constant, s_data_in, s_data_out
+	 */
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
-- 
1.6.0.4


^ permalink raw reply related

* [net-next-2.6 PATCH 2/4] TCPCT part 1: initial SYN exchange with SYNACK data
From: William Allen Simpson @ 2009-10-15  5:32 UTC (permalink / raw)
  To: Linux Kernel Network Developers
In-Reply-To: <4AD6B31B.3060402@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 874 bytes --]

Define sysctl (tcp_cookie_size) to turn on and off the cookie option
default globally, instead of a compiled configuration option.

Define per socket option (TCP_COOKIE_TRANSACTIONS) for setting constant
data values, retrieving variable cookie values, and other facilities.

This is a straightforward re-implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley).  The patch was previously reviewed:

   http://thread.gmane.org/gmane.linux.network/102586

These functions will also be used in subsequent patches that implement
additional features.
---
  include/linux/tcp.h        |   31 ++++++++++++++++++++++++++++++-
  include/net/tcp.h          |    1 +
  net/ipv4/sysctl_net_ipv4.c |    8 ++++++++
  net/ipv4/tcp_output.c      |    8 ++++++++
  4 files changed, 47 insertions(+), 1 deletions(-)


[-- Attachment #2: TCPCT+1-2.patch --]
[-- Type: text/plain, Size: 3570 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 61723a7..63ab660 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -96,6 +96,7 @@ enum {
 #define TCP_QUICKACK		12	/* Block/reenable quick acks */
 #define TCP_CONGESTION		13	/* Congestion control algorithm */
 #define TCP_MD5SIG		14	/* TCP MD5 Signature (RFC2385) */
+#define TCP_COOKIE_TRANSACTIONS	15	/* TCP Cookie Transactions */
 
 #define TCPI_OPT_TIMESTAMPS	1
 #define TCPI_OPT_SACK		2
@@ -170,6 +171,34 @@ struct tcp_md5sig {
 	__u8	tcpm_key[TCP_MD5SIG_MAXKEYLEN];		/* key (binary) */
 };
 
+/* for TCP_COOKIE_TRANSACTIONS (TCPCT) socket option */
+#define TCP_COOKIE_MAX		16		/* 128-bits */
+#define TCP_COOKIE_MIN		 8		/*  64-bits */
+#define TCP_COOKIE_PAIR_SIZE	(2*TCP_COOKIE_MAX)
+
+#define TCP_S_DATA_MAX		64U		/* after TCP+IP options */
+#define TCP_S_DATA_MSS_DEFAULT	536U		/* default MSS (RFC1122) */
+
+/* Flags for both getsockopt and setsockopt */
+#define TCP_COOKIE_IN_ALWAYS	(1 << 0)	/* Discard SYN without cookie */
+#define TCP_COOKIE_OUT_NEVER	(1 << 1)	/* Prohibit outgoing cookies,
+						 * supercedes everything else. */
+#define TCP_EXTEND_TIMESTAMP	(1 << 4)	/* Initiate 64-bit timestamps */
+
+/* Flags for getsockopt */
+#define TCP_S_DATA_IN		(1 << 2)	/* Was data received? */
+#define TCP_S_DATA_OUT		(1 << 3)	/* Was data sent? */
+
+/* TCP_COOKIE_TRANSACTIONS data */
+struct tcp_cookie_transactions {
+	__u16	tcpct_flags;			/* see above */
+	__u8	__tcpct_pad1;			/* zero */
+	__u8	tcpct_cookie_desired;		/* bytes */
+	__u16	tcpct_s_data_desired;		/* bytes of variable data */
+	__u16	tcpct_used;			/* bytes in value */
+	__u8	tcpct_value[TCP_S_DATA_MSS_DEFAULT];
+};
+
 #ifdef __KERNEL__
 
 #include <linux/skbuff.h>
@@ -431,6 +460,6 @@ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
 	return (struct tcp_timewait_sock *)sk;
 }
 
-#endif
+#endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_TCP_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 28bcaf7..63d17fd 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -237,6 +237,7 @@ extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_max_ssthresh;
+extern int sysctl_tcp_cookie_size;
 
 extern atomic_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 2dcf04d..3422c54 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -714,6 +714,14 @@ static struct ctl_table ipv4_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "tcp_cookie_size",
+		.data		= &sysctl_tcp_cookie_size,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 765d80f..c235196 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -59,6 +59,14 @@ int sysctl_tcp_base_mss __read_mostly = 512;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+#ifdef CONFIG_SYSCTL
+/* By default, let the user enable it. */
+int sysctl_tcp_cookie_size __read_mostly = 0;
+#else
+int sysctl_tcp_cookie_size __read_mostly = TCP_COOKIE_MAX;
+#endif
+
+
 /* Account for new data that has been sent to the network. */
 static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
 {
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] cxgb3: No need to wake queue in xmit handler
From: Krishna Kumar @ 2009-10-15  5:54 UTC (permalink / raw)
  To: davem; +Cc: netdev, divy, Krishna Kumar

The xmit handler doesn't need to wake the queue after stopping
it temporarily (some other drivers are doing the same).

Patch on net-next-2.6, multiple netperf sessions tested.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

 drivers/net/cxgb3/sge.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -ruNp org/drivers/net/cxgb3/sge.c new/drivers/net/cxgb3/sge.c
--- org/drivers/net/cxgb3/sge.c	2009-09-07 10:09:03.000000000 +0530
+++ new/drivers/net/cxgb3/sge.c	2009-09-07 10:11:11.000000000 +0530
@@ -1260,7 +1260,7 @@ netdev_tx_t t3_eth_xmit(struct sk_buff *
 		if (should_restart_tx(q) &&
 		    test_and_clear_bit(TXQ_ETH, &qs->txq_stopped)) {
 			q->restarts++;
-			netif_tx_wake_queue(txq);
+			netif_tx_start_queue(txq);
 		}
 	}
 

^ permalink raw reply

* [PATCH] genetlink: Optimize genl_register_family()
From: Krishna Kumar @ 2009-10-15  5:54 UTC (permalink / raw)
  To: davem; +Cc: netdev, Krishna Kumar

From: Krishna Kumar <krkumar2@in.ibm.com>

genl_register_family() doesn't need to call genl_family_find_byid
when GENL_ID_GENERATE is passed during register.

Patch on net-next-2.6, compile and reboot testing only.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
diff -ruNp org/net/netlink/genetlink.c new/net/netlink/genetlink.c
--- org/net/netlink/genetlink.c	2009-10-06 08:34:10.000000000 +0530
+++ new/net/netlink/genetlink.c	2009-10-12 13:57:38.000000000 +0530
@@ -374,11 +374,6 @@ int genl_register_family(struct genl_fam
 		goto errout_locked;
 	}
 
-	if (genl_family_find_byid(family->id)) {
-		err = -EEXIST;
-		goto errout_locked;
-	}
-
 	if (family->id == GENL_ID_GENERATE) {
 		u16 newid = genl_generate_id();
 
@@ -388,6 +383,9 @@ int genl_register_family(struct genl_fam
 		}
 
 		family->id = newid;
+	} else if (genl_family_find_byid(family->id)) {
+		err = -EEXIST;
+		goto errout_locked;
 	}
 
 	if (family->maxattr) {

^ permalink raw reply

* [PATCH] genetlink: Optimize and one bug fix in genl_generate_id()
From: Krishna Kumar @ 2009-10-15  5:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, Krishna Kumar
In-Reply-To: <20091015055453.30128.12160.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

1. GENL_MIN_ID is a valid id -> no need to start at
   GENL_MIN_ID + 1.
2. Avoid going through the ids two times: If we start at
   GENL_MIN_ID+1 (*or bigger*) and all ids are over!, the
   code iterates through the list twice (*or lesser*).
3. Simplify code - no need to start at idx=0 which gets
   reset to GENL_MIN_ID.

Patch on net-next-2.6. Reboot test shows that first id
passed to genl_register_family was 16, next two were
GENL_ID_GENERATE and genl_generate_id returned 17 & 18
(user level testing of same code shows expected values
across entire range of MIN/MAX).

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
diff -ruNp org/net/netlink/genetlink.c new/net/netlink/genetlink.c
--- org/net/netlink/genetlink.c	2009-10-12 13:57:38.000000000 +0530
+++ new/net/netlink/genetlink.c	2009-10-12 13:57:59.000000000 +0530
@@ -97,25 +97,17 @@ static struct genl_ops *genl_get_cmd(u8 
 */
 static inline u16 genl_generate_id(void)
 {
-	static u16 id_gen_idx;
-	int overflowed = 0;
+	static u16 id_gen_idx = GENL_MIN_ID;
+	int i;

-	do {
-		if (id_gen_idx == 0)
+	for (i = 0; i <= GENL_MAX_ID - GENL_MIN_ID; i++) {
+		if (!genl_family_find_byid(id_gen_idx))
+			return id_gen_idx;
+		if (++id_gen_idx > GENL_MAX_ID)
 			id_gen_idx = GENL_MIN_ID;
+	}

-		if (++id_gen_idx > GENL_MAX_ID) {
-			if (!overflowed) {
-				overflowed = 1;
-				id_gen_idx = 0;
-				continue;
-			} else
-				return 0;
-		}
-
-	} while (genl_family_find_byid(id_gen_idx));
-
-	return id_gen_idx;
+	return 0;
 }

 static struct genl_multicast_group notify_grp;

^ permalink raw reply

* [RFC] [PATCH 0/5] net: Implement fast TX queue selection
From: Krishna Kumar @ 2009-10-15  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

Multiqueue cards on routers/firewalls set skb->queue_mapping on
input which helps in faster xmit. Implement fast queue selection
for locally generated packets also, by saving the txq# for
connected sockets (in dev_pick_tx) and use it in subsequent
iterations. Locally generated packets for a connection will xmit
on the same txq, but routing & firewall loads should not be
affected by this patch. Tests shows the distribution across txq's
for 1-4 netperf sessions is similar to existing code.


                   Testing & results:
                   ------------------

1. Cycles/Iter (C/I) used by dev_pick_tx:
         (B -> Billion,   M -> Million)
   |--------------|------------------------|------------------------|
   |              |          ORG           |          NEW           |
   |  Test        |--------|---------|-----|--------|---------|-----|
   |              | Cycles |  Iters  | C/I | Cycles | Iters   | C/I |
   |--------------|--------|---------|-----|--------|---------|-----|
   | [TCP_STREAM, | 3.98 B | 12.47 M | 320 | 1.95 B | 12.92 M | 152 |
   |  UDP_STREAM, |        |         |     |        |         |     |         
   |  TCP_RR,     |        |         |     |        |         |     |        
   |  UDP_RR]     |        |         |     |        |         |     |        
   |--------------|--------|---------|-----|--------|---------|-----|        
   | [TCP_STREAM, | 8.92 B | 29.66 M | 300 | 3.82 B | 38.88 M | 98  |        
   |  TCP_RR,     |        |         |     |        |         |     |         
   |  UDP_RR]     |        |         |     |        |         |     |         
   |--------------|--------|---------|-----|--------|---------|-----|

2. Stress test (over 48 hours) : 1000 netperfs running combination
   of TCP_STREAM/RR, UDP_STREAM/RR (v4/6, NODELAY/~NODELAY for all
   tests), with some ssh sessions, reboots, modprobe -r driver, etc.

3. Performance test (10 hours): Single 10 hour netperf run of
   TCP_STREAM/RR, TCP_STREAM + NO_DELAY and UDP_RR. Results show an
   improvement in both performance and cpu utilization.

Tested on a 4-processor AMD Opteron 2.8 GHz system with 1GB memory,
10G Chelsio card. Each BW number is the sum of 3 iterations of
individual tests using 512, 16K, 64K & 128K I/O sizes, in Mb/s:

------------------------  TCP Tests  -----------------------
#procs  Org BW     New BW (%)     Org SD     New SD (%)
------------------------------------------------------------
1       77777.7    81011.0 (4.15)    42.3     40.2 (-5.11)
4       91599.2    91878.8 (.30)    955.9    919.3 (-3.83)
6       89533.3    91792.2 (2.52)  2262.0   2143.0 (-5.25)
8       87507.5    89161.9 (1.89)  4363.4   4073.6 (-6.64)
10      85152.4    85607.8 (.53)   6890.4   6851.2 (-.56)
------------------------------------------------------------

------------------------- TCP NO_DELAY Tests ---------------
#procs  Org BW     New BW (%)      Org SD      New SD (%)
------------------------------------------------------------
1       57001.9    57888.0 (1.55)     67.7      70.2 (3.75)
4       69555.1    69957.4 (.57)     823.0     834.3 (1.36)
6       71359.3    71918.7 (.78)    1740.8    1724.5 (-.93)
8       72577.6    72496.1 (-.11)   2955.4    2937.7 (-.59)
10      70829.6    71444.2 (.86)    4826.1    4673.4 (-3.16)
------------------------------------------------------------

----------------------- Request Response Tests --------------------
#procs  Org TPS     New TPS (%)      Org SD    New SD (%)
(1-10)
-------------------------------------------------------------------
TCP     1019245.9   1042626.4 (2.29) 16352.9   16459.8 (.65)
UDP     934598.64   942956.9  (.89)  11607.3   11593.2 (-.12)
-------------------------------------------------------------------

Thanks,

- KK

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply

* [RFC] [PATCH 1/5] net: Introduce sk_tx_queue_mapping
From: Krishna Kumar @ 2009-10-15  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091015055602.30145.65852.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

Introduce sk_tx_queue_mapping; and functions that set, test and get
this value. Reset sk_tx_queue_mapping to -1 whenever the dst cache
is set/reset, and in socket alloc & free (free probably doesn't need
it).

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/sock.h |   21 +++++++++++++++++++++
 net/core/sock.c    |    7 ++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff -ruNp org/include/net/sock.h new/include/net/sock.h
--- org/include/net/sock.h	2009-10-14 10:36:52.000000000 +0530
+++ new/include/net/sock.h	2009-10-14 17:59:44.000000000 +0530
@@ -107,6 +107,7 @@ struct net;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_refcnt: reference count
+ *	@skc_tx_queue_mapping: tx queue number for this connection
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_family: network address family
  *	@skc_state: Connection state
@@ -128,6 +129,7 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	atomic_t		skc_refcnt;
+	int			skc_tx_queue_mapping;
 
 	unsigned int		skc_hash;
 	unsigned short		skc_family;
@@ -215,6 +217,7 @@ struct sock {
 #define sk_node			__sk_common.skc_node
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
+#define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
 
 #define sk_copy_start		__sk_common.skc_hash
 #define sk_hash			__sk_common.skc_hash
@@ -1094,8 +1097,24 @@ static inline void sock_put(struct sock 
 extern int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 			  const int nested);
 
+static inline void sk_record_tx_queue(struct sock *sk, int tx_queue)
+{
+        sk->sk_tx_queue_mapping = tx_queue;
+}
+
+static inline int sk_get_tx_queue(const struct sock *sk)
+{
+        return sk->sk_tx_queue_mapping;
+}
+
+static inline bool sk_tx_queue_recorded(const struct sock *sk)
+{
+        return (sk && sk->sk_tx_queue_mapping >= 0);
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
+	sk_record_tx_queue(sk, -1);
 	sk->sk_socket = sock;
 }
 
@@ -1152,6 +1171,7 @@ __sk_dst_set(struct sock *sk, struct dst
 {
 	struct dst_entry *old_dst;
 
+	sk_record_tx_queue(sk, -1);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = dst;
 	dst_release(old_dst);
@@ -1170,6 +1190,7 @@ __sk_dst_reset(struct sock *sk)
 {
 	struct dst_entry *old_dst;
 
+	sk_record_tx_queue(sk, -1);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = NULL;
 	dst_release(old_dst);
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-14 10:36:52.000000000 +0530
+++ new/net/core/sock.c	2009-10-14 17:59:46.000000000 +0530
@@ -358,6 +358,7 @@ struct dst_entry *__sk_dst_check(struct 
 	struct dst_entry *dst = sk->sk_dst_cache;
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+		sk_record_tx_queue(sk, -1);
 		sk->sk_dst_cache = NULL;
 		dst_release(dst);
 		return NULL;
@@ -954,7 +955,8 @@ static void sock_copy(struct sock *nsk, 
 	void *sptr = nsk->sk_security;
 #endif
 	BUILD_BUG_ON(offsetof(struct sock, sk_copy_start) !=
-		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt));
+		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt) +
+		     sizeof(osk->sk_tx_queue_mapping));
 	memcpy(&nsk->sk_copy_start, &osk->sk_copy_start,
 	       osk->sk_prot->obj_size - offsetof(struct sock, sk_copy_start));
 #ifdef CONFIG_SECURITY_NETWORK
@@ -998,6 +1000,7 @@ static struct sock *sk_prot_alloc(struct
 
 		if (!try_module_get(prot->owner))
 			goto out_free_sec;
+		sk_record_tx_queue(sk, -1);
 	}
 
 	return sk;
@@ -1017,6 +1020,8 @@ static void sk_prot_free(struct proto *p
 	struct kmem_cache *slab;
 	struct module *owner;
 
+	sk_record_tx_queue(sk, -1);
+
 	owner = prot->owner;
 	slab = prot->slab;
 

^ permalink raw reply

* [RFC] [PATCH 2/5] net: Use sk_tx_queue_mapping for connected sockets
From: Krishna Kumar @ 2009-10-15  5:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091015055602.30145.65852.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either skb rx was recorded, or if the device has a queue select
handler. Next iterations of dev_pick_tx uses the cached value of
sk_tx_queue_mapping.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/core/dev.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2009-10-14 17:59:40.000000000 +0530
+++ new/net/core/dev.c	2009-10-14 18:00:04.000000000 +0530
@@ -1791,13 +1791,25 @@ EXPORT_SYMBOL(skb_tx_hash);
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
-	const struct net_device_ops *ops = dev->netdev_ops;
-	u16 queue_index = 0;
+	u16 queue_index;
+	struct sock *sk = skb->sk;
+
+	if (sk_tx_queue_recorded(sk)) {
+		queue_index = sk_get_tx_queue(sk);
+	} else {
+		const struct net_device_ops *ops = dev->netdev_ops;
 
-	if (ops->ndo_select_queue)
-		queue_index = ops->ndo_select_queue(dev, skb);
-	else if (dev->real_num_tx_queues > 1)
-		queue_index = skb_tx_hash(dev, skb);
+		if (ops->ndo_select_queue) {
+			queue_index = ops->ndo_select_queue(dev, skb);
+		} else {
+			queue_index = 0;
+			if (dev->real_num_tx_queues > 1)
+				queue_index = skb_tx_hash(dev, skb);
+
+			if (sk && sk->sk_dst_cache)
+				sk_record_tx_queue(sk, queue_index);
+		}
+	}
 
 	skb_set_queue_mapping(skb, queue_index);
 	return netdev_get_tx_queue(dev, queue_index);

^ permalink raw reply

* [RFC] [PATCH 3/5] net: IPv6 changes
From: Krishna Kumar @ 2009-10-15  5:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091015055602.30145.65852.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

IPv6: Reset sk_tx_queue_mapping when dst_cache is reset.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/ipv6/inet6_connection_sock.c |    1 +
 1 file changed, 1 insertion(+)

diff -ruNp org/net/ipv6/inet6_connection_sock.c new/net/ipv6/inet6_connection_sock.c
--- org/net/ipv6/inet6_connection_sock.c	2009-10-14 17:59:58.000000000 +0530
+++ new/net/ipv6/inet6_connection_sock.c	2009-10-14 18:00:11.000000000 +0530
@@ -168,6 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
 	if (dst) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
 		if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
+			sk_record_tx_queue(sk, -1);
 			sk->sk_dst_cache = NULL;
 			dst_release(dst);
 			dst = NULL;

^ permalink raw reply

* [RFC] [PATCH 4/5] net: Fix for dst_negative_advice
From: Krishna Kumar @ 2009-10-15  5:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091015055602.30145.65852.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.

(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/dst.h      |   12 ++++++++++--
 net/core/sock.c        |    6 ++++++
 net/dccp/timer.c       |    4 ++--
 net/decnet/af_decnet.c |    2 +-
 net/ipv4/tcp_timer.c   |    4 ++--
 5 files changed, 21 insertions(+), 7 deletions(-)

diff -ruNp org/include/net/dst.h new/include/net/dst.h
--- org/include/net/dst.h	2009-10-14 18:00:07.000000000 +0530
+++ new/include/net/dst.h	2009-10-14 18:00:22.000000000 +0530
@@ -222,11 +222,19 @@ static inline void dst_confirm(struct ds
 		neigh_confirm(dst->neighbour);
 }
 
-static inline void dst_negative_advice(struct dst_entry **dst_p)
+static inline void dst_negative_advice(struct dst_entry **dst_p,
+				       struct sock *sk)
 {
 	struct dst_entry * dst = *dst_p;
-	if (dst && dst->ops->negative_advice)
+	if (dst && dst->ops->negative_advice) {
 		*dst_p = dst->ops->negative_advice(dst);
+
+		if (dst != *dst_p) {
+			extern void sk_reset_txq(struct sock *sk);
+
+			sk_reset_txq(sk);
+		}
+	}
 }
 
 static inline void dst_link_failure(struct sk_buff *skb)
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-14 18:00:07.000000000 +0530
+++ new/net/core/sock.c	2009-10-14 18:00:22.000000000 +0530
@@ -353,6 +353,12 @@ discard_and_relse:
 }
 EXPORT_SYMBOL(sk_receive_skb);
 
+void sk_reset_txq(struct sock *sk)
+{
+	sk_record_tx_queue(sk, -1);
+}
+EXPORT_SYMBOL(sk_reset_txq);
+
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
 	struct dst_entry *dst = sk->sk_dst_cache;
diff -ruNp org/net/dccp/timer.c new/net/dccp/timer.c
--- org/net/dccp/timer.c	2009-10-14 18:00:07.000000000 +0530
+++ new/net/dccp/timer.c	2009-10-14 18:00:22.000000000 +0530
@@ -38,7 +38,7 @@ static int dccp_write_timeout(struct soc
 
 	if (sk->sk_state == DCCP_REQUESTING || sk->sk_state == DCCP_PARTOPEN) {
 		if (icsk->icsk_retransmits != 0)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ?
 			    : sysctl_dccp_request_retries;
 	} else {
@@ -63,7 +63,7 @@ static int dccp_write_timeout(struct soc
 			   Golden words :-).
 		   */
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_dccp_retries2;
diff -ruNp org/net/decnet/af_decnet.c new/net/decnet/af_decnet.c
--- org/net/decnet/af_decnet.c	2009-10-14 18:00:07.000000000 +0530
+++ new/net/decnet/af_decnet.c	2009-10-14 18:00:22.000000000 +0530
@@ -1955,7 +1955,7 @@ static int dn_sendmsg(struct kiocb *iocb
 	}
 
 	if ((flags & MSG_TRYHARD) && sk->sk_dst_cache)
-		dst_negative_advice(&sk->sk_dst_cache);
+		dst_negative_advice(&sk->sk_dst_cache, sk);
 
 	mss = scp->segsize_rem;
 	fctype = scp->services_rem & NSP_FC_MASK;
diff -ruNp org/net/ipv4/tcp_timer.c new/net/ipv4/tcp_timer.c
--- org/net/ipv4/tcp_timer.c	2009-10-14 18:00:07.000000000 +0530
+++ new/net/ipv4/tcp_timer.c	2009-10-14 18:00:22.000000000 +0530
@@ -141,14 +141,14 @@ static int tcp_write_timeout(struct sock
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
 		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_tcp_retries2;

^ permalink raw reply

* [RFC] [PATCH 5/5] net: Encapsulate inner code of __sk_dst_reset
From: Krishna Kumar @ 2009-10-15  5:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091015055602.30145.65852.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

Reuse code by adding ___sk_dst_reset() which is called by
__sk_dst_reset() and __sk_dst_check(). This new API is also
called from IPv6 to hide the internals of __sk_dst_reset
and the setting of sk_tx_queue_mapping.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/sock.h               |   13 +++++++++----
 net/core/sock.c                  |    4 +---
 net/ipv6/inet6_connection_sock.c |    4 +---
 3 files changed, 11 insertions(+), 10 deletions(-)

diff -ruNp org/include/net/sock.h new/include/net/sock.h
--- org/include/net/sock.h	2009-10-14 18:00:17.000000000 +0530
+++ new/include/net/sock.h	2009-10-14 18:00:30.000000000 +0530
@@ -1186,17 +1186,22 @@ sk_dst_set(struct sock *sk, struct dst_e
 }
 
 static inline void
-__sk_dst_reset(struct sock *sk)
+___sk_dst_reset(struct sock *sk, struct dst_entry *old_dst)
 {
-	struct dst_entry *old_dst;
-
 	sk_record_tx_queue(sk, -1);
-	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = NULL;
 	dst_release(old_dst);
 }
 
 static inline void
+__sk_dst_reset(struct sock *sk)
+{
+	struct dst_entry *old_dst = sk->sk_dst_cache;
+
+	___sk_dst_reset(sk, old_dst);
+}
+
+static inline void
 sk_dst_reset(struct sock *sk)
 {
 	write_lock(&sk->sk_dst_lock);
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-14 18:00:22.000000000 +0530
+++ new/net/core/sock.c	2009-10-14 18:00:30.000000000 +0530
@@ -364,9 +364,7 @@ struct dst_entry *__sk_dst_check(struct 
 	struct dst_entry *dst = sk->sk_dst_cache;
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
-		sk_record_tx_queue(sk, -1);
-		sk->sk_dst_cache = NULL;
-		dst_release(dst);
+		___sk_dst_reset(sk, dst);
 		return NULL;
 	}
 
diff -ruNp org/net/ipv6/inet6_connection_sock.c new/net/ipv6/inet6_connection_sock.c
--- org/net/ipv6/inet6_connection_sock.c	2009-10-14 18:00:17.000000000 +0530
+++ new/net/ipv6/inet6_connection_sock.c	2009-10-14 18:00:30.000000000 +0530
@@ -168,9 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
 	if (dst) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
 		if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
-			sk_record_tx_queue(sk, -1);
-			sk->sk_dst_cache = NULL;
-			dst_release(dst);
+			___sk_dst_reset(sk, dst);
 			dst = NULL;
 		}
 	}

^ permalink raw reply

* Re: Moving drivers into staging (was Re: [GIT PULL] SCSI fixes for 2.6.32-rc3)
From: Ingo Molnar @ 2009-10-15  6:03 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Joe Perches, Greg KH, Luis R. Rodriguez, James Bottomley,
	Linus Torvalds, Theodore Tso, Andrew Morton, linux-scsi,
	linux-kernel, Jing Huang, netdev, linux-wireless
In-Reply-To: <tkrat.88ac5cebebbd9689@s5r6.in-berlin.de>

* Stefan Richter <stefanr@s5r6.in-berlin.de> wrote:

> > Well, the answer is obvious i think. Tell me, at a glance, if you 
> > see a patch on lkml, which one is for a staging driver to be 
> > obsoleted, and which one is the one going upstream real soon? The 
> > patches say:
> > 
> >  +++ a/drivers/staging/foo/x.c
> > 
> >  +++ a/drivers/staging/bar/y.c
> > 
> > Then tell me the same at a glance if you see patches for:
> > 
> >  +++ a/drivers/staging/wip/x.c
> > 
> >  +++ a/drivers/staging/bad/y.c
> 
> Does this information matter much?

Yes. You might not appreciate it as you are active in a relatively 
narrow field (so all patches in your world have an 'obvious' place) - 
but i for example take most of the context of a change from the email 
itself and the more self-descriptive it is, the better. I would be more 
likely to review work-in-progress patches while not bother about 
obsolete drivers on the way out. YMMV.

> What's more interesting is whether development activity will _lead_ to 
> a driver being moved from bad or ugly to good.

... a prerequisite of which is for more developers to be accutely aware 
of in what state a driver is.

Anyway ... it's all up to Greg and he indicated that he wants the 
simplest structure, which is fair enough.

	Ingo

^ permalink raw reply

* [PATCH -mmotm -v2] Fix bitmap-introduce-bitmap_set-bitmap_clear-bitmap_find_next_zero_area. patch
From: Akinobu Mita @ 2009-10-15  6:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-usb, linux-ia64, linuxppc-dev, Paul Mackerras,
	H. Peter Anvin, sparclinux, Lothar Wassmann, x86, linux-altix,
	Ingo Molnar, Fenghua Yu, Yevgeny Petrilin, Thomas Gleixner,
	Tony Luck, Kyle Hubert, netdev, Greg Kroah-Hartman, linux-kernel,
	FUJITA Tomonori, David S. Miller
In-Reply-To: <20091014032258.GA18527@localhost.localdomain>

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: Fix bitmap-introduce-bitmap_set-bitmap_clear-bitmap_find_next_zero_area.patch

- Rewrite bitmap_set and bitmap_clear

  Instead of setting or clearing for each bit.

- Fix off-by-one errors in bitmap_find_next_zero_area

  This bug was derived from find_next_zero_area in iommu-helper.

- Add kerneldoc for bitmap_find_next_zero_area

This patch is supposed to be folded into
bitmap-introduce-bitmap_set-bitmap_clear-bitmap_find_next_zero_area.patch

* -v2
- Remove an extra test at the "index" value by find_next_bit

Index was just returned by find_next_zero_bit, so we know it's zero.

Noticed-by: Kyle Hubert <khubert@gmail.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
 lib/bitmap.c |   62 ++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/lib/bitmap.c b/lib/bitmap.c
index 2415da4..962b863 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -271,28 +271,62 @@ int __bitmap_weight(const unsigned long *bitmap, int bits)
 }
 EXPORT_SYMBOL(__bitmap_weight);
 
-void bitmap_set(unsigned long *map, int i, int len)
-{
-	int end = i + len;
+#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) % BITS_PER_LONG))
 
-	while (i < end) {
-		__set_bit(i, map);
-		i++;
+void bitmap_set(unsigned long *map, int start, int nr)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const int size = start + nr;
+	int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+	while (nr - bits_to_set >= 0) {
+		*p |= mask_to_set;
+		nr -= bits_to_set;
+		bits_to_set = BITS_PER_LONG;
+		mask_to_set = ~0UL;
+		p++;
+	}
+	if (nr) {
+		mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+		*p |= mask_to_set;
 	}
 }
 EXPORT_SYMBOL(bitmap_set);
 
 void bitmap_clear(unsigned long *map, int start, int nr)
 {
-	int end = start + nr;
-
-	while (start < end) {
-		__clear_bit(start, map);
-		start++;
+	unsigned long *p = map + BIT_WORD(start);
+	const int size = start + nr;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (nr - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		nr -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (nr) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
 	}
 }
 EXPORT_SYMBOL(bitmap_clear);
 
+/*
+ * bitmap_find_next_zero_area - find a contiguous aligned zero area
+ * @map: The address to base the search on
+ * @size: The bitmap size in bits
+ * @start: The bitnumber to start searching at
+ * @nr: The number of zeroed bits we're looking for
+ * @align_mask: Alignment mask for zero area
+ *
+ * The @align_mask should be one less than a power of 2; the effect is that
+ * the bit offset of all zero areas this function finds is multiples of that
+ * power of 2. A @align_mask of 0 means no alignment is required.
+ */
 unsigned long bitmap_find_next_zero_area(unsigned long *map,
 					 unsigned long size,
 					 unsigned long start,
@@ -304,12 +338,12 @@ again:
 	index = find_next_zero_bit(map, size, start);
 
 	/* Align allocation */
-	index = (index + align_mask) & ~align_mask;
+	index = __ALIGN_MASK(index, align_mask);
 
 	end = index + nr;
-	if (end >= size)
+	if (end > size)
 		return end;
-	i = find_next_bit(map, end, index);
+	i = find_next_bit(map, end, index + 1);
 	if (i < end) {
 		start = i + 1;
 		goto again;
-- 
1.5.4.3

^ permalink raw reply related

* Re: TCP_DEFER_ACCEPT is missing counter update
From: Willy Tarreau @ 2009-10-15  6:08 UTC (permalink / raw)
  To: David Miller; +Cc: ja, netdev, eric.dumazet
In-Reply-To: <20091014.154349.83940908.davem@davemloft.net>

On Wed, Oct 14, 2009 at 03:43:49PM -0700, David Miller wrote:
> For now I'm pushing Willy's change into Linus's tree.
> 
> After more discussion we can revert if necessary.
> 
> I won't submit this to -stable until the discussion is fully resolved.

Makes sense, thanks David.

BTW, I found a use case I didn't think about where current behaviour
causes trouble :

   https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/134274
   http://lkml.indiana.edu/hypermail/linux/kernel/0711.0/0461.html

In summary, when front proxies establish pools of connections to
an apache server making use of TCP_DEFER_ACCEPT, the connection
never establishes on the apache server but silently expires in
SYN_RECV state. The front proxy sees lots of SYN/ACKs and sends
many ACKs trying to complete this connection and finally believes
it got it since the server eventually becomes silent. However,
when trying to send data over such a socket, the server immediately
returns an RST.

Such a problem would not happen if we would only drop the first
X packets (X >= 1 is already fine), because the front proxy would
establish the connection, send a second ACK in response to the
second SYN/ACK and the connection would then really be established
and would not have to expire early in SYN_RECV state.

If we really want to behave as it does today, well, let's not fix
it, but obviously, I fail to see what real world use it has, except
causing random and hard to debug issues :-/

Reading the articles below clearly make it think it was designed
to help with HTTP connections by skipping the first expected and
useless ACK packet before waking up the task :

   http://httpd.apache.org/docs/1.3/misc/perf-bsd44.html
   http://articles.techrepublic.com.com/5100-10878_11-1050771.html

and people still get caught :

   http://lkml.indiana.edu/hypermail/linux/kernel/0711.0/0416.html

Maybe it was a bit over-engineered, in the end causing it to fail
to satisfy the primary goal ?

Regards,
Willy

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/4] TCPCT part 1: initial SYN exchange with SYNACK data
From: David Miller @ 2009-10-15  6:15 UTC (permalink / raw)
  To: william.allen.simpson; +Cc: netdev
In-Reply-To: <4AD6B31B.3060402@gmail.com>

From: William Allen Simpson <william.allen.simpson@gmail.com>
Date: Thu, 15 Oct 2009 01:28:59 -0400

> Pass optional function parameters associated with sending SYNACK.
> These parameters are not needed after sending SYNACK, and are not
> used for retransmission.  Avoids extending struct tcp_request_sock,
> and avoids allocating kernel memory.

This callback is generic and isn't designed to take opaque
data.  All the other arguments to this callback are strongly
typed, and this is on purpose.

You'll need to find another way to implement this.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox