Netdev List
 help / color / mirror / Atom feed
* Re: [PATCHv1] ethtool: added support for 40G link.
From: Ben Hutchings @ 2012-07-16 19:46 UTC (permalink / raw)
  To: Parav Pandit; +Cc: netdev
In-Reply-To: <847203bd-a11e-41c6-b451-abdeced5c5bf@exht1.ad.emulex.com>

On Wed, 2012-06-27 at 19:26 +0530, Parav Pandit wrote:
> 1. defined values for KR4, CR4, SR4, LR4 PHY.
> 
> Signed-off-by: Parav Pandit <parav.pandit@emulex.com>
[...]

Applied, thanks.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH 0/4] pch_gbe: avoiding transmit timeouts (rev 2)
From: Andy Cress @ 2012-07-16 20:02 UTC (permalink / raw)
  To: netdev


When the interface is stressed with 6 VLANs, some transmit timeout stats
were 
observed, which is a potential precursor to the more severe netdev
watchdog 
timeout oops.  Also we saw more than the expected number of
transmit restarts, which impacted performance.   The following patches
were applied and resolved the symptom of the transmit timeout stats, and

reduced the number of transmit restarts.  

This patch set includes the following patches:
0001-pch_gbe-Fix-the-checksum-fill-to-the-error-location.patch
0002-pch_gbe-fix-transmit-watchdog-timeout.patch
0003-pch_gbe-add-extra-clean-tx.patch  (includes bumping the version to
1.01) 
0004-pch_gbe-vlan-skb-len-fix.patch

This rev2 has the following changes:
0001: tested w skb_checksum_start_offset, but cannot use it here, added
comment
0004: delete transmit length error check as unnecessary

The resulting pch_gbe 1.01 driver has been tested on Kontron Tunnel
Creek 
EG20T modules and Intel Crown Bay EG20T modules, so I believe that these
are 
appropriate for consideration in the upstream pch_gbe driver.


Please review and comment.

Thanks,
Andy

^ permalink raw reply

* [PATCH 1/4] pch_gbe: Fix the checksum fill to the error location
From: Andy Cress @ 2012-07-16 20:03 UTC (permalink / raw)
  To: netdev


Author: Zhong Hongbo <hongbo.zhong@windriver.com>

Due to some unknown hardware limitations the pch_gbe hardware cannot
calculate checksums when the length of network package is less
than 64 bytes, where we will surprisingly encounter a problem of
the destination IP incorrectly changed.

When forwarding network packages at the network layer the IP packages
won't be relayed to the upper transport layer and analyzed there,
consequently, skb->transport_header pointer will be mistakenly remained
the same as that of skb->network_header, resulting in TCP checksum
wrongly
filled into the field of destination IP in IP header.

We can fix this issue by manually calculate the offset of the TCP
checksum
 and update it accordingly.

We would normally use the skb_checksum_start_offset(skb) here, but in
this
case it is sometimes -2 (csum_start=0 - skb_headroom=2 => -2), hence the
manual calculation.

Signed-off-by: Zhong Hongbo <hongbo.zhong@windriver.com>
Merged-by: Andy Cress <andy.cress@us.kontron.com>

---
 drivers/net/pch_gbe/pch_gbe_main.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 3787c64..1642bff 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -1178,32 +1178,35 @@ static void pch_gbe_tx_queue(struct
pch_gbe_adapter *adapter,
 	/*
 	 * It is because the hardware accelerator does not support a
checksum,
 	 * when the received data size is less than 64 bytes.
+	 * Note: skb_checksum_start_offset(skb) is sometimes -2 here.
 	 */
 	if (skb->len < PCH_GBE_SHORT_PKT && skb->ip_summed !=
CHECKSUM_NONE) {
+		struct iphdr *iph = ip_hdr(skb);
 		frame_ctrl |= PCH_GBE_TXD_CTRL_APAD |
 			      PCH_GBE_TXD_CTRL_TCPIP_ACC_OFF;
 		if (skb->protocol == htons(ETH_P_IP)) {
-			struct iphdr *iph = ip_hdr(skb);
 			unsigned int offset;
-			offset = skb_transport_offset(skb);
+			offset = (unsigned char *)((u8 *)iph + iph->ihl
* 4) - skb->data;
 			if (iph->protocol == IPPROTO_TCP) {
+				struct tcphdr *tcphdr_point = (struct
tcphdr *)((u8 *)iph + iph->ihl * 4);
 				skb->csum = 0;
-				tcp_hdr(skb)->check = 0;
+				tcphdr_point->check = 0;
 				skb->csum = skb_checksum(skb, offset,
 							 skb->len -
offset, 0);
-				tcp_hdr(skb)->check =
+				tcphdr_point->check = 
 					csum_tcpudp_magic(iph->saddr,
 							  iph->daddr,
 							  skb->len -
offset,
 							  IPPROTO_TCP,
 							  skb->csum);
 			} else if (iph->protocol == IPPROTO_UDP) {
+				struct udphdr *udphdr_point = (struct
udphdr *)((u8 *)iph + iph->ihl * 4);
 				skb->csum = 0;
-				udp_hdr(skb)->check = 0;
+				udphdr_point->check = 0;
 				skb->csum =
 					skb_checksum(skb, offset,
 						     skb->len - offset,
0);
-				udp_hdr(skb)->check =
+				udphdr_point->check = 
 					csum_tcpudp_magic(iph->saddr,
 							  iph->daddr,
 							  skb->len -
offset,

^ permalink raw reply related

* Re: [ethtool PATCH] ethtool: Resolve use of uninitialized memory in rxclass_get_dev_info
From: Ben Hutchings @ 2012-07-16 20:03 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <20120713165221.28140.92681.stgit@gitlad.jf.intel.com>

On Fri, 2012-07-13 at 09:55 -0700, Alexander Duyck wrote:
> The ethtool function for getting the rule count was not zeroing out the
> data field before passing it to the kernel.  As a result the value started
> uninitialized and was incorrectly returning a result indicating that
> devices supported setting new rule indexes.  In order to correct this I am
> adding a one line fix that sets data to zero before we pass the command to
> the kernel.

Right.  For 'get' commands with no parameters (besides the device) the
data copied back to userland is normally zero-initialised and then
filled out by the driver, and I seem to have worked on that assumption.
But because of the odd multiplexing of RX NFC commands
ETHTOOL_GRXCLSRLCNT doesn't work like that.  And for 'my' driver that
didn't matter.  Sorry about that.

(We should really have some explicit documentation of responsibility for
structure initialisation.)

> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
> I am resending this since I didn't see any notification that it had been seen.
> I also realized that I had not clearly identified that this is an ethtool user
> space patch and not an ethtool kernel space patch.

It was perfectly clear and I had queued it up to review but hadn't yet
done so.

Ben.

>  rxclass.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/rxclass.c b/rxclass.c
> index 4d49aa6..e1633a8 100644
> --- a/rxclass.c
> +++ b/rxclass.c
> @@ -207,6 +207,7 @@ static int rxclass_get_dev_info(struct cmd_context *ctx, __u32 *count,
>  	int err;
>  
>  	nfccmd.cmd = ETHTOOL_GRXCLSRLCNT;
> +	nfccmd.data = 0;
>  	err = send_ioctl(ctx, &nfccmd);
>  	*count = nfccmd.rule_cnt;
>  	if (driver_select)
> 

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH 2/4] pch_gbe: fix transmit watchdog timeout
From: Andy Cress @ 2012-07-16 20:04 UTC (permalink / raw)
  To: netdev


Author: Andy Cress <andy.cress@us.kontron.com>

An extended ping test with 6 vlans resulted in a driver oops with a
netdev transmit timeout.
Fix WATCHDOG_TIMEOUT to be more like e1000e at 5 * HZ, to avoid
unnecessary transmit timeouts.

Signed-off-by: Andy Cress <andy.cress@us.kontron.com>

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 4c04843..a746064 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -35,7 +35,7 @@ const char pch_driver_version[] = DRV_VERSION;
 #define DSC_INIT16			0xC000
 #define PCH_GBE_DMA_ALIGN		0
 #define PCH_GBE_DMA_PADDING		2
-#define PCH_GBE_WATCHDOG_PERIOD		(1 * HZ)	/*
watchdog time */
+#define PCH_GBE_WATCHDOG_PERIOD		(5 * HZ)	/*
watchdog time */
 #define PCH_GBE_COPYBREAK_DEFAULT	256
 #define PCH_GBE_PCI_BAR			1
 #define PCH_GBE_RESERVE_MEMORY		0x200000	/* 2MB */

^ permalink raw reply related

* [PATCH 3/4] pch_gbe: add extra clean tx
From: Andy Cress @ 2012-07-16 20:04 UTC (permalink / raw)
  To: netdev

Author: Andy Cress <andy.cress@us.kontron.com>

This adds extra cleaning to the pch_gbe_clean_tx routine to avoid 
transmit timeouts on some BCM PHYs that have different timing.
Also update the DRV_VERSION to 1.01, and show it.

Signed-off-by: Andy Cress <andy.cress@us.kontron.com>

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 42e9874..2ccdca6 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -26,7 +26,7 @@
 #include <linux/ptp_classify.h>
 #endif
 
-#define DRV_VERSION     "1.00"
+#define DRV_VERSION     "1.01"
 const char pch_driver_version[] = DRV_VERSION;
 
 #define PCI_DEVICE_ID_INTEL_IOH1_GBE	0x8802		/* Pci device ID
*/
@@ -1582,7 +1582,8 @@ pch_gbe_clean_tx(struct pch_gbe_adapter *adapter,
 	struct sk_buff *skb;
 	unsigned int i;
 	unsigned int cleaned_count = 0;
-	bool cleaned = true;
+	bool cleaned = false;
+	int unused, thresh;
 
 	pr_debug("next_to_clean : %d\n", tx_ring->next_to_clean);
 
@@ -1591,10 +1592,36 @@ pch_gbe_clean_tx(struct pch_gbe_adapter
*adapter,
 	pr_debug("gbec_status:0x%04x  dma_status:0x%04x\n",
 		 tx_desc->gbec_status, tx_desc->dma_status);
 
+	unused = PCH_GBE_DESC_UNUSED(tx_ring);
+	thresh = tx_ring->count - PCH_GBE_TX_WEIGHT;
+	if ((tx_desc->gbec_status == DSC_INIT16) && (unused < thresh))
+	{  /* current marked clean, tx queue filling up, do extra clean
*/
+		int j, k;
+		if (unused < 8) {  /* tx queue nearly full */
+			pr_debug("clean_tx: transmit queue warning
(%x,%x) unused=%d\n",
+
tx_ring->next_to_clean,tx_ring->next_to_use,unused);
+		}
+	   
+		/* current marked clean, scan for more that need
cleaning. */
+		k = i;
+		for (j = 0; j < PCH_GBE_TX_WEIGHT; j++) 
+		{
+			tx_desc = PCH_GBE_TX_DESC(*tx_ring, k);
+			if (tx_desc->gbec_status != DSC_INIT16) break;
/*found*/
+			if (++k >= tx_ring->count) k = 0;  /*increment,
wrap*/
+		}
+		if (j < PCH_GBE_TX_WEIGHT) {
+			pr_debug("clean_tx: unused=%d loops=%d found
tx_desc[%x,%x:%x].gbec_status=%04x\n",
+				unused,j, i,k, tx_ring->next_to_use,
tx_desc->gbec_status);
+			i = k;  /*found one to clean, usu
gbec_status==2000.*/
+		}
+	}
+
 	while ((tx_desc->gbec_status & DSC_INIT16) == 0x0000) {
 		pr_debug("gbec_status:0x%04x\n", tx_desc->gbec_status);
 		buffer_info = &tx_ring->buffer_info[i];
 		skb = buffer_info->skb;
+		cleaned = true;
 
 		if ((tx_desc->gbec_status & PCH_GBE_TXD_GMAC_STAT_ABT))
{
 			adapter->stats.tx_aborted_errors++;
@@ -1642,18 +1669,21 @@ pch_gbe_clean_tx(struct pch_gbe_adapter
*adapter,
 	}
 	pr_debug("called pch_gbe_unmap_and_free_tx_resource() %d
count\n",
 		 cleaned_count);
-	/* Recover from running out of Tx resources in xmit_frame */
-	spin_lock(&tx_ring->tx_lock);
-	if (unlikely(cleaned && (netif_queue_stopped(adapter->netdev))))
{
-		netif_wake_queue(adapter->netdev);
-		adapter->stats.tx_restart_count++;
-		pr_debug("Tx wake queue\n");
-	}
+	if (cleaned_count > 0)  { /*skip this if nothing cleaned*/
+		/* Recover from running out of Tx resources in
xmit_frame */
+		spin_lock(&tx_ring->tx_lock);
+		if (unlikely(cleaned &&
(netif_queue_stopped(adapter->netdev))))
+		{
+			netif_wake_queue(adapter->netdev);
+			adapter->stats.tx_restart_count++;
+			pr_debug("Tx wake queue\n");
+		}
 
-	tx_ring->next_to_clean = i;
+		tx_ring->next_to_clean = i;
 
-	pr_debug("next_to_clean : %d\n", tx_ring->next_to_clean);
-	spin_unlock(&tx_ring->tx_lock);
+		pr_debug("next_to_clean : %d\n",
tx_ring->next_to_clean);
+		spin_unlock(&tx_ring->tx_lock);
+	}
 	return cleaned;
 }
 
@@ -2390,7 +2420,7 @@ static int pch_gbe_napi_poll(struct napi_struct
*napi, int budget)
 	pch_gbe_clean_rx(adapter, adapter->rx_ring, &work_done, budget);
 	cleaned = pch_gbe_clean_tx(adapter, adapter->tx_ring);
 
-	if (!cleaned)
+	if (cleaned)
 		work_done = budget;
 	/* If no Tx and not enough Rx work done,
 	 * exit the polling mode
@@ -2796,6 +2826,7 @@ static int __init pch_gbe_init_module(void)
 {
 	int ret;
 
+	pr_info("EG20T PCH Gigabit Ethernet Driver - version
%s\n",DRV_VERSION);
 	ret = pci_register_driver(&pch_gbe_driver);
 	if (copybreak != PCH_GBE_COPYBREAK_DEFAULT) {
 		if (copybreak == 0) {

^ permalink raw reply related

* [PATCH 4/4] pch_gbe: vlan skb len fix
From: Andy Cress @ 2012-07-16 20:05 UTC (permalink / raw)
  To: netdev


Author: Veaceslav Falico <vfalico@redhat.com>
Date:   Tue Apr 10 08:14:17 2012 +0200

pch_gbe_xmit_frame skb->len verification was incorrect in vlan case 
causing bogus transfer length errors.  One correction could be:
    offset = skb->protocol == htons(ETH_P_8021Q) ? 0 : 4;
    if (unlikely(skb->len > (adapter->hw.mac.max_frame_size - offset))) 
However, this verification is not necessary, so remove it.

Merged-by: Andy Cress <andy.cress@us.kontron.com>
Signed-off-by: Andy Cress <andy.cress@us.kontron.com>

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 2ccdca6..5eaac7f 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -2160,13 +2160,6 @@ static int pch_gbe_xmit_frame(struct sk_buff
*skb, struct net_device *netdev)
 	struct pch_gbe_tx_ring *tx_ring = adapter->tx_ring;
 	unsigned long flags;
 
-	if (unlikely(skb->len > (adapter->hw.mac.max_frame_size - 4))) {
-		pr_err("Transfer length Error: skb len: %d > max: %d\n",
-		       skb->len, adapter->hw.mac.max_frame_size);
-		dev_kfree_skb_any(skb);
-		adapter->stats.tx_length_errors++;
-		return NETDEV_TX_OK;
-	}
 	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
 		/* Collision - tell upper layer to requeue */
 		return NETDEV_TX_LOCKED;

^ permalink raw reply related

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Or Gerlitz @ 2012-07-16 20:36 UTC (permalink / raw)
  To: Rick Jones
  Cc: netdev@vger.kernel.org, leitao@linux.vnet.ibm.com,
	amirv@mellanox.com, yevgenyp@mellanox.co.il,
	klebers@linux.vnet.ibm.com, Thadeu Lima de Souza Cascardo,
	brking@linux.vnet.ibm.com, ogerlitz@mellanox.com,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net,
	anton@samba.org
In-Reply-To: <50046EB1.5040909@hp.com>


[-- Attachment #1.1: Type: text/plain, Size: 312 bytes --]

On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones <rick.jones2@hp.com> wrote:

> I was thinking more along the lines of an additional comparison,
> explicitly using netperf TCP_RR or something like it, not just the packets
> per second from a bulk transfer test.
>

TCP_STREAM would be good to know here as well

Or.

[-- Attachment #1.2: Type: text/html, Size: 636 bytes --]

[-- Attachment #2: Type: text/plain, Size: 150 bytes --]

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Or Gerlitz @ 2012-07-16 20:43 UTC (permalink / raw)
  To: Rick Jones, Thadeu Lima de Souza Cascardo
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, anton@samba.org
In-Reply-To: <50046EB1.5040909@hp.com>

On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones <rick.jones2@hp.com> wrote:

> I was thinking more along the lines of an additional comparison,
> explicitly using netperf TCP_RR or something like it, not just the packets
> per second from a bulk transfer test.


TCP_STREAM from this setup before the patch would be good to know as well

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 20:47 UTC (permalink / raw)
  To: Rick Jones
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, anton@samba.org
In-Reply-To: <50046EB1.5040909@hp.com>

On Mon, Jul 16, 2012 at 12:42:41PM -0700, Rick Jones wrote:
> On 07/16/2012 12:06 PM, Thadeu Lima de Souza Cascardo wrote:
> >On Mon, Jul 16, 2012 at 10:27:57AM -0700, Rick Jones wrote:
> >
> >>What is the effect on packet-per-second performance?  (eg aggregate,
> >>burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
> >>
> >I used uperf with TCP_NODELAY and 16 threads sending from another
> >machine 64000-sized writes for 60 seconds.
> >
> >I get 5898op/s (3.02Gb/s) without the patch against 18022ops/s
> >(9.23Gb/s) with the patch.
> 
> I was thinking more along the lines of an additional comparison,
> explicitly using netperf TCP_RR or something like it, not just the
> packets per second from a bulk transfer test.
> 
> rick
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I used a uperf profile that is similar to TCP_RR. It writes, then reads
some bytes. I kept the TCP_NODELAY flag.

Without the patch, I saw the following:

packet size	ops/s		Gb/s
1		337024		0.0027
90		276620		0.199
900		190455		1.37
4000		68863		2.20
9000		45638		3.29
60000		9409		4.52

With the patch:

packet size	ops/s		Gb/s
1		451738		0.0036
90		345682		0.248
900		272258		1.96
4000		127055		4.07
9000		106614		7.68
60000		30671		14.72

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Thadeu Lima de Souza Cascardo @ 2012-07-16 20:57 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Rick Jones, davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, anton@samba.org
In-Reply-To: <CAJZOPZL3F+xdHSFfhg7v9A6DDjT6CPK=kgwyzcE6c0pGYFyupg@mail.gmail.com>

On Mon, Jul 16, 2012 at 11:43:33PM +0300, Or Gerlitz wrote:
> On Mon, Jul 16, 2012 at 10:42 PM, Rick Jones <rick.jones2@hp.com> wrote:
> 
> > I was thinking more along the lines of an additional comparison,
> > explicitly using netperf TCP_RR or something like it, not just the packets
> > per second from a bulk transfer test.
> 
> 
> TCP_STREAM from this setup before the patch would be good to know as well
> 

Hi, Or.

Does the stream test that I did with uperf using messages of 64000 bytes
fit?

TCP_NODELAY does not make a difference in this case. I get something
around 3Gbps before the patch and something around 9Gbps after the
patch.

Before the patch:

# ./uperf-1.0.3-beta/src/uperf -m tcp.xml
Starting 16 threads running profile:tcp_stream ...   0.00 seconds
Txn1          0 /1.00(s) =            0          16op/s
Txn2    20.81GB /59.26(s) =     3.02Gb/s        5914op/s
Txn3          0 /0.00(s) =            0      128295op/s
-------------------------------------------------------------------------------------------------------------------------------
Total   20.81GB /61.37(s) =     2.91Gb/s        5712op/s

Netstat statistics for this run
-------------------------------------------------------------------------------------------------------------------------------
Nic       opkts/s     ipkts/s     obits/s     ibits/s
eth6       252459       31694   3.06Gb/s  16.74Mb/s
eth0            2          18   3.87Kb/s  14.28Kb/s
-------------------------------------------------------------------------------------------------------------------------------

Run Statistics
Hostname           Time        Data   Throughput   Operations
Errors
-------------------------------------------------------------------------------------------------------------------------------
10.0.0.2         61.47s     20.81GB     2.91Gb/s       350528
0.00
master           61.37s     20.81GB     2.91Gb/s       350528
0.00
-------------------------------------------------------------------------------------------------------------------------------
Difference(%)     -0.16%      0.00%        0.16%        0.00%
0.00%


After the patch:

# ./uperf-1.0.3-beta/src/uperf -m tcp.xml
Starting 16 threads running profile:tcp_stream ...   0.00 seconds
Txn1          0 /1.00(s) =            0          16op/s
Txn2    64.50GB /60.27(s) =     9.19Gb/s       17975op/s
Txn3          0 /0.00(s) =            0
-------------------------------------------------------------------------------------------------------------------------------
Total   64.50GB /62.27(s) =     8.90Gb/s       17397op/s

Netstat statistics for this run
-------------------------------------------------------------------------------------------------------------------------------
Nic       opkts/s     ipkts/s     obits/s     ibits/s
eth6       769428       96018   9.31Gb/s  50.72Mb/s
eth0            1          15   2.48Kb/s  13.59Kb/s
-------------------------------------------------------------------------------------------------------------------------------

Run Statistics
Hostname           Time        Data   Throughput   Operations
Errors
-------------------------------------------------------------------------------------------------------------------------------
10.0.0.2         62.27s     64.36GB     8.88Gb/s      1081096
0.00
master           62.27s     64.50GB     8.90Gb/s      1083325
0.00
-------------------------------------------------------------------------------------------------------------------------------
Difference(%)     -0.00%      0.21%        0.21%        0.21%
0.00%


Profile tcp.xml:

<?xml version="1.0"?>
<profile name="TCP_STREAM">
  <group nthreads="16">
        <transaction iterations="1">
            <flowop type="connect" options="remotehost=10.0.0.2 protocol=tcp tcp_nodelay"/>
        </transaction>
        <transaction duration="60">
            <flowop type="write" options="count=160 size=64000"/>
        </transaction>
        <transaction iterations="1">
            <flowop type="disconnect" />
        </transaction>
  </group>
</profile>

^ permalink raw reply

* Re: [PATCH 2/3] ipvs: add missing lock in ip_vs_ftp_init_conn()
From: Pablo Neira Ayuso @ 2012-07-16 21:07 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Hans Schillstrom, Jesper Dangaard Brouer,
	Xiaotian Feng, Xiaotian Feng, Patrick McHardy, David S. Miller
In-Reply-To: <1341965963-7275-3-git-send-email-horms@verge.net.au>

Hi Simon,

On Wed, Jul 11, 2012 at 09:19:22AM +0900, Simon Horman wrote:
> From: Xiaotian Feng <xtfeng@gmail.com>
> 
> We met a kernel panic in 2.6.32.43 kernel:
[...]
>  net/netfilter/ipvs/ip_vs_ftp.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index b20b29c..c2bc264 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -65,8 +65,10 @@ static int ip_vs_ftp_pasv;
>  static int
>  ip_vs_ftp_init_conn(struct ip_vs_app *app, struct ip_vs_conn *cp)
>  {
> +	spin_lock(&cp->lock);
>  	/* We use connection tracking for the command connection */
>  	cp->flags |= IP_VS_CONN_F_NFCT;
> +	spin_unlock(&cp->lock);
>  	return 0;

The conntrack support for FTP IPVS helper seems to be there since
2.6.37.

However, the patch description mentions 2.6.32.43.

Something doesn't match here, could you clarify this?

Thanks.

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Rick Jones @ 2012-07-16 21:08 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, anton@samba.org
In-Reply-To: <20120716204717.GA16137@oc1711230544.ibm.com>


I was thinking more along the lines of an additional comparison,
explicitly using netperf TCP_RR or something like it, not just the
packets per second from a bulk transfer test.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

> I used a uperf profile that is similar to TCP_RR. It writes, then reads
> some bytes. I kept the TCP_NODELAY flag.
>
> Without the patch, I saw the following:
>
> packet size	ops/s		Gb/s
> 1		337024		0.0027
> 90		276620		0.199
> 900		190455		1.37
> 4000		68863		2.20
> 9000		45638		3.29
> 60000		9409		4.52
>
> With the patch:
>
> packet size	ops/s		Gb/s
> 1		451738		0.0036
> 90		345682		0.248
> 900		272258		1.96
> 4000		127055		4.07
> 9000		106614		7.68
> 60000		30671		14.72
>

So, on the surface it looks like it did good things for PPS, though it 
would be nice to know what the CPU utilizations/service demands were as 
a sanity check - does uperf not have that sort of functionality?

I'm guessing there were several writes at a time - the 1 byte packet 
size (sic - that is payload, not packet, and without TCP_NODELAY not 
even payload necessarily) How many writes does it have outstanding 
before it does a read?  And does it take care to build-up to that number 
of writes to avoid batching during slowstart, even with TCP_NODELAY set?

rick jones

^ permalink raw reply

* [PATCH 0/7] TCP Fast Open client
From: Yuchung Cheng @ 2012-07-16 21:16 UTC (permalink / raw)
  To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng

This patch series implement the client functionality of TCP Fast Open.
TCP Fast Open (TFO) allows data to be carried in the SYN and SYN-ACK
packets and consumed by the receiving end during the initial connection
handshake, thus providing a saving of up to one full round trip time (RTT)
compared to standard TCP requiring a three-way handshake (3WHS) to
complete before data can be exchanged.

The protocol change is detailed in the IETF internet draft at
http://www.ietf.org/id/draft-ietf-tcpm-fastopen-00.txt . The research
paper (http://conferences.sigcomm.org/co-next/2011/papers/1569470463.pdf)
studied the performance impact of HTTP using Fast Open, based on this
Linux implementation and the Chrome browser.

To use Fast Open, the client application (active SYN sender) must 
replace connect() socket call with sendmsg() or sendto() with the new
MSG_FASTOPEN flag. If the server supports Fast Open the data exchange
starts at TCP handshake. Otherwise the connection will automatically
fall back to conventional TCP.


Yuchung Cheng (7):
  net-tcp: Fast Open base
  net-tcp: Fast Open client - cookie cache
  net-tcp: Fast Open client - sending SYN-data
  net-tcp: Fast Open client - receiving SYN-ACK
  net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)
  net-tcp: Fast Open client - detecting SYN-data drops
  net-tcp: Fast Open client - cookie-less mode

 Documentation/networking/ip-sysctl.txt |   13 +++
 include/linux/snmp.h                   |    3 +-
 include/linux/socket.h                 |    1 +
 include/linux/tcp.h                    |   17 +++-
 include/net/inet_common.h              |    6 +-
 include/net/inetpeer.h                 |    2 +
 include/net/tcp.h                      |   30 ++++++-
 net/ipv4/Makefile                      |    2 +-
 net/ipv4/af_inet.c                     |   26 ++++-
 net/ipv4/inetpeer.c                    |    2 +
 net/ipv4/proc.c                        |    1 +
 net/ipv4/syncookies.c                  |    2 +-
 net/ipv4/sysctl_net_ipv4.c             |   33 +++++++
 net/ipv4/tcp.c                         |   61 +++++++++++-
 net/ipv4/tcp_fastopen.c                |  163 ++++++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c                   |   76 +++++++++++++--
 net/ipv4/tcp_ipv4.c                    |    5 +-
 net/ipv4/tcp_minisocks.c               |    4 +-
 net/ipv4/tcp_output.c                  |  153 +++++++++++++++++++++++++++---
 net/ipv6/syncookies.c                  |    2 +-
 net/ipv6/tcp_ipv6.c                    |    2 +-
 21 files changed, 561 insertions(+), 43 deletions(-)
 create mode 100644 net/ipv4/tcp_fastopen.c

-- 
1.7.7.3

^ permalink raw reply

* [PATCH 4/7] net-tcp: Fast Open client - receiving SYN-ACK
From: Yuchung Cheng @ 2012-07-16 21:16 UTC (permalink / raw)
  To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342473410-6265-1-git-send-email-ycheng@google.com>

On receiving the SYN-ACK after SYN-data, the client needs to
a) update the cached MSS and cookie (if included in SYN-ACK)
b) retransmit the data yet acknowledged by the SYN-ACK in the final ACK of
   the handshake.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_input.c |   40 +++++++++++++++++++++++++++++++++++-----
 1 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d6e16e2..5b09f71 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5610,6 +5610,34 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 	}
 }
 
+static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+				    struct tcp_fastopen_cookie *cookie)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *data = tcp_write_queue_head(sk);
+	u16 mss = tp->rx_opt.mss_clamp;
+
+	if (mss == tp->rx_opt.user_mss) {
+		struct tcp_options_received opt;
+		const u8 *hash_location;
+
+		/* Get original SYNACK MSS value if user MSS sets mss_clamp */
+		tcp_clear_options(&opt);
+		opt.user_mss = opt.mss_clamp = 0;
+		tcp_parse_options(synack, &opt, &hash_location, 0, NULL);
+		mss = opt.mss_clamp;
+	}
+
+	tcp_fastopen_cache_set(sk, &mss, cookie);
+
+	if (data) { /* Retransmit unacked data in SYN */
+		tcp_retransmit_skb(sk, data);
+		tcp_rearm_rto(sk);
+		return true;
+	}
+	return false;
+}
+
 static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 					 const struct tcphdr *th, unsigned int len)
 {
@@ -5617,9 +5645,10 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_cookie_values *cvp = tp->cookie_values;
+	struct tcp_fastopen_cookie foc = { .len = -1 };
 	int saved_clamp = tp->rx_opt.mss_clamp;
 
-	tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0, NULL);
+	tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0, &foc);
 
 	if (th->ack) {
 		/* rfc793:
@@ -5629,11 +5658,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 *	  If SEG.ACK =< ISS, or SEG.ACK > SND.NXT, send
 		 *        a reset (unless the RST bit is set, if so drop
 		 *        the segment and return)"
-		 *
-		 *  We do not send data with SYN, so that RFC-correct
-		 *  test reduces to:
 		 */
-		if (TCP_SKB_CB(skb)->ack_seq != tp->snd_nxt)
+		if (!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_una) ||
+		    after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt))
 			goto reset_and_undo;
 
 		if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -5745,6 +5772,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 
 		tcp_finish_connect(sk, skb);
 
+		if (tp->syn_fastopen && tcp_rcv_fastopen_synack(sk, skb, &foc))
+			return -1;
+
 		if (sk->sk_write_pending ||
 		    icsk->icsk_accept_queue.rskq_defer_accept ||
 		    icsk->icsk_ack.pingpong) {
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 2/7] net-tcp: Fast Open client - cookie cache
From: Yuchung Cheng @ 2012-07-16 21:16 UTC (permalink / raw)
  To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342473410-6265-1-git-send-email-ycheng@google.com>

The Fast Open cookie cache is used by a TCP Fast Open client to store
remote servers' Fast Open cookies. It stores one Fast Open cookie
per IP (v4 or v6) and by default 1024 cookies total. The size is
tunable via /proc/sys/net/ipv4/tcp_fastopen_cookies. Setting it to 0
will flush the cache.

The inetpeer cache also caches remote peer's information but the
in-active cache entries are recycled on the scale of minutes. Therefore
a separate storage is required but the lookup is done via inetpeer.
Each inetpeer entry holds a cookie cache entry pointer (if TFO is used
on that IP). On cache write, the cookie cache entry is allocated and
stored in a list for LRU replacement. A spinlock protects any R/W
operation on the cookie cache entry and the list.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/inetpeer.h     |    2 +
 include/net/tcp.h          |    6 ++
 net/ipv4/inetpeer.c        |    2 +
 net/ipv4/sysctl_net_ipv4.c |   26 ++++++++
 net/ipv4/tcp_fastopen.c    |  140 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 176 insertions(+), 0 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 53f464d..0240709 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -11,6 +11,7 @@
 #include <linux/init.h>
 #include <linux/jiffies.h>
 #include <linux/spinlock.h>
+#include <linux/tcp.h>
 #include <linux/rtnetlink.h>
 #include <net/ipv6.h>
 #include <linux/atomic.h>
@@ -53,6 +54,7 @@ struct inet_peer {
 		struct rcu_head         rcu;
 		struct inet_peer	*gc_next;
 	};
+	struct fastopen_entry		*fastopen;
 
 	/* following fields might be frequently dirtied */
 	__u32			dtime;	/* the time of last use of not referenced entries */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 87f486f..4b29688 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -385,6 +385,12 @@ enum tcp_tw_status {
 	TCP_TW_SYN = 3
 };
 
+/* From tcp_fastopen.c */
+extern int tcp_fastopen_size_cache(int size);
+extern void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
+				   struct tcp_fastopen_cookie *cookie);
+extern void tcp_fastopen_cache_set(struct sock *sk, u16 *mss,
+				   struct tcp_fastopen_cookie *cookie);
 
 extern enum tcp_tw_status tcp_timewait_state_process(struct inet_timewait_sock *tw,
 						     struct sk_buff *skb,
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index e1e0a4e..9151ddc 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -63,6 +63,7 @@
  *		   usually under some other lock to prevent node disappearing
  *		daddr: unchangeable
  *		ip_id_count: atomic value (no lock needed)
+ *		fastopen_entry: TCP Fast Open cookie cache. See tcp_fastopen.c
  */
 
 static struct kmem_cache *peer_cachep __read_mostly;
@@ -511,6 +512,7 @@ relookup:
 		p->metrics[RTAX_LOCK-1] = INETPEER_METRICS_NEW;
 		p->rate_tokens = 0;
 		p->rate_last = 0;
+		p->fastopen = NULL;
 		INIT_LIST_HEAD(&p->gc_list);
 
 		/* Link the node. */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 2a946a19..8d4571a 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -29,6 +29,7 @@
 static int zero;
 static int two = 2;
 static int tcp_retr1_max = 255;
+static int tcp_fastopen_cookies_max = 65536;
 static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
 static int tcp_adv_win_scale_min = -31;
@@ -220,6 +221,25 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	return 0;
 }
 
+static int proc_tcp_fastopen_cache_size(ctl_table *ctl, int write,
+					void __user *buffer, size_t *lenp,
+					loff_t *ppos)
+{
+	int ret;
+	int max_cookies = tcp_fastopen_size_cache(-1);
+	ctl_table tbl = {
+		.data = &max_cookies,
+		.maxlen = sizeof(max_cookies),
+		.extra1 = &zero,
+		.extra2 = &tcp_fastopen_cookies_max,
+	};
+
+	ret = proc_dointvec_minmax(&tbl, write, buffer, lenp, ppos);
+	if (write && ret == 0)
+		tcp_fastopen_size_cache(max_cookies);
+	return ret;
+}
+
 static struct ctl_table ipv4_table[] = {
 	{
 		.procname	= "tcp_timestamps",
@@ -374,6 +394,12 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "tcp_fastopen_cookies",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_tcp_fastopen_cache_size,
+	},
+	{
 		.procname	= "tcp_tw_recycle",
 		.data		= &tcp_death_row.sysctl_tw_recycle,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index a7f729c..40fdf21 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -1,10 +1,150 @@
 #include <linux/init.h>
 #include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/tcp.h>
+#include <net/inetpeer.h>
 
 int sysctl_tcp_fastopen;
 
+/* The Fast Open cookie cache is used by a TCP Fast Open client to store
+ * remote servers' Fast Open cookies. It stores one Fast Open cookie
+ * per IP and by default 1024 cookies total. The size is tunable via
+ * /proc/sys/net/ipv4/tcp_fastopen_cookies. Setting it to 0 will flush
+ * the cache.
+ *
+ * The inetpeer cache also caches remote peer's information but the
+ * in-active cache entries are recycled on the scale of minutes. Therefore
+ * a separate storage is required but the lookup is done via inetpeer.
+ * Each inetpeer entry holds a cookie cache entry pointer (if TFO is used
+ * on that IP). On cache write, the cookie cache entry is allocated and
+ * stored in a list for LRU replacement. A spinlock protects any R/W
+ * operation on the cookie cache entry and the list.
+ */
+struct fastopen_entry {
+	u16	mss;			/* TCP MSS value */
+	struct	tcp_fastopen_cookie	cookie;	/* TCP Fast Open cookie */
+	struct	list_head	lru_list;	/* cookie cache lru_list node */
+	struct	inet_peer	*peer;	/* inetpeer entry (for fast lookup) */
+};
+
+static struct tcp_fastopen_cookie_cache {
+	spinlock_t lock;		/* for lru_list, cnt, size, entries */
+	struct list_head lru_list;	/* head is the least recently used */
+	int cnt;			/* size of lru_list */
+	int size;			/* cache capacity */
+} cookie_cache;
+
+/* Evict the LRU entry if cache is full. Caller must hold cooke_cache.lock */
+static struct fastopen_entry *__tcp_fastopen_remove_lru(void)
+{
+	struct fastopen_entry *entry;
+
+	if (cookie_cache.cnt <= cookie_cache.size || cookie_cache.cnt <= 0)
+		return NULL;
+
+	entry = list_first_entry(&cookie_cache.lru_list,
+				 struct fastopen_entry, lru_list);
+	list_del_init(&entry->lru_list);
+	--cookie_cache.cnt;
+	entry->peer->fastopen = NULL;
+	return entry;
+}
+
+int tcp_fastopen_size_cache(int size)
+{
+	while (size >= 0) {
+		struct fastopen_entry *lru_entry;
+
+		spin_lock_bh(&cookie_cache.lock);
+		cookie_cache.size = size;
+		lru_entry = __tcp_fastopen_remove_lru();
+		spin_unlock_bh(&cookie_cache.lock);
+
+		if (lru_entry == NULL)
+			break;
+		inet_putpeer(lru_entry->peer);
+		kfree(lru_entry);
+	}
+	return cookie_cache.size;
+}
+
+static struct inet_peer *tcp_fastopen_inetpeer(struct sock *sk, int create)
+{
+	struct net *net = dev_net(__sk_dst_get(sk)->dev);
+
+	if (sk->sk_family == AF_INET)
+		return inet_getpeer_v4(net->ipv4.peers,
+				       inet_sk(sk)->inet_daddr, create);
+	else if (sk->sk_family == AF_INET6)
+		return inet_getpeer_v6(net->ipv6.peers,
+				       &inet6_sk(sk)->daddr, create);
+	return NULL;
+}
+
+void tcp_fastopen_cache_get(struct sock *sk, u32 *mss,
+			    struct tcp_fastopen_cookie *cookie)
+{
+	struct inet_peer *peer = tcp_fastopen_inetpeer(sk, 0);
+	struct fastopen_entry *entry;
+
+	if (peer == NULL)
+		return;
+
+	spin_lock_bh(&cookie_cache.lock);
+	entry = peer->fastopen;
+	if (entry != NULL) {
+		*mss = entry->mss;
+		*cookie = entry->cookie;
+		list_move_tail(&entry->lru_list, &cookie_cache.lru_list);
+	}
+	spin_unlock_bh(&cookie_cache.lock);
+
+	inet_putpeer(peer);
+}
+
+void tcp_fastopen_cache_set(struct sock *sk, u32 *mss,
+			    struct tcp_fastopen_cookie *cookie)
+{
+	struct inet_peer *peer = tcp_fastopen_inetpeer(sk, 1);
+	struct fastopen_entry *entry = NULL, *new_entry = NULL;
+
+	if (peer == NULL)
+		return;
+
+	spin_lock_bh(&cookie_cache.lock);
+	if (peer->fastopen == NULL) {
+		new_entry = kmalloc(sizeof(struct fastopen_entry), GFP_ATOMIC);
+		if (new_entry == NULL) {
+			spin_unlock_bh(&cookie_cache.lock);
+			goto out;
+		}
+		new_entry->peer = peer;
+		INIT_LIST_HEAD(&new_entry->lru_list);
+		peer->fastopen = new_entry;
+		++cookie_cache.cnt;
+	}
+	entry = peer->fastopen;
+	entry->mss = *mss;
+	if (cookie->len > 0)
+		entry->cookie = *cookie;
+	list_move_tail(&entry->lru_list, &cookie_cache.lru_list);
+	entry = __tcp_fastopen_remove_lru();
+	spin_unlock_bh(&cookie_cache.lock);
+
+	if (entry) {
+		inet_putpeer(entry->peer);
+		kfree(entry);
+	}
+out:
+	if (new_entry == NULL)
+		inet_putpeer(peer);
+}
+
 static int __init tcp_fastopen_init(void)
 {
+	INIT_LIST_HEAD(&cookie_cache.lru_list);
+	spin_lock_init(&cookie_cache.lock);
+	cookie_cache.size = 1024;
 	return 0;
 }
 
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 3/7] net-tcp: Fast Open client - sending SYN-data
From: Yuchung Cheng @ 2012-07-16 21:16 UTC (permalink / raw)
  To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342473410-6265-1-git-send-email-ycheng@google.com>

This patch implements sending SYN-data in tcp_connect(). The data is
from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).

The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
type of the SYN. If the cookie is not cached (len==0), the host sends
data-less SYN with Fast Open cookie request option to solicit a cookie
from the remote. If cookie is not available (len > 0), the host sends
a SYN-data with Fast Open cookie option. If cookie length is negative,
  the SYN will not include any Fast Open option (for fall back operations).

To deal with middleboxes that may drop SYN with data or experimental TCP
option, the SYN-data is only sent once. SYN retransmits do not include
data or Fast Open options. The connection will fall back to regular TCP
handshake.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/linux/snmp.h  |    3 +-
 include/linux/tcp.h   |    6 ++-
 include/net/tcp.h     |    9 ++++
 net/ipv4/af_inet.c    |    7 +++
 net/ipv4/proc.c       |    1 +
 net/ipv4/tcp_output.c |  115 +++++++++++++++++++++++++++++++++++++++++++++----
 6 files changed, 130 insertions(+), 11 deletions(-)

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index 2e68f5b..c0a34c6 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -233,7 +233,8 @@ enum
 	LINUX_MIB_TCPREQQFULLDOCOOKIES,		/* TCPReqQFullDoCookies */
 	LINUX_MIB_TCPREQQFULLDROP,		/* TCPReqQFullDrop */
 	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
-	LINUX_MIB_TCPRCVCOALESCE,			/* TCPRcvCoalesce */
+	LINUX_MIB_TCPRCVCOALESCE,		/* TCPRcvCoalesce */
+	LINUX_MIB_TCPFASTOPENACTIVE,		/* TCPFastOpenActive */
 	__LINUX_MIB_MAX
 };
 
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 12948f5..1edf96a 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -386,7 +386,8 @@ struct tcp_sock {
 		unused      : 1;
 	u8	repair_queue;
 	u8	do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
-		early_retrans_delayed:1; /* Delayed ER timer installed */
+		early_retrans_delayed:1, /* Delayed ER timer installed */
+		syn_fastopen:1;	/* SYN includes Fast Open option */
 
 /* RTT measurement */
 	u32	srtt;		/* smoothed round trip time << 3	*/
@@ -500,6 +501,9 @@ struct tcp_sock {
 	struct tcp_md5sig_info	__rcu *md5sig_info;
 #endif
 
+/* TCP fastopen related information */
+	struct tcp_fastopen_request *fastopen_req;
+
 	/* When the cookie options are generated and exchanged, then this
 	 * object holds a reference to them (cookie_values->kref).  Also
 	 * contains related tcp_cookie_transactions fields.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4b29688..2d3b09d2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1290,6 +1290,15 @@ extern int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *, const struct sk_buff
 extern int tcp_md5_hash_key(struct tcp_md5sig_pool *hp,
 			    const struct tcp_md5sig_key *key);
 
+struct tcp_fastopen_request {
+	/* Fast Open cookie. Size 0 means a cookie request */
+	struct tcp_fastopen_cookie	cookie;
+	struct msghdr			*data;  /* data in MSG_FASTOPEN */
+	u16				copied;	/* queued in tcp_connect() */
+};
+
+void tcp_free_fastopen_req(struct tcp_sock *tp);
+
 /* write queue abstraction */
 static inline void tcp_write_queue_purge(struct sock *sk)
 {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 07a02f6..6ef67b7 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -558,9 +558,14 @@ EXPORT_SYMBOL(inet_dgram_connect);
 
 static long inet_wait_for_connect(struct sock *sk, long timeo)
 {
+	const bool write = (sk->sk_protocol == IPPROTO_TCP) &&
+			   tcp_sk(sk)->fastopen_req &&
+			   tcp_sk(sk)->fastopen_req->data;
 	DEFINE_WAIT(wait);
 
 	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+	if (write)
+		sk->sk_write_pending++;
 
 	/* Basic assumption: if someone sets sk->sk_err, he _must_
 	 * change state of the socket from TCP_SYN_*.
@@ -576,6 +581,8 @@ static long inet_wait_for_connect(struct sock *sk, long timeo)
 		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	}
 	finish_wait(sk_sleep(sk), &wait);
+	if (write)
+		sk->sk_write_pending--;
 	return timeo;
 }
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 8af0d44..8f8c842 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -258,6 +258,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPReqQFullDrop", LINUX_MIB_TCPREQQFULLDROP),
 	SNMP_MIB_ITEM("TCPRetransFail", LINUX_MIB_TCPRETRANSFAIL),
 	SNMP_MIB_ITEM("TCPRcvCoalesce", LINUX_MIB_TCPRCVCOALESCE),
+	SNMP_MIB_ITEM("TCPFastOpenActive", LINUX_MIB_TCPFASTOPENACTIVE),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4849be7..8869328 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -596,6 +596,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 	u8 cookie_size = (!tp->rx_opt.cookie_out_never && cvp != NULL) ?
 			 tcp_cookie_size_check(cvp->cookie_desired) :
 			 0;
+	struct tcp_fastopen_request *fastopen = tp->fastopen_req;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
@@ -636,6 +637,16 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
 	}
 
+	if (fastopen && fastopen->cookie.len >= 0) {
+		u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
+		need = (need + 3) & ~3U;  /* Align to 32 bits */
+		if (remaining >= need) {
+			opts->options |= OPTION_FAST_OPEN_COOKIE;
+			opts->fastopen_cookie = &fastopen->cookie;
+			remaining -= need;
+			tp->syn_fastopen = 1;
+		}
+	}
 	/* Note that timestamps are required by the specification.
 	 *
 	 * Odd numbers of bytes are prohibited by the specification, ensuring
@@ -2824,6 +2835,96 @@ void tcp_connect_init(struct sock *sk)
 	tcp_clear_retrans(tp);
 }
 
+static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+	tcb->end_seq += skb->len;
+	skb_header_release(skb);
+	__tcp_add_write_queue_tail(sk, skb);
+	sk->sk_wmem_queued += skb->truesize;
+	sk_mem_charge(sk, skb->truesize);
+	tp->write_seq = tcb->end_seq;
+	tp->packets_out += tcp_skb_pcount(skb);
+}
+
+/* Build and send a SYN with data and (cached) Fast Open cookie. However,
+ * queue a data-only packet after the regular SYN, such that regular SYNs
+ * are retransmitted on timeouts. Also if the remote SYN-ACK acknowledges
+ * only the SYN sequence, the data are retransmitted in the first ACK.
+ * If cookie is not cached or other error occurs, falls back to send a
+ * regular SYN with Fast Open cookie request option.
+ */
+static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_fastopen_request *fo = tp->fastopen_req;
+	int space, i, err = 0, iovlen = fo->data->msg_iovlen;
+	struct sk_buff *syn_data = NULL, *data;
+
+	tcp_fastopen_cache_get(sk, &tp->rx_opt.mss_clamp, &fo->cookie);
+	if (fo->cookie.len <= 0)
+		goto fallback;
+
+	/* MSS for SYN-data is based on cached MSS and bounded by PMTU and
+	 * user-MSS. Reserve maximum option space for middleboxes that add
+	 * private TCP options. The cost is reduced data space in SYN :(
+	 */
+	if (tp->rx_opt.user_mss && tp->rx_opt.user_mss < tp->rx_opt.mss_clamp)
+		tp->rx_opt.mss_clamp = tp->rx_opt.user_mss;
+	space = tcp_mtu_to_mss(sk, inet_csk(sk)->icsk_pmtu_cookie) -
+		MAX_TCP_OPTION_SPACE;
+
+	syn_data = skb_copy_expand(syn, skb_headroom(syn), space,
+				   sk->sk_allocation);
+	if (syn_data == NULL)
+		goto fallback;
+
+	for (i = 0; i < iovlen && syn_data->len < space; ++i) {
+		struct iovec *iov = &fo->data->msg_iov[i];
+		unsigned char __user *from = iov->iov_base;
+		int len = iov->iov_len;
+
+		if (syn_data->len + len > space)
+			len = space - syn_data->len;
+		else if (i + 1 == iovlen)
+			/* No more data pending in inet_wait_for_connect() */
+			fo->data = NULL;
+
+		if (skb_add_data(syn_data, from, len))
+			goto fallback;
+	}
+
+	/* Queue a data-only packet after the regular SYN for retransmission */
+	data = pskb_copy(syn_data, sk->sk_allocation);
+	if (data == NULL)
+		goto fallback;
+	TCP_SKB_CB(data)->seq++;
+	TCP_SKB_CB(data)->tcp_flags &= ~TCPHDR_SYN;
+	TCP_SKB_CB(data)->tcp_flags = (TCPHDR_ACK|TCPHDR_PSH);
+	tcp_connect_queue_skb(sk, data);
+	fo->copied = data->len;
+
+	if (tcp_transmit_skb(sk, syn_data, 0, sk->sk_allocation) == 0) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENACTIVE);
+		goto done;
+	}
+	syn_data = NULL;
+
+fallback:
+	/* Send a regular SYN with Fast Open cookie request option */
+	if (fo->cookie.len > 0)
+		fo->cookie.len = 0;
+	err = tcp_transmit_skb(sk, syn, 1, sk->sk_allocation);
+	if (err)
+		tp->syn_fastopen = 0;
+	kfree_skb(syn_data);
+done:
+	fo->cookie.len = -1;  /* Exclude Fast Open option for SYN retries */
+	return err;
+}
+
 /* Build a SYN and send it off. */
 int tcp_connect(struct sock *sk)
 {
@@ -2841,17 +2942,13 @@ int tcp_connect(struct sock *sk)
 	skb_reserve(buff, MAX_TCP_HEADER);
 
 	tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN);
+	tp->retrans_stamp = TCP_SKB_CB(buff)->when = tcp_time_stamp;
+	tcp_connect_queue_skb(sk, buff);
 	TCP_ECN_send_syn(sk, buff);
 
-	/* Send it off. */
-	TCP_SKB_CB(buff)->when = tcp_time_stamp;
-	tp->retrans_stamp = TCP_SKB_CB(buff)->when;
-	skb_header_release(buff);
-	__tcp_add_write_queue_tail(sk, buff);
-	sk->sk_wmem_queued += buff->truesize;
-	sk_mem_charge(sk, buff->truesize);
-	tp->packets_out += tcp_skb_pcount(buff);
-	err = tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
+	/* Send off SYN; include data in Fast Open. */
+	err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) :
+	      tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
 	if (err == -ECONNREFUSED)
 		return err;
 
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 1/7] net-tcp: Fast Open base
From: Yuchung Cheng @ 2012-07-16 21:16 UTC (permalink / raw)
  To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342473410-6265-1-git-send-email-ycheng@google.com>

This patch impelements the common code for both the client and server.

1. TCP Fast Open option processing. Since Fast Open does not have a
   option number assigned by IANA yet, it shares the experiment option
   code 254 by implementing draft-ietf-tcpm-experimental-options
   with a 16 bits magic number 0xF989

2. The new sysctl tcp_fastopen

3. A place holder init function

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/linux/tcp.h        |   10 ++++++++++
 include/net/tcp.h          |    9 ++++++++-
 net/ipv4/Makefile          |    2 +-
 net/ipv4/syncookies.c      |    2 +-
 net/ipv4/sysctl_net_ipv4.c |    7 +++++++
 net/ipv4/tcp_fastopen.c    |   11 +++++++++++
 net/ipv4/tcp_input.c       |   26 ++++++++++++++++++++++----
 net/ipv4/tcp_ipv4.c        |    2 +-
 net/ipv4/tcp_minisocks.c   |    4 ++--
 net/ipv4/tcp_output.c      |   25 +++++++++++++++++++++----
 net/ipv6/syncookies.c      |    2 +-
 net/ipv6/tcp_ipv6.c        |    2 +-
 12 files changed, 86 insertions(+), 16 deletions(-)
 create mode 100644 net/ipv4/tcp_fastopen.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 1888169..12948f5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -243,6 +243,16 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
 	return (tcp_hdr(skb)->doff - 5) * 4;
 }
 
+/* TCP Fast Open */
+#define TCP_FASTOPEN_COOKIE_MIN	4	/* Min Fast Open Cookie size in bytes */
+#define TCP_FASTOPEN_COOKIE_MAX	16	/* Max Fast Open Cookie size in bytes */
+
+/* TCP Fast Open Cookie as stored in memory */
+struct tcp_fastopen_cookie {
+	s8	len;
+	u8	val[TCP_FASTOPEN_COOKIE_MAX];
+};
+
 /* This defines a selective acknowledgement block. */
 struct tcp_sack_block_wire {
 	__be32	start_seq;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 439984b..87f486f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -170,6 +170,11 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
 #define TCPOPT_COOKIE		253	/* Cookie extension (experimental) */
+#define TCPOPT_EXP		254	/* Experimental */
+/* Magic number to be after the option value for sharing TCP
+ * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
+ */
+#define TCPOPT_FASTOPEN_MAGIC	0xF989
 
 /*
  *     TCP option lengths
@@ -180,6 +185,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_SACK_PERM      2
 #define TCPOLEN_TIMESTAMP      10
 #define TCPOLEN_MD5SIG         18
+#define TCPOLEN_EXP_FASTOPEN_BASE  4
 #define TCPOLEN_COOKIE_BASE    2	/* Cookie-less header extension */
 #define TCPOLEN_COOKIE_PAIR    3	/* Cookie pair header extension */
 #define TCPOLEN_COOKIE_MIN     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN)
@@ -222,6 +228,7 @@ extern int sysctl_tcp_retries1;
 extern int sysctl_tcp_retries2;
 extern int sysctl_tcp_orphan_retries;
 extern int sysctl_tcp_syncookies;
+extern int sysctl_tcp_fastopen;
 extern int sysctl_tcp_retrans_collapse;
 extern int sysctl_tcp_stdurg;
 extern int sysctl_tcp_rfc1337;
@@ -417,7 +424,7 @@ extern int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t len, int nonblock, int flags, int *addr_len);
 extern void tcp_parse_options(const struct sk_buff *skb,
 			      struct tcp_options_received *opt_rx, const u8 **hvpp,
-			      int estab);
+			      int estab, struct tcp_fastopen_cookie *foc);
 extern const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
 
 /*
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 5a23e8b..63e6995 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -7,7 +7,7 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     ip_output.o ip_sockglue.o inet_hashtables.o \
 	     inet_timewait_sock.o inet_connection_sock.o \
 	     tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
-	     tcp_minisocks.o tcp_cong.o tcp_metrics.o \
+	     tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
 	     datagram.o raw.o udp.o udplite.o \
 	     arp.o icmp.o devinet.o af_inet.o  igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o \
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index eab2a7f..650e152 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -293,7 +293,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, &hash_location, 0);
+	tcp_parse_options(skb, &tcp_opt, &hash_location, 0, NULL);
 
 	if (!cookie_check_timestamp(&tcp_opt, &ecn_ok))
 		goto out;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 70730f7..2a946a19 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -367,6 +367,13 @@ static struct ctl_table ipv4_table[] = {
 	},
 #endif
 	{
+		.procname	= "tcp_fastopen",
+		.data		= &sysctl_tcp_fastopen,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "tcp_tw_recycle",
 		.data		= &tcp_death_row.sysctl_tw_recycle,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
new file mode 100644
index 0000000..a7f729c
--- /dev/null
+++ b/net/ipv4/tcp_fastopen.c
@@ -0,0 +1,11 @@
+#include <linux/init.h>
+#include <linux/kernel.h>
+
+int sysctl_tcp_fastopen;
+
+static int __init tcp_fastopen_init(void)
+{
+	return 0;
+}
+
+late_initcall(tcp_fastopen_init);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 055ac49..d6e16e2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3729,7 +3729,8 @@ old_ack:
  * the fast version below fails.
  */
 void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx,
-		       const u8 **hvpp, int estab)
+		       const u8 **hvpp, int estab,
+		       struct tcp_fastopen_cookie *foc)
 {
 	const unsigned char *ptr;
 	const struct tcphdr *th = tcp_hdr(skb);
@@ -3836,8 +3837,25 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o
 					break;
 				}
 				break;
-			}
 
+			case TCPOPT_EXP:
+				/* Fast Open option shares code 254 using a
+				 * 16 bits magic number. It's valid only in
+				 * SYN or SYN-ACK with an even size.
+				 */
+				if (opsize < TCPOLEN_EXP_FASTOPEN_BASE ||
+				    get_unaligned_be16(ptr) != TCPOPT_FASTOPEN_MAGIC ||
+				    foc == NULL || !th->syn || (opsize & 1))
+					break;
+				foc->len = opsize - TCPOLEN_EXP_FASTOPEN_BASE;
+				if (foc->len >= TCP_FASTOPEN_COOKIE_MIN &&
+				    foc->len <= TCP_FASTOPEN_COOKIE_MAX)
+					memcpy(foc->val, ptr + 2, foc->len);
+				else if (foc->len != 0)
+					foc->len = -1;
+				break;
+
+			}
 			ptr += opsize-2;
 			length -= opsize;
 		}
@@ -3879,7 +3897,7 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
 		if (tcp_parse_aligned_timestamp(tp, th))
 			return true;
 	}
-	tcp_parse_options(skb, &tp->rx_opt, hvpp, 1);
+	tcp_parse_options(skb, &tp->rx_opt, hvpp, 1, NULL);
 	return true;
 }
 
@@ -5601,7 +5619,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	struct tcp_cookie_values *cvp = tp->cookie_values;
 	int saved_clamp = tp->rx_opt.mss_clamp;
 
-	tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0);
+	tcp_parse_options(skb, &tp->rx_opt, &hash_location, 0, NULL);
 
 	if (th->ack) {
 		/* rfc793:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7a0062c..588200e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1314,7 +1314,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
-	tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+	tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
 
 	if (tmp_opt.cookie_plus > 0 &&
 	    tmp_opt.saw_tstamp &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index c66f2ed..5912ac3 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -97,7 +97,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+		tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
@@ -534,7 +534,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
-		tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+		tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent = req->ts_recent;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 15a7c7b..4849be7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -385,15 +385,17 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_MD5		(1 << 2)
 #define OPTION_WSCALE		(1 << 3)
 #define OPTION_COOKIE_EXTENSION	(1 << 4)
+#define OPTION_FAST_OPEN_COOKIE	(1 << 8)
 
 struct tcp_out_options {
-	u8 options;		/* bit field of OPTION_* */
+	u16 options;		/* bit field of OPTION_* */
+	u16 mss;		/* 0 to disable */
 	u8 ws;			/* window scale, 0 to disable */
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u8 hash_size;		/* bytes in hash_location */
-	u16 mss;		/* 0 to disable */
-	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	__u8 *hash_location;	/* temporary pointer, overloaded */
+	__u32 tsval, tsecr;	/* need to include OPTION_TS */
+	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
 };
 
 /* The sysctl int routines are generic, so check consistency here.
@@ -442,7 +444,7 @@ static u8 tcp_cookie_size_check(u8 desired)
 static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 			      struct tcp_out_options *opts)
 {
-	u8 options = opts->options;	/* mungable copy */
+	u16 options = opts->options;	/* mungable copy */
 
 	/* Having both authentication and cookies for security is redundant,
 	 * and there's certainly not enough room.  Instead, the cookie-less
@@ -564,6 +566,21 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 
 		tp->rx_opt.dsack = 0;
 	}
+
+	if (unlikely(OPTION_FAST_OPEN_COOKIE & options)) {
+		struct tcp_fastopen_cookie *foc = opts->fastopen_cookie;
+
+		*ptr++ = htonl((TCPOPT_EXP << 24) |
+			       ((TCPOLEN_EXP_FASTOPEN_BASE + foc->len) << 16) |
+			       TCPOPT_FASTOPEN_MAGIC);
+
+		memcpy(ptr, foc->val, foc->len);
+		if ((foc->len & 3) == 2) {
+			u8 *align = ((u8 *)ptr) + foc->len;
+			align[0] = align[1] = TCPOPT_NOP;
+		}
+		ptr += (foc->len + 3) >> 2;
+	}
 }
 
 /* Compute TCP options for SYN packets. This is not the final
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 7bf3cc4..bb46061 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -177,7 +177,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, &hash_location, 0);
+	tcp_parse_options(skb, &tcp_opt, &hash_location, 0, NULL);
 
 	if (!cookie_check_timestamp(&tcp_opt, &ecn_ok))
 		goto out;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3071f37..41eaeb5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1062,7 +1062,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
-	tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+	tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
 
 	if (tmp_opt.cookie_plus > 0 &&
 	    tmp_opt.saw_tstamp &&
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 7/7] net-tcp: Fast Open client - cookie-less mode
From: Yuchung Cheng @ 2012-07-16 21:16 UTC (permalink / raw)
  To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342473410-6265-1-git-send-email-ycheng@google.com>

In trusted networks, e.g., intranet, data-center, the client does not
need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
of cookie availability.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 Documentation/networking/ip-sysctl.txt |    2 ++
 include/linux/tcp.h                    |    1 +
 include/net/tcp.h                      |    1 +
 net/ipv4/tcp_input.c                   |    8 ++++++--
 net/ipv4/tcp_output.c                  |    6 +++++-
 5 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index b3c5225..d67d858 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -476,6 +476,8 @@ tcp_fastopen - INTEGER
 
 	The values (bitmap) are:
 	1: Enables sending data in the opening SYN on the client
+	5: Enables sending data in the opening SYN on the client regardless
+	   of cookie availability.
 
 	Default: 0
 
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 1edf96a..9febfb6 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -387,6 +387,7 @@ struct tcp_sock {
 	u8	repair_queue;
 	u8	do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
 		early_retrans_delayed:1, /* Delayed ER timer installed */
+		syn_data:1,	/* SYN includes data */
 		syn_fastopen:1;	/* SYN includes Fast Open option */
 
 /* RTT measurement */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1b70444..99ec440 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -214,6 +214,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* Bit Flags for sysctl_tcp_fastopen */
 #define	TFO_CLIENT_ENABLE	1
+#define	TFO_CLIENT_NO_COOKIE	4	/* Data in SYN w/o cookie option */
 
 extern struct inet_timewait_death_row tcp_death_row;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index aef8514..5915ac2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5614,7 +5614,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
 				    struct tcp_fastopen_cookie *cookie)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *data = tcp_write_queue_head(sk);
+	struct sk_buff *data = tp->syn_data ? tcp_write_queue_head(sk) : NULL;
 	u16 mss = tp->rx_opt.mss_clamp;
 	bool syn_drop;
 
@@ -5636,6 +5636,9 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
 	syn_drop = (cookie->len <= 0 && data &&
 		    inet_csk(sk)->icsk_retransmits);
 
+	if (!tp->syn_fastopen)  /* Ignore an unsolicited cookie */
+		cookie->len = -1;
+
 	tcp_fastopen_cache_set(sk, &mss, cookie, syn_drop);
 
 	if (data) { /* Retransmit unacked data in SYN */
@@ -5780,7 +5783,8 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 
 		tcp_finish_connect(sk, skb);
 
-		if (tp->syn_fastopen && tcp_rcv_fastopen_synack(sk, skb, &foc))
+		if ((tp->syn_fastopen || tp->syn_data) &&
+		    tcp_rcv_fastopen_synack(sk, skb, &foc))
 			return -1;
 
 		if (sk->sk_write_pending ||
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c5cfd5e..27a32ac 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2864,6 +2864,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 	struct sk_buff *syn_data = NULL, *data;
 	unsigned long last_syn_loss = 0;
 
+	tp->rx_opt.mss_clamp = tp->advmss;  /* If MSS is not cached */
 	tcp_fastopen_cache_get(sk, &tp->rx_opt.mss_clamp, &fo->cookie,
 			       &syn_loss, &last_syn_loss);
 	/* Recurring FO SYN losses: revert to regular handshake temporarily */
@@ -2873,7 +2874,9 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 		goto fallback;
 	}
 
-	if (fo->cookie.len <= 0)
+	if (sysctl_tcp_fastopen & TFO_CLIENT_NO_COOKIE)
+		fo->cookie.len = -1;
+	else if (fo->cookie.len <= 0)
 		goto fallback;
 
 	/* MSS for SYN-data is based on cached MSS and bounded by PMTU and
@@ -2916,6 +2919,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 	fo->copied = data->len;
 
 	if (tcp_transmit_skb(sk, syn_data, 0, sk->sk_allocation) == 0) {
+		tp->syn_data = (fo->copied > 0);
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENACTIVE);
 		goto done;
 	}
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 6/7] net-tcp: Fast Open client - detecting SYN-data drops
From: Yuchung Cheng @ 2012-07-16 21:16 UTC (permalink / raw)
  To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342473410-6265-1-git-send-email-ycheng@google.com>

On paths with firewalls dropping SYN with data or experimental TCP options,
Fast Open connections will have experience SYN timeout and bad performance.
The solution is to track such incidents in the cookie cache and disables
Fast Open temporarily.

Since only the original SYN includes data and/or Fast Open option, the
SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
such drops. If a path has recurring Fast Open SYN drops, Fast Open is
disabled for 2^(recurring_losses) minutes starting from four minutes up to
roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
it behaves as connect() then write().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h       |    6 ++++--
 net/ipv4/tcp_fastopen.c |   16 ++++++++++++++--
 net/ipv4/tcp_input.c    |   10 +++++++++-
 net/ipv4/tcp_output.c   |   13 +++++++++++--
 4 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8c20edd..1b70444 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -391,9 +391,11 @@ enum tcp_tw_status {
 /* From tcp_fastopen.c */
 extern int tcp_fastopen_size_cache(int size);
 extern void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
-				   struct tcp_fastopen_cookie *cookie);
+				   struct tcp_fastopen_cookie *cookie,
+				   int *syn_loss, unsigned long *last_syn_loss);
 extern void tcp_fastopen_cache_set(struct sock *sk, u16 *mss,
-				   struct tcp_fastopen_cookie *cookie);
+				   struct tcp_fastopen_cookie *cookie,
+				   bool syn_lost);
 
 extern enum tcp_tw_status tcp_timewait_state_process(struct inet_timewait_sock *tw,
 						     struct sk_buff *skb,
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 40fdf21..b10e8ca 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -22,6 +22,8 @@ int sysctl_tcp_fastopen;
  */
 struct fastopen_entry {
 	u16	mss;			/* TCP MSS value */
+	u16	syn_loss:10;		/* Recurring Fast Open SYN losses */
+	unsigned long	last_syn_loss;	/* Last Fast Open SYN loss */
 	struct	tcp_fastopen_cookie	cookie;	/* TCP Fast Open cookie */
 	struct	list_head	lru_list;	/* cookie cache lru_list node */
 	struct	inet_peer	*peer;	/* inetpeer entry (for fast lookup) */
@@ -82,7 +84,8 @@ static struct inet_peer *tcp_fastopen_inetpeer(struct sock *sk, int create)
 }
 
 void tcp_fastopen_cache_get(struct sock *sk, u32 *mss,
-			    struct tcp_fastopen_cookie *cookie)
+			    struct tcp_fastopen_cookie *cookie,
+			    int *syn_loss, unsigned long *last_syn_loss)
 {
 	struct inet_peer *peer = tcp_fastopen_inetpeer(sk, 0);
 	struct fastopen_entry *entry;
@@ -95,6 +98,8 @@ void tcp_fastopen_cache_get(struct sock *sk, u32 *mss,
 	if (entry != NULL) {
 		*mss = entry->mss;
 		*cookie = entry->cookie;
+		*syn_loss = entry->syn_loss;
+		*last_syn_loss = *syn_loss ? entry->last_syn_loss : 0;
 		list_move_tail(&entry->lru_list, &cookie_cache.lru_list);
 	}
 	spin_unlock_bh(&cookie_cache.lock);
@@ -103,7 +108,7 @@ void tcp_fastopen_cache_get(struct sock *sk, u32 *mss,
 }
 
 void tcp_fastopen_cache_set(struct sock *sk, u32 *mss,
-			    struct tcp_fastopen_cookie *cookie)
+			    struct tcp_fastopen_cookie *cookie, bool syn_lost)
 {
 	struct inet_peer *peer = tcp_fastopen_inetpeer(sk, 1);
 	struct fastopen_entry *entry = NULL, *new_entry = NULL;
@@ -119,6 +124,8 @@ void tcp_fastopen_cache_set(struct sock *sk, u32 *mss,
 			goto out;
 		}
 		new_entry->peer = peer;
+		new_entry->mss = new_entry->cookie.len = 0;
+		new_entry->last_syn_loss = new_entry->syn_loss = 0;
 		INIT_LIST_HEAD(&new_entry->lru_list);
 		peer->fastopen = new_entry;
 		++cookie_cache.cnt;
@@ -127,6 +134,11 @@ void tcp_fastopen_cache_set(struct sock *sk, u32 *mss,
 	entry->mss = *mss;
 	if (cookie->len > 0)
 		entry->cookie = *cookie;
+	if (syn_lost) {
+		++entry->syn_loss;
+		entry->last_syn_loss = jiffies;
+	} else
+		entry->syn_loss = 0;
 	list_move_tail(&entry->lru_list, &cookie_cache.lru_list);
 	entry = __tcp_fastopen_remove_lru();
 	spin_unlock_bh(&cookie_cache.lock);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5b09f71..aef8514 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5616,6 +5616,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *data = tcp_write_queue_head(sk);
 	u16 mss = tp->rx_opt.mss_clamp;
+	bool syn_drop;
 
 	if (mss == tp->rx_opt.user_mss) {
 		struct tcp_options_received opt;
@@ -5628,7 +5629,14 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
 		mss = opt.mss_clamp;
 	}
 
-	tcp_fastopen_cache_set(sk, &mss, cookie);
+	/* The SYN-ACK neither has cookie nor acknowledges the data. Presumably
+	 * the remote receives only the retransmitted (regular) SYNs: either
+	 * the original SYN-data or the corresponding SYN-ACK is lost.
+	 */
+	syn_drop = (cookie->len <= 0 && data &&
+		    inet_csk(sk)->icsk_retransmits);
+
+	tcp_fastopen_cache_set(sk, &mss, cookie, syn_drop);
 
 	if (data) { /* Retransmit unacked data in SYN */
 		tcp_retransmit_skb(sk, data);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8869328..c5cfd5e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2860,10 +2860,19 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_fastopen_request *fo = tp->fastopen_req;
-	int space, i, err = 0, iovlen = fo->data->msg_iovlen;
+	int syn_loss = 0, space, i, err = 0, iovlen = fo->data->msg_iovlen;
 	struct sk_buff *syn_data = NULL, *data;
+	unsigned long last_syn_loss = 0;
+
+	tcp_fastopen_cache_get(sk, &tp->rx_opt.mss_clamp, &fo->cookie,
+			       &syn_loss, &last_syn_loss);
+	/* Recurring FO SYN losses: revert to regular handshake temporarily */
+	if (syn_loss > 1 &&
+	    time_before(jiffies, last_syn_loss + (60*HZ << syn_loss))) {
+		fo->cookie.len = -1;
+		goto fallback;
+	}
 
-	tcp_fastopen_cache_get(sk, &tp->rx_opt.mss_clamp, &fo->cookie);
 	if (fo->cookie.len <= 0)
 		goto fallback;
 
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 5/7] net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)
From: Yuchung Cheng @ 2012-07-16 21:16 UTC (permalink / raw)
  To: davem, hkchu, edumazet, ncardwell; +Cc: sivasankar, netdev, Yuchung Cheng
In-Reply-To: <1342473410-6265-1-git-send-email-ycheng@google.com>

sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.

For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.

For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().

Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.

The buffer size of sendmsg() is independent of the MSS of the connection.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 Documentation/networking/ip-sysctl.txt |   11 ++++++
 include/linux/socket.h                 |    1 +
 include/net/inet_common.h              |    6 ++-
 include/net/tcp.h                      |    3 ++
 net/ipv4/af_inet.c                     |   19 +++++++---
 net/ipv4/tcp.c                         |   61 +++++++++++++++++++++++++++++---
 net/ipv4/tcp_ipv4.c                    |    3 ++
 7 files changed, 92 insertions(+), 12 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index e20c17a..b3c5225 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -468,6 +468,17 @@ tcp_syncookies - BOOLEAN
 	SYN flood warnings in logs not being really flooded, your server
 	is seriously misconfigured.
 
+tcp_fastopen - INTEGER
+	Enable TCP Fast Open feature (draft-ietf-tcpm-fastopen) to send data
+	in the opening SYN packet. To use this feature, the client application
+	must not use connect(). Instead, it should use sendmsg() or sendto()
+	with MSG_FASTOPEN flag which performs a TCP handshake automatically.
+
+	The values (bitmap) are:
+	1: Enables sending data in the opening SYN on the client
+
+	Default: 0
+
 tcp_syn_retries - INTEGER
 	Number of times initial SYNs for an active TCP connection attempt
 	will be retransmitted. Should not be higher than 255. Default value
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 25d6322..90297db 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -276,6 +276,7 @@ struct ucred {
 #else
 #define MSG_CMSG_COMPAT	0		/* We never have 32 bit fixups */
 #endif
+#define MSG_FASTOPEN	0x20000000 /* Send data in TCP SYN */
 
 
 /* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..2340087 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -14,9 +14,11 @@ struct sockaddr;
 struct socket;
 
 extern int inet_release(struct socket *sock);
-extern int inet_stream_connect(struct socket *sock, struct sockaddr * uaddr,
+extern int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 			       int addr_len, int flags);
-extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
+extern int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+				 int addr_len, int flags);
+extern int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr,
 			      int addr_len, int flags);
 extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
 extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2d3b09d2..8c20edd 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -212,6 +212,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* TCP initial congestion window as per draft-hkchu-tcpm-initcwnd-01 */
 #define TCP_INIT_CWND		10
 
+/* Bit Flags for sysctl_tcp_fastopen */
+#define	TFO_CLIENT_ENABLE	1
+
 extern struct inet_timewait_death_row tcp_death_row;
 
 /* sysctl variables for tcp */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6ef67b7..c05ec41 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -590,8 +590,8 @@ static long inet_wait_for_connect(struct sock *sk, long timeo)
  *	Connect to a remote host. There is regrettably still a little
  *	TCP 'magic' in here.
  */
-int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
-			int addr_len, int flags)
+int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+			  int addr_len, int flags)
 {
 	struct sock *sk = sock->sk;
 	int err;
@@ -600,8 +600,6 @@ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	if (addr_len < sizeof(uaddr->sa_family))
 		return -EINVAL;
 
-	lock_sock(sk);
-
 	if (uaddr->sa_family == AF_UNSPEC) {
 		err = sk->sk_prot->disconnect(sk, flags);
 		sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
@@ -664,7 +662,6 @@ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	sock->state = SS_CONNECTED;
 	err = 0;
 out:
-	release_sock(sk);
 	return err;
 
 sock_error:
@@ -674,6 +671,18 @@ sock_error:
 		sock->state = SS_DISCONNECTING;
 	goto out;
 }
+EXPORT_SYMBOL(__inet_stream_connect);
+
+int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+			int addr_len, int flags)
+{
+	int err;
+
+	lock_sock(sock->sk);
+	err = __inet_stream_connect(sock, uaddr, addr_len, flags);
+	release_sock(sock->sk);
+	return err;
+}
 EXPORT_SYMBOL(inet_stream_connect);
 
 /*
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4252cd8..581ecf0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -270,6 +270,7 @@
 #include <linux/slab.h>
 
 #include <net/icmp.h>
+#include <net/inet_common.h>
 #include <net/tcp.h>
 #include <net/xfrm.h>
 #include <net/ip.h>
@@ -982,26 +983,67 @@ static inline int select_size(const struct sock *sk, bool sg)
 	return tmp;
 }
 
+void tcp_free_fastopen_req(struct tcp_sock *tp)
+{
+	if (tp->fastopen_req != NULL) {
+		kfree(tp->fastopen_req);
+		tp->fastopen_req = NULL;
+	}
+}
+
+static int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *size)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int err, flags;
+
+	if (!(sysctl_tcp_fastopen & TFO_CLIENT_ENABLE))
+		return -EOPNOTSUPP;
+	if (tp->fastopen_req != NULL)
+		return -EALREADY; /* Another Fast Open is in progress */
+
+	tp->fastopen_req = kzalloc(sizeof(struct tcp_fastopen_request),
+				   sk->sk_allocation);
+	if (unlikely(tp->fastopen_req == NULL))
+		return -ENOBUFS;
+	tp->fastopen_req->data = msg;
+
+	flags = (msg->msg_flags & MSG_DONTWAIT) ? O_NONBLOCK : 0;
+	err = __inet_stream_connect(sk->sk_socket, msg->msg_name,
+				    msg->msg_namelen, flags);
+	*size = tp->fastopen_req->copied;
+	tcp_free_fastopen_req(tp);
+	return err;
+}
+
 int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t size)
 {
 	struct iovec *iov;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
-	int iovlen, flags, err, copied;
-	int mss_now = 0, size_goal;
+	int iovlen, flags, err, copied = 0;
+	int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
 	bool sg;
 	long timeo;
 
 	lock_sock(sk);
 
 	flags = msg->msg_flags;
+	if (flags & MSG_FASTOPEN) {
+		err = tcp_sendmsg_fastopen(sk, msg, &copied_syn);
+		if (err == -EINPROGRESS && copied_syn > 0)
+			goto out;
+		else if (err)
+			goto out_err;
+		offset = copied_syn;
+	}
+
 	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 
 	/* Wait for a connection to finish. */
 	if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
-			goto out_err;
+			goto do_error;
 
 	if (unlikely(tp->repair)) {
 		if (tp->repair_queue == TCP_RECV_QUEUE) {
@@ -1037,6 +1079,15 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		unsigned char __user *from = iov->iov_base;
 
 		iov++;
+		if (unlikely(offset > 0)) {  /* Skip bytes copied in SYN */
+			if (offset >= seglen) {
+				offset -= seglen;
+				continue;
+			}
+			seglen -= offset;
+			from += offset;
+			offset = 0;
+		}
 
 		while (seglen > 0) {
 			int copy = 0;
@@ -1199,7 +1250,7 @@ out:
 	if (copied && likely(!tp->repair))
 		tcp_push(sk, flags, mss_now, tp->nonagle);
 	release_sock(sk);
-	return copied;
+	return copied + copied_syn;
 
 do_fault:
 	if (!skb->len) {
@@ -1212,7 +1263,7 @@ do_fault:
 	}
 
 do_error:
-	if (copied)
+	if (copied + copied_syn)
 		goto out;
 out_err:
 	err = sk_stream_error(sk, flags, err);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 588200e..b2f457f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1959,6 +1959,9 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		tp->cookie_values = NULL;
 	}
 
+	/* If socket is aborted during connect operation */
+	tcp_free_fastopen_req(tp);
+
 	sk_sockets_allocated_dec(sk);
 	sock_release_memcg(sk);
 }
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH] SUNRPC: Prevent kernel stack corruption on long values of flush
From: Sasha Levin @ 2012-07-16 22:01 UTC (permalink / raw)
  To: Trond.Myklebust, bfields, davem
  Cc: davej, linux-nfs, netdev, linux-kernel, Sasha Levin

The buffer size in read_flush() is too small for the longest possible values
for it. This can lead to a kernel stack corruption:

[   43.047329] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff833e64b4
[   43.047329]
[   43.049030] Pid: 6015, comm: trinity-child18 Tainted: G        W    3.5.0-rc7-next-20120716-sasha #221
[   43.050038] Call Trace:
[   43.050435]  [<ffffffff836c60c2>] panic+0xcd/0x1f4
[   43.050931]  [<ffffffff833e64b4>] ? read_flush.isra.7+0xe4/0x100
[   43.051602]  [<ffffffff810e94e6>] __stack_chk_fail+0x16/0x20
[   43.052206]  [<ffffffff833e64b4>] read_flush.isra.7+0xe4/0x100
[   43.052951]  [<ffffffff833e6500>] ? read_flush_pipefs+0x30/0x30
[   43.053594]  [<ffffffff833e652c>] read_flush_procfs+0x2c/0x30
[   43.053596]  [<ffffffff812b9a8c>] proc_reg_read+0x9c/0xd0
[   43.053596]  [<ffffffff812b99f0>] ? proc_reg_write+0xd0/0xd0
[   43.053596]  [<ffffffff81250d5b>] do_loop_readv_writev+0x4b/0x90
[   43.053596]  [<ffffffff81250fd6>] do_readv_writev+0xf6/0x1d0
[   43.053596]  [<ffffffff812510ee>] vfs_readv+0x3e/0x60
[   43.053596]  [<ffffffff812511b8>] sys_readv+0x48/0xb0
[   43.053596]  [<ffffffff8378167d>] system_call_fastpath+0x1a/0x1f

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
---
 net/sunrpc/cache.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 2afd2a8..f86d95e 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -1409,11 +1409,11 @@ static ssize_t read_flush(struct file *file, char __user *buf,
 			  size_t count, loff_t *ppos,
 			  struct cache_detail *cd)
 {
-	char tbuf[20];
+	char tbuf[22];
 	unsigned long p = *ppos;
 	size_t len;
 
-	sprintf(tbuf, "%lu\n", convert_to_wallclock(cd->flush_time));
+	snprintf(tbuf, sizeof(tbuf), "%lu\n", convert_to_wallclock(cd->flush_time));
 	len = strlen(tbuf);
 	if (p >= len)
 		return 0;
-- 
1.7.8.6

^ permalink raw reply related

* [PATCH ethtool 0/2] Backward-compatibility for feature get/set
From: Ben Hutchings @ 2012-07-16 22:17 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers

I've committed fixes for another 2 bugs in features (-k/-K options):

- A silly regression introduced in 3.4, not covered by unit tests which
don't yet check the command output
- An older bug (really a defect in the ETHTOOL_GFLAGS API) which we can
work around

Ben.

Ben Hutchings (2):
  Remove bogus error message when changing offload settings on older
    kernels
  Fix reporting of VLAN tag offload flags for Linux 2.6.24-2.6.36

 ethtool.c |  101 ++++++++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 74 insertions(+), 27 deletions(-)

-- 
1.7.7.6


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH ethtool 1/2] Remove bogus error message when changing offload settings on older kernels
From: Ben Hutchings @ 2012-07-16 22:18 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers
In-Reply-To: <1342477041.2523.40.camel@bwh-desktop.uk.solarflarecom.com>

We should not be checking for fixed features when we have no
information about which are fixed.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 ethtool.c |   60 +++++++++++++++++++++++++++++++++---------------------------
 1 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index 3c34273..b424756 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -1962,39 +1962,45 @@ static int do_sfeatures(struct cmd_context *ctx)
 	if (!old_state)
 		return 1;
 
-	/* For each offload that the user specified, update any
-	 * related features that the user did not specify and that
-	 * are not fixed.  Warn if all related features are fixed.
-	 */
-	for (i = 0; i < ARRAY_SIZE(off_flag_def); i++) {
-		int fixed = 1;
-
-		if (!(off_flags_mask & off_flag_def[i].value))
-			continue;
+	if (efeatures) {
+		/* For each offload that the user specified, update any
+		 * related features that the user did not specify and that
+		 * are not fixed.  Warn if all related features are fixed.
+		 */
+		for (i = 0; i < ARRAY_SIZE(off_flag_def); i++) {
+			int fixed = 1;
 
-		for (j = 0; j < defs->n_features; j++) {
-			if (defs->def[j].off_flag_index != i ||
-			    !FEATURE_BIT_IS_SET(old_state->features.features,
-						j, available) ||
-			    FEATURE_BIT_IS_SET(old_state->features.features,
-					       j, never_changed))
+			if (!(off_flags_mask & off_flag_def[i].value))
 				continue;
 
-			fixed = 0;
-			if (!FEATURE_BIT_IS_SET(efeatures->features, j, valid)) {
-				FEATURE_BIT_SET(efeatures->features, j, valid);
-				if (off_flags_wanted & off_flag_def[i].value)
-					FEATURE_BIT_SET(efeatures->features, j,
-							requested);
+			for (j = 0; j < defs->n_features; j++) {
+				if (defs->def[j].off_flag_index != i ||
+				    !FEATURE_BIT_IS_SET(
+					    old_state->features.features,
+					    j, available) ||
+				    FEATURE_BIT_IS_SET(
+					    old_state->features.features,
+					    j, never_changed))
+					continue;
+
+				fixed = 0;
+				if (!FEATURE_BIT_IS_SET(efeatures->features,
+							j, valid)) {
+					FEATURE_BIT_SET(efeatures->features,
+							j, valid);
+					if (off_flags_wanted &
+					    off_flag_def[i].value)
+						FEATURE_BIT_SET(
+							efeatures->features,
+							j, requested);
+				}
 			}
-		}
 
-		if (fixed)
-			fprintf(stderr, "Cannot change %s\n",
-				off_flag_def[i].long_name);
-	}
+			if (fixed)
+				fprintf(stderr, "Cannot change %s\n",
+					off_flag_def[i].long_name);
+		}
 
-	if (efeatures) {
 		err = send_ioctl(ctx, efeatures);
 		if (err < 0) {
 			perror("Cannot set device feature settings");
-- 
1.7.7.6



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH ethtool 2/2] Fix reporting of VLAN tag offload flags for Linux 2.6.24-2.6.36
From: Ben Hutchings @ 2012-07-16 22:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers
In-Reply-To: <1342477041.2523.40.camel@bwh-desktop.uk.solarflarecom.com>

These kernel versions implement ETHTOOL_GFLAGS but do not include the
flags for VLAN tag offload (and do not implement ETHTOOL_GFEATURES).
Since the VLAN tag offload features were already defined and
implemented by many drivers, we shouldn't assume they are off.
Instead, since these feature flag values were stable, read them from
sysfs.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 ethtool.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index b424756..9991ce2 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -33,6 +33,8 @@
 #include <sys/utsname.h>
 #include <limits.h>
 #include <ctype.h>
+#include <assert.h>
+#include <sys/fcntl.h>
 
 #include <sys/socket.h>
 #include <netinet/in.h>
@@ -1417,6 +1419,31 @@ static struct feature_defs *get_feature_defs(struct cmd_context *ctx)
 	return defs;
 }
 
+static int get_netdev_attr(struct cmd_context *ctx, const char *name,
+		    char *buf, size_t buf_len)
+{
+#ifdef TEST_ETHTOOL
+	errno = ENOENT;
+	return -1;
+#else
+	char path[40 + IFNAMSIZ];
+	ssize_t len;
+	int fd;
+
+	len = snprintf(path, sizeof(path), "/sys/class/net/%s/%s",
+		       ctx->devname, name);
+	assert(len < sizeof(path));
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return fd;
+	len = read(fd, buf, buf_len - 1);
+	if (len >= 0)
+		buf[len] = 0;
+	close(fd);
+	return len;
+#endif
+}
+
 static int do_gdrv(struct cmd_context *ctx)
 {
 	int err;
@@ -1858,6 +1885,20 @@ get_features(struct cmd_context *ctx, const struct feature_defs *defs)
 			perror("Cannot get device generic features");
 		else
 			allfail = 0;
+	} else {
+		/* We should have got VLAN tag offload flags through
+		 * ETHTOOL_GFLAGS.  However, prior to Linux 2.6.37
+		 * they were not exposed in this way - and since VLAN
+		 * tag offload was defined and implemented by many
+		 * drivers, we shouldn't assume they are off.
+		 * Instead, since these feature flag values were
+		 * stable, read them from sysfs.
+		 */
+		char buf[20];
+		if (get_netdev_attr(ctx, "features", buf, sizeof(buf)) > 0)
+			state->off_flags |=
+				strtoul(buf, NULL, 0) &
+				(ETH_FLAG_RXVLAN | ETH_FLAG_TXVLAN);
 	}
 
 	if (allfail) {
-- 
1.7.7.6


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox