Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: route.c:645 suspicious rcu_dereference_check()
From: David Miller @ 2012-08-30 17:34 UTC (permalink / raw)
  To: eric.dumazet; +Cc: proski, netdev
In-Reply-To: <1346193187.3571.21.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Aug 2012 15:33:07 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> [PATCH] ipv4: must use rcu protection while calling fib_lookup
> 
> Following lockdep splat was reported by Pavel Roskin :
 ...
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Pavel Roskin <proski@gnu.org>

Applied, thanks.

It looks like the redirect handlers might have the same problem?

^ permalink raw reply

* Re: [net patch 0/2] bnx2x: Correct bnx2x operation with netpoll
From: David Miller @ 2012-08-30 17:37 UTC (permalink / raw)
  To: meravs; +Cc: netdev, eilong
In-Reply-To: <1346073980-30901-1-git-send-email-meravs@broadcom.com>

From: "Merav Sicron" <meravs@broadcom.com>
Date: Mon, 27 Aug 2012 16:26:18 +0300

> Hi Dave,
> This patch series corrects bnx2x operation with netpoll:
> The first patch is to move the netif_napi_add to the open call to avoid napi
> objects that are added but may not be later enabled.
> The second patch corrects the poll_bnx2x function itself to also work with
> MSI-X.
> 
> Please consider applying this patch series to net.

Applied.

^ permalink raw reply

* Re: [PATCH] skbuff: remove pointless conditional before kfree_skb()
From: David Miller @ 2012-08-30 17:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: fbl, weiyj.lk, yongjun_wei, netdev
In-Reply-To: <1346211528.3571.26.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Aug 2012 20:38:48 -0700

> On Tue, 2012-08-28 at 17:39 -0300, Flavio Leitner wrote:
> 
>> Ok, and what if kfree_skb() becomes a macro that first checks
>> if the skb is NULL and if not, call the _kfree_skb() to
>> continue as before?
>> 
>> #define kfree_skb(skb)		\
>>         if (skb)		\
>> 		_kfree_skb(skb)	\
> 
> Then its adding a conditional test on each call site and increase
> kernel code size.
> 
> So if you plan submitting such patch, please keep the whole thing out of
> line.

I'm tossing this entire series.

Each and every case must be investigated individually and:

1) If the check is kept, a big comment explaining why is added
   to the code.

2) If the check is removed, a big piece of explanatory text is
   added to the commit log message explaining everything in
   full detail.

^ permalink raw reply

* Re: [PATCH 0/5] dev_<level> and dynamic_debug cleanups
From: Jim Cromie @ 2012-08-30 17:43 UTC (permalink / raw)
  To: Joe Perches
  Cc: Andrew Morton, netdev, Greg Kroah-Hartman, David S. Miller,
	Jason Baron, Kay Sievers, linux-kernel
In-Reply-To: <cover.1345978012.git.joe@perches.com>

On Sun, Aug 26, 2012 at 5:25 AM, Joe Perches <joe@perches.com> wrote:
> The recent commit to fix dynamic_debug was a bit unclean.
> Neaten the style for dynamic_debug.
> Reduce the stack use of message logging that uses netdev_printk
> Add utility functions dev_printk_emit and dev_vprintk_emit for /dev/kmsg.
>
> Joe Perches (5):
>   dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
>   netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
>   netdev_printk/netif_printk: Remove a superfluous logging colon
>   dev: Add dev_vprintk_emit and dev_printk_emit
>   device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
>

Ive tested this on 2 builds differing only by DYNAMIC_DEBUG
It works for me on x86-64

However, I just booted a non-dyndbg build on x86-32, and got this.


root@voyage:~# dmesg | grep Unknown
mac80211: Unknown symbol __dynamic_dev_dbg (err 0)
mac80211: Unknown symbol __dynamic_pr_debug (err 0)
scx200_acb: Unknown symbol __dynamic_dev_dbg (err 0)
scx200_acb: Unknown symbol __dynamic_pr_debug (err 0)
hwmon: Unknown symbol __dynamic_dev_dbg (err 0)
nsc_gpio: Unknown symbol __dynamic_dev_dbg (err 0)
nsc_gpio: Unknown symbol __dynamic_dev_dbg (err 0)
root@voyage:~#

It may be my error, will investigate asap,
but wanted to report now.

thanks

^ permalink raw reply

* Re: [PATCH 1/1] tcp: Wrong timeout for SYN segments
From: H.K. Jerry Chu @ 2012-08-30 17:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexander Bergmann, David Miller, netdev, linux-kernel
In-Reply-To: <1346332350.2586.10.camel@edumazet-glaptop>

On Thu, Aug 30, 2012 at 6:12 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2012-08-29 at 10:25 -0700, H.K. Jerry Chu wrote:
>
>> But it probably matter slightly more for TCP Fast Open (the server
>> side patch has
>> been completed and will be posted soon, after I finish breaking it up
>> into smaller
>> pieces for ease of review purpose), when a full socket will be created with data
>> passed to the app upon a valid SYN+data. Dropping a fully functioning socket
>> won't be the same as dropping a request_sock unknown to the app and letting
>> the other side retransmitting SYN (w/o data this time).
>>
>> >
>> > Sure, RFC numbers are what they are, but in practice, I doubt someone
>> > will really miss the extra SYNACK sent after ~32 seconds, since it would
>> > matter only for the last SYN attempted.
>>
>> I'd slightly prefer 1 extra retry plus longer wait time just to make
>> TCP Fast Open
>> a little more robust (even though the app protocol is required to be
>> idempotent).
>> But this is not a showstopper.
>
> Thats very good points indeed, thanks.
>
> Maybe we can increase SYNACK max retrans only if the FastOpen SYN cookie
> was validated.
>
> This way, we increase reliability without amplifying the effect of wild
> SYN packets.

Ok, will use sysctl_tcp_synack_retries + 1 in tcp_fastopen_synack_timer() of my
upcoming TCP Fast Open server patch (hope to submit today).

Jerry

>
>
>

^ permalink raw reply

* Re: [PATCH 1/1] tcp: Wrong timeout for SYN segments
From: H.K. Jerry Chu @ 2012-08-30 18:04 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, alex, netdev, linux-kernel
In-Reply-To: <20120830.124543.1590020152319166487.davem@davemloft.net>

On Thu, Aug 30, 2012 at 9:45 AM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 30 Aug 2012 06:12:30 -0700
>
>> On Wed, 2012-08-29 at 10:25 -0700, H.K. Jerry Chu wrote:
>>
>>> But it probably matter slightly more for TCP Fast Open (the server
>>> side patch has
>>> been completed and will be posted soon, after I finish breaking it up
>>> into smaller
>>> pieces for ease of review purpose), when a full socket will be created with data
>>> passed to the app upon a valid SYN+data. Dropping a fully functioning socket
>>> won't be the same as dropping a request_sock unknown to the app and letting
>>> the other side retransmitting SYN (w/o data this time).
>>>
>>> >
>>> > Sure, RFC numbers are what they are, but in practice, I doubt someone
>>> > will really miss the extra SYNACK sent after ~32 seconds, since it would
>>> > matter only for the last SYN attempted.
>>>
>>> I'd slightly prefer 1 extra retry plus longer wait time just to make
>>> TCP Fast Open
>>> a little more robust (even though the app protocol is required to be
>>> idempotent).
>>> But this is not a showstopper.
>>
>> Thats very good points indeed, thanks.
>>
>> Maybe we can increase SYNACK max retrans only if the FastOpen SYN cookie
>> was validated.
>>
>> This way, we increase reliability without amplifying the effect of wild
>> SYN packets.
>
> Can we come to a final conclusion on this last point and arrive at a final
> patch?
>
> Thanks.

Acked-by: H.K. Jerry Chu <hkchu@google.com>

^ permalink raw reply

* Re: [PATCH 0/5] dev_<level> and dynamic_debug cleanups
From: Joe Perches @ 2012-08-30 18:25 UTC (permalink / raw)
  To: Jim Cromie
  Cc: Andrew Morton, netdev, Greg Kroah-Hartman, David S. Miller,
	Jason Baron, Kay Sievers, linux-kernel
In-Reply-To: <CAJfuBxyQc-XeNHxw0_XDs0JuCrNArNzcgsmrLya4ncDvSojZxQ@mail.gmail.com>

On Thu, 2012-08-30 at 11:43 -0600, Jim Cromie wrote:
> On Sun, Aug 26, 2012 at 5:25 AM, Joe Perches <joe@perches.com> wrote:
> > The recent commit to fix dynamic_debug was a bit unclean.
> > Neaten the style for dynamic_debug.
> > Reduce the stack use of message logging that uses netdev_printk
> > Add utility functions dev_printk_emit and dev_vprintk_emit for /dev/kmsg.
> >
> > Joe Perches (5):
> >   dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
> >   netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
> >   netdev_printk/netif_printk: Remove a superfluous logging colon
> >   dev: Add dev_vprintk_emit and dev_printk_emit
> >   device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
> >
> 
> Ive tested this on 2 builds differing only by DYNAMIC_DEBUG
> It works for me on x86-64
> 
> However, I just booted a non-dyndbg build on x86-32, and got this.
> 
> 
> root@voyage:~# dmesg | grep Unknown
> mac80211: Unknown symbol __dynamic_dev_dbg (err 0)
> mac80211: Unknown symbol __dynamic_pr_debug (err 0)
> scx200_acb: Unknown symbol __dynamic_dev_dbg (err 0)
> scx200_acb: Unknown symbol __dynamic_pr_debug (err 0)
> hwmon: Unknown symbol __dynamic_dev_dbg (err 0)
> nsc_gpio: Unknown symbol __dynamic_dev_dbg (err 0)
> nsc_gpio: Unknown symbol __dynamic_dev_dbg (err 0)
> root@voyage:~#
> 
> It may be my error, will investigate asap,
> but wanted to report now.

Thanks, but I don't know how this is possible.
There are no 32/64 component tests in this code.

^ permalink raw reply

* [PATCH] bonding: add some slack to arp monitoring time limits
From: Jiri Bohac @ 2012-08-30 22:02 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Chris Friesen, Jiri Bohac, Andy Gospodarek, netdev, Petr Tesarik,
	davem
In-Reply-To: <24655.1345660922@death.nxdomain>

Currently, all the time limits in the bonding ARP monitor are in
multiples of arp_interval -- the time interval at which the ARP
monitor is periodically scheduled.

With a fast network round-trip and a little scheduling latency
of the ARP monitor work, a limit of n*delta_in_ticks may
effectively mean (n-1)*delta_in_ticks.

This is fatal in case of n==1  (the link will stay down
forever) and makes the behaviour non-deterministic in all the
other cases.

Add a delta_in_ticks/2 time slack to all the time limits.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 6fae5f3..0f04115 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2813,12 +2813,13 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
 					    arp_work.work);
 	struct slave *slave, *oldcurrent;
 	int do_failover = 0;
-	int delta_in_ticks;
+	int delta_in_ticks, extra_ticks;
 	int i;
 
 	read_lock(&bond->lock);
 
 	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
+	extra_ticks = delta_in_ticks / 2;
 
 	if (bond->slave_cnt == 0)
 		goto re_arm;
@@ -2841,10 +2842,10 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
 		if (slave->link != BOND_LINK_UP) {
 			if (time_in_range(jiffies,
 				trans_start - delta_in_ticks,
-				trans_start + delta_in_ticks) &&
+				trans_start + delta_in_ticks + extra_ticks) &&
 			    time_in_range(jiffies,
 				slave->dev->last_rx - delta_in_ticks,
-				slave->dev->last_rx + delta_in_ticks)) {
+				slave->dev->last_rx + delta_in_ticks + extra_ticks)) {
 
 				slave->link  = BOND_LINK_UP;
 				bond_set_active_slave(slave);
@@ -2874,10 +2875,10 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
 			 */
 			if (!time_in_range(jiffies,
 				trans_start - delta_in_ticks,
-				trans_start + 2 * delta_in_ticks) ||
+				trans_start + 2 * delta_in_ticks + extra_ticks) ||
 			    !time_in_range(jiffies,
 				slave->dev->last_rx - delta_in_ticks,
-				slave->dev->last_rx + 2 * delta_in_ticks)) {
+				slave->dev->last_rx + 2 * delta_in_ticks + extra_ticks)) {
 
 				slave->link  = BOND_LINK_DOWN;
 				bond_set_backup_slave(slave);
@@ -2935,6 +2936,14 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 	struct slave *slave;
 	int i, commit = 0;
 	unsigned long trans_start;
+	int extra_ticks;
+
+	/* All the time comparisons below need some extra time. Otherwise, on
+	 * fast networks the ARP probe/reply may arrive within the same jiffy
+	 * as it was sent.  Then, the next time the ARP monitor is run, one
+	 * arp_interval will already have passed in the comparisons.
+	 */
+	extra_ticks = delta_in_ticks / 2;
 
 	bond_for_each_slave(bond, slave, i) {
 		slave->new_link = BOND_LINK_NOCHANGE;
@@ -2942,7 +2951,7 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 		if (slave->link != BOND_LINK_UP) {
 			if (time_in_range(jiffies,
 				slave_last_rx(bond, slave) - delta_in_ticks,
-				slave_last_rx(bond, slave) + delta_in_ticks)) {
+				slave_last_rx(bond, slave) + delta_in_ticks + extra_ticks)) {
 
 				slave->new_link = BOND_LINK_UP;
 				commit++;
@@ -2958,7 +2967,7 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 		 */
 		if (time_in_range(jiffies,
 				  slave->jiffies - delta_in_ticks,
-				  slave->jiffies + 2 * delta_in_ticks))
+				  slave->jiffies + 2 * delta_in_ticks + extra_ticks))
 			continue;
 
 		/*
@@ -2978,7 +2987,7 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 		    !bond->current_arp_slave &&
 		    !time_in_range(jiffies,
 			slave_last_rx(bond, slave) - delta_in_ticks,
-			slave_last_rx(bond, slave) + 3 * delta_in_ticks)) {
+			slave_last_rx(bond, slave) + 3 * delta_in_ticks + extra_ticks)) {
 
 			slave->new_link = BOND_LINK_DOWN;
 			commit++;
@@ -2994,10 +3003,10 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 		if (bond_is_active_slave(slave) &&
 		    (!time_in_range(jiffies,
 			trans_start - delta_in_ticks,
-			trans_start + 2 * delta_in_ticks) ||
+			trans_start + 2 * delta_in_ticks + extra_ticks) ||
 		     !time_in_range(jiffies,
 			slave_last_rx(bond, slave) - delta_in_ticks,
-			slave_last_rx(bond, slave) + 2 * delta_in_ticks))) {
+			slave_last_rx(bond, slave) + 2 * delta_in_ticks + extra_ticks))) {
 
 			slave->new_link = BOND_LINK_DOWN;
 			commit++;
@@ -3029,7 +3038,7 @@ static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
 			if ((!bond->curr_active_slave &&
 			     time_in_range(jiffies,
 					   trans_start - delta_in_ticks,
-					   trans_start + delta_in_ticks)) ||
+					   trans_start + delta_in_ticks + delta_in_ticks / 2)) ||
 			    bond->curr_active_slave != slave) {
 				slave->link = BOND_LINK_UP;
 				if (bond->current_arp_slave) {
-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply related

* Re: macvlan/macvtap: guest/host cannot communicate when network cable is unplugged
From: ching @ 2012-08-30 22:32 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: netdev, Michael S. Tsirkin, qemu-devel, kaber
In-Reply-To: <CAJSP0QXkHLKpo0kbnPu7L1zH5aL2XCPdmzi7+-xFctU7Pchyqw@mail.gmail.com>

>> I also perform an additional test: the guests (macvtap bridge mode) CAN communicate each other no matter network cable is plugged or not.
> Strange.  I thought the original problem was that the macvtap guests
> cannot communicate with each other when the network cable is
> unplugged?
>
> Hopefully someone else can help you, I'm not familiar enough with
> macvlan/macvtap.
>
> Stefan
>

My word may be a little bit confusing, it should be

I also perform an additional test: I created two qemu guests (macvtap bridge mode on eth0), they CAN communicate each other no matter network cable of the host is unplugged or not.

and my problem is specified in the subject already: guest/host cannot communicate when network cable is unplugged

if my configuration seems correct, i may file a kernel bugzilla and seek for their advice.

ching

^ permalink raw reply

* [PATCH 0/3] tcp: TCP Fast Open, Server Side
From: H.K. Jerry Chu @ 2012-08-30 23:39 UTC (permalink / raw)
  To: davem, ycheng, edumazet, ncardwell
  Cc: sivasankar, therbert, netdev, Jerry Chu

From: Jerry Chu <hkchu@google.com>

This patch series provides the server (passive open) side code
for TCP Fast Open. Together with the earlier client side patches
it completes the TCP Fast Open implementation.

The server side Fast Open code accepts data carried in the SYN
packet with a valid Fast Open cookie, and passes it to the
application right away, allowing application to send back response
data, all before TCP's 3-way handshake finishes.

A simple cookie scheme together with capping the number of
outstanding TFO requests (still in TCP_SYN_RECV state) to a limit
per listener forms the main line of defense against spoofed SYN
attacks.

For more details about TCP Fast Open see our IETF internet draft
at http://www.ietf.org/id/draft-ietf-tcpm-fastopen-01.txt
and a research paper at
http://conferences.sigcomm.org/co-next/2011/papers/1569470463.pdf

A prototype implementation was first developed by Sivasankar
Radhakrishnan (sivasankar@cs.ucsd.edu).

A patch based on an older version of Linux kernel has been
undergoing internal tests at Google for the past few months.

Jerry Chu (3):
  tcp: TCP Fast Open Server - header & support functions
  tcp: TCP Fast Open Server - support TFO listeners
  tcp: TCP Fast Open Server - main code path

 Documentation/networking/ip-sysctl.txt |   29 +++-
 include/linux/snmp.h                   |    4 +
 include/linux/tcp.h                    |   45 +++++-
 include/net/request_sock.h             |   49 +++++--
 include/net/tcp.h                      |   52 +++++-
 net/core/request_sock.c                |   95 +++++++++++
 net/ipv4/af_inet.c                     |   28 +++-
 net/ipv4/inet_connection_sock.c        |   57 ++++++-
 net/ipv4/proc.c                        |    4 +
 net/ipv4/syncookies.c                  |    1 +
 net/ipv4/sysctl_net_ipv4.c             |   45 ++++++
 net/ipv4/tcp.c                         |   49 +++++-
 net/ipv4/tcp_fastopen.c                |   83 ++++++++++-
 net/ipv4/tcp_input.c                   |   75 +++++++--
 net/ipv4/tcp_ipv4.c                    |  270 ++++++++++++++++++++++++++++++--
 net/ipv4/tcp_minisocks.c               |   61 ++++++--
 net/ipv4/tcp_output.c                  |   21 ++-
 net/ipv4/tcp_timer.c                   |   39 +++++-
 net/ipv6/syncookies.c                  |    1 +
 net/ipv6/tcp_ipv6.c                    |    5 +-
 20 files changed, 916 insertions(+), 97 deletions(-)

-- 
1.7.7.3

^ permalink raw reply

* [PATCH 2/3] tcp: TCP Fast Open Server - support TFO listeners
From: H.K. Jerry Chu @ 2012-08-30 23:39 UTC (permalink / raw)
  To: davem, ycheng, edumazet, ncardwell
  Cc: sivasankar, therbert, netdev, Jerry Chu
In-Reply-To: <1346369948-1722-1-git-send-email-hkchu@google.com>

From: Jerry Chu <hkchu@google.com>

This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
---
 include/net/request_sock.h      |   13 -----
 include/net/tcp.h               |    6 ++-
 net/core/request_sock.c         |   95 +++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c              |   28 +++++++++++-
 net/ipv4/inet_connection_sock.c |   57 +++++++++++++++++++++---
 net/ipv4/syncookies.c           |    1 +
 net/ipv4/tcp.c                  |   49 +++++++++++++++++---
 net/ipv4/tcp_ipv4.c             |    4 +-
 net/ipv4/tcp_minisocks.c        |   61 ++++++++++++++++++++-----
 net/ipv4/tcp_output.c           |   21 +++++++--
 net/ipv4/tcp_timer.c            |   39 +++++++++++++++-
 net/ipv6/syncookies.c           |    1 +
 net/ipv6/tcp_ipv6.c             |    5 +-
 13 files changed, 330 insertions(+), 50 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index c3cdd6c..b01d8dd 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -226,19 +226,6 @@ static inline struct request_sock *reqsk_queue_remove(struct request_sock_queue
 	return req;
 }
 
-static inline struct sock *reqsk_queue_get_child(struct request_sock_queue *queue,
-						 struct sock *parent)
-{
-	struct request_sock *req = reqsk_queue_remove(queue);
-	struct sock *child = req->sk;
-
-	WARN_ON(child == NULL);
-
-	sk_acceptq_removed(parent);
-	__reqsk_free(req);
-	return child;
-}
-
 static inline int reqsk_queue_removed(struct request_sock_queue *queue,
 				      struct request_sock *req)
 {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1a88a97..2d1f70c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -414,7 +414,8 @@ extern enum tcp_tw_status tcp_timewait_state_process(struct inet_timewait_sock *
 						     const struct tcphdr *th);
 extern struct sock * tcp_check_req(struct sock *sk,struct sk_buff *skb,
 				   struct request_sock *req,
-				   struct request_sock **prev);
+				   struct request_sock **prev,
+				   bool fastopen);
 extern int tcp_child_process(struct sock *parent, struct sock *child,
 			     struct sk_buff *skb);
 extern bool tcp_use_frto(struct sock *sk);
@@ -468,7 +469,8 @@ extern int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr,
 extern int tcp_connect(struct sock *sk);
 extern struct sk_buff * tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 					struct request_sock *req,
-					struct request_values *rvp);
+					struct request_values *rvp,
+					struct tcp_fastopen_cookie *foc);
 extern int tcp_disconnect(struct sock *sk, int flags);
 
 void tcp_connect_init(struct sock *sk);
diff --git a/net/core/request_sock.c b/net/core/request_sock.c
index 9b570a6..c31d9e8 100644
--- a/net/core/request_sock.c
+++ b/net/core/request_sock.c
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/slab.h>
 #include <linux/string.h>
+#include <linux/tcp.h>
 #include <linux/vmalloc.h>
 
 #include <net/request_sock.h>
@@ -130,3 +131,97 @@ void reqsk_queue_destroy(struct request_sock_queue *queue)
 		kfree(lopt);
 }
 
+/*
+ * This function is called to set a Fast Open socket's "fastopen_rsk" field
+ * to NULL when a TFO socket no longer needs to access the request_sock.
+ * This happens only after 3WHS has been either completed or aborted (e.g.,
+ * RST is received).
+ *
+ * Before TFO, a child socket is created only after 3WHS is completed,
+ * hence it never needs to access the request_sock. things get a lot more
+ * complex with TFO. A child socket, accepted or not, has to access its
+ * request_sock for 3WHS processing, e.g., to retransmit SYN-ACK pkts,
+ * until 3WHS is either completed or aborted. Afterwards the req will stay
+ * until either the child socket is accepted, or in the rare case when the
+ * listener is closed before the child is accepted.
+ *
+ * In short, a request socket is only freed after BOTH 3WHS has completed
+ * (or aborted) and the child socket has been accepted (or listener closed).
+ * When a child socket is accepted, its corresponding req->sk is set to
+ * NULL since it's no longer needed. More importantly, "req->sk == NULL"
+ * will be used by the code below to determine if a child socket has been
+ * accepted or not, and the check is protected by the fastopenq->lock
+ * described below.
+ *
+ * Note that fastopen_rsk is only accessed from the child socket's context
+ * with its socket lock held. But a request_sock (req) can be accessed by
+ * both its child socket through fastopen_rsk, and a listener socket through
+ * icsk_accept_queue.rskq_accept_head. To protect the access a simple spin
+ * lock per listener "icsk->icsk_accept_queue.fastopenq->lock" is created.
+ * only in the rare case when both the listener and the child locks are held,
+ * e.g., in inet_csk_listen_stop() do we not need to acquire the lock.
+ * The lock also protects other fields such as fastopenq->qlen, which is
+ * decremented by this function when fastopen_rsk is no longer needed.
+ *
+ * Note that another solution was to simply use the existing socket lock
+ * from the listener. But first socket lock is difficult to use. It is not
+ * a simple spin lock - one must consider sock_owned_by_user() and arrange
+ * to use sk_add_backlog() stuff. But what really makes it infeasible is the
+ * locking hierarchy violation. E.g., inet_csk_listen_stop() may try to
+ * acquire a child's lock while holding listener's socket lock. A corner
+ * case might also exist in tcp_v4_hnd_req() that will trigger this locking
+ * order.
+ *
+ * When a TFO req is created, it needs to sock_hold its listener to prevent
+ * the latter data structure from going away.
+ *
+ * This function also sets "treq->listener" to NULL and unreference listener
+ * socket. treq->listener is used by the listener so it is protected by the
+ * fastopenq->lock in this function.
+ */
+void reqsk_fastopen_remove(struct sock *sk, struct request_sock *req,
+			   bool reset)
+{
+	struct sock *lsk = tcp_rsk(req)->listener;
+	struct fastopen_queue *fastopenq =
+	    inet_csk(lsk)->icsk_accept_queue.fastopenq;
+
+	BUG_ON(!spin_is_locked(&sk->sk_lock.slock) && !sock_owned_by_user(sk));
+
+	tcp_sk(sk)->fastopen_rsk = NULL;
+	spin_lock_bh(&fastopenq->lock);
+	fastopenq->qlen--;
+	tcp_rsk(req)->listener = NULL;
+	if (req->sk)	/* the child socket hasn't been accepted yet */
+		goto out;
+
+	if (!reset || lsk->sk_state != TCP_LISTEN) {
+		/* If the listener has been closed don't bother with the
+		 * special RST handling below.
+		 */
+		spin_unlock_bh(&fastopenq->lock);
+		sock_put(lsk);
+		reqsk_free(req);
+		return;
+	}
+	/* Wait for 60secs before removing a req that has triggered RST.
+	 * This is a simple defense against TFO spoofing attack - by
+	 * counting the req against fastopen.max_qlen, and disabling
+	 * TFO when the qlen exceeds max_qlen.
+	 *
+	 * For more details see CoNext'11 "TCP Fast Open" paper.
+	 */
+	req->expires = jiffies + 60*HZ;
+	if (fastopenq->rskq_rst_head == NULL)
+		fastopenq->rskq_rst_head = req;
+	else
+		fastopenq->rskq_rst_tail->dl_next = req;
+
+	req->dl_next = NULL;
+	fastopenq->rskq_rst_tail = req;
+	fastopenq->qlen++;
+out:
+	spin_unlock_bh(&fastopenq->lock);
+	sock_put(lsk);
+	return;
+}
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6681ccf..4f70ef0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -149,6 +149,11 @@ void inet_sock_destruct(struct sock *sk)
 		pr_err("Attempt to release alive inet socket %p\n", sk);
 		return;
 	}
+	if (sk->sk_type == SOCK_STREAM) {
+		struct fastopen_queue *fastopenq =
+			inet_csk(sk)->icsk_accept_queue.fastopenq;
+		kfree(fastopenq);
+	}
 
 	WARN_ON(atomic_read(&sk->sk_rmem_alloc));
 	WARN_ON(atomic_read(&sk->sk_wmem_alloc));
@@ -212,6 +217,26 @@ int inet_listen(struct socket *sock, int backlog)
 	 * we can only allow the backlog to be adjusted.
 	 */
 	if (old_state != TCP_LISTEN) {
+		/* Check special setups for testing purpose to enable TFO w/o
+		 * requiring TCP_FASTOPEN sockopt.
+		 * Note that only TCP sockets (SOCK_STREAM) will reach here.
+		 * Also fastopenq may already been allocated because this
+		 * socket was in TCP_LISTEN state previously but was
+		 * shutdown() (rather than close()).
+		 */
+		if ((sysctl_tcp_fastopen & TFO_SERVER_ENABLE) != 0 &&
+		    inet_csk(sk)->icsk_accept_queue.fastopenq == NULL) {
+			if ((sysctl_tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) != 0)
+				err = fastopen_init_queue(sk, backlog);
+			else if ((sysctl_tcp_fastopen &
+				  TFO_SERVER_WO_SOCKOPT2) != 0)
+				err = fastopen_init_queue(sk,
+				    ((uint)sysctl_tcp_fastopen) >> 16);
+			else
+				err = 0;
+			if (err)
+				goto out;
+		}
 		err = inet_csk_listen_start(sk, backlog);
 		if (err)
 			goto out;
@@ -701,7 +726,8 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
 
 	sock_rps_record_flow(sk2);
 	WARN_ON(!((1 << sk2->sk_state) &
-		  (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_CLOSE)));
+		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
+		  TCPF_CLOSE_WAIT | TCPF_CLOSE)));
 
 	sock_graft(sk2, newsock);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 7f75f21..8464b79 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -283,7 +283,9 @@ static int inet_csk_wait_for_connect(struct sock *sk, long timeo)
 struct sock *inet_csk_accept(struct sock *sk, int flags, int *err)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
 	struct sock *newsk;
+	struct request_sock *req;
 	int error;
 
 	lock_sock(sk);
@@ -296,7 +298,7 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err)
 		goto out_err;
 
 	/* Find already established connection */
-	if (reqsk_queue_empty(&icsk->icsk_accept_queue)) {
+	if (reqsk_queue_empty(queue)) {
 		long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);
 
 		/* If this is a non blocking socket don't sleep */
@@ -308,14 +310,32 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err)
 		if (error)
 			goto out_err;
 	}
-
-	newsk = reqsk_queue_get_child(&icsk->icsk_accept_queue, sk);
-	WARN_ON(newsk->sk_state == TCP_SYN_RECV);
+	req = reqsk_queue_remove(queue);
+	newsk = req->sk;
+
+	sk_acceptq_removed(sk);
+	if (sk->sk_type == SOCK_STREAM && queue->fastopenq != NULL) {
+		spin_lock_bh(&queue->fastopenq->lock);
+		if (tcp_rsk(req)->listener) {
+			/* We are still waiting for the final ACK from 3WHS
+			 * so can't free req now. Instead, we set req->sk to
+			 * NULL to signify that the child socket is taken
+			 * so reqsk_fastopen_remove() will free the req
+			 * when 3WHS finishes (or is aborted).
+			 */
+			req->sk = NULL;
+			req = NULL;
+		}
+		spin_unlock_bh(&queue->fastopenq->lock);
+	}
 out:
 	release_sock(sk);
+	if (req)
+		__reqsk_free(req);
 	return newsk;
 out_err:
 	newsk = NULL;
+	req = NULL;
 	*err = error;
 	goto out;
 }
@@ -720,13 +740,14 @@ EXPORT_SYMBOL_GPL(inet_csk_listen_start);
 void inet_csk_listen_stop(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
 	struct request_sock *acc_req;
 	struct request_sock *req;
 
 	inet_csk_delete_keepalive_timer(sk);
 
 	/* make all the listen_opt local to us */
-	acc_req = reqsk_queue_yank_acceptq(&icsk->icsk_accept_queue);
+	acc_req = reqsk_queue_yank_acceptq(queue);
 
 	/* Following specs, it would be better either to send FIN
 	 * (and enter FIN-WAIT-1, it is normal close)
@@ -736,7 +757,7 @@ void inet_csk_listen_stop(struct sock *sk)
 	 * To be honest, we are not able to make either
 	 * of the variants now.			--ANK
 	 */
-	reqsk_queue_destroy(&icsk->icsk_accept_queue);
+	reqsk_queue_destroy(queue);
 
 	while ((req = acc_req) != NULL) {
 		struct sock *child = req->sk;
@@ -754,6 +775,19 @@ void inet_csk_listen_stop(struct sock *sk)
 
 		percpu_counter_inc(sk->sk_prot->orphan_count);
 
+		if (sk->sk_type == SOCK_STREAM && tcp_rsk(req)->listener) {
+			BUG_ON(tcp_sk(child)->fastopen_rsk != req);
+			BUG_ON(sk != tcp_rsk(req)->listener);
+
+			/* Paranoid, to prevent race condition if
+			 * an inbound pkt destined for child is
+			 * blocked by sock lock in tcp_v4_rcv().
+			 * Also to satisfy an assertion in
+			 * tcp_v4_destroy_sock().
+			 */
+			tcp_sk(child)->fastopen_rsk = NULL;
+			sock_put(sk);
+		}
 		inet_csk_destroy_sock(child);
 
 		bh_unlock_sock(child);
@@ -763,6 +797,17 @@ void inet_csk_listen_stop(struct sock *sk)
 		sk_acceptq_removed(sk);
 		__reqsk_free(req);
 	}
+	if (queue->fastopenq != NULL) {
+		/* Free all the reqs queued in rskq_rst_head. */
+		spin_lock_bh(&queue->fastopenq->lock);
+		acc_req = queue->fastopenq->rskq_rst_head;
+		queue->fastopenq->rskq_rst_head = NULL;
+		spin_unlock_bh(&queue->fastopenq->lock);
+		while ((req = acc_req) != NULL) {
+			acc_req = req->dl_next;
+			__reqsk_free(req);
+		}
+	}
 	WARN_ON(sk->sk_ack_backlog);
 }
 EXPORT_SYMBOL_GPL(inet_csk_listen_stop);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 650e152..ba48e79 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -319,6 +319,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
 	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 	treq->snt_synack	= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsecr : 0;
+	treq->listener		= NULL;
 
 	/* We throwed the options of the initial SYN away, so we hope
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2109ff4..df83d74 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -486,8 +486,9 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	if (sk->sk_shutdown & RCV_SHUTDOWN)
 		mask |= POLLIN | POLLRDNORM | POLLRDHUP;
 
-	/* Connected? */
-	if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
+	/* Connected or passive Fast Open socket? */
+	if (sk->sk_state != TCP_SYN_SENT &&
+	    (sk->sk_state != TCP_SYN_RECV || tp->fastopen_rsk != NULL)) {
 		int target = sock_rcvlowat(sk, 0, INT_MAX);
 
 		if (tp->urg_seq == tp->copied_seq &&
@@ -840,10 +841,15 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 	ssize_t copied;
 	long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 
-	/* Wait for a connection to finish. */
-	if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
+	/* Wait for a connection to finish. One exception is TCP Fast Open
+	 * (passive side) where data is allowed to be sent before a connection
+	 * is fully established.
+	 */
+	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+	    !tcp_passive_fastopen(sk)) {
 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
 			goto out_err;
+	}
 
 	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 
@@ -1042,10 +1048,15 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 
-	/* Wait for a connection to finish. */
-	if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
+	/* Wait for a connection to finish. One exception is TCP Fast Open
+	 * (passive side) where data is allowed to be sent before a connection
+	 * is fully established.
+	 */
+	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+	    !tcp_passive_fastopen(sk)) {
 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
 			goto do_error;
+	}
 
 	if (unlikely(tp->repair)) {
 		if (tp->repair_queue == TCP_RECV_QUEUE) {
@@ -2144,6 +2155,10 @@ void tcp_close(struct sock *sk, long timeout)
 		 * they look as CLOSING or LAST_ACK for Linux)
 		 * Probably, I missed some more holelets.
 		 * 						--ANK
+		 * XXX (TFO) - To start off we don't support SYN+ACK+FIN
+		 * in a single packet! (May consider it later but will
+		 * probably need API support or TCP_CORK SYN-ACK until
+		 * data is written and socket is closed.)
 		 */
 		tcp_send_fin(sk);
 	}
@@ -2215,8 +2230,16 @@ adjudge_to_death:
 		}
 	}
 
-	if (sk->sk_state == TCP_CLOSE)
+	if (sk->sk_state == TCP_CLOSE) {
+		struct request_sock *req = tcp_sk(sk)->fastopen_rsk;
+		/* We could get here with a non-NULL req if the socket is
+		 * aborted (e.g., closed with unread data) before 3WHS
+		 * finishes.
+		 */
+		if (req != NULL)
+			reqsk_fastopen_remove(sk, req, false);
 		inet_csk_destroy_sock(sk);
+	}
 	/* Otherwise, socket is reprieved until protocol close. */
 
 out:
@@ -2688,6 +2711,14 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		else
 			icsk->icsk_user_timeout = msecs_to_jiffies(val);
 		break;
+
+	case TCP_FASTOPEN:
+		if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE |
+		    TCPF_LISTEN)))
+			err = fastopen_init_queue(sk, val);
+		else
+			err = -EINVAL;
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -3501,11 +3532,15 @@ EXPORT_SYMBOL(tcp_cookie_generator);
 
 void tcp_done(struct sock *sk)
 {
+	struct request_sock *req = tcp_sk(sk)->fastopen_rsk;
+
 	if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
 
 	tcp_set_state(sk, TCP_CLOSE);
 	tcp_clear_xmit_timers(sk);
+	if (req != NULL)
+		reqsk_fastopen_remove(sk, req, false);
 
 	sk->sk_shutdown = SHUTDOWN_MASK;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 36f02f9..bb148de 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -839,7 +839,7 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
 	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
 		return -1;
 
-	skb = tcp_make_synack(sk, dst, req, rvp);
+	skb = tcp_make_synack(sk, dst, req, rvp, NULL);
 
 	if (skb) {
 		__tcp_v4_send_check(skb, ireq->loc_addr, ireq->rmt_addr);
@@ -1554,7 +1554,7 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
 	struct request_sock *req = inet_csk_search_req(sk, &prev, th->source,
 						       iph->saddr, iph->daddr);
 	if (req)
-		return tcp_check_req(sk, skb, req, prev);
+		return tcp_check_req(sk, skb, req, prev, false);
 
 	nsk = inet_lookup_established(sock_net(sk), &tcp_hashinfo, iph->saddr,
 			th->source, iph->daddr, th->dest, inet_iif(skb));
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 6ff7f10..e965319 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -507,6 +507,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 			newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len;
 		newtp->rx_opt.mss_clamp = req->mss;
 		TCP_ECN_openreq_child(newtp, req);
+		newtp->fastopen_rsk = NULL;
 
 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_PASSIVEOPENS);
 	}
@@ -515,13 +516,18 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 EXPORT_SYMBOL(tcp_create_openreq_child);
 
 /*
- *	Process an incoming packet for SYN_RECV sockets represented
- *	as a request_sock.
+ * Process an incoming packet for SYN_RECV sockets represented as a
+ * request_sock. Normally sk is the listener socket but for TFO it
+ * points to the child socket.
+ *
+ * XXX (TFO) - The current impl contains a special check for ack
+ * validation and inside tcp_v4_reqsk_send_ack(). Can we do better?
  */
 
 struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 			   struct request_sock *req,
-			   struct request_sock **prev)
+			   struct request_sock **prev,
+			   bool fastopen)
 {
 	struct tcp_options_received tmp_opt;
 	const u8 *hash_location;
@@ -530,6 +536,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
 	bool paws_reject = false;
 
+	BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
+
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
 		tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
@@ -565,6 +573,9 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		 *
 		 * Enforce "SYN-ACK" according to figure 8, figure 6
 		 * of RFC793, fixed by RFC1122.
+		 *
+		 * Note that even if there is new data in the SYN packet
+		 * they will be thrown away too.
 		 */
 		req->rsk_ops->rtx_syn_ack(sk, req, NULL);
 		return NULL;
@@ -622,9 +633,12 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	 *                  sent (the segment carries an unacceptable ACK) ...
 	 *                  a reset is sent."
 	 *
-	 * Invalid ACK: reset will be sent by listening socket
+	 * Invalid ACK: reset will be sent by listening socket.
+	 * Note that the ACK validity check for a Fast Open socket is done
+	 * elsewhere and is checked directly against the child socket rather
+	 * than req because user data may have been sent out.
 	 */
-	if ((flg & TCP_FLAG_ACK) &&
+	if ((flg & TCP_FLAG_ACK) && !fastopen &&
 	    (TCP_SKB_CB(skb)->ack_seq !=
 	     tcp_rsk(req)->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk))))
 		return sk;
@@ -637,7 +651,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	/* RFC793: "first check sequence number". */
 
 	if (paws_reject || !tcp_in_window(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
-					  tcp_rsk(req)->rcv_isn + 1, tcp_rsk(req)->rcv_isn + 1 + req->rcv_wnd)) {
+					  tcp_rsk(req)->rcv_nxt, tcp_rsk(req)->rcv_nxt + req->rcv_wnd)) {
 		/* Out of window: send ACK and drop. */
 		if (!(flg & TCP_FLAG_RST))
 			req->rsk_ops->send_ack(sk, skb, req);
@@ -648,7 +662,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 
 	/* In sequence, PAWS is OK. */
 
-	if (tmp_opt.saw_tstamp && !after(TCP_SKB_CB(skb)->seq, tcp_rsk(req)->rcv_isn + 1))
+	if (tmp_opt.saw_tstamp && !after(TCP_SKB_CB(skb)->seq, tcp_rsk(req)->rcv_nxt))
 		req->ts_recent = tmp_opt.rcv_tsval;
 
 	if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn) {
@@ -667,10 +681,19 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 
 	/* ACK sequence verified above, just make sure ACK is
 	 * set.  If ACK not set, just silently drop the packet.
+	 *
+	 * XXX (TFO) - if we ever allow "data after SYN", the
+	 * following check needs to be removed.
 	 */
 	if (!(flg & TCP_FLAG_ACK))
 		return NULL;
 
+	/* For Fast Open no more processing is needed (sk is the
+	 * child socket).
+	 */
+	if (fastopen)
+		return sk;
+
 	/* While TCP_DEFER_ACCEPT is active, drop bare ACK. */
 	if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
 	    TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
@@ -706,11 +729,21 @@ listen_overflow:
 	}
 
 embryonic_reset:
-	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
-	if (!(flg & TCP_FLAG_RST))
+	if (!(flg & TCP_FLAG_RST)) {
+		/* Received a bad SYN pkt - for TFO We try not to reset
+		 * the local connection unless it's really necessary to
+		 * avoid becoming vulnerable to outside attack aiming at
+		 * resetting legit local connections.
+		 */
 		req->rsk_ops->send_reset(sk, skb);
-
-	inet_csk_reqsk_queue_drop(sk, req, prev);
+	} else if (fastopen) { /* received a valid RST pkt */
+		reqsk_fastopen_remove(sk, req, true);
+		tcp_reset(sk);
+	}
+	if (!fastopen) {
+		inet_csk_reqsk_queue_drop(sk, req, prev);
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
+	}
 	return NULL;
 }
 EXPORT_SYMBOL(tcp_check_req);
@@ -719,6 +752,12 @@ EXPORT_SYMBOL(tcp_check_req);
  * Queue segment on the new socket if the new socket is active,
  * otherwise we just shortcircuit this and continue with
  * the new socket.
+ *
+ * For the vast majority of cases child->sk_state will be TCP_SYN_RECV
+ * when entering. But other states are possible due to a race condition
+ * where after __inet_lookup_established() fails but before the listener
+ * locked is obtained, other packets cause the same connection to
+ * be created.
  */
 
 int tcp_child_process(struct sock *parent, struct sock *child,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d046326..9383b51 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -702,7 +702,8 @@ static unsigned int tcp_synack_options(struct sock *sk,
 				   unsigned int mss, struct sk_buff *skb,
 				   struct tcp_out_options *opts,
 				   struct tcp_md5sig_key **md5,
-				   struct tcp_extend_values *xvp)
+				   struct tcp_extend_values *xvp,
+				   struct tcp_fastopen_cookie *foc)
 {
 	struct inet_request_sock *ireq = inet_rsk(req);
 	unsigned int remaining = MAX_TCP_OPTION_SPACE;
@@ -747,7 +748,15 @@ static unsigned int tcp_synack_options(struct sock *sk,
 		if (unlikely(!ireq->tstamp_ok))
 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
 	}
-
+	if (foc != NULL) {
+		u32 need = TCPOLEN_EXP_FASTOPEN_BASE + foc->len;
+		need = (need + 3) & ~3U;  /* Align to 32 bits */
+		if (remaining >= need) {
+			opts->options |= OPTION_FAST_OPEN_COOKIE;
+			opts->fastopen_cookie = foc;
+			remaining -= need;
+		}
+	}
 	/* Similar rationale to tcp_syn_options() applies here, too.
 	 * If the <SYN> options fit, the same options should fit now!
 	 */
@@ -2658,7 +2667,8 @@ int tcp_send_synack(struct sock *sk)
  */
 struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 				struct request_sock *req,
-				struct request_values *rvp)
+				struct request_values *rvp,
+				struct tcp_fastopen_cookie *foc)
 {
 	struct tcp_out_options opts;
 	struct tcp_extend_values *xvp = tcp_xv(rvp);
@@ -2718,7 +2728,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 #endif
 	TCP_SKB_CB(skb)->when = tcp_time_stamp;
 	tcp_header_size = tcp_synack_options(sk, req, mss,
-					     skb, &opts, &md5, xvp)
+					     skb, &opts, &md5, xvp, foc)
 			+ sizeof(*th);
 
 	skb_push(skb, tcp_header_size);
@@ -2772,7 +2782,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	}
 
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
-	th->ack_seq = htonl(tcp_rsk(req)->rcv_isn + 1);
+	/* XXX data is queued and acked as is. No buffer/window check */
+	th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);
 
 	/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
 	th->window = htons(min(req->rcv_wnd, 65535U));
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index b774a03..fc04711 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -305,6 +305,35 @@ static void tcp_probe_timer(struct sock *sk)
 }
 
 /*
+ *	Timer for Fast Open socket to retransmit SYNACK. Note that the
+ *	sk here is the child socket, not the parent (listener) socket.
+ */
+static void tcp_fastopen_synack_timer(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	int max_retries = icsk->icsk_syn_retries ? :
+	    sysctl_tcp_synack_retries + 1; /* add one more retry for fastopen */
+	struct request_sock *req;
+
+	req = tcp_sk(sk)->fastopen_rsk;
+	req->rsk_ops->syn_ack_timeout(sk, req);
+
+	if (req->retrans >= max_retries) {
+		tcp_write_err(sk);
+		return;
+	}
+	/* XXX (TFO) - Unlike regular SYN-ACK retransmit, we ignore error
+	 * returned from rtx_syn_ack() to make it more persistent like
+	 * regular retransmit because if the child socket has been accepted
+	 * it's not good to give up too easily.
+	 */
+	req->rsk_ops->rtx_syn_ack(sk, req, NULL);
+	req->retrans++;
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+			  TCP_TIMEOUT_INIT << req->retrans, TCP_RTO_MAX);
+}
+
+/*
  *	The TCP retransmit timer.
  */
 
@@ -317,7 +346,15 @@ void tcp_retransmit_timer(struct sock *sk)
 		tcp_resume_early_retransmit(sk);
 		return;
 	}
-
+	if (tp->fastopen_rsk) {
+		BUG_ON(sk->sk_state != TCP_SYN_RECV &&
+		    sk->sk_state != TCP_FIN_WAIT1);
+		tcp_fastopen_synack_timer(sk);
+		/* Before we receive ACK to our SYN-ACK don't retransmit
+		 * anything else (e.g., data or FIN segments).
+		 */
+		return;
+	}
 	if (!tp->packets_out)
 		goto out;
 
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index bb46061..182ab9a 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -190,6 +190,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	ireq = inet_rsk(req);
 	ireq6 = inet6_rsk(req);
 	treq = tcp_rsk(req);
+	treq->listener = NULL;
 
 	if (security_inet_conn_request(sk, skb, req))
 		goto out_free;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f99b81d..09078b9 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -475,7 +475,7 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
 	if (!dst && (dst = inet6_csk_route_req(sk, fl6, req)) == NULL)
 		goto done;
 
-	skb = tcp_make_synack(sk, dst, req, rvp);
+	skb = tcp_make_synack(sk, dst, req, rvp, NULL);
 
 	if (skb) {
 		__tcp_v6_send_check(skb, &treq->loc_addr, &treq->rmt_addr);
@@ -987,7 +987,7 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb)
 				   &ipv6_hdr(skb)->saddr,
 				   &ipv6_hdr(skb)->daddr, inet6_iif(skb));
 	if (req)
-		return tcp_check_req(sk, skb, req, prev);
+		return tcp_check_req(sk, skb, req, prev, false);
 
 	nsk = __inet6_lookup_established(sock_net(sk), &tcp_hashinfo,
 			&ipv6_hdr(skb)->saddr, th->source,
@@ -1179,6 +1179,7 @@ have_isn:
 	    want_cookie)
 		goto drop_and_free;
 
+	tcp_rsk(req)->listener = NULL;
 	inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
 	return 0;
 
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 1/3] tcp: TCP Fast Open Server - header & support functions
From: H.K. Jerry Chu @ 2012-08-30 23:39 UTC (permalink / raw)
  To: davem, ycheng, edumazet, ncardwell
  Cc: sivasankar, therbert, netdev, Jerry Chu
In-Reply-To: <1346369948-1722-1-git-send-email-hkchu@google.com>

From: Jerry Chu <hkchu@google.com>

This patch adds all the necessary data structure and support
functions to implement TFO server side. It also documents a number
of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
extension MIBs.

In addition, it includes the following:

1. a new TCP_FASTOPEN socket option an application must call to
supply a max backlog allowed in order to enable TFO on its listener.

2. A number of key data structures:
"fastopen_rsk" in tcp_sock - for a big socket to access its
request_sock for retransmission and ack processing purpose. It is
non-NULL iff 3WHS not completed.

"fastopenq" in request_sock_queue - points to a per Fast Open
listener data structure "fastopen_queue" to keep track of qlen (# of
outstanding Fast Open requests) and max_qlen, among other things.

"listener" in tcp_request_sock - to point to the original listener
for book-keeping purpose, i.e., to maintain qlen against max_qlen
as part of defense against IP spoofing attack.

3. various data structure and functions, many in tcp_fastopen.c, to
support server side Fast Open cookie operations, including
/proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
---
 Documentation/networking/ip-sysctl.txt |   29 ++++++++---
 include/linux/snmp.h                   |    4 ++
 include/linux/tcp.h                    |   45 ++++++++++++++++-
 include/net/request_sock.h             |   36 ++++++++++++++
 include/net/tcp.h                      |   46 +++++++++++++++---
 net/ipv4/proc.c                        |    4 ++
 net/ipv4/sysctl_net_ipv4.c             |   45 +++++++++++++++++
 net/ipv4/tcp_fastopen.c                |   83 +++++++++++++++++++++++++++++++-
 net/ipv4/tcp_input.c                   |    4 +-
 9 files changed, 276 insertions(+), 20 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index ca447b3..5385b28 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -465,16 +465,31 @@ tcp_syncookies - BOOLEAN
 tcp_fastopen - INTEGER
 	Enable TCP Fast Open feature (draft-ietf-tcpm-fastopen) to send data
 	in the opening SYN packet. To use this feature, the client application
-	must not use connect(). Instead, it should use sendmsg() or sendto()
-	with MSG_FASTOPEN flag which performs a TCP handshake automatically.
-
-	The values (bitmap) are:
-	1: Enables sending data in the opening SYN on the client
-	5: Enables sending data in the opening SYN on the client regardless
-	   of cookie availability.
+	must use sendmsg() or sendto() with MSG_FASTOPEN flag rather than
+	connect() to perform a TCP handshake automatically.
+
+	The values (bitmap) are
+	1: Enables sending data in the opening SYN on the client.
+	2: Enables TCP Fast Open on the server side, i.e., allowing data in
+	   a SYN packet to be accepted and passed to the application before
+	   3-way hand shake finishes.
+	4: Send data in the opening SYN regardless of cookie availability and
+	   without a cookie option.
+	0x100: Accept SYN data w/o validating the cookie.
+	0x200: Accept data-in-SYN w/o any cookie option present.
+	0x400/0x800: Enable Fast Open on all listeners regardless of the
+	   TCP_FASTOPEN socket option. The two different flags designate two
+	   different ways of setting max_qlen without the TCP_FASTOPEN socket
+	   option.
 
 	Default: 0
 
+	Note that the client & server side Fast Open flags (1 and 2
+	respectively) must be also enabled before the rest of flags can take
+	effect.
+
+	See include/net/tcp.h and the code for more details.
+
 tcp_syn_retries - INTEGER
 	Number of times initial SYNs for an active TCP connection attempt
 	will be retransmitted. Should not be higher than 255. Default value
diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index ad6e3a6..fdfba23 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -241,6 +241,10 @@ enum
 	LINUX_MIB_TCPCHALLENGEACK,		/* TCPChallengeACK */
 	LINUX_MIB_TCPSYNCHALLENGE,		/* TCPSYNChallenge */
 	LINUX_MIB_TCPFASTOPENACTIVE,		/* TCPFastOpenActive */
+	LINUX_MIB_TCPFASTOPENPASSIVE,		/* TCPFastOpenPassive*/
+	LINUX_MIB_TCPFASTOPENPASSIVEFAIL,	/* TCPFastOpenPassiveFail */
+	LINUX_MIB_TCPFASTOPENLISTENOVERFLOW,	/* TCPFastOpenListenOverflow */
+	LINUX_MIB_TCPFASTOPENCOOKIEREQD,	/* TCPFastOpenCookieReqd */
 	__LINUX_MIB_MAX
 };
 
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index eb125a4..418beda 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -110,6 +110,7 @@ enum {
 #define TCP_REPAIR_QUEUE	20
 #define TCP_QUEUE_SEQ		21
 #define TCP_REPAIR_OPTIONS	22
+#define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
@@ -246,6 +247,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
 /* TCP Fast Open */
 #define TCP_FASTOPEN_COOKIE_MIN	4	/* Min Fast Open Cookie size in bytes */
 #define TCP_FASTOPEN_COOKIE_MAX	16	/* Max Fast Open Cookie size in bytes */
+#define TCP_FASTOPEN_COOKIE_SIZE 8	/* the size employed by this impl. */
 
 /* TCP Fast Open Cookie as stored in memory */
 struct tcp_fastopen_cookie {
@@ -312,9 +314,14 @@ struct tcp_request_sock {
 	/* Only used by TCP MD5 Signature so far. */
 	const struct tcp_request_sock_ops *af_specific;
 #endif
+	struct sock			*listener; /* needed for TFO */
 	u32				rcv_isn;
 	u32				snt_isn;
 	u32				snt_synack; /* synack sent time */
+	u32				rcv_nxt; /* the ack # by SYNACK. For
+						  * FastOpen it's the seq#
+						  * after data-in-SYN.
+						  */
 };
 
 static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
@@ -505,14 +512,18 @@ struct tcp_sock {
 	struct tcp_md5sig_info	__rcu *md5sig_info;
 #endif
 
-/* TCP fastopen related information */
-	struct tcp_fastopen_request *fastopen_req;
-
 	/* When the cookie options are generated and exchanged, then this
 	 * object holds a reference to them (cookie_values->kref).  Also
 	 * contains related tcp_cookie_transactions fields.
 	 */
 	struct tcp_cookie_values  *cookie_values;
+
+/* TCP fastopen related information */
+	struct tcp_fastopen_request *fastopen_req;
+	/* fastopen_rsk points to request_sock that resulted in this big
+	 * socket. Used to retransmit SYNACKs etc.
+	 */
+	struct request_sock *fastopen_rsk;
 };
 
 enum tsq_flags {
@@ -552,6 +563,34 @@ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
 	return (struct tcp_timewait_sock *)sk;
 }
 
+static inline bool tcp_passive_fastopen(const struct sock *sk)
+{
+	return (sk->sk_state == TCP_SYN_RECV &&
+		tcp_sk(sk)->fastopen_rsk != NULL);
+}
+
+static inline bool fastopen_cookie_present(struct tcp_fastopen_cookie *foc)
+{
+	return (foc)->len != -1;
+}
+
+static inline int fastopen_init_queue(struct sock *sk, int backlog)
+{
+	struct request_sock_queue *queue =
+	    &inet_csk(sk)->icsk_accept_queue;
+
+	if (queue->fastopenq == NULL) {
+		queue->fastopenq = kzalloc(
+		    sizeof(struct fastopen_queue),
+		    sk->sk_allocation);
+		if (queue->fastopenq == NULL)
+			return -ENOMEM;
+		spin_lock_init(&queue->fastopenq->lock);
+	}
+	queue->fastopenq->max_qlen = backlog;
+	return 0;
+}
+
 #endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_TCP_H */
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 4c0766e..c3cdd6c 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -106,6 +106,34 @@ struct listen_sock {
 	struct request_sock	*syn_table[0];
 };
 
+/*
+ * For a TCP Fast Open listener -
+ *	lock - protects the access to all the reqsk, which is co-owned by
+ *		the listener and the child socket.
+ *	qlen - pending TFO requests (still in TCP_SYN_RECV).
+ *	max_qlen - max TFO reqs allowed before TFO is disabled.
+ *
+ *	XXX (TFO) - ideally these fields can be made as part of "listen_sock"
+ *	structure above. But there is some implementation difficulty due to
+ *	listen_sock being part of request_sock_queue hence will be freed when
+ *	a listener is stopped. But TFO related fields may continue to be
+ *	accessed even after a listener is closed, until its sk_refcnt drops
+ *	to 0 implying no more outstanding TFO reqs. One solution is to keep
+ *	listen_opt around until	sk_refcnt drops to 0. But there is some other
+ *	complexity that needs to be resolved. E.g., a listener can be disabled
+ *	temporarily through shutdown()->tcp_disconnect(), and re-enabled later.
+ */
+struct fastopen_queue {
+	struct request_sock	*rskq_rst_head; /* Keep track of past TFO */
+	struct request_sock	*rskq_rst_tail; /* requests that caused RST.
+						 * This is part of the defense
+						 * against spoofing attack.
+						 */
+	spinlock_t	lock;
+	int		qlen;		/* # of pending (TCP_SYN_RECV) reqs */
+	int		max_qlen;	/* != 0 iff TFO is currently enabled */
+};
+
 /** struct request_sock_queue - queue of request_socks
  *
  * @rskq_accept_head - FIFO head of established children
@@ -129,6 +157,12 @@ struct request_sock_queue {
 	u8			rskq_defer_accept;
 	/* 3 bytes hole, try to pack */
 	struct listen_sock	*listen_opt;
+	struct fastopen_queue	*fastopenq; /* This is non-NULL iff TFO has been
+					     * enabled on this listener. Check
+					     * max_qlen != 0 in fastopen_queue
+					     * to determine if TFO is enabled
+					     * right at this moment.
+					     */
 };
 
 extern int reqsk_queue_alloc(struct request_sock_queue *queue,
@@ -136,6 +170,8 @@ extern int reqsk_queue_alloc(struct request_sock_queue *queue,
 
 extern void __reqsk_queue_destroy(struct request_sock_queue *queue);
 extern void reqsk_queue_destroy(struct request_sock_queue *queue);
+extern void reqsk_fastopen_remove(struct sock *sk,
+				  struct request_sock *req, bool reset);
 
 static inline struct request_sock *
 	reqsk_queue_yank_acceptq(struct request_sock_queue *queue)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9a0021d..1a88a97 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -214,8 +214,24 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 /* Bit Flags for sysctl_tcp_fastopen */
 #define	TFO_CLIENT_ENABLE	1
+#define	TFO_SERVER_ENABLE	2
 #define	TFO_CLIENT_NO_COOKIE	4	/* Data in SYN w/o cookie option */
 
+/* Process SYN data but skip cookie validation */
+#define	TFO_SERVER_COOKIE_NOT_CHKED	0x100
+/* Accept SYN data w/o any cookie option */
+#define	TFO_SERVER_COOKIE_NOT_REQD	0x200
+
+/* Force enable TFO on all listeners, i.e., not requiring the
+ * TCP_FASTOPEN socket option. SOCKOPT1/2 determine how to set max_qlen.
+ */
+#define	TFO_SERVER_WO_SOCKOPT1	0x400
+#define	TFO_SERVER_WO_SOCKOPT2	0x800
+/* Always create TFO child sockets on a TFO listener even when
+ * cookie/data not present. (For testing purpose!)
+ */
+#define	TFO_SERVER_ALWAYS	0x1000
+
 extern struct inet_timewait_death_row tcp_death_row;
 
 /* sysctl variables for tcp */
@@ -411,12 +427,6 @@ extern void tcp_metrics_init(void);
 extern bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst, bool paws_check);
 extern bool tcp_remember_stamp(struct sock *sk);
 extern bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw);
-extern void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
-				   struct tcp_fastopen_cookie *cookie,
-				   int *syn_loss, unsigned long *last_syn_loss);
-extern void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
-				   struct tcp_fastopen_cookie *cookie,
-				   bool syn_lost);
 extern void tcp_fetch_timewait_stamp(struct sock *sk, struct dst_entry *dst);
 extern void tcp_disable_fack(struct tcp_sock *tp);
 extern void tcp_close(struct sock *sk, long timeout);
@@ -527,6 +537,7 @@ extern void tcp_send_delayed_ack(struct sock *sk);
 extern void tcp_cwnd_application_limited(struct sock *sk);
 extern void tcp_resume_early_retransmit(struct sock *sk);
 extern void tcp_rearm_rto(struct sock *sk);
+extern void tcp_reset(struct sock *sk);
 
 /* tcp_timer.c */
 extern void tcp_init_xmit_timers(struct sock *);
@@ -576,6 +587,7 @@ extern int tcp_mtu_to_mss(struct sock *sk, int pmtu);
 extern int tcp_mss_to_mtu(struct sock *sk, int mss);
 extern void tcp_mtup_init(struct sock *sk);
 extern void tcp_valid_rtt_meas(struct sock *sk, u32 seq_rtt);
+extern void tcp_init_buffer_space(struct sock *sk);
 
 static inline void tcp_bound_rto(const struct sock *sk)
 {
@@ -1094,6 +1106,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
 	req->rcv_wnd = 0;		/* So that tcp_send_synack() knows! */
 	req->cookie_ts = 0;
 	tcp_rsk(req)->rcv_isn = TCP_SKB_CB(skb)->seq;
+	tcp_rsk(req)->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
 	req->mss = rx_opt->mss_clamp;
 	req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0;
 	ireq->tstamp_ok = rx_opt->tstamp_ok;
@@ -1298,15 +1311,34 @@ extern int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *, const struct sk_buff
 extern int tcp_md5_hash_key(struct tcp_md5sig_pool *hp,
 			    const struct tcp_md5sig_key *key);
 
+/* From tcp_fastopen.c */
+extern void tcp_fastopen_cache_get(struct sock *sk, u16 *mss,
+				   struct tcp_fastopen_cookie *cookie,
+				   int *syn_loss, unsigned long *last_syn_loss);
+extern void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
+				   struct tcp_fastopen_cookie *cookie,
+				   bool syn_lost);
 struct tcp_fastopen_request {
 	/* Fast Open cookie. Size 0 means a cookie request */
 	struct tcp_fastopen_cookie	cookie;
 	struct msghdr			*data;  /* data in MSG_FASTOPEN */
 	u16				copied;	/* queued in tcp_connect() */
 };
-
 void tcp_free_fastopen_req(struct tcp_sock *tp);
 
+extern struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
+int tcp_fastopen_reset_cipher(void *key, unsigned int len);
+void tcp_fastopen_cookie_gen(__be32 addr, struct tcp_fastopen_cookie *foc);
+
+#define TCP_FASTOPEN_KEY_LENGTH 16
+
+/* Fastopen key context */
+struct tcp_fastopen_context {
+	struct crypto_cipher __rcu	*tfm;
+	__u8				key[TCP_FASTOPEN_KEY_LENGTH];
+	struct rcu_head			rcu;
+};
+
 /* write queue abstraction */
 static inline void tcp_write_queue_purge(struct sock *sk)
 {
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 957acd1..8de53e1 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -263,6 +263,10 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPChallengeACK", LINUX_MIB_TCPCHALLENGEACK),
 	SNMP_MIB_ITEM("TCPSYNChallenge", LINUX_MIB_TCPSYNCHALLENGE),
 	SNMP_MIB_ITEM("TCPFastOpenActive", LINUX_MIB_TCPFASTOPENACTIVE),
+	SNMP_MIB_ITEM("TCPFastOpenPassive", LINUX_MIB_TCPFASTOPENPASSIVE),
+	SNMP_MIB_ITEM("TCPFastOpenPassiveFail", LINUX_MIB_TCPFASTOPENPASSIVEFAIL),
+	SNMP_MIB_ITEM("TCPFastOpenListenOverflow", LINUX_MIB_TCPFASTOPENLISTENOVERFLOW),
+	SNMP_MIB_ITEM("TCPFastOpenCookieReqd", LINUX_MIB_TCPFASTOPENCOOKIEREQD),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 3e78c79..9205e49 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -232,6 +232,45 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	return 0;
 }
 
+int proc_tcp_fastopen_key(ctl_table *ctl, int write, void __user *buffer,
+			  size_t *lenp, loff_t *ppos)
+{
+	ctl_table tbl = { .maxlen = (TCP_FASTOPEN_KEY_LENGTH * 2 + 10) };
+	struct tcp_fastopen_context *ctxt;
+	int ret;
+	u32  user_key[4]; /* 16 bytes, matching TCP_FASTOPEN_KEY_LENGTH */
+
+	tbl.data = kmalloc(tbl.maxlen, GFP_KERNEL);
+	if (!tbl.data)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	ctxt = rcu_dereference(tcp_fastopen_ctx);
+	if (ctxt)
+		memcpy(user_key, ctxt->key, TCP_FASTOPEN_KEY_LENGTH);
+	rcu_read_unlock();
+
+	snprintf(tbl.data, tbl.maxlen, "%08x-%08x-%08x-%08x",
+		user_key[0], user_key[1], user_key[2], user_key[3]);
+	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
+
+	if (write && ret == 0) {
+		if (sscanf(tbl.data, "%x-%x-%x-%x", user_key, user_key + 1,
+			   user_key + 2, user_key + 3) != 4) {
+			ret = -EINVAL;
+			goto bad_key;
+		}
+		tcp_fastopen_reset_cipher(user_key, TCP_FASTOPEN_KEY_LENGTH);
+	}
+
+bad_key:
+	pr_debug("proc FO key set 0x%x-%x-%x-%x <- 0x%s: %u\n",
+	       user_key[0], user_key[1], user_key[2], user_key[3],
+	       (char *)tbl.data, ret);
+	kfree(tbl.data);
+	return ret;
+}
+
 static struct ctl_table ipv4_table[] = {
 	{
 		.procname	= "tcp_timestamps",
@@ -386,6 +425,12 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "tcp_fastopen_key",
+		.mode		= 0600,
+		.maxlen		= ((TCP_FASTOPEN_KEY_LENGTH * 2) + 10),
+		.proc_handler	= proc_tcp_fastopen_key,
+	},
+	{
 		.procname	= "tcp_tw_recycle",
 		.data		= &tcp_death_row.sysctl_tw_recycle,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index a7f729c..8f7ef0a 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -1,10 +1,91 @@
+#include <linux/err.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/tcp.h>
+#include <linux/rcupdate.h>
+#include <linux/rculist.h>
+#include <net/inetpeer.h>
+#include <net/tcp.h>
 
-int sysctl_tcp_fastopen;
+int sysctl_tcp_fastopen __read_mostly;
+
+struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
+
+static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock);
+
+static void tcp_fastopen_ctx_free(struct rcu_head *head)
+{
+	struct tcp_fastopen_context *ctx =
+	    container_of(head, struct tcp_fastopen_context, rcu);
+	crypto_free_cipher(ctx->tfm);
+	kfree(ctx);
+}
+
+int tcp_fastopen_reset_cipher(void *key, unsigned int len)
+{
+	int err;
+	struct tcp_fastopen_context *ctx, *octx;
+
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	ctx->tfm = crypto_alloc_cipher("aes", 0, 0);
+
+	if (IS_ERR(ctx->tfm)) {
+		err = PTR_ERR(ctx->tfm);
+error:		kfree(ctx);
+		pr_err("TCP: TFO aes cipher alloc error: %d\n", err);
+		return err;
+	}
+	err = crypto_cipher_setkey(ctx->tfm, key, len);
+	if (err) {
+		pr_err("TCP: TFO cipher key error: %d\n", err);
+		crypto_free_cipher(ctx->tfm);
+		goto error;
+	}
+	memcpy(ctx->key, key, len);
+
+	spin_lock(&tcp_fastopen_ctx_lock);
+
+	octx = rcu_dereference_protected(tcp_fastopen_ctx,
+				lockdep_is_held(&tcp_fastopen_ctx_lock));
+	rcu_assign_pointer(tcp_fastopen_ctx, ctx);
+	spin_unlock(&tcp_fastopen_ctx_lock);
+
+	if (octx)
+		call_rcu(&octx->rcu, tcp_fastopen_ctx_free);
+	return err;
+}
+
+/* Computes the fastopen cookie for the peer.
+ * The peer address is a 128 bits long (pad with zeros for IPv4).
+ *
+ * The caller must check foc->len to determine if a valid cookie
+ * has been generated successfully.
+*/
+void tcp_fastopen_cookie_gen(__be32 addr, struct tcp_fastopen_cookie *foc)
+{
+	__be32 peer_addr[4] = { addr, 0, 0, 0 };
+	struct tcp_fastopen_context *ctx;
+
+	rcu_read_lock();
+	ctx = rcu_dereference(tcp_fastopen_ctx);
+	if (ctx) {
+		crypto_cipher_encrypt_one(ctx->tfm,
+					  foc->val,
+					  (__u8 *)peer_addr);
+		foc->len = TCP_FASTOPEN_COOKIE_SIZE;
+	}
+	rcu_read_unlock();
+}
 
 static int __init tcp_fastopen_init(void)
 {
+	__u8 key[TCP_FASTOPEN_KEY_LENGTH];
+
+	get_random_bytes(key, sizeof(key));
+	tcp_fastopen_reset_cipher(key, sizeof(key));
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bcfccc5..d184f23 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -378,7 +378,7 @@ static void tcp_fixup_rcvbuf(struct sock *sk)
 /* 4. Try to fixup all. It is made immediately after connection enters
  *    established state.
  */
-static void tcp_init_buffer_space(struct sock *sk)
+void tcp_init_buffer_space(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int maxwin;
@@ -4039,7 +4039,7 @@ static inline bool tcp_sequence(const struct tcp_sock *tp, u32 seq, u32 end_seq)
 }
 
 /* When we get a reset we do this. */
-static void tcp_reset(struct sock *sk)
+void tcp_reset(struct sock *sk)
 {
 	/* We want the right error as BSD sees it (and indeed as we do). */
 	switch (sk->sk_state) {
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 3/3] tcp: TCP Fast Open Server - main code path
From: H.K. Jerry Chu @ 2012-08-30 23:39 UTC (permalink / raw)
  To: davem, ycheng, edumazet, ncardwell
  Cc: sivasankar, therbert, netdev, Jerry Chu
In-Reply-To: <1346369948-1722-1-git-send-email-hkchu@google.com>

From: Jerry Chu <hkchu@google.com>

This patch adds the main processing path to complete the TFO server
patches.

A TFO request (i.e., SYN+data packet with a TFO cookie option) first
gets processed in tcp_v4_conn_request(). If it passes the various TFO
checks by tcp_fastopen_check(), a child socket will be created right
away to be accepted by applications, rather than waiting for the 3WHS
to finish.

In additon to the use of TFO cookie, a simple max_qlen based scheme
is put in place to fend off spoofed TFO attack.

When a valid ACK comes back to tcp_rcv_state_process(), it will cause
the state of the child socket to switch from either TCP_SYN_RECV to
TCP_ESTABLISHED, or TCP_FIN_WAIT1 to TCP_FIN_WAIT2. At this time
retransmission will resume for any unack'ed (data, FIN,...) segments.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
---
 net/ipv4/tcp_input.c |   71 +++++++++++---
 net/ipv4/tcp_ipv4.c  |  266 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 310 insertions(+), 27 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d184f23..ba179b6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3124,6 +3124,12 @@ void tcp_rearm_rto(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	/* If the retrans timer is currently being used by Fast Open
+	 * for SYN-ACK retrans purpose, stay put.
+	 */
+	if (tp->fastopen_rsk)
+		return;
+
 	if (!tp->packets_out) {
 		inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
 	} else {
@@ -5896,7 +5902,9 @@ discard:
 		tcp_send_synack(sk);
 #if 0
 		/* Note, we could accept data and URG from this segment.
-		 * There are no obstacles to make this.
+		 * There are no obstacles to make this (except that we must
+		 * either change tcp_recvmsg() to prevent it from returning data
+		 * before 3WHS completes per RFC793, or employ TCP Fast Open).
 		 *
 		 * However, if we ignore data in ACKless segments sometimes,
 		 * we have no reasons to accept it sometimes.
@@ -5936,6 +5944,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct request_sock *req;
 	int queued = 0;
 
 	tp->rx_opt.saw_tstamp = 0;
@@ -5991,7 +6000,14 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 		return 0;
 	}
 
-	if (!tcp_validate_incoming(sk, skb, th, 0))
+	req = tp->fastopen_rsk;
+	if (req != NULL) {
+		BUG_ON(sk->sk_state != TCP_SYN_RECV &&
+		    sk->sk_state != TCP_FIN_WAIT1);
+
+		if (tcp_check_req(sk, skb, req, NULL, true) == NULL)
+			goto discard;
+	} else if (!tcp_validate_incoming(sk, skb, th, 0))
 		return 0;
 
 	/* step 5: check the ACK field */
@@ -6001,7 +6017,22 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 		switch (sk->sk_state) {
 		case TCP_SYN_RECV:
 			if (acceptable) {
-				tp->copied_seq = tp->rcv_nxt;
+				/* Once we leave TCP_SYN_RECV, we no longer
+				 * need req so release it.
+				 */
+				if (req) {
+					reqsk_fastopen_remove(sk, req, false);
+				} else {
+					/* Make sure socket is routed, for
+					 * correct metrics.
+					 */
+					icsk->icsk_af_ops->rebuild_header(sk);
+					tcp_init_congestion_control(sk);
+
+					tcp_mtup_init(sk);
+					tcp_init_buffer_space(sk);
+					tp->copied_seq = tp->rcv_nxt;
+				}
 				smp_mb();
 				tcp_set_state(sk, TCP_ESTABLISHED);
 				sk->sk_state_change(sk);
@@ -6023,23 +6054,27 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 				if (tp->rx_opt.tstamp_ok)
 					tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
 
-				/* Make sure socket is routed, for
-				 * correct metrics.
-				 */
-				icsk->icsk_af_ops->rebuild_header(sk);
-
-				tcp_init_metrics(sk);
-
-				tcp_init_congestion_control(sk);
+				if (req) {
+					/* Re-arm the timer because data may
+					 * have been sent out. This is similar
+					 * to the regular data transmission case
+					 * when new data has just been ack'ed.
+					 *
+					 * (TFO) - we could try to be more
+					 * aggressive and retranmitting any data
+					 * sooner based on when they were sent
+					 * out.
+					 */
+					tcp_rearm_rto(sk);
+				} else
+					tcp_init_metrics(sk);
 
 				/* Prevent spurious tcp_cwnd_restart() on
 				 * first data packet.
 				 */
 				tp->lsndtime = tcp_time_stamp;
 
-				tcp_mtup_init(sk);
 				tcp_initialize_rcv_mss(sk);
-				tcp_init_buffer_space(sk);
 				tcp_fast_path_on(tp);
 			} else {
 				return 1;
@@ -6047,6 +6082,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 			break;
 
 		case TCP_FIN_WAIT1:
+			/* If we enter the TCP_FIN_WAIT1 state and we are a
+			 * Fast Open socket and this is the first acceptable
+			 * ACK we have received, this would have acknowledged
+			 * our SYNACK so stop the SYNACK timer.
+			 */
+			if (acceptable && req != NULL) {
+				/* We no longer need the request sock. */
+				reqsk_fastopen_remove(sk, req, false);
+				tcp_rearm_rto(sk);
+			}
 			if (tp->snd_una == tp->write_seq) {
 				struct dst_entry *dst;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index bb148de..232e51c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -352,6 +352,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 	const int code = icmp_hdr(icmp_skb)->code;
 	struct sock *sk;
 	struct sk_buff *skb;
+	struct request_sock *req;
 	__u32 seq;
 	__u32 remaining;
 	int err;
@@ -394,9 +395,12 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 
 	icsk = inet_csk(sk);
 	tp = tcp_sk(sk);
+	req = tp->fastopen_rsk;
 	seq = ntohl(th->seq);
 	if (sk->sk_state != TCP_LISTEN &&
-	    !between(seq, tp->snd_una, tp->snd_nxt)) {
+	    !between(seq, tp->snd_una, tp->snd_nxt) &&
+	    (req == NULL || seq != tcp_rsk(req)->snt_isn)) {
+		/* For a Fast Open socket, allow seq to be snt_isn. */
 		NET_INC_STATS_BH(net, LINUX_MIB_OUTOFWINDOWICMPS);
 		goto out;
 	}
@@ -435,6 +439,8 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 		    !icsk->icsk_backoff)
 			break;
 
+		/* XXX (TFO) - revisit the following logic for TFO */
+
 		if (sock_owned_by_user(sk))
 			break;
 
@@ -466,6 +472,14 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 		goto out;
 	}
 
+	/* XXX (TFO) - if it's a TFO socket and has been accepted, rather
+	 * than following the TCP_SYN_RECV case and closing the socket,
+	 * we ignore the ICMP error and keep trying like a fully established
+	 * socket. Is this the right thing to do?
+	 */
+	if (req && req->sk == NULL)
+		goto out;
+
 	switch (sk->sk_state) {
 		struct request_sock *req, **prev;
 	case TCP_LISTEN:
@@ -498,7 +512,8 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 
 	case TCP_SYN_SENT:
 	case TCP_SYN_RECV:  /* Cannot happen.
-			       It can f.e. if SYNs crossed.
+			       It can f.e. if SYNs crossed,
+			       or Fast Open.
 			     */
 		if (!sock_owned_by_user(sk)) {
 			sk->sk_err = err;
@@ -809,8 +824,12 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
 static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
 				  struct request_sock *req)
 {
-	tcp_v4_send_ack(skb, tcp_rsk(req)->snt_isn + 1,
-			tcp_rsk(req)->rcv_isn + 1, req->rcv_wnd,
+	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+	 */
+	tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+			tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
 			req->ts_recent,
 			0,
 			tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
@@ -1272,6 +1291,178 @@ static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
 };
 #endif
 
+static bool tcp_fastopen_check(struct sock *sk, struct sk_buff *skb,
+			       struct request_sock *req,
+			       struct tcp_fastopen_cookie *foc,
+			       struct tcp_fastopen_cookie *valid_foc)
+{
+	bool skip_cookie = false;
+	struct fastopen_queue *fastopenq;
+
+	if (likely(!fastopen_cookie_present(foc))) {
+		/* See include/net/tcp.h for the meaning of these knobs */
+		if ((sysctl_tcp_fastopen & TFO_SERVER_ALWAYS) ||
+		    ((sysctl_tcp_fastopen & TFO_SERVER_COOKIE_NOT_REQD) &&
+		    (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq + 1)))
+			skip_cookie = true; /* no cookie to validate */
+		else
+			return false;
+	}
+	fastopenq = inet_csk(sk)->icsk_accept_queue.fastopenq;
+	/* A FO option is present; bump the counter. */
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPFASTOPENPASSIVE);
+
+	/* Make sure the listener has enabled fastopen, and we don't
+	 * exceed the max # of pending TFO requests allowed before trying
+	 * to validating the cookie in order to avoid burning CPU cycles
+	 * unnecessarily.
+	 *
+	 * XXX (TFO) - The implication of checking the max_qlen before
+	 * processing a cookie request is that clients can't differentiate
+	 * between qlen overflow causing Fast Open to be disabled
+	 * temporarily vs a server not supporting Fast Open at all.
+	 */
+	if ((sysctl_tcp_fastopen & TFO_SERVER_ENABLE) == 0 ||
+	    fastopenq == NULL || fastopenq->max_qlen == 0)
+		return false;
+
+	if (fastopenq->qlen >= fastopenq->max_qlen) {
+		struct request_sock *req1;
+		spin_lock(&fastopenq->lock);
+		req1 = fastopenq->rskq_rst_head;
+		if ((req1 == NULL) || time_after(req1->expires, jiffies)) {
+			spin_unlock(&fastopenq->lock);
+			NET_INC_STATS_BH(sock_net(sk),
+			    LINUX_MIB_TCPFASTOPENLISTENOVERFLOW);
+			/* Avoid bumping LINUX_MIB_TCPFASTOPENPASSIVEFAIL*/
+			foc->len = -1;
+			return false;
+		}
+		fastopenq->rskq_rst_head = req1->dl_next;
+		fastopenq->qlen--;
+		spin_unlock(&fastopenq->lock);
+		reqsk_free(req1);
+	}
+	if (skip_cookie) {
+		tcp_rsk(req)->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+		return true;
+	}
+	if (foc->len == TCP_FASTOPEN_COOKIE_SIZE) {
+		if ((sysctl_tcp_fastopen & TFO_SERVER_COOKIE_NOT_CHKED) == 0) {
+			tcp_fastopen_cookie_gen(ip_hdr(skb)->saddr, valid_foc);
+			if ((valid_foc->len != TCP_FASTOPEN_COOKIE_SIZE) ||
+			    memcmp(&foc->val[0], &valid_foc->val[0],
+			    TCP_FASTOPEN_COOKIE_SIZE) != 0)
+				return false;
+			valid_foc->len = -1;
+		}
+		/* Acknowledge the data received from the peer. */
+		tcp_rsk(req)->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+		return true;
+	} else if (foc->len == 0) { /* Client requesting a cookie */
+		tcp_fastopen_cookie_gen(ip_hdr(skb)->saddr, valid_foc);
+		NET_INC_STATS_BH(sock_net(sk),
+		    LINUX_MIB_TCPFASTOPENCOOKIEREQD);
+	} else {
+		/* Client sent a cookie with wrong size. Treat it
+		 * the same as invalid and return a valid one.
+		 */
+		tcp_fastopen_cookie_gen(ip_hdr(skb)->saddr, valid_foc);
+	}
+	return false;
+}
+
+static int tcp_v4_conn_req_fastopen(struct sock *sk,
+				    struct sk_buff *skb,
+				    struct sk_buff *skb_synack,
+				    struct request_sock *req,
+				    struct request_values *rvp)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
+	const struct inet_request_sock *ireq = inet_rsk(req);
+	struct sock *child;
+
+	req->retrans = 0;
+	req->sk = NULL;
+
+	child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
+	if (child == NULL) {
+		NET_INC_STATS_BH(sock_net(sk),
+				 LINUX_MIB_TCPFASTOPENPASSIVEFAIL);
+		kfree_skb(skb_synack);
+		return -1;
+	}
+	ip_build_and_send_pkt(skb_synack, sk, ireq->loc_addr,
+			ireq->rmt_addr, ireq->opt);
+	/* XXX (TFO) - is it ok to ignore error and continue? */
+
+	spin_lock(&queue->fastopenq->lock);
+	queue->fastopenq->qlen++;
+	spin_unlock(&queue->fastopenq->lock);
+
+	/* Initialize the child socket. Have to fix some values to take
+	 * into account the child is a Fast Open socket and is created
+	 * only out of the bits carried in the SYN packet.
+	 */
+	tp = tcp_sk(child);
+
+	tp->fastopen_rsk = req;
+	/* Do a hold on the listner sk so that if the listener is being
+	 * closed, the child that has been accepted can live on and still
+	 * access listen_lock.
+	 */
+	sock_hold(sk);
+	tcp_rsk(req)->listener = sk;
+
+	/* RFC1323: The window in SYN & SYN/ACK segments is never
+	 * scaled. So correct it appropriately.
+	 */
+	tp->snd_wnd = ntohs(tcp_hdr(skb)->window);
+
+	/* Activate the retrans timer so that SYNACK can be retransmitted.
+	 * The request socket is not added to the SYN table of the parent
+	 * because it's been added to the accept queue directly.
+	 */
+	inet_csk_reset_xmit_timer(child, ICSK_TIME_RETRANS,
+	    TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+
+	/* Add the child socket directly into the accept queue */
+	inet_csk_reqsk_queue_add(sk, req, child);
+
+	/* Now finish processing the fastopen child socket. */
+	inet_csk(child)->icsk_af_ops->rebuild_header(child);
+	tcp_init_congestion_control(child);
+	tcp_mtup_init(child);
+	tcp_init_buffer_space(child);
+	tcp_init_metrics(child);
+
+	/* Queue the data carried in the SYN packet. We need to first
+	 * bump skb's refcnt because the caller will attempt to free it.
+	 *
+	 * XXX (TFO) - we honor a zero-payload TFO request for now.
+	 * (Any reason not to?)
+	 */
+	if (TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq + 1) {
+		/* Don't queue the skb if there is no payload in SYN.
+		 * XXX (TFO) - How about SYN+FIN?
+		 */
+		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+	} else {
+		skb = skb_get(skb);
+		skb_dst_drop(skb);
+		__skb_pull(skb, tcp_hdr(skb)->doff * 4);
+		skb_set_owner_r(skb, child);
+		__skb_queue_tail(&child->sk_receive_queue, skb);
+		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+	}
+	sk->sk_data_ready(sk, 0);
+	bh_unlock_sock(child);
+	sock_put(child);
+	WARN_ON(req->sk == NULL);
+	return 0;
+}
+
 int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_extend_values tmp_ext;
@@ -1285,6 +1476,12 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	__be32 daddr = ip_hdr(skb)->daddr;
 	__u32 isn = TCP_SKB_CB(skb)->when;
 	bool want_cookie = false;
+	struct flowi4 fl4;
+	struct fastopen_queue *fastopenq;
+	struct tcp_fastopen_cookie foc = { .len = -1 };
+	struct tcp_fastopen_cookie valid_foc = { .len = -1 };
+	struct sk_buff *skb_synack;
+	int do_fastopen = 0;
 
 	/* Never answer to SYNs send to broadcast or multicast */
 	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -1319,7 +1516,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
-	tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
+	tcp_parse_options(skb, &tmp_opt, &hash_location, 0,
+	    want_cookie ? NULL : &foc);
 
 	if (tmp_opt.cookie_plus > 0 &&
 	    tmp_opt.saw_tstamp &&
@@ -1377,8 +1575,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
 		req->cookie_ts = tmp_opt.tstamp_ok;
 	} else if (!isn) {
-		struct flowi4 fl4;
-
 		/* VJ's idea. We save last timestamp seen
 		 * from the destination in peer table, when entering
 		 * state TIME-WAIT, and check against it before
@@ -1419,14 +1615,52 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_rsk(req)->snt_synack = tcp_time_stamp;
 
-	if (tcp_v4_send_synack(sk, dst, req,
-			       (struct request_values *)&tmp_ext,
-			       skb_get_queue_mapping(skb),
-			       want_cookie) ||
-	    want_cookie)
+	if (dst == NULL) {
+		dst = inet_csk_route_req(sk, &fl4, req);
+		if (dst == NULL)
+			goto drop_and_free;
+	}
+	do_fastopen = tcp_fastopen_check(sk, skb, req, &foc, &valid_foc);
+
+	/* We don't call tcp_v4_send_synack() directly because we need
+	 * to make sure a child socket can be created successfully before
+	 * sending back synack!
+	 *
+	 * XXX (TFO) - Ideally one would simply call tcp_v4_send_synack()
+	 * (or better yet, call tcp_send_synack() in the child context
+	 * directly, but will have to fix bunch of other code first)
+	 * after syn_recv_sock() except one will need to first fix the
+	 * latter to remove its dependency on the current implementation
+	 * of tcp_v4_send_synack()->tcp_select_initial_window().
+	 */
+	skb_synack = tcp_make_synack(sk, dst, req,
+	    (struct request_values *)&tmp_ext,
+	    fastopen_cookie_present(&valid_foc) ? &valid_foc : NULL);
+
+	if (skb_synack) {
+		__tcp_v4_send_check(skb_synack, ireq->loc_addr, ireq->rmt_addr);
+		skb_set_queue_mapping(skb_synack, skb_get_queue_mapping(skb));
+	} else
+		goto drop_and_free;
+
+	if (likely(!do_fastopen)) {
+		int err;
+		err = ip_build_and_send_pkt(skb_synack, sk, ireq->loc_addr,
+		     ireq->rmt_addr, ireq->opt);
+		err = net_xmit_eval(err);
+		if (err || want_cookie)
+			goto drop_and_free;
+
+		tcp_rsk(req)->listener = NULL;
+		/* Add the request_sock to the SYN table */
+		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		if (fastopen_cookie_present(&foc) && foc.len != 0)
+			NET_INC_STATS_BH(sock_net(sk),
+			    LINUX_MIB_TCPFASTOPENPASSIVEFAIL);
+	} else if (tcp_v4_conn_req_fastopen(sk, skb, skb_synack, req,
+	    (struct request_values *)&tmp_ext))
 		goto drop_and_free;
 
-	inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
 	return 0;
 
 drop_and_release:
@@ -1977,6 +2211,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 			 tcp_cookie_values_release);
 		tp->cookie_values = NULL;
 	}
+	BUG_ON(tp->fastopen_rsk != NULL);
 
 	/* If socket is aborted during connect operation */
 	tcp_free_fastopen_req(tp);
@@ -2425,6 +2660,7 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
 	const struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	const struct inet_sock *inet = inet_sk(sk);
+	struct fastopen_queue *fastopenq = icsk->icsk_accept_queue.fastopenq;
 	__be32 dest = inet->inet_daddr;
 	__be32 src = inet->inet_rcv_saddr;
 	__u16 destp = ntohs(inet->inet_dport);
@@ -2469,7 +2705,9 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
 		jiffies_to_clock_t(icsk->icsk_ack.ato),
 		(icsk->icsk_ack.quick << 1) | icsk->icsk_ack.pingpong,
 		tp->snd_cwnd,
-		tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh,
+		sk->sk_state == TCP_LISTEN ?
+		    (fastopenq ? fastopenq->max_qlen : 0) :
+		    (tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh),
 		len);
 }
 
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 1/2] ipv4: Improve the scaling of the ARP cache for multicast destinations.
From: Bob Gilligan @ 2012-08-31  0:55 UTC (permalink / raw)
  To: netdev


The ARP cache maintains entries for both unicast and multicast IPv4
next-hop destinations.  The MAC addresses for unicast destinations are
determined by running the ARP protocol, but those for multicast
destinations are determined by a simple direct mapping from the
destination IPv4 multicast address.

Currently, the ARP cache maintains one entry for each IPv4 multicast
destination for each interface that has members in that group.  On a
multicast router that is forwarding traffic for many groups via many
interfaces, the number of ARP cache entries for multicast destinations
can become large. It could be as many as: (number of interfaces) *
(number of groups).  Beside using a great deal of memory, these entries
consume space in the ARP cache that could otherwise be occupied by
unicast entries, makeing it more likely that the ARP cache will become
full.

The mapping from multicast IPv4 address to MAC address can just as
easily be done at the time a packet is to be sent.  With this change,
we maintain one ARP cache entry for each interface that has at least
one multicast group member.  All routes to IPv4 multicast destinations
via a particular interface use the same ARP cache entry.  This entry
does not store the MAC address to use.  Instead, packets for multicast
destinations go to a new output function that maps the destination
IPv4 multicast address into the MAC address and forms the MAC header.

Signed-off-by: Bob Gilligan <gilligan@aristanetworks.com>
---
 net/ipv4/arp.c   |   49 +++++++++++++++++++++++++++++++++++++++++++++----
 net/ipv4/route.c |   14 ++++++++++++--
 2 files changed, 57 insertions(+), 6 deletions(-)

Index: b/net/ipv4/arp.c
===================================================================
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -126,6 +126,7 @@ static int arp_constructor(struct neighb
 static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb);
 static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb);
 static void parp_redo(struct sk_buff *skb);
+static int arp_multicast_output(struct neighbour *neigh, struct sk_buff *skb);
 
 static const struct neigh_ops arp_generic_ops = {
 	.family =		AF_INET,
@@ -157,6 +158,13 @@ static const struct neigh_ops arp_broken
 	.connected_output =	neigh_compat_output,
 };
 
+static const struct neigh_ops arp_multicast_ops = {
+	.family =		AF_INET,
+	.error_report =		arp_error_report,
+	.output =		arp_multicast_output,
+	.connected_output =	arp_multicast_output,
+};
+
 struct neigh_table arp_tbl = {
 	.family		= AF_INET,
 	.key_len	= 4,
@@ -217,6 +225,38 @@ static u32 arp_hash(const void *pkey,
 	return arp_hashfn(*(u32 *)pkey, dev, *hash_rnd);
 }
 
+
+/*
+ * Output function for IPv4 multicast destinations.  We map the
+ * next-hop address directly into the destination MAC addr here so
+ * that we don't have to store it in the ARP cache entry.  This allows
+ * routes for multiple multicast destinations to share a single ARP
+ * cache entry.
+ */
+static int arp_multicast_output(struct neighbour *neigh, struct sk_buff *skb)
+{
+	int err;
+	struct dst_entry *dst = skb_dst(skb);
+	struct rtable *rt = (struct rtable *)dst;
+	struct net_device *dev = neigh->dev;
+	unsigned char ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];
+
+	__skb_pull(skb, skb_network_offset(skb));
+
+	arp_mc_map(rt->rt_gateway, ha, dev, 1);
+
+	err = dev_hard_header(skb, dev, ntohs(skb->protocol), ha, NULL,
+			      skb->len);
+	if (err >= 0)
+		err = dev_queue_xmit(skb);
+	else {
+		err = -EINVAL;
+		kfree_skb(skb);
+	}
+	return err;
+}
+
+
 static int arp_constructor(struct neighbour *neigh)
 {
 	__be32 addr = *(__be32 *)neigh->primary_key;
@@ -287,10 +327,9 @@ static int arp_constructor(struct neighb
 #endif
 		}
 #endif
-		if (neigh->type == RTN_MULTICAST) {
+		if (neigh->type == RTN_MULTICAST)
 			neigh->nud_state = NUD_NOARP;
-			arp_mc_map(addr, neigh->ha, dev, 1);
-		} else if (dev->flags & (IFF_NOARP | IFF_LOOPBACK)) {
+		else if (dev->flags & (IFF_NOARP | IFF_LOOPBACK)) {
 			neigh->nud_state = NUD_NOARP;
 			memcpy(neigh->ha, dev->dev_addr, dev->addr_len);
 		} else if (neigh->type == RTN_BROADCAST ||
@@ -299,7 +338,9 @@ static int arp_constructor(struct neighb
 			memcpy(neigh->ha, dev->broadcast, dev->addr_len);
 		}
 
-		if (dev->header_ops->cache)
+		if (neigh->type == RTN_MULTICAST)
+			neigh->ops = &arp_multicast_ops;
+		else if (dev->header_ops->cache)
 			neigh->ops = &arp_hh_ops;
 		else
 			neigh->ops = &arp_generic_ops;
Index: b/net/ipv4/route.c
===================================================================
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1114,6 +1114,7 @@ static int slow_chain_length(const struc
 static struct neighbour *ipv4_neigh_lookup(const struct dst_entry *dst, const void *daddr)
 {
 	static const __be32 inaddr_any = 0;
+	static const __be32 inaddr_unspec_group = htonl(INADDR_UNSPEC_GROUP);
 	struct net_device *dev = dst->dev;
 	const __be32 *pkey = daddr;
 	const struct rtable *rt;
@@ -1123,8 +1124,17 @@ static struct neighbour *ipv4_neigh_look
 
 	if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
 		pkey = &inaddr_any;
-	else if (rt->rt_gateway)
-		pkey = (const __be32 *) &rt->rt_gateway;
+	else {
+		if (rt->rt_gateway)
+			pkey = (const __be32 *) &rt->rt_gateway;
+		if (pkey && ipv4_is_multicast(*pkey))
+			/*
+			 * Map all multicast destinations to a single
+			 * address so tht they share a single ARP
+			 * cache entry per interface.
+			 */
+			pkey = &inaddr_unspec_group;
+	}
 
 	n = __ipv4_neigh_lookup(dev, *(__force u32 *)pkey);
 	if (n)

^ permalink raw reply

* [PATCH 2/2] ipv6: Improve the scaling of the IPv6 neighbor cache for multicast destinations.
From: Bob Gilligan @ 2012-08-31  0:55 UTC (permalink / raw)
  To: netdev


As with the IPv4 ARP cache, the IPv6 neighbor cache maintains entries
for both unicast and multicast IPv6 next-hop destinations.  The MAC
addresses for unicast destinations are determined by running the
Neighbor Discovery protocol, but those for multicast destinations are
determined by a simple direct mapping from the destination IPv6
multicast address.

Currently, the IPv6 neighbor cache maintains one entry for each IPv6
multicast destination for each interface that has members in that
group.  On a multicast router that is forwarding traffic for many
groups via many interfaces, the number of IPv6 neighbor cache entries
for multicast destinations can become large. It could be as many as:
(number of interfaces) * (number of groups).  Beside using a great
deal of memory, these entries consume space in the IPv6 neighbor cache
that could otherwise be occupied by unicast entries, makeing it more
likely that the IPv6 neighbor cache will become full.

The mapping from multicast IPv6 address to MAC address can just as
easily be done at the time a packet is to be sent.  With this change,
we maintain one IPv6 neighbor cache entry for each interface that has
at least one multicast group member.  All routes to IPv6 multicast
destinations via a particular interface use the same IPv6 neighbor
cache entry.  This entry does not store the MAC address to use.
Instead, packets for multicast destinations go to a new output
function that maps the destination IPv6 multicast address into the MAC
address and forms the MAC header.

Signed-off-by: Bob Gilligan <gilligan@aristanetworks.com>
---
 net/ipv6/ndisc.c |   48 ++++++++++++++++++++++++++++++++++++++++++++----
 net/ipv6/route.c |   13 +++++++++++--
 2 files changed, 55 insertions(+), 6 deletions(-)

Index: b/net/ipv6/ndisc.c
===================================================================
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -90,6 +90,7 @@ static void ndisc_error_report(struct ne
 static int pndisc_constructor(struct pneigh_entry *n);
 static void pndisc_destructor(struct pneigh_entry *n);
 static void pndisc_redo(struct sk_buff *skb);
+static int ndisc_multicast_output(struct neighbour *neigh, struct sk_buff *skb);
 
 static const struct neigh_ops ndisc_generic_ops = {
 	.family =		AF_INET6,
@@ -114,6 +115,13 @@ static const struct neigh_ops ndisc_dire
 	.connected_output =	neigh_direct_output,
 };
 
+static const struct neigh_ops ndisc_multicast_ops = {
+	.family =		AF_INET6,
+	.error_report =		ndisc_error_report,
+	.output =		ndisc_multicast_output,
+	.connected_output =	ndisc_multicast_output,
+};
+
 struct neigh_table nd_tbl = {
 	.family =	AF_INET6,
 	.key_len =	sizeof(struct in6_addr),
@@ -342,6 +350,37 @@ static u32 ndisc_hash(const void *pkey,
 	return ndisc_hashfn(pkey, dev, hash_rnd);
 }
 
+/*
+ * Output function for IPv6 multicast destinations.  We map the
+ * nex-hop address directly into the destination MAC addr here so
+ * that we don't have to store it in the neighbor cache entry.  This allows
+ * routes for multiple multicast destinations to share a single neighbor
+ * cache entry.
+ */
+static int ndisc_multicast_output(struct neighbour *neigh, struct sk_buff *skb)
+{
+	int err;
+	struct dst_entry *dst = skb_dst(skb);
+	struct rt6_info *rt = (struct rt6_info *)dst;
+	struct net_device *dev = neigh->dev;
+	unsigned char ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];
+
+	__skb_pull(skb, skb_network_offset(skb));
+
+	ndisc_mc_map(&rt->rt6i_gateway, ha, dev, 1);
+
+	err = dev_hard_header(skb, dev, ntohs(skb->protocol), ha, NULL,
+			      skb->len);
+	if (err >= 0)
+		err = dev_queue_xmit(skb);
+	else {
+		err = -EINVAL;
+		kfree_skb(skb);
+	}
+	return err;
+}
+
+
 static int ndisc_constructor(struct neighbour *neigh)
 {
 	struct in6_addr *addr = (struct in6_addr*)&neigh->primary_key;
@@ -365,10 +404,9 @@ static int ndisc_constructor(struct neig
 		neigh->ops = &ndisc_direct_ops;
 		neigh->output = neigh_direct_output;
 	} else {
-		if (is_multicast) {
+		if (is_multicast)
 			neigh->nud_state = NUD_NOARP;
-			ndisc_mc_map(addr, neigh->ha, dev, 1);
-		} else if (dev->flags&(IFF_NOARP|IFF_LOOPBACK)) {
+		else if (dev->flags&(IFF_NOARP|IFF_LOOPBACK)) {
 			neigh->nud_state = NUD_NOARP;
 			memcpy(neigh->ha, dev->dev_addr, dev->addr_len);
 			if (dev->flags&IFF_LOOPBACK)
@@ -377,7 +415,9 @@ static int ndisc_constructor(struct neig
 			neigh->nud_state = NUD_NOARP;
 			memcpy(neigh->ha, dev->broadcast, dev->addr_len);
 		}
-		if (dev->header_ops->cache)
+		if (is_multicast)
+			neigh->ops = &ndisc_multicast_ops;
+		else if (dev->header_ops->cache)
 			neigh->ops = &ndisc_hh_ops;
 		else
 			neigh->ops = &ndisc_generic_ops;
Index: b/net/ipv6/route.c
===================================================================
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -138,6 +138,8 @@ static struct neighbour *ip6_neigh_looku
 	struct neighbour *n;
 
 	daddr = choose_neigh_daddr(rt, daddr);
+	if (ipv6_addr_type(daddr) & IPV6_ADDR_MULTICAST)
+		daddr = &in6addr_linklocal_allnodes;
 	n = __ipv6_neigh_lookup(&nd_tbl, dst->dev, daddr);
 	if (n)
 		return n;
@@ -146,9 +148,15 @@ static struct neighbour *ip6_neigh_looku
 
 static int rt6_bind_neighbour(struct rt6_info *rt, struct net_device *dev)
 {
-	struct neighbour *n = __ipv6_neigh_lookup(&nd_tbl, dev, &rt->rt6i_gateway);
+	struct neighbour *n;
+	void *daddr = &rt->rt6i_gateway;
+	
+	if (ipv6_addr_type(daddr) & IPV6_ADDR_MULTICAST)
+		daddr = &in6addr_linklocal_allnodes;
+
+	n = __ipv6_neigh_lookup(&nd_tbl, dev, daddr);
 	if (!n) {
-		n = neigh_create(&nd_tbl, &rt->rt6i_gateway, dev);
+		n = neigh_create(&nd_tbl, daddr, dev);
 		if (IS_ERR(n))
 			return PTR_ERR(n);
 	}
@@ -1128,6 +1136,7 @@ struct dst_entry *icmp6_dst_alloc(struct
 		}
 	}
 
+	rt->rt6i_gateway = fl6->daddr;
 	rt->dst.flags |= DST_HOST;
 	rt->dst.output  = ip6_output;
 	dst_set_neighbour(&rt->dst, neigh);

^ permalink raw reply

* Re: [PATCH 1/2] ipv4: Improve the scaling of the ARP cache for multicast destinations.
From: David Miller @ 2012-08-31  1:06 UTC (permalink / raw)
  To: gilligan; +Cc: netdev
In-Reply-To: <50400B68.3060302@aristanetworks.com>

From: Bob Gilligan <gilligan@aristanetworks.com>
Date: Thu, 30 Aug 2012 17:55:04 -0700

> The mapping from multicast IPv4 address to MAC address can just as
> easily be done at the time a packet is to be sent.  With this change,
> we maintain one ARP cache entry for each interface that has at least
> one multicast group member.  All routes to IPv4 multicast destinations
> via a particular interface use the same ARP cache entry.  This entry
> does not store the MAC address to use.  Instead, packets for multicast
> destinations go to a new output function that maps the destination
> IPv4 multicast address into the MAC address and forms the MAC header.

Doing an ARP MC mapping on every packet is much more expensive than
doing a copy of the hard header cache.

I do not believe the memory consumption issue you use to justify this
change is a real issue.

If you are talking to that many multicast groups actively, you do want
that many neighbour cache entries.  This is not different from talking
to nearly every IP address on a local /8 subnet.  You'll have a huge
number of neighbour table entries in that case as well.

If your the actual steady state number of active groups being spoken
to is smaller, you can tune the neighbour cache thresholds to collect
old less used entries more quickly.

And this today is trivial, since routes no longer hold a reference
to neighbour entries.  Therefore any neighbour entry whatsoever can
be immediately reclaimed at any moment.

I'm not fond of these patches, and adding yet more special cases to
the neighbour layer, and therefore will not apply them.

^ permalink raw reply

* Re: [PATCH] userns: Add basic quota support v4
From: Dave Chinner @ 2012-08-31  1:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, linux-kernel, netdev, linux-fsdevel, Serge E. Hallyn,
	David Miller, Steven Whitehouse, Mark Fasheh, Joel Becker,
	Ben Myers, Alex Elder, Dmitry Monakhov, Abhijith Das
In-Reply-To: <87r4qqnixd.fsf@xmission.com>

On Wed, Aug 29, 2012 at 02:31:26AM -0700, Eric W. Biederman wrote:
> 
> Dave thanks for taking the time to take a detailed look at this code.
> 
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Tue, Aug 28, 2012 at 12:09:56PM -0700, Eric W. Biederman wrote:
> >> 
> >> Add the data type struct kqid which holds the kernel internal form of
> >> the owning identifier of a quota.  struct kqid is a replacement for
> >> the implicit union of uid, gid and project stored in an unsigned int
> >> and the quota type field that is was used in the quota data
> >> structures.  Making the data type explicit allows the kuid_t and
> >> kgid_t type safety to propogate more thoroughly through the code,
> >> revealing more places where uid/gid conversions need be made.
> >> 
> >> Along with the data type struct kqid comes the helper functions
> >> qid_eq, qid_lt, from_kqid, from_kqid_munged, qid_valid, make_kqid,
> >
> > I think Jan's comment about from_kqid being named id_from_kgid is
> > better, though I also think it would read better as kqid_to_id().
> > ie:
> >
> > 	id = kqid_to_id(ns, qid);
> 
> kqid and qid are the same thing just in a different encoding.
> Emphasizing the quota identifier instead of the kernel vs user encoding
> change is paying attention to the wrong thing.

Not from a quota perspective. The only thing the quota code really
cares about is the quota identifier, not the encoding.

Fundamentally, from_kqid() doen't tell me anything about what I'm
getting from the kqid. There's code all over the place that used the
"<format>_to_<other format>" convention because it's obvious what is
being converted from/to. e.g. cpu_to_beXX, compat_to_ptr,
dma_to_phys, pfn_to_page, etc.  Best practises say "follow existing
conventions".

> Using make_kqid and from_kqid follows the exact same conventions as I have
> established for kuids and kgids.  So if you learn one you have learned
> them all.

For those of us that have to look at it once every few months,
following the same conventions as all the other code in the kernel
(i.e. kqid_to_id()) tells me everything I need to know without
having to go through the process of looking up the unusual
from_kqid() function and then from_kuid() to find out what it is
actually doing....

> >> make_kqid_invalid, make_kqid_uid, make_kqid_gid.
> >
> > and these named something like uid_to_kqid()
> 
> The last two are indeed weird, and definitely not the common case,
> since there is no precedent I can almost see doing something different
> but I don't see a good case for a different name.

There's plenty of precendence in other code that converts format.
A very common convention that is used everywhere is DEFINE_...().
That would be make the code easier to grasp than "make...".

> >> Change struct dquot dq_id to a struct kqid and remove the now
> >> unecessary dq_type.
> >> 
> >> Update the signature of dqget, quota_send_warning, dquot_get_dqblk,
> >> and dquot_set_dqblk to use struct kqid.
> >> 
> >> Make minimal changes to ext3, ext4, gfs2, ocfs2, and xfs to deal with
> >> the change in quota structures and signatures.  The ocfs2 changes are
> >> larger than most because of the extensive tracing throughout the ocfs2
> >> quota code that prints out dq_id.
> >
> > How did you test that this all works?
> 
> By making it a compile error if you get a conversion wrong and making it
> a rule not to make any logic changes.
>
> That combined with code review
> and running the code a bit to make certain I did not horribly mess up.

But no actual regression testing. You're messing with code that I
will have to triage when it goes wrong for a user, so IMO your code
has to pass the same bar as the code I write has to pass for review
- please regression test your code and write new regression tests
for new functionality.

> > e.g. run xfstests -g quota on
> > each of those filesystems and check for no regressions? And if you
> > wrote any tests, can you convert them to be part of xfstests so that
> > namespace aware quotas get tested regularly?
> 
> I have not written any tests, and running the xfstests in a namespace
> should roughly be a matter of "unshare -U xfstest -g quota"  It isn't
> quite that easy because  /proc/self/uid_map and /proc/self/gid_map need

Asking people to run the entire regression test suite differently
and with special setup magic won't get the code tested regularly.
Writing a new, self contained test that exercises quota in multiple
namespaces simultaneously is what is needed - that way people who
don't even know that namespaces exist will be regression testing
it...

> >> --- a/include/linux/quota.h
> >> +++ b/include/linux/quota.h
> >> @@ -181,10 +181,161 @@ enum {
> >>  #include <linux/dqblk_v2.h>
> >>  
> >>  #include <linux/atomic.h>
> >> +#include <linux/uidgid.h>
> >>  
> >>  typedef __kernel_uid32_t qid_t; /* Type in which we store ids in memory */
> >>  typedef long long qsize_t;	/* Type in which we store sizes */
> >
> > From fs/xfs/xfs_types.h:
> >
> > typedef __uint32_t              prid_t;         /* project ID */
> >
> > Perhaps it would be better to have an official kprid_t definition
> > here, i.e:
> 
> We can always improve.  For this patch I don't see it making a
> useful difference.

If you are touching the code and introducing new types and concepts
at a global level, then this is precisely when you should be getting
the definitions of those new types correct.

> prid_t isn't exported to non xfs code.  The project id is already
> stored in an unsigned int or a qid_t.

You missed my point - that you are now making a distinction
outside XFS about project quotas. i.e. effectively making it
a first-class quota citizen. That means it needs an equivalent
global type definition and not rely implictly on some subsystem
getting the type conversion correct.

Indeed, isn't that the entire point of your patch set - using the
correct global types throughout the stack to avoid type conversion
errors?

> >> +struct kqid {			/* Type in which we store the quota identifier */
> >> +	union {
> >> +		kuid_t uid;
> >> +		kgid_t gid;
> >> +		qid_t prj;
> >
> > 		kprid_t prid;
> >
> 
> Hmm.  The naming here is interesting.  No one calls the project id prj.
> So I have an avoidable inconsistency here.  xfs most commonly uses
> projid and occassionally calls it prid.

Wrong way around. XFS uses projid in one place - xfs_set_projid() -
and prid in 30 others. The type is xfs_prid_t as well. prid is much
prefered to projid...

> So I am inclined to rename this field projid, but I don't know if it is
> worth respinning the patch for something so trivial.

I'd like to think you are joking, given that the patch series is all
about using consistent, verifiable identifiers... :/

> >> +	};
> >> +	int type; /* USRQUOTA (uid) or GRPQUOTA (gid) or XQM_PRJQUOTA (prj) */
> >> +};
> >> +
> >> +static inline bool qid_eq(struct kqid left, struct kqid right)
> >> +{
> >> +	if (left.type != right.type)
> >> +		return false;
> >> +	switch(left.type) {
> >> +	case USRQUOTA:
> >> +		return uid_eq(left.uid, right.uid);
> >> +	case GRPQUOTA:
> >> +		return gid_eq(left.gid, right.gid);
> >> +	case XQM_PRJQUOTA:
> >> +		return left.prj == right.prj;
> >> +	default:
> >> +		BUG();
> >
> > BUG()? Seriously? The most this justifies is a WARN_ON_ONCE() to
> > indicate a potential programming error, not bringing down the entire
> > machine.
> 
> Dead serious.  Any other type value is impossible and unsupported.

So make the compiler check it (use an enum, or perhaps
BUILD_BUG_ON()).  Don't bring the machine down because you haven't
thought about error handling properly.

> BUG is not panic.  BUG is an oops. BUG won't bring down the machine
> unless the sysadmin wants it too.

panic, oops, it's the same thing to most enterprise distros as they
are configured to dump the kernel or reboot when an oops occurs.

Even if they are not configured this way, any oops that occurs with
a filesystem lock held (like these will) will eventually result in a
filesystem deadlock and subsequent fatal application and/or system
hang. This most definitely is not a condition that should be
triggering a fatal error.

> Beyond that I am not interesting in supporting exploitable abuse
> of this type.

/me jumps up and down on his seat excitedly

Oh, Oh, it's Security Theatre time! Can I have a go!? Please?

If someone can cause an unsupported quota type to appear
in this field, using BUG() creates a wonderful DOS exploit...

:)

> >> +static inline bool qid_lt(struct kqid left, struct kqid right)
> >> +{
> >> +	if (left.type < right.type)
> >> +		return true;
> >> +	if (left.type > right.type)
> >> +		return false;
> >> +	switch (left.type) {
> >> +	case USRQUOTA:
> >> +		return uid_lt(left.uid, right.uid);
> >> +	case GRPQUOTA:
> >> +		return gid_lt(left.gid, right.gid);
> >> +	case XQM_PRJQUOTA:
> >> +		return left.prj < right.prj;
> >> +	default:
> >> +		BUG();
> >> +	}
> >> +}
> >
> > What is this function used for? it's not referenced at all by the
> > patch, and there's no documentation/comments explaining why it
> > exists or how it is intended to be used....
> 
> This function is introduced early, but it is needed for gfs2 as
> performs a sort of quota identifiers.

Introduce it when it is used so the correctness can be determined in
the context that uses it.

> >> +static inline qid_t from_kqid(struct user_namespace *user_ns, struct kqid qid)
> >> +{
> >> +	switch (qid.type) {
> >> +	case USRQUOTA:
> >> +		return from_kuid(user_ns, qid.uid);
> >> +	case GRPQUOTA:
> >> +		return from_kgid(user_ns, qid.gid);
> >> +	case XQM_PRJQUOTA:
> >> +		return (user_ns == &init_user_ns) ? qid.prj : -1;
> >> +	default:
> >> +		BUG();
> >> +	}
> >> +}
> >
> > Oh, this can return an error. That's only checked in a coupl eof
> > places this function is called. it needs tobe checked everywhere,
> > otherwise we now have the possibility of quota usage being accounted
> > to uid/gid/prid 0xffffffff when namespace matches are not found.
> 
> No this is not an error condition.  Returning -1 is the mapping that is
> used when there is not a mapping entry.
>
> Depending on the circumstances not having a mapping can be an error,
> but it can also be a don't care condition.

Which the XFS code would then use as the quota ID for accounting.
That's wrong. So, not having a mapping in this case is an error than
needs to be handled. You didn't add any error handling.

> All in kernel values are defined as having a mapping into the initial
> user namespace.
> 
> Looking at my tree the only calls of from_kqid not mapping the id
> into the initial user namespace are from_kqid_munged and they report
> to userspace.  from_kqid_munged returns fs_overflowuid or fs_overflowgid
> in case of a mapping failure.
>
> projects ids can only be accessed if capable(CAP_SYS_ADMIN) which
> you can only have in the initial user namespace so for now there will
> alwasy be a mapping for project ids.

from_kuid and from_kgid can return -1 when a mapping fails, so it is
not just project quotas that are at issue here. Returning -1 as a
quota id is almost always going to be an invalid ID. If you can't
account quota correctly, then the operation should not be allowed to
proceed. IOWs, no mapping is an error from a quota perspective.

And FWIW, explanations like this need to be in the code....

> >> +static inline qid_t from_kqid_munged(struct user_namespace *user_ns,
> >> +				   struct kqid qid)
> >
> > What does munging do to the return value? how is it different to
> > from_kqid()? Document your API....
> 
> This bit certainly certainly could use a bit more explanation.

Yes, the lack of comments in your code is a bit disturbing.

> Mapping of uids and gids from one domain to another is not new in linux.
> It originates with the transition from 16bit identifiers to 32bit
> identifiers.  In most places when there is a 32bit identifier that can
> not be represented we return a fs_overflowid aka (u16)-2 aka nobody.
> 
> So in general when we want to pass a value to userspace from_kqid_munged
> is called, and if there is a mapping failure we report the value that
> userspace set fs_overflowuid or fs_overflowgid to.  For project ids it
> which are restricted to the initial user namespace no mapping failures
> can occur.

Please add this as a comment to the code.

> >> +static inline struct kqid make_kqid_uid(kuid_t uid)
> >> +{
> >> +	struct kqid kqid = {
> >> +		.type = USRQUOTA,
> >> +		.uid  = uid,
> >> +	};
> >> +	return kqid;
> >> +}
> >
> > Isn't this sort of construct frowned upon? i.e. returning a
> > structure out of scope? It may be inline code and hence work, but
> > this strikes me as a landmine waiting for someone to tread on....
> 
> Not at all because I am returning the structure by value.
> 
> Return a structure by reference is the case that doesn't work.

Right, but you completely missed my point. It's unconventional,
tricky code - returning a structure by value is not something that
is commonly used. If I have to stop and think hard about why 5 lines
of apparently simple code is really OK, then IMO you are doing
something wrong. I might only look at this once every 6 months - I
don't want to have to spend time working this tricky stuff out
every time I look at it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply

* Re: [PATCH 0/5] dev_<level> and dynamic_debug cleanups
From: Jim Cromie @ 2012-08-31  3:48 UTC (permalink / raw)
  To: Joe Perches
  Cc: Andrew Morton, netdev, Greg Kroah-Hartman, David S. Miller,
	Jason Baron, Kay Sievers, linux-kernel
In-Reply-To: <CAJfuBxyQc-XeNHxw0_XDs0JuCrNArNzcgsmrLya4ncDvSojZxQ@mail.gmail.com>

On Thu, Aug 30, 2012 at 11:43 AM, Jim Cromie <jim.cromie@gmail.com> wrote:
> On Sun, Aug 26, 2012 at 5:25 AM, Joe Perches <joe@perches.com> wrote:
>> The recent commit to fix dynamic_debug was a bit unclean.
>> Neaten the style for dynamic_debug.
>> Reduce the stack use of message logging that uses netdev_printk
>> Add utility functions dev_printk_emit and dev_vprintk_emit for /dev/kmsg.
>>
>> Joe Perches (5):
>>   dev_dbg/dynamic_debug: Update to use printk_emit, optimize stack
>>   netdev_printk/dynamic_netdev_dbg: Directly call printk_emit
>>   netdev_printk/netif_printk: Remove a superfluous logging colon
>>   dev: Add dev_vprintk_emit and dev_printk_emit
>>   device and dynamic_debug: Use dev_vprintk_emit and dev_printk_emit
>>
>
> Ive tested this on 2 builds differing only by DYNAMIC_DEBUG
> It works for me on x86-64
>
> However, I just booted a non-dyndbg build on x86-32, and got this.
>

Ok, transient error, went away with a clean build.

tested-by: Jim Cromie <jim.cromie@gmail.com>

thanks

^ permalink raw reply

* [net-next 1/8] e1000e: use correct type for read of 32-bit register
From: Jeff Kirsher @ 2012-08-31  5:16 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1346390174-30449-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

The POEMB register is 32 bits, not 16.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/82571.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/82571.c b/drivers/net/ethernet/intel/e1000e/82571.c
index 080c890..c985864 100644
--- a/drivers/net/ethernet/intel/e1000e/82571.c
+++ b/drivers/net/ethernet/intel/e1000e/82571.c
@@ -653,7 +653,7 @@ static void e1000_put_hw_semaphore_82574(struct e1000_hw *hw)
  **/
 static s32 e1000_set_d0_lplu_state_82574(struct e1000_hw *hw, bool active)
 {
-	u16 data = er32(POEMB);
+	u32 data = er32(POEMB);
 
 	if (active)
 		data |= E1000_PHY_CTRL_D0A_LPLU;
@@ -677,7 +677,7 @@ static s32 e1000_set_d0_lplu_state_82574(struct e1000_hw *hw, bool active)
  **/
 static s32 e1000_set_d3_lplu_state_82574(struct e1000_hw *hw, bool active)
 {
-	u16 data = er32(POEMB);
+	u32 data = er32(POEMB);
 
 	if (!active) {
 		data &= ~E1000_PHY_CTRL_NOND0A_LPLU;
-- 
1.7.11.4

^ permalink raw reply related

* [net-next 0/8][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-08-31  5:16 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

This series contains updates to e1000e and ixgbevf.

The following are changes since commit 761743ebc92df72053e736fce953a5d2e90099d5:
  net/fsl_pq_mdio: add support for the Fman 1G MDIO controller
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Alexander Duyck (2):
  ixgbevf: Add suspend and resume support to the VF
  ixgbevf: Cleanup handling of configuration for jumbo frames

Bruce Allan (6):
  e1000e: use correct type for read of 32-bit register
  e1000e: cleanup strict checkpatch MEMORY_BARRIER checks
  e1000e: cleanup strict checkpatch check
  e1000e: cleanup - remove inapplicable comment
  e1000e: cleanup - remove unnecessary variable
  e1000e: update driver version number

 drivers/net/ethernet/intel/e1000e/82571.c         |   4 +-
 drivers/net/ethernet/intel/e1000e/ethtool.c       |   3 +-
 drivers/net/ethernet/intel/e1000e/netdev.c        |  19 ++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |   4 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 160 +++++++++++++++++-----
 drivers/net/ethernet/intel/ixgbevf/vf.c           |  14 ++
 drivers/net/ethernet/intel/ixgbevf/vf.h           |   1 +
 7 files changed, 164 insertions(+), 41 deletions(-)

-- 
1.7.11.4

^ permalink raw reply

* [net-next 2/8] e1000e: cleanup strict checkpatch MEMORY_BARRIER checks
From: Jeff Kirsher @ 2012-08-31  5:16 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1346390174-30449-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

Add comments to memory barriers per strict checkpatch.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 46c3b1f..fb6c813 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -3746,6 +3746,10 @@ static irqreturn_t e1000_intr_msi_test(int irq, void *data)
 	e_dbg("icr is %08X\n", icr);
 	if (icr & E1000_ICR_RXSEQ) {
 		adapter->flags &= ~FLAG_MSI_TEST_FAILED;
+		/*
+		 * Force memory writes to complete before acknowledging the
+		 * interrupt is handled.
+		 */
 		wmb();
 	}
 
@@ -3787,6 +3791,10 @@ static int e1000_test_msi_interrupt(struct e1000_adapter *adapter)
 		goto msi_test_failed;
 	}
 
+	/*
+	 * Force memory writes to complete before enabling and firing an
+	 * interrupt.
+	 */
 	wmb();
 
 	e1000_irq_enable(adapter);
@@ -3798,7 +3806,7 @@ static int e1000_test_msi_interrupt(struct e1000_adapter *adapter)
 
 	e1000_irq_disable(adapter);
 
-	rmb();
+	rmb();			/* read flags after interrupt has been fired */
 
 	if (adapter->flags & FLAG_MSI_TEST_FAILED) {
 		adapter->int_mode = E1000E_INT_MODE_LEGACY;
-- 
1.7.11.4

^ permalink raw reply related

* [net-next 3/8] e1000e: cleanup strict checkpatch check
From: Jeff Kirsher @ 2012-08-31  5:16 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1346390174-30449-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

CHECK: multiple assignments should be avoided

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/ethtool.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 2e76f06..c11ac27 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -1942,7 +1942,8 @@ static int e1000_set_coalesce(struct net_device *netdev,
 		return -EINVAL;
 
 	if (ec->rx_coalesce_usecs == 4) {
-		adapter->itr = adapter->itr_setting = 4;
+		adapter->itr_setting = 4;
+		adapter->itr = adapter->itr_setting;
 	} else if (ec->rx_coalesce_usecs <= 3) {
 		adapter->itr = 20000;
 		adapter->itr_setting = ec->rx_coalesce_usecs;
-- 
1.7.11.4

^ permalink raw reply related

* [net-next 4/8] e1000e: cleanup - remove inapplicable comment
From: Jeff Kirsher @ 2012-08-31  5:16 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1346390174-30449-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

Early Receive has been disabled in the driver so this comment is no longer
applicable.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index fb6c813..c35d354 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -3446,7 +3446,7 @@ void e1000e_reset(struct e1000_adapter *adapter)
 
 			/*
 			 * if short on Rx space, Rx wins and must trump Tx
-			 * adjustment or use Early Receive if available
+			 * adjustment
 			 */
 			if (pba < min_rx_space)
 				pba = min_rx_space;
-- 
1.7.11.4

^ permalink raw reply related

* [net-next 5/8] e1000e: cleanup - remove unnecessary variable
From: Jeff Kirsher @ 2012-08-31  5:16 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1346390174-30449-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index c35d354..c0815ce 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4669,7 +4669,7 @@ static int e1000_tso(struct e1000_ring *tx_ring, struct sk_buff *skb)
 	struct e1000_buffer *buffer_info;
 	unsigned int i;
 	u32 cmd_length = 0;
-	u16 ipcse = 0, tucse, mss;
+	u16 ipcse = 0, mss;
 	u8 ipcss, ipcso, tucss, tucso, hdr_len;
 
 	if (!skb_is_gso(skb))
@@ -4703,7 +4703,6 @@ static int e1000_tso(struct e1000_ring *tx_ring, struct sk_buff *skb)
 	ipcso = (void *)&(ip_hdr(skb)->check) - (void *)skb->data;
 	tucss = skb_transport_offset(skb);
 	tucso = (void *)&(tcp_hdr(skb)->check) - (void *)skb->data;
-	tucse = 0;
 
 	cmd_length |= (E1000_TXD_CMD_DEXT | E1000_TXD_CMD_TSE |
 	               E1000_TXD_CMD_TCP | (skb->len - (hdr_len)));
@@ -4717,7 +4716,7 @@ static int e1000_tso(struct e1000_ring *tx_ring, struct sk_buff *skb)
 	context_desc->lower_setup.ip_fields.ipcse  = cpu_to_le16(ipcse);
 	context_desc->upper_setup.tcp_fields.tucss = tucss;
 	context_desc->upper_setup.tcp_fields.tucso = tucso;
-	context_desc->upper_setup.tcp_fields.tucse = cpu_to_le16(tucse);
+	context_desc->upper_setup.tcp_fields.tucse = 0;
 	context_desc->tcp_seg_setup.fields.mss     = cpu_to_le16(mss);
 	context_desc->tcp_seg_setup.fields.hdr_len = hdr_len;
 	context_desc->cmd_and_length = cpu_to_le32(cmd_length);
-- 
1.7.11.4

^ permalink raw reply related

* [net-next 6/8] e1000e: update driver version number
From: Jeff Kirsher @ 2012-08-31  5:16 UTC (permalink / raw)
  To: davem; +Cc: Bruce Allan, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1346390174-30449-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Bruce Allan <bruce.w.allan@intel.com>

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index c0815ce..095a6be 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -56,7 +56,7 @@
 
 #define DRV_EXTRAVERSION "-k"
 
-#define DRV_VERSION "2.0.0" DRV_EXTRAVERSION
+#define DRV_VERSION "2.1.4" DRV_EXTRAVERSION
 char e1000e_driver_name[] = "e1000e";
 const char e1000e_driver_version[] = DRV_VERSION;
 
-- 
1.7.11.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox