Netdev List

Netdev List
 help / color / mirror / Atom feed

* Question about IPV6 forwarding and proxy_ndp
From: Maciej Żenczykowski @ 2009-10-18  7:30 UTC (permalink / raw)
  To: Linux Networking, YOSHIFUJI Hideaki

I would like to have a machine:

* do IPv6 forwarding
- thus "echo 1 > /proc/sys/net/ipv6/conf/all/forwarding"

* accept router advertisements and default route from eth0
- thus "echo 0 > /proc/sys/net/ipv6/conf/eth0/forwarding"

* do proxy NDP on eth0 for a specific v6 address
- thus "echo 1 > /proc/sys/net/ipv6/conf/eth0/proxy_ndp"
- and "ip -6 neigh add proxy ${PROXIED_V6_IP} dev eth0"

Problem is, this doesn't work (no ndp responses for the proxied v6 ip)...
While, if I "echo 1 > /proc/sys/net/ipv6/conf/eth0/forwarding" than
proxy ndp starts working, but I lose my default route...

I think the problem is at:

http://lxr.linux.no/linux+v2.6.31/net/ipv6/ndisc.c#L832

 831                if (ipv6_chk_acast_addr(net, dev, &msg->target) ||
 832                    (idev->cnf.forwarding &&
 833                     (net->ipv6.devconf_all->proxy_ndp ||
idev->cnf.proxy_ndp) &&
 834                     (is_router = pndisc_is_router(&msg->target,
dev)) >= 0)) {
 835                        if (!(NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED) &&
 836                            skb->pkt_type != PACKET_HOST &&
 837                            inc != 0 &&
 838                            idev->nd_parms->proxy_delay != 0) {

notice that we require idev->cnf.forwarding to be true.

Should this perhaps be changed to
    (net->ipv6.devconf_all->forwarding || idev->cnf.forwarding)

or even just
    net->ipv6.devconf_all->forwarding
?

It's not clear to me what the reasoning behind that if statement is...

- Maciej

^ permalink raw reply

* Re: Question about IPV6 forwarding and proxy_ndp
From: Maciej Żenczykowski @ 2009-10-18  7:37 UTC (permalink / raw)
  To: Linux Networking, YOSHIFUJI Hideaki
In-Reply-To: <55a4f86e0910180030s52484bffs4ca42eb1ff4d5131@mail.gmail.com>

Just a quick follow up.

The current logic for when to do proxy ndp seems to be:

network_device_is_forwarding && (global_proxy_ndp_is_on ||
network_device_is configured_for_proxy_ndp)

Maybe this should be:

(network_device_is_forwarding && global_proxy_ndp_is_on) ||
network_device_is configured_for_proxy_ndp

After all, if the admin has explicitly set proxy ndp on this specific
device, then maybe he knows best?

Alternatively it could also just be:

global_proxy_ndp_is_on || network_device_is configured_for_proxy_ndp

and not bother with checking forwarding at all.


[It should be pointed out that ipv6/conf/all/forwarding is a very
different beast than ipv6/conf/<device>/forwarding, the first globally
turns on ipv6 forwarding, the second switches a device between 'I am a
host' and 'I am a router' modes of operation]

^ permalink raw reply

* Re: [PATCH] iputils: ping by mark
From: jamal @ 2009-10-18 11:37 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: Rob.Townley, YOSHIFUJI Hideaki, netdev
In-Reply-To: <55a4f86e0910171846w45dda0e9w86f570b087c1543d@mail.gmail.com>

On Sat, 2009-10-17 at 18:46 -0700, Maciej Żenczykowski wrote:
> Try it with a udp packet or a tcp connection - so_mark and ip rule
> fwmark only work for raw sockets (and maybe some other special cases),
> unless you're lucky and the ip(6)tables mangle module just happens to
> rerun the routing decision (because it mangles the packet in some
> other way...).

It works fine with tcp and udp and to emphasize: i have never seen it
broken.
Above you mention iptables - I dont use it and that maybe the missing
part in our discussion.
I should note though that rpf is broken with policy routing;-> Now that
you got me going on this, I will post a patch. 

cheers,
jamal

^ permalink raw reply

* [PATCH] net: Fix RPF to work with policy routing
From: jamal @ 2009-10-18 12:12 UTC (permalink / raw)
  To: netdev, David Miller; +Cc: Atis Elsts, eric.dumazet, Maciej Żenczykowski

[-- Attachment #1: Type: text/plain, Size: 129 bytes --]


policy routing never worked with mark.

I tested this with ping and the skbedit patch i posted a few days back.

cheers,
jamal


[-- Attachment #2: policy-mark-rpf --]
[-- Type: text/plain, Size: 3076 bytes --]

commit f7c6fd2465d8e6f4f89c5d1262da10b4a6d499d0
Author: Jamal Hadi Salim <hadi@cyberus.ca>
Date:   Sun Oct 18 08:04:51 2009 -0400

    [PATCH] net: Fix RPF to work with policy routing
    Policy routing is not looked up by mark on reverse path filtering.
    This fixes it.
    
    Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index ef91fe9..4d22fab 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -210,7 +210,8 @@ extern struct fib_table *fib_get_table(struct net *net, u32 id);
 extern const struct nla_policy rtm_ipv4_policy[];
 extern void		ip_fib_init(void);
 extern int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
-			       struct net_device *dev, __be32 *spec_dst, u32 *itag);
+			       struct net_device *dev, __be32 *spec_dst,
+			       u32 *itag, u32 mark);
 extern void fib_select_default(struct net *net, const struct flowi *flp,
 			       struct fib_result *res);
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index e2f9505..aa00398 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -229,14 +229,17 @@ unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
  */
 
 int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
-			struct net_device *dev, __be32 *spec_dst, u32 *itag)
+			struct net_device *dev, __be32 *spec_dst,
+			u32 *itag, u32 mark)
 {
 	struct in_device *in_dev;
 	struct flowi fl = { .nl_u = { .ip4_u =
 				      { .daddr = src,
 					.saddr = dst,
 					.tos = tos } },
+			    .mark = mark,
 			    .iif = oif };
+
 	struct fib_result res;
 	int no_addr, rpf;
 	int ret;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 278f46f..9744fc5 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1852,7 +1852,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 			goto e_inval;
 		spec_dst = inet_select_addr(dev, 0, RT_SCOPE_LINK);
 	} else if (fib_validate_source(saddr, 0, tos, 0,
-					dev, &spec_dst, &itag) < 0)
+					dev, &spec_dst, &itag, 0) < 0)
 		goto e_inval;
 
 	rth = dst_alloc(&ipv4_dst_ops);
@@ -1965,7 +1965,7 @@ static int __mkroute_input(struct sk_buff *skb,
 
 
 	err = fib_validate_source(saddr, daddr, tos, FIB_RES_OIF(*res),
-				  in_dev->dev, &spec_dst, &itag);
+				  in_dev->dev, &spec_dst, &itag, skb->mark);
 	if (err < 0) {
 		ip_handle_martian_source(in_dev->dev, in_dev, skb, daddr,
 					 saddr);
@@ -2139,7 +2139,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		int result;
 		result = fib_validate_source(saddr, daddr, tos,
 					     net->loopback_dev->ifindex,
-					     dev, &spec_dst, &itag);
+					     dev, &spec_dst, &itag, skb->mark);
 		if (result < 0)
 			goto martian_source;
 		if (result)
@@ -2168,7 +2168,7 @@ brd_input:
 		spec_dst = inet_select_addr(dev, 0, RT_SCOPE_LINK);
 	else {
 		err = fib_validate_source(saddr, 0, tos, 0, dev, &spec_dst,
-					  &itag);
+					  &itag, skb->mark);
 		if (err < 0)
 			goto martian_source;
 		if (err)

^ permalink raw reply related

* Re: [PATCH] net: Fix RPF to work with policy routing
From: jamal @ 2009-10-18 12:13 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Atis Elsts, eric.dumazet, Maciej Żenczykowski
In-Reply-To: <1255867954.4815.25.camel@dogo.mojatatu.com>

On Sun, 2009-10-18 at 08:12 -0400, jamal wrote:
> policy routing never worked with mark.

I meant policy routing, mark and RPF never worked together ;->

cheers,
jamal


^ permalink raw reply

* [PATCH][RFC]: ingress socket filter by mark
From: jamal @ 2009-10-18 12:42 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Atis Elsts, eric.dumazet, Maciej Żenczykowski

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

Maciej forced me to dig into this ;->

at the socket level if a packet arrives with a different mark than
what we bind to, drop it. I have tested this patch and it drops a packet
with mismatching mark.

There are several approaches - and i think the patch suggestion i have
made here maybe too strict. I assume that if someone binds to a mark,
they want to not only send packets with that mark but receive
only if that mark is set. 
A looser check would be something along the line accept as well if mark
is not set i.e
if (sk->sk_mark && skb->mark && sk->sk_mark != skb->mark)

Alternatively i could add one bit in the socket flags and have it so
that check is made only if app has been explicit:
if (sock_flag(sk, SOCK_CHK_SOMARK) && sk->sk_mark != skb->mark) drop

Another approach  is to set sock filter from app. I dont like this
approach because it will be the least usable from app level and would be
the least simple from kernel level.

cheers,
jamal

[-- Attachment #2: filt-sock-m --]
[-- Type: text/x-patch, Size: 375 bytes --]

diff --git a/net/core/filter.c b/net/core/filter.c
index d1d779c..6fcf577 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,9 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
 	if (err)
 		return err;

+	if (sk->sk_mark && sk->sk_mark != skb->mark)
+		return -EPERM;
+
 	rcu_read_lock_bh();
 	filter = rcu_dereference(sk->sk_filter);
 	if (filter) {

^ permalink raw reply related

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:38 UTC (permalink / raw)
  To: Brad Doctor; +Cc: netdev, Luca Deri
In-Reply-To: <a07586b0910140733s1976cd05u39286de42d9fac23@mail.gmail.com>

Hi Brad and Luca,

On Wed, Oct 14, 2009 at 08:33:08AM -0600, Brad Doctor wrote:

> On behalf of the users and developers of the PF_RING project, we would
> like to ask consideration to include the PF_RING module in the main
> line kernel.

First of all, let me state that I think the mainline support for nProbe/nTop is
something that I have been hoping for many years.  I think the performance you
are achieving is remarkable, and it would be very usable to have this
capability of high performance zero-copy packet access from userspace as a
stock feature of the Linux kernel.

The actual PF_RING implementation has been criticized a couple of times even in
the past.  One general point I remember from past discussions in the kernel
network community was that there is too much overlap with PF_PACKET, and that
this could possibly be extended with a ring buffer rather than replaced with a
fairly similar alternative mechanism.

So let's see what kind of solution the current discussion thread will come up
with... let's hope eventually we'll have the functionality in the kernel.
-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* [OT] ntop / GPL (was Re: PF_RING: Include in main line kernel?)
From: Harald Welte @ 2009-10-18 12:47 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Brad Doctor, netdev, Luca Deri
In-Reply-To: <4AD60053.1030804@gmail.com>

Hi Jarek, Brad, Luca,

[putting my gpl-violations.org hat on]

On Wed, Oct 14, 2009 at 06:46:11PM +0200, Jarek Poplawski wrote:
> Brad Doctor wrote, On 10/14/2009 04:33 PM:
> 
> > Download ntop
> > 
> > ntop is distributed under the GNU GPL. In order to be entitled to download
> > ntop you must accept the GNU license. 
> 
> I can't find such a thing neither in GNU GPL v2:

This is true.  The GPL does never need to be accepted for mere use (i.e.
running) the program.  This is at least true for the continental european
copyright systems, where any legally obtained copy of a program implicitly
carries the permission for running the program.  Only for any other activity
you will need to accept the license.

but, like others posted in this thread, ntop is not the PF_RING code.

-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:45 UTC (permalink / raw)
  To: Ben Greear
  Cc: Evgeniy Polyakov, David Miller, deri, shemminger, brad.doctor,
	netdev
In-Reply-To: <4AD74EF1.6080106@candelatech.com>

Hi Ben,

On Thu, Oct 15, 2009 at 09:33:53AM -0700, Ben Greear wrote:

> Bonding, bridging, mac-vlans, pktgen-rx logic, sniffers, and others could.
> The only trick is ordering...it may be better to keep the hard-coded hooks in
> a definite order than to allow users to switch them around.

why does this sound like the netfilter hooks with their priorities to me?  The
only difference is that netfilter hooks are inside the layer 3 protocols at the
moment, whereas this is one layer lower...

-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:43 UTC (permalink / raw)
  To: David Miller; +Cc: greearb, zbr, deri, shemminger, brad.doctor, netdev
In-Reply-To: <20091014.144923.112167161.davem@davemloft.net>

Hi Dave,

On Wed, Oct 14, 2009 at 02:49:23PM -0700, David Miller wrote:

> > Maybe something similar to the attached patch?
> 
> This is not something I'm interested in applying.
> 
> It makes implementing proprietary complete networking stacks
> for Linux way too easy.

How does it make it any easier?  Even right now you can implement an entire
protocol family in your own module, either by registering as netpoll handler,
or even using the regular dev_add_pack().

-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:50 UTC (permalink / raw)
  To: Luca Deri; +Cc: Brent Cook, Brad Doctor, netdev
In-Reply-To: <903D7FEC-34E8-4F70-ABC0-E56A3A5FBBE6@ntop.org>

On Wed, Oct 14, 2009 at 09:54:26PM +0200, Luca Deri wrote:
> Brent
> contrary to other socket types, PF_RING allows
> - packets to be filtered using both BPF and ACL-like filters
> - parsing information is returned as metadata with the packet (i.e.
> you don't have to parse the packet again as it happens with BPF)
> - ACL-like filters allows you to specify advanced features such as
> port ranges or packet payload match

So it seems there is some added features over the existing functionality, plus
probably increased performance mainly to hooking earlier in the packet receive
flow.

What would normally be done is to try to make incremental changes
to the existing code and extend their features/performacne, rather than
adding something relatively similar alternative.

> I agree with you that PF_RING has some overlaps with PACKET_RX/
> TX_RING, but the main idea behind PF_RING is not just to accelerate
> packet capture. For instance in PF_RING you can have actions
> attached to rules, or extend PF_RING filtering/packet handling by
> means of plugins.

Is this in the actual kernel code?  I am not sure whether people generally want
to see ayet another packet filter in Linux ;)

Regards,
-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:56 UTC (permalink / raw)
  To: Luca Deri; +Cc: David Miller, bcook, brad.doctor, netdev
In-Reply-To: <C803B824-8249-44BA-ACCD-D9AE4AA21F92@ntop.org>

Dear Luca,

On Wed, Oct 14, 2009 at 10:34:17PM +0200, Luca Deri wrote:

> so do you want me to start porting PF_RING facilities into
> PF_PACKET? As I have said I;m not against this: my goal is to
> include this work into the linux kernel, as it has been separate for
> too long.

I would suggest you do some analysis of each of the features that PF_RING
currently offers, and think about how each those features could be added
as an incremental change to PF_PACKET.

Then post this as a RFC to the list, see what people think about each of those
issues and follow with the actual implementation.  If you think the actual
implementation might be easier than to make a written proposal, that would
of course also be great.

Regards,
-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* [PATCH 0/4 v3] net: Implement fast TX queue selection
From: Krishna Kumar @ 2009-10-18 13:07 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

Notes:
    1.  Eric suggested:
	- To use u16 for txq#, but I am using an "int" for now as that
	  avoids one unnecessary subtraction during tx.
	- An improvement of caching the txq at connection establishment
	  time (TBD later) so as to use rxq# = txq#.
	- Drivers can call sk_tx_queue_set() to set the txq if they are
	  going to call skb_tx_hash() internally.
    2. v3 patch stress tested with 1000 netperfs, reboot's, etc.

Changelog [from v2]
--------------------
	1. Changed names of functions setting, getting and returning the
	   txq#; and added a new one to reset the txq#.
	2. Free sk doesn't need to reset txq#.

Changelog [from v1]
--------------------
	1. Changed IPv6 code to call __sk_dst_reset() directly.
	2. Removed the patch re-arranging ("encapsulating") __sk_dst_reset()

Multiqueue cards on routers/firewalls set skb->queue_mapping on
input which helps in faster xmit. Implement fast queue selection
for locally generated packets also, by saving the txq# for
connected sockets (in dev_pick_tx) and use it in subsequent
iterations. Locally generated packets for a connection will xmit
on the same txq, but routing & firewall loads should not be
affected by this patch. Tests shows the distribution across txq's
for 1-4 netperf sessions is similar to existing code.


                   Testing & results:
                   ------------------

1. Cycles/Iter (C/I) used by dev_pick_tx:
         (B -> Billion,   M -> Million)
   |--------------|------------------------|------------------------|
   |              |          ORG           |          NEW           |
   |  Test        |--------|---------|-----|--------|---------|-----|
   |              | Cycles |  Iters  | C/I | Cycles | Iters   | C/I |
   |--------------|--------|---------|-----|--------|---------|-----|
   | [TCP_STREAM, | 3.98 B | 12.47 M | 320 | 1.95 B | 12.92 M | 152 |
   |  UDP_STREAM, |        |         |     |        |         |     |         
   |  TCP_RR,     |        |         |     |        |         |     |        
   |  UDP_RR]     |        |         |     |        |         |     |        
   |--------------|--------|---------|-----|--------|---------|-----|        
   | [TCP_STREAM, | 8.92 B | 29.66 M | 300 | 3.82 B | 38.88 M | 98  |        
   |  TCP_RR,     |        |         |     |        |         |     |         
   |  UDP_RR]     |        |         |     |        |         |     |         
   |--------------|--------|---------|-----|--------|---------|-----|

2. Stress test (over 48 hours) : 1000 netperfs running combination
   of TCP_STREAM/RR, UDP_STREAM/RR (v4/6, NODELAY/~NODELAY for all
   tests), with some ssh sessions, reboots, modprobe -r driver, etc.

3. Performance test (10 hours): Single 10 hour netperf run of
   TCP_STREAM/RR, TCP_STREAM + NO_DELAY and UDP_RR. Results show an
   improvement in both performance and cpu utilization.

Tested on a 4-processor AMD Opteron 2.8 GHz system with 1GB memory,
10G Chelsio card. Each BW number is the sum of 3 iterations of
individual tests using 512, 16K, 64K & 128K I/O sizes, in Mb/s:

------------------------  TCP Tests  -----------------------
#procs  Org BW     New BW (%)     Org SD     New SD (%)
------------------------------------------------------------
1       77777.7    81011.0 (4.15)    42.3     40.2 (-5.11)
4       91599.2    91878.8 (.30)    955.9    919.3 (-3.83)
6       89533.3    91792.2 (2.52)  2262.0   2143.0 (-5.25)
8       87507.5    89161.9 (1.89)  4363.4   4073.6 (-6.64)
10      85152.4    85607.8 (.53)   6890.4   6851.2 (-.56)
------------------------------------------------------------

------------------------- TCP NO_DELAY Tests ---------------
#procs  Org BW     New BW (%)      Org SD      New SD (%)
------------------------------------------------------------
1       57001.9    57888.0 (1.55)     67.7      70.2 (3.75)
4       69555.1    69957.4 (.57)     823.0     834.3 (1.36)
6       71359.3    71918.7 (.78)    1740.8    1724.5 (-.93)
8       72577.6    72496.1 (-.11)   2955.4    2937.7 (-.59)
10      70829.6    71444.2 (.86)    4826.1    4673.4 (-3.16)
------------------------------------------------------------

----------------------- Request Response Tests --------------------
#procs  Org TPS     New TPS (%)      Org SD    New SD (%)
(1-10)
-------------------------------------------------------------------
TCP     1019245.9   1042626.4 (2.29) 16352.9   16459.8 (.65)
UDP     934598.64   942956.9  (.89)  11607.3   11593.2 (-.12)
-------------------------------------------------------------------

Thanks,

- KK

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply

* [PATCH 1/4 v3] net: Introduce sk_tx_queue_mapping
From: Krishna Kumar @ 2009-10-18 13:07 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091018130727.3960.32107.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/sock.h |   26 ++++++++++++++++++++++++++
 net/core/sock.c    |    5 ++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff -ruNp org/include/net/sock.h new/include/net/sock.h
--- org/include/net/sock.h	2009-10-16 18:53:40.000000000 +0530
+++ new/include/net/sock.h	2009-10-16 21:38:44.000000000 +0530
@@ -107,6 +107,7 @@ struct net;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_refcnt: reference count
+ *	@skc_tx_queue_mapping: tx queue number for this connection
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_family: network address family
  *	@skc_state: Connection state
@@ -128,6 +129,7 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	atomic_t		skc_refcnt;
+	int			skc_tx_queue_mapping;
 
 	unsigned int		skc_hash;
 	unsigned short		skc_family;
@@ -215,6 +217,7 @@ struct sock {
 #define sk_node			__sk_common.skc_node
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
+#define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
 
 #define sk_copy_start		__sk_common.skc_hash
 #define sk_hash			__sk_common.skc_hash
@@ -1094,8 +1097,29 @@ static inline void sock_put(struct sock 
 extern int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 			  const int nested);
 
+static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
+{
+	sk->sk_tx_queue_mapping = tx_queue;
+}
+
+static inline void sk_tx_queue_clear(struct sock *sk)
+{
+	sk->sk_tx_queue_mapping = -1;
+}
+
+static inline int sk_tx_queue_get(const struct sock *sk)
+{
+	return sk->sk_tx_queue_mapping;
+}
+
+static inline bool sk_tx_queue_recorded(const struct sock *sk)
+{
+	return (sk && sk->sk_tx_queue_mapping >= 0);
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
+	sk_tx_queue_clear(sk);
 	sk->sk_socket = sock;
 }
 
@@ -1152,6 +1176,7 @@ __sk_dst_set(struct sock *sk, struct dst
 {
 	struct dst_entry *old_dst;
 
+	sk_tx_queue_clear(sk);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = dst;
 	dst_release(old_dst);
@@ -1170,6 +1195,7 @@ __sk_dst_reset(struct sock *sk)
 {
 	struct dst_entry *old_dst;
 
+	sk_tx_queue_clear(sk);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = NULL;
 	dst_release(old_dst);
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-16 18:53:40.000000000 +0530
+++ new/net/core/sock.c	2009-10-16 21:29:02.000000000 +0530
@@ -357,6 +357,7 @@ struct dst_entry *__sk_dst_check(struct 
 	struct dst_entry *dst = sk->sk_dst_cache;
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+		sk_tx_queue_clear(sk);
 		sk->sk_dst_cache = NULL;
 		dst_release(dst);
 		return NULL;
@@ -953,7 +954,8 @@ static void sock_copy(struct sock *nsk, 
 	void *sptr = nsk->sk_security;
 #endif
 	BUILD_BUG_ON(offsetof(struct sock, sk_copy_start) !=
-		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt));
+		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt) +
+		     sizeof(osk->sk_tx_queue_mapping));
 	memcpy(&nsk->sk_copy_start, &osk->sk_copy_start,
 	       osk->sk_prot->obj_size - offsetof(struct sock, sk_copy_start));
 #ifdef CONFIG_SECURITY_NETWORK
@@ -997,6 +999,7 @@ static struct sock *sk_prot_alloc(struct
 
 		if (!try_module_get(prot->owner))
 			goto out_free_sec;
+		sk_tx_queue_clear(sk);
 	}
 
 	return sk;

^ permalink raw reply

* [PATCH 2/4 v3] net: Use sk_tx_queue_mapping for connected sockets
From: Krishna Kumar @ 2009-10-18 13:07 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091018130727.3960.32107.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/core/dev.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2009-10-16 18:53:40.000000000 +0530
+++ new/net/core/dev.c	2009-10-16 21:30:38.000000000 +0530
@@ -1791,13 +1791,25 @@ EXPORT_SYMBOL(skb_tx_hash);
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
-	const struct net_device_ops *ops = dev->netdev_ops;
-	u16 queue_index = 0;
+	u16 queue_index;
+	struct sock *sk = skb->sk;
+
+	if (sk_tx_queue_recorded(sk)) {
+		queue_index = sk_tx_queue_get(sk);
+	} else {
+		const struct net_device_ops *ops = dev->netdev_ops;
 
-	if (ops->ndo_select_queue)
-		queue_index = ops->ndo_select_queue(dev, skb);
-	else if (dev->real_num_tx_queues > 1)
-		queue_index = skb_tx_hash(dev, skb);
+		if (ops->ndo_select_queue) {
+			queue_index = ops->ndo_select_queue(dev, skb);
+		} else {
+			queue_index = 0;
+			if (dev->real_num_tx_queues > 1)
+				queue_index = skb_tx_hash(dev, skb);
+
+			if (sk && sk->sk_dst_cache)
+				sk_record_tx_queue(sk, queue_index);
+		}
+	}
 
 	skb_set_queue_mapping(skb, queue_index);
 	return netdev_get_tx_queue(dev, queue_index);

^ permalink raw reply

* [PATCH 3/4 v3] net: IPv6 changes
From: Krishna Kumar @ 2009-10-18 13:08 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091018130727.3960.32107.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

IPv6: Reset sk_tx_queue_mapping when dst_cache is reset. Use existing
macro to do the work.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/ipv6/inet6_connection_sock.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff -ruNp org/net/ipv6/inet6_connection_sock.c new/net/ipv6/inet6_connection_sock.c
--- org/net/ipv6/inet6_connection_sock.c	2009-10-16 21:29:19.000000000 +0530
+++ new/net/ipv6/inet6_connection_sock.c	2009-10-16 21:31:00.000000000 +0530
@@ -168,8 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
 	if (dst) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
 		if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
-			sk->sk_dst_cache = NULL;
-			dst_release(dst);
+			__sk_dst_reset(sk);
 			dst = NULL;
 		}
 	}

^ permalink raw reply

* [PATCH 4/4 v3] net: Fix for dst_negative_advice
From: Krishna Kumar @ 2009-10-18 13:08 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091018130727.3960.32107.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.

(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/dst.h      |   12 ++++++++++--
 net/core/sock.c        |    6 ++++++
 net/dccp/timer.c       |    4 ++--
 net/decnet/af_decnet.c |    2 +-
 net/ipv4/tcp_timer.c   |    4 ++--
 5 files changed, 21 insertions(+), 7 deletions(-)

diff -ruNp org/include/net/dst.h new/include/net/dst.h
--- org/include/net/dst.h	2009-10-16 21:30:56.000000000 +0530
+++ new/include/net/dst.h	2009-10-16 21:31:30.000000000 +0530
@@ -222,11 +222,19 @@ static inline void dst_confirm(struct ds
 		neigh_confirm(dst->neighbour);
 }
 
-static inline void dst_negative_advice(struct dst_entry **dst_p)
+static inline void dst_negative_advice(struct dst_entry **dst_p,
+				       struct sock *sk)
 {
 	struct dst_entry * dst = *dst_p;
-	if (dst && dst->ops->negative_advice)
+	if (dst && dst->ops->negative_advice) {
 		*dst_p = dst->ops->negative_advice(dst);
+
+		if (dst != *dst_p) {
+			extern void sk_reset_txq(struct sock *sk);
+
+			sk_reset_txq(sk);
+		}
+	}
 }
 
 static inline void dst_link_failure(struct sk_buff *skb)
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/core/sock.c	2009-10-16 21:32:33.000000000 +0530
@@ -352,6 +352,12 @@ discard_and_relse:
 }
 EXPORT_SYMBOL(sk_receive_skb);
 
+void sk_reset_txq(struct sock *sk)
+{
+	sk_tx_queue_clear(sk);
+}
+EXPORT_SYMBOL(sk_reset_txq);
+
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
 	struct dst_entry *dst = sk->sk_dst_cache;
diff -ruNp org/net/dccp/timer.c new/net/dccp/timer.c
--- org/net/dccp/timer.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/dccp/timer.c	2009-10-16 21:31:30.000000000 +0530
@@ -38,7 +38,7 @@ static int dccp_write_timeout(struct soc
 
 	if (sk->sk_state == DCCP_REQUESTING || sk->sk_state == DCCP_PARTOPEN) {
 		if (icsk->icsk_retransmits != 0)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ?
 			    : sysctl_dccp_request_retries;
 	} else {
@@ -63,7 +63,7 @@ static int dccp_write_timeout(struct soc
 			   Golden words :-).
 		   */
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_dccp_retries2;
diff -ruNp org/net/decnet/af_decnet.c new/net/decnet/af_decnet.c
--- org/net/decnet/af_decnet.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/decnet/af_decnet.c	2009-10-16 21:31:30.000000000 +0530
@@ -1955,7 +1955,7 @@ static int dn_sendmsg(struct kiocb *iocb
 	}
 
 	if ((flags & MSG_TRYHARD) && sk->sk_dst_cache)
-		dst_negative_advice(&sk->sk_dst_cache);
+		dst_negative_advice(&sk->sk_dst_cache, sk);
 
 	mss = scp->segsize_rem;
 	fctype = scp->services_rem & NSP_FC_MASK;
diff -ruNp org/net/ipv4/tcp_timer.c new/net/ipv4/tcp_timer.c
--- org/net/ipv4/tcp_timer.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/ipv4/tcp_timer.c	2009-10-16 21:31:30.000000000 +0530
@@ -141,14 +141,14 @@ static int tcp_write_timeout(struct sock
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
 		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_tcp_retries2;

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Evgeniy Polyakov @ 2009-10-18 14:18 UTC (permalink / raw)
  To: Harald Welte; +Cc: David Miller, greearb, deri, shemminger, brad.doctor, netdev
In-Reply-To: <20091018124337.GE27747@prithivi.gnumonks.org>

On Sun, Oct 18, 2009 at 02:43:37PM +0200, Harald Welte (laforge@gnumonks.org) wrote:
> How does it make it any easier?  Even right now you can implement an entire
> protocol family in your own module, either by registering as netpoll handler,
> or even using the regular dev_add_pack().

Well, it does, since packet will be processed by the main stack after
that, and module will work with the copy only. But I agree that this is
a weak argument.

If it is still a blocking one, what about implementing additional
gpl-only list of handlers which will have 'consumed' skb check? I
believe it would be enough to put it only in single place after the
bridge?

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Evgeniy Polyakov @ 2009-10-18 14:50 UTC (permalink / raw)
  To: Harald Welte; +Cc: Luca Deri, Brent Cook, Brad Doctor, netdev
In-Reply-To: <20091018125014.GH27747@prithivi.gnumonks.org>

On Sun, Oct 18, 2009 at 02:50:14PM +0200, Harald Welte (laforge@gnumonks.org) wrote:
> > contrary to other socket types, PF_RING allows
> > - packets to be filtered using both BPF and ACL-like filters
> > - parsing information is returned as metadata with the packet (i.e.
> > you don't have to parse the packet again as it happens with BPF)
> > - ACL-like filters allows you to specify advanced features such as
> > port ranges or packet payload match
> 
> So it seems there is some added features over the existing functionality, plus
> probably increased performance mainly to hooking earlier in the packet receive
> flow.
> 
> What would normally be done is to try to make incremental changes
> to the existing code and extend their features/performacne, rather than
> adding something relatively similar alternative.

PF_PACKET as is can not be made faster - it requires a packet copy, so
virtually this is an end of the game, while mapped packet socket is
quite different and does not require that expensive copy. And while
currently difference between both goes down, it still exists and may
hummer some use cases.

PF_RING uses another ring structure and I saw comparisons of both (many
years ago though), where pf_ring was faster. Unfortunately there is no
way to easily adopt its mapping into pf_packet ring without breaking
compatibility, but I wonder whether performance different between both
still exists and can it be a main factor for the preference. If
difference is not visible, than I believe the only way for PF_RING is to
extend existing packet sockets with its other features.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/4 revised] TCPCT part 1a: extend struct tcp_request_sock
From: William Allen Simpson @ 2009-10-18 15:57 UTC (permalink / raw)
  To: Linux Kernel Network Developers
In-Reply-To: <4AD8AFC0.1090101@gmail.com>

William Allen Simpson wrote:
> Pass additional parameters associated with sending SYNACK.  This
> is not as straightforward or architecturally clean as previously
> proposed, and has the unfortunate side effect of potentially
> including otherwise unneeded headers for related protocols, but
> that problem will affect very few files.
> ---
>  include/net/extend_request_sock.h |   37 
> +++++++++++++++++++++++++++++++++++++
>  1 files changed, 37 insertions(+), 0 deletions(-)
>  create mode 100644 include/net/extend_request_sock.h
> 
This technique appears to be unworkable:

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9971870..30c4808 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -71,6 +71,7 @@
  #include <net/timewait_sock.h>
  #include <net/xfrm.h>
  #include <net/netdma.h>
+#include <net/extend_request_sock.h>

  #include <linux/inet.h>
  #include <linux/ipv6.h>
@@ -1195,6 +1196,15 @@ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
  	.send_reset	=	tcp_v4_send_reset,
  };

+struct request_sock_ops tcp4_extend_request_sock_ops __read_mostly = {
+	.family		=	PF_INET,
+	.obj_size	=	sizeof(struct extend_request_sock),
+	.rtx_syn_ack	=	tcp_v4_send_synack,
+	.send_ack	=	tcp_v4_reqsk_send_ack,
+	.destructor	=	tcp_v4_reqsk_destructor,
+	.send_reset	=	tcp_v4_send_reset,
+};
+

...

+		req = inet_reqsk_alloc(&tcp4_extend_request_sock_ops);
+		if (NULL == req)
+			goto drop;
+

Many hours of investigation demonstrated that the obj_size isn't actually
used to allocate the structure.  Heck, it's not even checked to determine
whether there's enough room!  Instead, the kernel crashes later, as the
extended variables are accessed!

Returning to the architecturally clean parameters of the previous patch
series, that has the distinct advantage of actually working....

^ permalink raw reply related

* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second
From: Benjamin LaHaise @ 2009-10-18 16:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <4ADA98EE.9040509@gmail.com>

On Sun, Oct 18, 2009 at 06:26:22AM +0200, Eric Dumazet wrote:
> Unfortunatly this slow down fast path by an order of magnitude.
> 
> atomic_dec() is pretty cheap (and eventually could use a per_cpu thing,
> now we have a new and sexy per_cpu allocator), but atomic_dec_and_test()
> is not that cheap and more important forbids a per_cpu conversion.

dev_put() is not a fast path by any means.  atomic_dec_and_test() costs 
the same as atomic_dec() on any modern CPU -- the cost is in the cacheline 
bouncing and serialisation both require.  The case of the device count 
becoming 0 is quite rare -- any device with a route on it will never hit 
a reference count of 0.

		-ben

^ permalink raw reply

* [net-next-2.6 PATCH 1/4 resent] TCPCT part 1a: add function parameter for sending SYNACK
From: William Allen Simpson @ 2009-10-18 16:14 UTC (permalink / raw)
  To: Linux Kernel Network Developers

[-- Attachment #1: Type: text/plain, Size: 906 bytes --]

Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission.  Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.

Only affects DCCP as it also uses common struct request_sock_ops,
but this void parameter is currently reserved for future use.
---
   include/net/request_sock.h      |    3 ++-
   include/net/tcp.h               |    3 ++-
   net/dccp/ipv4.c                 |    5 +++--
   net/dccp/ipv6.c                 |    5 +++--
   net/dccp/minisocks.c            |    2 +-
   net/ipv4/inet_connection_sock.c |    2 +-
   net/ipv4/tcp_ipv4.c             |   12 +++++++-----
   net/ipv4/tcp_minisocks.c        |    2 +-
   net/ipv4/tcp_output.c           |    2 +-
   net/ipv6/tcp_ipv6.c             |   14 +++++++-------
   10 files changed, 28 insertions(+), 22 deletions(-)



[-- Attachment #2: TCPCT+1-1.patch --]
[-- Type: text/plain, Size: 7276 bytes --]

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index c719084..cdd9e8b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -33,7 +33,8 @@ struct request_sock_ops {
 	struct kmem_cache	*slab;
 	char		*slab_name;
 	int		(*rtx_syn_ack)(struct sock *sk,
-				       struct request_sock *req);
+				       struct request_sock *req,
+				       void *extend_values);
 	void		(*send_ack)(struct sock *sk, struct sk_buff *skb,
 				    struct request_sock *req);
 	void		(*send_reset)(struct sock *sk,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..28bcaf7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -443,7 +443,8 @@ extern int			tcp_connect(struct sock *sk);
 
 extern struct sk_buff *		tcp_make_synack(struct sock *sk,
 						struct dst_entry *dst,
-						struct request_sock *req);
+						struct request_sock *req,
+						void *extend_values);
 
 extern int			tcp_disconnect(struct sock *sk, int flags);
 
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 7302e14..6fc9ea3 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -473,7 +473,8 @@ static struct dst_entry* dccp_v4_route_skb(struct net *net, struct sock *sk,
 	return &rt->u.dst;
 }
 
-static int dccp_v4_send_response(struct sock *sk, struct request_sock *req)
+static int dccp_v4_send_response(struct sock *sk, struct request_sock *req,
+				 void *extend_unused)
 {
 	int err = -1;
 	struct sk_buff *skb;
@@ -622,7 +623,7 @@ int dccp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	dreq->dreq_iss	   = dccp_v4_init_sequence(skb);
 	dreq->dreq_service = service;
 
-	if (dccp_v4_send_response(sk, req))
+	if (dccp_v4_send_response(sk, req, NULL))
 		goto drop_and_free;
 
 	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index a2afb55..63fb189 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -241,7 +241,8 @@ out:
 }
 
 
-static int dccp_v6_send_response(struct sock *sk, struct request_sock *req)
+static int dccp_v6_send_response(struct sock *sk, struct request_sock *req,
+				 void *extend_unused)
 {
 	struct inet6_request_sock *ireq6 = inet6_rsk(req);
 	struct ipv6_pinfo *np = inet6_sk(sk);
@@ -468,7 +469,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	dreq->dreq_iss	   = dccp_v6_init_sequence(skb);
 	dreq->dreq_service = service;
 
-	if (dccp_v6_send_response(sk, req))
+	if (dccp_v6_send_response(sk, req, NULL))
 		goto drop_and_free;
 
 	inet6_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 5ca49ce..af226a0 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -184,7 +184,7 @@ struct sock *dccp_check_req(struct sock *sk, struct sk_buff *skb,
 			 * counter (backoff, monitored by dccp_response_timer).
 			 */
 			req->retrans++;
-			req->rsk_ops->rtx_syn_ack(sk, req);
+			req->rsk_ops->rtx_syn_ack(sk, req, NULL);
 		}
 		/* Network Duplicate, discard packet */
 		return NULL;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 9139e8f..b7314f2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -504,7 +504,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
 			if (time_after_eq(now, req->expires)) {
 				if ((req->retrans < thresh ||
 				     (inet_rsk(req)->acked && req->retrans < max_retries))
-				    && !req->rsk_ops->rtx_syn_ack(parent, req)) {
+				    && !req->rsk_ops->rtx_syn_ack(parent, req, NULL)) {
 					unsigned long timeo;
 
 					if (req->retrans++ == 0)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9971870..2d25bd4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -742,7 +742,7 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
  *	socket.
  */
 static int __tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
-				struct dst_entry *dst)
+				struct dst_entry *dst, void *extend_values)
 {
 	const struct inet_request_sock *ireq = inet_rsk(req);
 	int err = -1;
@@ -752,7 +752,7 @@ static int __tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
 	if (!dst && (dst = inet_csk_route_req(sk, req)) == NULL)
 		return -1;
 
-	skb = tcp_make_synack(sk, dst, req);
+	skb = tcp_make_synack(sk, dst, req, extend_values);
 
 	if (skb) {
 		struct tcphdr *th = tcp_hdr(skb);
@@ -773,9 +773,10 @@ static int __tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
 	return err;
 }
 
-static int tcp_v4_send_synack(struct sock *sk, struct request_sock *req)
+static int tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
+			      void *extend_values)
 {
-	return __tcp_v4_send_synack(sk, req, NULL);
+	return __tcp_v4_send_synack(sk, req, NULL, extend_values);
 }
 
 /*
@@ -1333,7 +1334,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	}
 	tcp_rsk(req)->snt_isn = isn;
 
-	if (__tcp_v4_send_synack(sk, req, dst) || want_cookie)
+	if (__tcp_v4_send_synack(sk, req, dst, NULL) ||
+	    want_cookie)
 		goto drop_and_free;
 
 	inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index e320afe..8819882 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -537,7 +537,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		 * Enforce "SYN-ACK" according to figure 8, figure 6
 		 * of RFC793, fixed by RFC1122.
 		 */
-		req->rsk_ops->rtx_syn_ack(sk, req);
+		req->rsk_ops->rtx_syn_ack(sk, req, NULL);
 		return NULL;
 	}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fcd278a..765d80f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2219,7 +2219,7 @@ int tcp_send_synack(struct sock *sk)
 
 /* Prepare a SYN-ACK. */
 struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
-				struct request_sock *req)
+				struct request_sock *req, void *extend_values)
 {
 	struct inet_request_sock *ireq = inet_rsk(req);
 	struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4517630..3b3d7b3 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -460,7 +460,8 @@ out:
 }
 
 
-static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req)
+static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req,
+			      void *extend_values)
 {
 	struct inet6_request_sock *treq = inet6_rsk(req);
 	struct ipv6_pinfo *np = inet6_sk(sk);
@@ -498,7 +499,7 @@ static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req)
 	if ((err = xfrm_lookup(sock_net(sk), &dst, &fl, sk, 0)) < 0)
 		goto done;
 
-	skb = tcp_make_synack(sk, dst, req);
+	skb = tcp_make_synack(sk, dst, req, extend_values);
 	if (skb) {
 		struct tcphdr *th = tcp_hdr(skb);
 
@@ -1242,13 +1243,12 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	security_inet_conn_request(sk, skb, req);
 
-	if (tcp_v6_send_synack(sk, req))
+	if (tcp_v6_send_synack(sk, req, NULL) ||
+	    want_cookie)
 		goto drop;
 
-	if (!want_cookie) {
-		inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-		return 0;
-	}
+	inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+	return 0;
 
 drop:
 	if (req)
-- 
1.6.0.4



^ permalink raw reply related

* [net-next-2.6 PATCH 4/4 resent] TCPCT part 1d: initial SYN exchange with SYNACK data
From: William Allen Simpson @ 2009-10-18 16:28 UTC (permalink / raw)
  To: Linux Kernel Network Developers

[-- Attachment #1: Type: text/plain, Size: 2666 bytes --]



-------- Original Message --------
Subject: [net-next-2.6 PATCH 4/4] TCPCT part 1: initial SYN exchange with SYNACK data
Date: Thu, 15 Oct 2009 01:36:48 -0400
From: William Allen Simpson <william.allen.simpson@gmail.com>
To: Linux Kernel Network Developers <netdev@vger.kernel.org>
References: <4AD6B31B.3060402@gmail.com> <4AD6B3E8.2050904@gmail.com> <4AD6B467.2080701@gmail.com>

This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley).  That patch was previously reviewed:

    http://thread.gmane.org/gmane.linux.network/102586

The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data.  This is more flexible and
less subject to user configuration error.  Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.

    "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
    http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html

    "Re: what a new TCP header might look like", May 12, 1998.
    ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail

Data structures are carefully composed to require minimal additions.
For example, the struct tcp_options_received cookie_plus variable fits
between existing 16-bit and 8-bit variables, requiring no additional
space (taking alignment into consideration).  There are no additions to
tcp_request_sock, and only 1 pointer and 1 flag byte in tcp_sock.

Allocations have been rearranged to avoid requiring GFP_ATOMIC, with
only one unavoidable exception in tcp_create_openreq_child(), where the
tcp_sock itself is created GFP_ATOMIC.

These functions will also be used in subsequent patches that implement
additional features.

Requires:
   TCPCT part 1a: add function parameter for sending SYNACK
   TCPCT part 1b: sysctl_tcp_cookie_size and TCP_COOKIE_TRANSACTIONS
   TCPCT part 1c: redefine TCP header functions *_len_th(), cleanup
---
   include/linux/tcp.h      |   35 +++++++-
   include/net/tcp.h        |   72 ++++++++++++++--
   net/ipv4/syncookies.c    |    5 +-
   net/ipv4/tcp.c           |  133 +++++++++++++++++++++++++++-
   net/ipv4/tcp_input.c     |   82 +++++++++++++++---
   net/ipv4/tcp_ipv4.c      |   62 +++++++++++--
   net/ipv4/tcp_minisocks.c |   43 +++++++---
   net/ipv4/tcp_output.c    |  223 ++++++++++++++++++++++++++++++++++++++++++---
   net/ipv6/syncookies.c    |    5 +-
   net/ipv6/tcp_ipv6.c      |   47 +++++++++-
   10 files changed, 641 insertions(+), 66 deletions(-)



[-- Attachment #2: TCPCT+1-4.patch --]
[-- Type: text/plain, Size: 39341 bytes --]

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d304ba5..1c9a1d1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -252,26 +252,36 @@ struct tcp_options_received {
 		sack_ok : 4,	/* SACK seen on SYN packet		*/
 		snd_wscale : 4,	/* Window scaling received from sender	*/
 		rcv_wscale : 4;	/* Window scaling to send to receiver	*/
-/*	SACKs data	*/
+	u8	cookie_plus: 6;	/* bytes in authenticator/cookie option	*/
 	u8	num_sacks;	/* Number of SACK blocks		*/
-	u16	user_mss;  	/* mss requested by user in ioctl */
+	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
 };
 
+static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+{
+	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+	rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
+	rx_opt->cookie_plus = 0;
+}
+
 /* This is the max number of SACKS that we'll generate and process. It's safe
  * to increse this, although since:
  *   size = TCPOLEN_SACK_BASE_ALIGNED (4) + n * TCPOLEN_SACK_PERBLOCK (8)
  * only four options will fit in a standard TCP header */
 #define TCP_NUM_SACKS 4
 
+struct tcp_cookie_values;
+struct tcp_request_sock_ops;
+
 struct tcp_request_sock {
 	struct inet_request_sock 	req;
 #ifdef CONFIG_TCP_MD5SIG
 	/* Only used by TCP MD5 Signature so far. */
 	const struct tcp_request_sock_ops *af_specific;
 #endif
-	u32			 	rcv_isn;
-	u32			 	snt_isn;
+	u32				rcv_isn;
+	u32				snt_isn;
 };
 
 static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
@@ -441,6 +451,19 @@ struct tcp_sock {
 /* TCP MD5 Signature Option information */
 	struct tcp_md5sig_info	*md5sig_info;
 #endif
+
+	/* When the cookie options are generated and exchanged, then this
+	 * object holds a reference to them (cookie_values->kref).  Also
+	 * contains related tcp_cookie_transactions fields.
+	 */
+	struct tcp_cookie_values  	*cookie_values;
+
+	u8				cookie_in_always:1,
+					cookie_out_never:1,
+					extend_timestamp:1,
+					s_data_constant:1,
+					s_data_in:1,
+					s_data_out:1;
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
@@ -459,6 +482,10 @@ struct tcp_timewait_sock {
 	u16			  tw_md5_keylen;
 	u8			  tw_md5_key[TCP_MD5SIG_MAXKEYLEN];
 #endif
+	/* Few sockets in timewait have cookies; in that case, then this
+	 * object holds a reference to it (tw_cookie_values->kref)
+	 */
+	struct tcp_cookie_values  *tw_cookie_values;
 };
 
 static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 63d17fd..a2d2c0f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -30,6 +30,7 @@
 #include <linux/dmaengine.h>
 #include <linux/crypto.h>
 #include <linux/cryptohash.h>
+#include <linux/kref.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -167,6 +168,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_SACK             5       /* SACK Block */
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
+#define TCPOPT_COOKIE		253	/* Cookie extension (experimental) */
 
 /*
  *     TCP option lengths
@@ -177,6 +179,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_SACK_PERM      2
 #define TCPOLEN_TIMESTAMP      10
 #define TCPOLEN_MD5SIG         18
+#define TCPOLEN_COOKIE_BASE    2	/* Cookie-less header extension */
+#define TCPOLEN_COOKIE_PAIR    3	/* Cookie pair header extension */
+#define TCPOLEN_COOKIE_MAX     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MAX)
+#define TCPOLEN_COOKIE_MIN     (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN)
 
 /* But this is what stacks really send out. */
 #define TCPOLEN_TSTAMP_ALIGNED		12
@@ -344,11 +350,6 @@ static inline void tcp_dec_quickack_mode(struct sock *sk,
 
 extern void tcp_enter_quickack_mode(struct sock *sk);
 
-static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
-{
- 	rx_opt->tstamp_ok = rx_opt->sack_ok = rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
-}
-
 #define	TCP_ECN_OK		1
 #define	TCP_ECN_QUEUE_CWR	2
 #define	TCP_ECN_DEMAND_CWR	4
@@ -410,7 +411,7 @@ extern int			tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
 
 extern void			tcp_parse_options(struct sk_buff *skb,
 						  struct tcp_options_received *opt_rx,
-						  int estab);
+						  u8 **cryptic, int estab);
 
 extern u8			*tcp_parse_md5sig_option(struct tcphdr *th);
 
@@ -1482,6 +1483,65 @@ struct tcp_request_sock_ops {
 #endif
 };
 
+/**
+ * A tcp_sock contains a pointer to the current value, and this is cloned to
+ * the tcp_timewait_sock.
+ *
+ * @cookie_pair:	variable data from the option exchange.
+ *
+ * @cookie_desired:	user specified tcpct_cookie_desired.  Zero
+ *			indicates default (sysctl_tcp_cookie_size).
+ *			After cookie sent, remembers size of cookie.
+ *
+ * @s_data_desired:	user specified tcpct_s_data_desired.  When the
+ *			constant payload is specified (s_data_constant),
+ *			holds its length instead.
+ *
+ * @s_data_payload:	constant data that is to be included in the
+ *			payload of SYN or SYNACK segments when the
+ *			cookie option is present.
+ */
+struct tcp_cookie_values {
+	struct kref	kref;
+	u8		cookie_pair[TCP_COOKIE_PAIR_SIZE];
+	u8		cookie_pair_size;
+	u8		cookie_desired;
+	u16		s_data_desired;
+	u8		s_data_payload[0];
+};
+
+static inline void tcp_cookie_values_release(struct kref *kref)
+{
+	kfree(container_of(kref, struct tcp_cookie_values, kref));
+}
+
+/* The length of constant payload data.  Note that s_data_desired is
+ * overloaded, depending on s_data_constant: either the length of constant
+ * data (returned here) or the limit on variable data.
+ */
+static inline int tcp_s_data_size(const struct tcp_sock *tp)
+{
+	return (NULL != tp->cookie_values && tp->s_data_constant)
+		? tp->cookie_values->s_data_desired
+		: 0;
+}
+
+/* As tcp_request_sock has already been extended in other places, the
+ * only remaining method is to pass stack values along as function
+ * parameters.  These parameters are not needed after sending SYNACK.
+ */
+struct tcp_extend_values {
+	u8				cookie_bakery[TCP_COOKIE_MAX];
+	u8				cookie_plus;
+	u8				cookie_in_always:1,
+					cookie_out_never:1;
+};
+
+static inline struct tcp_extend_values *tcp_xv(const void *extend_values)
+{
+	return (struct tcp_extend_values *)extend_values;
+}
+
 extern void tcp_v4_init(void);
 extern void tcp_init(void);
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 5ec678a..cdab491 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -253,6 +253,8 @@ EXPORT_SYMBOL(cookie_check_timestamp);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 			     struct ip_options *opt)
 {
+	struct tcp_options_received tcp_opt;
+	u8 *cryptic_value;
 	struct inet_request_sock *ireq;
 	struct tcp_request_sock *treq;
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -263,7 +265,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	int mss;
 	struct rtable *rt;
 	__u8 rcv_wscale;
-	struct tcp_options_received tcp_opt;
 
 	if (!sysctl_tcp_syncookies || !th->ack)
 		goto out;
@@ -278,7 +279,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
+	tcp_parse_options(skb, &tcp_opt, &cryptic_value, 0);
 
 	if (tcp_opt.saw_tstamp)
 		cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cf13726..0b47ffe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2039,8 +2039,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 	int val;
 	int err = 0;
 
-	/* This is a string value all the others are int's */
-	if (optname == TCP_CONGESTION) {
+	/* These are data/string values, all the others are ints */
+	if (TCP_CONGESTION == optname) {
 		char name[TCP_CA_NAME_MAX];
 
 		if (optlen < 1)
@@ -2056,6 +2056,95 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		err = tcp_set_congestion_control(sk, name);
 		release_sock(sk);
 		return err;
+	} else if (TCP_COOKIE_TRANSACTIONS == optname) {
+		struct tcp_cookie_transactions ctd;
+		struct tcp_cookie_values *cvp = NULL;
+
+		if (sizeof(ctd) > optlen) {
+			return -EINVAL;
+		}
+		if (copy_from_user(&ctd, optval, sizeof(ctd))) {
+			return -EFAULT;
+		}
+		if (sizeof(ctd.tcpct_value) < ctd.tcpct_used) {
+			return -EINVAL;
+		}
+		if (0 == ctd.tcpct_cookie_desired) {
+			/* default to global value */
+		} else if ((0x1 & ctd.tcpct_cookie_desired)
+			|| TCP_COOKIE_MAX < ctd.tcpct_cookie_desired
+			|| TCP_COOKIE_MIN > ctd.tcpct_cookie_desired) {
+			return -EINVAL;
+		}
+
+		if (TCP_COOKIE_OUT_NEVER & ctd.tcpct_flags) {
+			/* Supercedes all other values */
+			lock_sock(sk);
+			if (NULL != tp->cookie_values) {
+				kref_put(&tp->cookie_values->kref,
+					 tcp_cookie_values_release);
+				tp->cookie_values = NULL;
+			}
+			tp->cookie_in_always = 0; /* false */
+			tp->cookie_out_never = 1; /* true */
+			tp->extend_timestamp = 0; /* false */
+			tp->s_data_constant = 0; /* false */
+			tp->s_data_in = 0; /* false */
+			tp->s_data_out = 0; /* false */
+			release_sock(sk);
+			return err;
+		}
+
+		/* Allocate ancillary memory before locking.
+		 */
+		if (0 < ctd.tcpct_used
+		 || (NULL == tp->cookie_values
+		  && (0 < sysctl_tcp_cookie_size
+		   || 0 < ctd.tcpct_cookie_desired
+		   || 0 < ctd.tcpct_s_data_desired))) {
+			cvp = kmalloc(sizeof(*cvp) + ctd.tcpct_used,
+				      GFP_KERNEL);
+			if (NULL == cvp) {
+				return -ENOMEM;
+			}
+		}
+
+		lock_sock(sk);
+		tp->cookie_in_always = (TCP_COOKIE_IN_ALWAYS & ctd.tcpct_flags);
+		tp->cookie_out_never = 0; /* false */
+		tp->extend_timestamp = (TCP_EXTEND_TIMESTAMP & ctd.tcpct_flags);
+		tp->s_data_in = 0; /* false */
+		tp->s_data_out = 0; /* false */
+
+		if (NULL == cvp) {
+			/* No cookies by default. */
+			tp->s_data_constant = 0; /* false */
+		} else if (0 == ctd.tcpct_used) {
+			/* No constant payload data. */
+			cvp->cookie_desired = ctd.tcpct_cookie_desired;
+			cvp->s_data_desired = ctd.tcpct_s_data_desired;
+			tp->cookie_values = cvp;
+			tp->s_data_constant = 0; /* false */
+		} else {
+			/* Changes in values are recorded by a change in
+			 * pointer, ensuring that the cookie will differ,
+			 * without separately hashing each value later.
+			 */
+			if (unlikely(NULL != tp->cookie_values)) {
+				kref_put(&tp->cookie_values->kref,
+					 tcp_cookie_values_release);
+			}
+			kref_init(&cvp->kref);
+			memcpy(cvp->s_data_payload, ctd.tcpct_value,
+			       ctd.tcpct_used);
+			cvp->cookie_desired = ctd.tcpct_cookie_desired;
+			cvp->s_data_desired = ctd.tcpct_used;
+			tp->cookie_values = cvp;
+			tp->s_data_constant = 1; /* true */
+		}
+
+		release_sock(sk);
+		return err;
 	}
 
 	if (optlen < sizeof(int))
@@ -2387,6 +2476,46 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
 			return -EFAULT;
 		return 0;
+
+	case TCP_COOKIE_TRANSACTIONS: {
+		struct tcp_cookie_transactions ctd;
+		struct tcp_cookie_values *cvp = tp->cookie_values;
+
+		if (get_user(len, optlen))
+			return -EFAULT;
+		if (len < sizeof(ctd))
+			return -EINVAL;
+
+		memset(&ctd, 0, sizeof(ctd));
+		ctd.tcpct_flags =
+			  (tp->cookie_in_always ? TCP_COOKIE_IN_ALWAYS : 0)
+			| (tp->cookie_out_never ? TCP_COOKIE_OUT_NEVER : 0)
+			| (tp->extend_timestamp ? TCP_EXTEND_TIMESTAMP : 0)
+			| (tp->s_data_in ? TCP_S_DATA_IN : 0)
+			| (tp->s_data_out ? TCP_S_DATA_OUT : 0);
+
+		if (NULL != cvp) {
+			/* Cookie(s) saved, return as nonce */
+			if (sizeof(ctd.tcpct_value) < cvp->cookie_pair_size) {
+				/* impossible? */
+				return -EINVAL;
+			}
+			memcpy(&ctd.tcpct_value[0], &cvp->cookie_pair[0],
+			       cvp->cookie_pair_size);
+			ctd.tcpct_used = cvp->cookie_pair_size;
+
+			ctd.tcpct_cookie_desired = cvp->cookie_desired;
+			ctd.tcpct_s_data_desired = cvp->s_data_desired;
+		}
+
+		if (copy_to_user(optval, &ctd, sizeof(ctd))) {
+			return -EFAULT;
+		}
+		if (put_user(sizeof(ctd), optlen)) {
+			return -EFAULT;
+		}
+		return 0;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d86784b..200afa8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3698,11 +3698,11 @@ old_ack:
  * the fast version below fails.
  */
 void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
-		       int estab)
+		       u8 **cryptic, int estab)
 {
 	unsigned char *ptr;
 	struct tcphdr *th = tcp_hdr(skb);
-	int length = (th->doff * 4) - sizeof(struct tcphdr);
+	int length = tcp_option_len_th(th);
 
 	ptr = (unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
@@ -3782,6 +3782,19 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				 */
 				break;
 #endif
+			case TCPOPT_COOKIE:
+				/* This option carries 3 different lengths.
+				 */
+				if (TCPOLEN_COOKIE_MAX >= opsize
+				 && TCPOLEN_COOKIE_MIN <= opsize) {
+					opt_rx->cookie_plus = opsize;
+					*cryptic = ptr;
+				} else if (TCPOLEN_COOKIE_PAIR == opsize) {
+					/* not yet implemented */
+				} else if (TCPOLEN_COOKIE_BASE == opsize) {
+					/* not yet implemented */
+				}
+				break;
 			}
 
 			ptr += opsize-2;
@@ -3810,17 +3823,21 @@ static int tcp_parse_aligned_timestamp(struct tcp_sock *tp, struct tcphdr *th)
  * If it is wrong it falls back on tcp_parse_options().
  */
 static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
-				  struct tcp_sock *tp)
+				  struct tcp_sock *tp, u8 **cryptic)
 {
-	if (th->doff == sizeof(struct tcphdr) >> 2) {
+	/* In the spirit of fast parsing, compare doff directly to shifted
+	 * constant values.  Because equality is used, short doff can be
+	 * ignored here, and checked later.
+	 */
+	if ((sizeof(*th) >> 2) == th->doff) {
 		tp->rx_opt.saw_tstamp = 0;
 		return 0;
 	} else if (tp->rx_opt.tstamp_ok &&
-		   th->doff == (sizeof(struct tcphdr)>>2)+(TCPOLEN_TSTAMP_ALIGNED>>2)) {
+		   ((sizeof(*th)+TCPOLEN_TSTAMP_ALIGNED)>>2) == th->doff) {
 		if (tcp_parse_aligned_timestamp(tp, th))
 			return 1;
 	}
-	tcp_parse_options(skb, &tp->rx_opt, 1);
+	tcp_parse_options(skb, &tp->rx_opt, cryptic, 1);
 	return 1;
 }
 
@@ -3830,7 +3847,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
  */
 u8 *tcp_parse_md5sig_option(struct tcphdr *th)
 {
-	int length = (th->doff << 2) - sizeof (*th);
+	int length = tcp_option_len_th(th);
 	u8 *ptr = (u8*)(th + 1);
 
 	/* If the TCP option is too short, we can short cut */
@@ -5070,10 +5087,11 @@ out:
 static int tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 			      struct tcphdr *th, int syn_inerr)
 {
+	u8 *cv;
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	/* RFC1323: H1. Apply PAWS check first. */
-	if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp &&
+	if (tcp_fast_parse_options(skb, th, tp, &cv) && tp->rx_opt.saw_tstamp &&
 	    tcp_paws_discard(sk, skb)) {
 		if (!th->rst) {
 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
@@ -5361,11 +5379,14 @@ discard:
 static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 					 struct tcphdr *th, unsigned len)
 {
-	struct tcp_sock *tp = tcp_sk(sk);
+	u8 *cryptic_value;
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_cookie_values *cvp = tp->cookie_values;
 	int saved_clamp = tp->rx_opt.mss_clamp;
+	int queued = 0;
 
-	tcp_parse_options(skb, &tp->rx_opt, 0);
+	tcp_parse_options(skb, &tp->rx_opt, &cryptic_value, 0);
 
 	if (th->ack) {
 		/* rfc793:
@@ -5462,6 +5483,42 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 * Change state from SYN-SENT only after copied_seq
 		 * is initialized. */
 		tp->copied_seq = tp->rcv_nxt;
+
+		if (NULL != cvp
+		 && 0 < cvp->cookie_pair_size
+		 && 0 < tp->rx_opt.cookie_plus) {
+			int cookie_size = tp->rx_opt.cookie_plus - TCPOLEN_COOKIE_BASE;
+			int cookie_pair_size = cvp->cookie_desired + cookie_size;
+
+			/* A cookie extension option was sent and returned.
+			 * Note that each incoming SYNACK replaces the
+			 * Responder cookie.  The initial exchange is most
+			 * fragile, as protection against spoofing relies
+			 * entirely upon the sequence and timestamp (above).
+			 * This replacement strategy allows the correct pair to
+			 * pass through, while any others will be filtered via
+			 * Responder verification later.
+			 */
+			if (sizeof(cvp->cookie_pair) >= cookie_pair_size) {
+				memcpy(&cvp->cookie_pair[cvp->cookie_desired],
+				       cryptic_value, cookie_size);
+				cvp->cookie_pair_size = cookie_pair_size;
+			}
+
+			if (tcp_header_len_th(th) < skb->len) {
+				/* Queue incoming transaction data. */
+				__skb_pull(skb, tcp_header_len_th(th));
+				__skb_queue_tail(&sk->sk_receive_queue, skb);
+				skb_set_owner_r(skb, sk);
+				sk->sk_data_ready(sk, 0);
+				tp->s_data_in = 1; /* true */
+				queued = 1; /* should be amount? */
+				tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+				tp->rcv_wup = TCP_SKB_CB(skb)->end_seq;
+				tp->copied_seq = TCP_SKB_CB(skb)->seq + 1;
+			}
+		}
+
 		smp_mb();
 		tcp_set_state(sk, TCP_ESTABLISHED);
 
@@ -5513,11 +5570,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 						  TCP_DELACK_MAX, TCP_RTO_MAX);
 
 discard:
-			__kfree_skb(skb);
+			if (0 == queued)
+				__kfree_skb(skb);
 			return 0;
 		} else {
 			tcp_send_ack(sk);
 		}
+		if (0 < queued)
+			return 0; /* amount queued? */
 		return -1;
 	}
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2d25bd4..7d5fd4d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -217,7 +217,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (inet->opt)
 		inet_csk(sk)->icsk_ext_hdr_len = inet->opt->optlen;
 
-	tp->rx_opt.mss_clamp = 536;
+	tp->rx_opt.mss_clamp = TCP_MIN_RCVMSS;
 
 	/* Socket identity is still unknown (sport may be zero).
 	 * However we set state to SYN-SENT and not releasing socket
@@ -1211,9 +1211,12 @@ static struct timewait_sock_ops tcp_timewait_sock_ops = {
 
 int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 {
-	struct inet_request_sock *ireq;
+	struct tcp_extend_values tmp_ext;
 	struct tcp_options_received tmp_opt;
+	u8 *cryptic_value;
+	struct inet_request_sock *ireq;
 	struct request_sock *req;
+	struct tcp_sock *tp = tcp_sk(sk);
 	__be32 saddr = ip_hdr(skb)->saddr;
 	__be32 daddr = ip_hdr(skb)->daddr;
 	__u32 isn = TCP_SKB_CB(skb)->when;
@@ -1258,16 +1261,37 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 #endif
 
 	tcp_clear_options(&tmp_opt);
-	tmp_opt.mss_clamp = 536;
-	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
+	tmp_opt.mss_clamp = TCP_MIN_RCVMSS;
+	tmp_opt.user_mss  = tp->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
+
+	if (0 < tmp_opt.cookie_plus
+	 && tmp_opt.saw_tstamp
+	 && !tp->cookie_out_never
+	 && (0 < sysctl_tcp_cookie_size
+	  || (NULL != tp->cookie_values
+	   && 0 < tp->cookie_values->cookie_desired))) {
+#ifdef CONFIG_SYN_COOKIES
+		want_cookie = 0;	/* not our kind of cookie */
+#endif
+		tmp_ext.cookie_out_never = 0; /* false */
+		tmp_ext.cookie_plus = tmp_opt.cookie_plus;
+
+		/* secret recipe not yet implemented */
+	} else if (!tp->cookie_in_always) {
+		/* redundant indications, but ensure initialization. */
+		tmp_ext.cookie_out_never = 1; /* true */
+		tmp_ext.cookie_plus = 0;
+	} else {
+		goto drop_and_free;
+	}
+	tmp_ext.cookie_in_always = tp->cookie_in_always;
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
 
 	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-
 	tcp_openreq_init(req, &tmp_opt, skb);
 
 	ireq = inet_rsk(req);
@@ -1334,7 +1358,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	}
 	tcp_rsk(req)->snt_isn = isn;
 
-	if (__tcp_v4_send_synack(sk, req, dst, NULL) ||
+	if (__tcp_v4_send_synack(sk, req, dst, (void *)&tmp_ext) ||
 	    want_cookie)
 		goto drop_and_free;
 
@@ -1812,7 +1836,7 @@ static int tcp_v4_init_sock(struct sock *sk)
 	 */
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd_clamp = ~0;
-	tp->mss_cache = 536;
+	tp->mss_cache = TCP_MIN_RCVMSS;
 
 	tp->reordering = sysctl_tcp_reordering;
 	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
@@ -1828,6 +1852,19 @@ static int tcp_v4_init_sock(struct sock *sk)
 	tp->af_specific = &tcp_sock_ipv4_specific;
 #endif
 
+	/* TCP Cookie Transactions */
+	if (0 < sysctl_tcp_cookie_size) {
+		/* Default, cookies without s_data. */
+		tp->cookie_values =
+			kzalloc(sizeof(*tp->cookie_values), sk->sk_allocation);
+		if (NULL != tp->cookie_values) {
+			kref_init(&tp->cookie_values->kref);
+		}
+	}
+	/* Presumed zeroed, in order of appearance:
+	 *	cookie_in_always, cookie_out_never, extend_timestamp,
+	 *	s_data_constant, s_data_in, s_data_out
+	 */
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
@@ -1881,6 +1918,15 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		sk->sk_sndmsg_page = NULL;
 	}
 
+	/*
+	 * If cookie or s_data exists, remove it.
+	 */
+	if (NULL != tp->cookie_values) {
+		kref_put(&tp->cookie_values->kref,
+			 tcp_cookie_values_release);
+		tp->cookie_values = NULL;
+	}
+
 	percpu_counter_dec(&tcp_sockets_allocated);
 }
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 8819882..0d33f5c 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -96,13 +96,14 @@ enum tcp_tw_status
 tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 			   const struct tcphdr *th)
 {
-	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
 	struct tcp_options_received tmp_opt;
+	u8 *cryptic_value;
+	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
 	int paws_reject = 0;
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
@@ -394,9 +395,12 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		/* Now setup tcp_sock */
 		newtp = tcp_sk(newsk);
 		newtp->pred_flags = 0;
-		newtp->rcv_wup = newtp->copied_seq = newtp->rcv_nxt = treq->rcv_isn + 1;
-		newtp->snd_sml = newtp->snd_una = newtp->snd_nxt = treq->snt_isn + 1;
-		newtp->snd_up = treq->snt_isn + 1;
+
+		newtp->rcv_wup = newtp->copied_seq =
+		newtp->rcv_nxt = treq->rcv_isn + 1;
+
+		newtp->snd_sml = newtp->snd_una = newtp->snd_nxt =
+		newtp->snd_up = treq->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk));
 
 		tcp_prequeue_init(newtp);
 
@@ -429,9 +433,24 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
 		skb_queue_head_init(&newtp->out_of_order_queue);
-		newtp->write_seq = treq->snt_isn + 1;
-		newtp->pushed_seq = newtp->write_seq;
+		newtp->write_seq = newtp->pushed_seq =
+			treq->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk));
 
+		/* TCP Cookie Transactions */
+		if (NULL != tcp_sk(sk)->cookie_values) {
+			/* Instead of reusing the original, replace with
+			 * default, cookies without s_data.
+			 */
+			newtp->cookie_values =
+				kzalloc(sizeof(*newtp->cookie_values), GFP_ATOMIC);
+			if (NULL != newtp->cookie_values) {
+				kref_init(&newtp->cookie_values->kref);
+			}
+		}
+		/* Presumed copied, in order of appearance:
+		 *	cookie_in_always, cookie_out_never, extend_timestamp,
+		 *	s_data_constant, s_data_in, s_data_out
+		 */
 		newtp->rx_opt.saw_tstamp = 0;
 
 		newtp->rx_opt.dsack = 0;
@@ -495,15 +514,16 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 			   struct request_sock *req,
 			   struct request_sock **prev)
 {
+	struct tcp_options_received tmp_opt;
+	u8 *cryptic_value;
 	const struct tcphdr *th = tcp_hdr(skb);
 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
 	int paws_reject = 0;
-	struct tcp_options_received tmp_opt;
 	struct sock *child;
 
 	tmp_opt.saw_tstamp = 0;
-	if (th->doff > (sizeof(struct tcphdr)>>2)) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+	if (th->doff > (sizeof(*th) >> 2)) {
+		tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent = req->ts_recent;
@@ -596,7 +616,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	 * Invalid ACK: reset will be sent by listening socket
 	 */
 	if ((flg & TCP_FLAG_ACK) &&
-	    (TCP_SKB_CB(skb)->ack_seq != tcp_rsk(req)->snt_isn + 1))
+	    (TCP_SKB_CB(skb)->ack_seq != tcp_rsk(req)->snt_isn + 1 +
+					 tcp_s_data_size(tcp_sk(sk))))
 		return sk;
 
 	/* Also, it would be not so bad idea to check rcv_tsecr, which
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c235196..0a04684 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -370,6 +370,7 @@ static inline int tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_TS		(1 << 1)
 #define OPTION_MD5		(1 << 2)
 #define OPTION_WSCALE		(1 << 3)
+#define OPTION_COOKIE_EXTENSION	(1 << 4)
 
 struct tcp_out_options {
 	u8 options;		/* bit field of OPTION_* */
@@ -377,8 +378,35 @@ struct tcp_out_options {
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u16 mss;		/* 0 to disable */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
+	u8	*cookie_copy;	/* temporary pointer */
+	u8	cookie_size;	/* bytes in copy */
 };
 
+/* The sysctl int routines are generic, so check consistency here.
+ */
+static u8 tcp_cookie_size_check(u8 desired)
+{
+	if (0 < desired) {
+		/* previously specified */
+		return desired;
+	}
+	if (0 >= sysctl_tcp_cookie_size) {
+		/* no default specified */
+		return 0;
+	}
+	if (TCP_COOKIE_MIN > sysctl_tcp_cookie_size) {
+		return TCP_COOKIE_MIN;
+	}
+	if (TCP_COOKIE_MAX < sysctl_tcp_cookie_size) {
+		return TCP_COOKIE_MAX;
+	}
+	if (0x1 & sysctl_tcp_cookie_size) {
+		/* 8-bit multiple, illegal, fix it */
+		return (u8)(sysctl_tcp_cookie_size + 0x1);
+	}
+	return (u8)sysctl_tcp_cookie_size;
+}
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -395,11 +423,22 @@ struct tcp_out_options {
 static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 			      const struct tcp_out_options *opts,
 			      __u8 **md5_hash) {
-	if (unlikely(OPTION_MD5 & opts->options)) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_NOP << 16) |
-			       (TCPOPT_MD5SIG << 8) |
-			       TCPOLEN_MD5SIG);
+	u8 options = opts->options;	/* mungable copy */
+
+	if (unlikely(OPTION_MD5 & options)) {
+		if (unlikely(OPTION_COOKIE_EXTENSION & options)) {
+			*ptr++ = htonl((TCPOPT_COOKIE << 24) |
+				       (TCPOLEN_COOKIE_BASE << 16) |
+				       (TCPOPT_MD5SIG << 8) |
+				       TCPOLEN_MD5SIG);
+		} else {
+			*ptr++ = htonl((TCPOPT_NOP << 24) |
+				       (TCPOPT_NOP << 16) |
+				       (TCPOPT_MD5SIG << 8) |
+				       TCPOLEN_MD5SIG);
+		}
+		/* larger cookies are incompatible */
+		options &= ~OPTION_COOKIE_EXTENSION;
 		*md5_hash = (__u8 *)ptr;
 		ptr += 4;
 	} else {
@@ -412,12 +451,13 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 			       opts->mss);
 	}
 
-	if (likely(OPTION_TS & opts->options)) {
-		if (unlikely(OPTION_SACK_ADVERTISE & opts->options)) {
+	if (likely(OPTION_TS & options)) {
+		if (unlikely(OPTION_SACK_ADVERTISE & options)) {
 			*ptr++ = htonl((TCPOPT_SACK_PERM << 24) |
 				       (TCPOLEN_SACK_PERM << 16) |
 				       (TCPOPT_TIMESTAMP << 8) |
 				       TCPOLEN_TIMESTAMP);
+			options &= ~OPTION_SACK_ADVERTISE;
 		} else {
 			*ptr++ = htonl((TCPOPT_NOP << 24) |
 				       (TCPOPT_NOP << 16) |
@@ -428,15 +468,48 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 		*ptr++ = htonl(opts->tsecr);
 	}
 
-	if (unlikely(OPTION_SACK_ADVERTISE & opts->options &&
-		     !(OPTION_TS & opts->options))) {
+	/* Specification requires after timestamp, so do it now.
+	 */
+	if (unlikely(OPTION_COOKIE_EXTENSION & options)) {
+		u8 *cookie_copy = opts->cookie_copy;
+		u8 cookie_size = opts->cookie_size;
+
+		if (unlikely(0x1 & cookie_size)) {
+			/* 8-bit multiple, illegal, ignore */
+			cookie_size = 0;
+		} else if (likely(0x2 & cookie_size)) {
+			__u8 *p = (__u8 *)ptr;
+
+			/* 16-bit multiple */
+			*p++ = TCPOPT_COOKIE;
+			*p++ = TCPOLEN_COOKIE_BASE + cookie_size;
+			*p++ = *cookie_copy++;
+			*p++ = *cookie_copy++;
+			ptr++;
+			cookie_size -= 2;
+		} else {
+			/* 32-bit multiple */
+			*ptr++ = htonl(((TCPOPT_NOP << 24) |
+				        (TCPOPT_NOP << 16) |
+				        (TCPOPT_COOKIE << 8) |
+				        TCPOLEN_COOKIE_BASE) +
+				       cookie_size);
+		}
+
+		if (0 < cookie_size) {
+			memcpy(ptr, cookie_copy, cookie_size);
+			ptr += (cookie_size >> 2);
+		}
+	}
+
+	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
 		*ptr++ = htonl((TCPOPT_NOP << 24) |
 			       (TCPOPT_NOP << 16) |
 			       (TCPOPT_SACK_PERM << 8) |
 			       TCPOLEN_SACK_PERM);
 	}
 
-	if (unlikely(OPTION_WSCALE & opts->options)) {
+	if (unlikely(OPTION_WSCALE & options)) {
 		*ptr++ = htonl((TCPOPT_NOP << 24) |
 			       (TCPOPT_WINDOW << 16) |
 			       (TCPOLEN_WINDOW << 8) |
@@ -471,11 +544,18 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 				struct tcp_out_options *opts,
 				struct tcp_md5sig_key **md5) {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_cookie_values *cvp = tp->cookie_values;
 	unsigned size = 0;
+	u8 cookie_size = (!tp->cookie_out_never && NULL != cvp)
+			 ? tcp_cookie_size_check(cvp->cookie_desired)
+			 : 0;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
 	if (*md5) {
+		if (0 < cookie_size) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+		}
 		opts->options |= OPTION_MD5;
 		size += TCPOLEN_MD5SIG_ALIGNED;
 	}
@@ -512,6 +592,63 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 			size += TCPOLEN_SACKPERM_ALIGNED;
 	}
 
+	/* Having both authentication and cookies for security is redundant,
+	 * and there's certainly not enough room.  Instead, the cookie-less
+	 * variant is proposed above.
+	 *
+	 * Consider the pessimal case with authentication.  The options
+	 * could look like:
+	 *   COOKIE|MD5(20) + MSS(4) + WSCALE(4) + SACK|TS(12) == 40
+	 *
+	 * (Currently, the timestamps && *MD5 test above prevents this.)
+	 *
+	 * Note that timestamps are required by the specification.
+	 *
+	 * Odd numbers of bytes are prohibited by the specification, ensuring
+	 * that the cookie is 16-bit aligned, and the resulting cookie pair is
+	 * 32-bit aligned.
+	 */
+	if (NULL == *md5
+	 && (OPTION_TS & opts->options)
+	 && 0 < cookie_size) {
+		int need = TCPOLEN_COOKIE_BASE + cookie_size;
+		int remaining = MAX_TCP_OPTION_SPACE - size;
+
+		if (0x2 & need) {
+			/* 32-bit multiple */
+			need += 2; /* NOPs */
+
+			if (need > remaining) {
+				/* try shrinking cookie to fit */
+				cookie_size -= 2;
+				need -= 4;
+			}
+		}
+		while (need > remaining && TCP_COOKIE_MIN <= cookie_size) {
+			cookie_size -= 4;
+			need -= 4;
+		}
+		if (TCP_COOKIE_MIN <= cookie_size) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+			opts->cookie_copy = &cvp->cookie_pair[0];
+			opts->cookie_size = cookie_size;
+
+			/* Remember for future incarnations. */
+			cvp->cookie_desired = cookie_size;
+
+			if (cvp->cookie_desired != cvp->cookie_pair_size) {
+				/* Currently use random bytes as a nonce,
+				 * assuming these are completely unpredictable
+				 * by hostile users of the same system.
+				 */
+				get_random_bytes(opts->cookie_copy,
+						 cookie_size);
+				cvp->cookie_pair_size = cookie_size;
+			}
+
+			size += need;
+		}
+	}
 	return size;
 }
 
@@ -520,14 +657,22 @@ static unsigned tcp_synack_options(struct sock *sk,
 				   struct request_sock *req,
 				   unsigned mss, struct sk_buff *skb,
 				   struct tcp_out_options *opts,
-				   struct tcp_md5sig_key **md5) {
-	unsigned size = 0;
+				   struct tcp_md5sig_key **md5,
+				   struct tcp_extend_values *xvp)
+{
 	struct inet_request_sock *ireq = inet_rsk(req);
+	unsigned size = 0;
+	u8 cookie_plus = (NULL != xvp && !xvp->cookie_out_never)
+			 ? xvp->cookie_plus
+			 : 0;
 	char doing_ts;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
 	if (*md5) {
+		if (0 < cookie_plus) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+		}
 		opts->options |= OPTION_MD5;
 		size += TCPOLEN_MD5SIG_ALIGNED;
 	}
@@ -561,6 +706,34 @@ static unsigned tcp_synack_options(struct sock *sk,
 			size += TCPOLEN_SACKPERM_ALIGNED;
 	}
 
+	/* Similar rationale to tcp_syn_options() applies here, too.
+	 * If the <SYN> options fit, the same options should fit now!
+	 */
+	if (NULL == *md5
+	 && doing_ts
+	 && 0 < cookie_plus) {
+		int need = cookie_plus; /* has TCPOLEN_COOKIE_BASE */
+		int remaining = MAX_TCP_OPTION_SPACE - size;
+
+		if (0x2 & need) {
+			/* 32-bit multiple */
+			need += 2; /* NOPs */
+		}
+		if (need <= remaining) {
+			opts->options |= OPTION_COOKIE_EXTENSION;
+			opts->cookie_copy = &xvp->cookie_bakery[0];
+			opts->cookie_size = cookie_plus - TCPOLEN_COOKIE_BASE;
+
+			/* secret recipe not yet implemented */
+			get_random_bytes(opts->cookie_copy,
+					 opts->cookie_size);
+
+			size += need;
+		} else {
+			/* There's no error return, so flag it. */
+			xvp->cookie_out_never = 1; /* true */
+		}
+	}
 	return size;
 }
 
@@ -2229,14 +2402,15 @@ int tcp_send_synack(struct sock *sk)
 struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 				struct request_sock *req, void *extend_values)
 {
+	struct tcp_out_options opts;
+	struct tcp_extend_values *xvp = tcp_xv(extend_values);
 	struct inet_request_sock *ireq = inet_rsk(req);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcphdr *th;
-	int tcp_header_size;
-	struct tcp_out_options opts;
 	struct sk_buff *skb;
 	struct tcp_md5sig_key *md5;
 	__u8 *md5_hash_location;
+	int tcp_header_size;
 	int mss;
 
 	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
@@ -2274,7 +2448,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 #endif
 	TCP_SKB_CB(skb)->when = tcp_time_stamp;
 	tcp_header_size = tcp_synack_options(sk, req, mss,
-					     skb, &opts, &md5) +
+					     skb, &opts, &md5, xvp) +
 			  sizeof(struct tcphdr);
 
 	skb_push(skb, tcp_header_size);
@@ -2292,6 +2466,25 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	 */
 	tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
 			     TCPCB_FLAG_SYN | TCPCB_FLAG_ACK);
+
+	/* If cookies are active, and constant data is available, copy it
+	 * directly from the listening socket.
+	 */
+	if (NULL != xvp
+	 && !xvp->cookie_out_never
+	 && 0 < xvp->cookie_plus
+	 && tp->s_data_constant) {
+		const struct tcp_cookie_values *cvp = tp->cookie_values;
+
+		if (NULL != cvp
+		 && 0 < cvp->s_data_desired) {
+			u8 *buf = skb_put(skb, cvp->s_data_desired);
+
+			memcpy(buf, cvp->s_data_payload, cvp->s_data_desired);
+			TCP_SKB_CB(skb)->end_seq += cvp->s_data_desired;
+		}
+	}
+
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
 	th->ack_seq = htonl(tcp_rsk(req)->rcv_isn + 1);
 
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index cbe55e5..2839349 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -159,6 +159,8 @@ static inline int cookie_check(struct sk_buff *skb, __u32 cookie)
 
 struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 {
+	struct tcp_options_received tcp_opt;
+	u8 *cryptic_value;
 	struct inet_request_sock *ireq;
 	struct inet6_request_sock *ireq6;
 	struct tcp_request_sock *treq;
@@ -171,7 +173,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	int mss;
 	struct dst_entry *dst;
 	__u8 rcv_wscale;
-	struct tcp_options_received tcp_opt;
 
 	if (!sysctl_tcp_syncookies || !th->ack)
 		goto out;
@@ -186,7 +187,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
+	tcp_parse_options(skb, &tcp_opt, &cryptic_value, 0);
 
 	if (tcp_opt.saw_tstamp)
 		cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3b3d7b3..1320825 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1161,11 +1161,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb)
  */
 static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 {
+	struct tcp_extend_values tmp_ext;
+	struct tcp_options_received tmp_opt;
+	u8 *cryptic_value;
 	struct inet6_request_sock *treq;
 	struct ipv6_pinfo *np = inet6_sk(sk);
-	struct tcp_options_received tmp_opt;
-	struct tcp_sock *tp = tcp_sk(sk);
 	struct request_sock *req = NULL;
+	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 isn = TCP_SKB_CB(skb)->when;
 #ifdef CONFIG_SYN_COOKIES
 	int want_cookie = 0;
@@ -1205,7 +1207,29 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
+
+	if (0 < tmp_opt.cookie_plus
+	 && tmp_opt.saw_tstamp
+	 && !tp->cookie_out_never
+	 && (0 < sysctl_tcp_cookie_size
+	  || (NULL != tp->cookie_values
+	   && 0 < tp->cookie_values->cookie_desired))) {
+#ifdef CONFIG_SYN_COOKIES
+		want_cookie = 0;	/* not our kind of cookie */
+#endif
+		tmp_ext.cookie_out_never = 0; /* false */
+		tmp_ext.cookie_plus = tmp_opt.cookie_plus;
+
+		/* secret recipe not yet implemented */
+	} else if (!tp->cookie_in_always) {
+		/* redundant indications, but ensure initialization. */
+		tmp_ext.cookie_out_never = 1; /* true */
+		tmp_ext.cookie_plus = 0;
+	} else {
+		goto drop;
+	}
+	tmp_ext.cookie_in_always = tp->cookie_in_always;
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
@@ -1243,7 +1267,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	security_inet_conn_request(sk, skb, req);
 
-	if (tcp_v6_send_synack(sk, req, NULL) ||
+	if (tcp_v6_send_synack(sk, req, (void *)&tmp_ext) ||
 	    want_cookie)
 		goto drop;
 
@@ -1848,7 +1872,7 @@ static int tcp_v6_init_sock(struct sock *sk)
 	 */
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd_clamp = ~0;
-	tp->mss_cache = 536;
+	tp->mss_cache = TCP_MIN_RCVMSS;
 
 	tp->reordering = sysctl_tcp_reordering;
 
@@ -1864,6 +1888,19 @@ static int tcp_v6_init_sock(struct sock *sk)
 	tp->af_specific = &tcp_sock_ipv6_specific;
 #endif
 
+	/* TCP Cookie Transactions */
+	if (0 < sysctl_tcp_cookie_size) {
+		/* Default, cookies without s_data. */
+		tp->cookie_values =
+			kzalloc(sizeof(*tp->cookie_values), sk->sk_allocation);
+		if (NULL != tp->cookie_values) {
+			kref_init(&tp->cookie_values->kref);
+		}
+	}
+	/* Presumed zeroed, in order of appearance:
+	 *	cookie_in_always, cookie_out_never, extend_timestamp,
+	 *	s_data_constant, s_data_in, s_data_out
+	 */
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
-- 
1.6.0.4



^ permalink raw reply related

* Re: [PATCH][RFC]: ingress socket filter by mark
From: Eric Dumazet @ 2009-10-18 17:28 UTC (permalink / raw)
  To: hadi; +Cc: netdev, David Miller, Atis Elsts, Maciej Z.enczykowski
In-Reply-To: <1255869758.4815.40.camel@dogo.mojatatu.com>

jamal a écrit :
> Maciej forced me to dig into this ;->
> 
> at the socket level if a packet arrives with a different mark than
> what we bind to, drop it. I have tested this patch and it drops a packet
> with mismatching mark.
> 
> There are several approaches - and i think the patch suggestion i have
> made here maybe too strict. I assume that if someone binds to a mark,
> they want to not only send packets with that mark but receive
> only if that mark is set. 
> A looser check would be something along the line accept as well if mark
> is not set i.e
> if (sk->sk_mark && skb->mark && sk->sk_mark != skb->mark)
> 
> Alternatively i could add one bit in the socket flags and have it so
> that check is made only if app has been explicit:
> if (sock_flag(sk, SOCK_CHK_SOMARK) && sk->sk_mark != skb->mark) drop
> 
> Another approach  is to set sock filter from app. I dont like this
> approach because it will be the least usable from app level and would be
> the least simple from kernel level.
> 
> cheers,
> jamal
> 

I vote for extending BPF, and not adding the price of a compare
for each packet. Only users wanting mark filtering should pay the price.


^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Luca Deri @ 2009-10-18 17:37 UTC (permalink / raw)
  To: Harald Welte; +Cc: Brad Doctor, netdev
In-Reply-To: <20091018123811.GD27747@prithivi.gnumonks.org>

Harald
many thanks for your support. As I have stated before my wish is to  
include into the the mainstream kernel some features that I have  
implemented in PF_RING and on which many users rely since very long  
time. I understand that there are some overlaps with PF_PACKET and I'm  
willing to work with the kernel maintainers to address this issue.

The only thing I want to say is that PF_RING is *not* just for  
accelerating packet capture. This was the minimal goal when in 2003 I  
have coded the first release. PF_RING is a kernel module that  
implements several features (e.g. advanced packet filtering,  
extensible architecture by means of plugins, balancing, multicore/ 
multiqueue, packet modification and retransmission) that ease the  
implementation of efficient applications not limited to packet capture  
application. So in this view PF_RING has been designed to be a  
superset of PF_PACKET, because the needs I (and many other people  
have) are not of just having efficient packet capture.

This said I'm already at work to modify PF_RING so that it's a pure  
module that does not require any change in the mainstream kernel (i.e.  
net/core/dev.c). I'm almost done so I plan to release by tomorrow a  
new PF_RING release that implements this. Of course some changes into  
the kernel (such as Ben's patch) would ease PF_RING's life and pave  
the way to new kernel modules.

Done that I will start working at the RFC that you proposed.

Cheers Luca

PS. Just to clarify, when I say 'packet filtering' I mean the ability  
for packet capture applications to specify filters more advanced that  
BPF (even if BPF is supported by PF_RING) that prevent those  
applications from receiving packets they don't like, but that in any  
case will continue their journey into the kernel; this has nothing to  
do with netfilter filtering).

On Oct 18, 2009, at 2:38 PM, Harald Welte wrote:

> Hi Brad and Luca,
>
> On Wed, Oct 14, 2009 at 08:33:08AM -0600, Brad Doctor wrote:
>
>> On behalf of the users and developers of the PF_RING project, we  
>> would
>> like to ask consideration to include the PF_RING module in the main
>> line kernel.
>
> First of all, let me state that I think the mainline support for  
> nProbe/nTop is
> something that I have been hoping for many years.  I think the  
> performance you
> are achieving is remarkable, and it would be very usable to have this
> capability of high performance zero-copy packet access from  
> userspace as a
> stock feature of the Linux kernel.
>
> The actual PF_RING implementation has been criticized a couple of  
> times even in
> the past.  One general point I remember from past discussions in the  
> kernel
> network community was that there is too much overlap with PF_PACKET,  
> and that
> this could possibly be extended with a ring buffer rather than  
> replaced with a
> fairly similar alternative mechanism.
>
> So let's see what kind of solution the current discussion thread  
> will come up
> with... let's hope eventually we'll have the functionality in the  
> kernel.
> -- 
> - Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> "Privacy in residential applications is a desirable marketing option."
>                                                  (ETSI EN 300 175-7  
> Ch. A6)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox