Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: b44: Reset due to FIFO overflow.
From: Eric Dumazet @ 2010-06-28 11:09 UTC (permalink / raw)
  To: James Courtier-Dutton; +Cc: Mitchell Erblich, netdev
In-Reply-To: <AANLkTimK4mGdsSq206aqfusXPvnQczbYDlOWSYXAbOQJ@mail.gmail.com>

Le lundi 28 juin 2010 à 11:17 +0100, James Courtier-Dutton a écrit :
> On 28 June 2010 11:00, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Problem is we receive a spike of RX network frames (possibly UDP or some
> > other RX only trafic), and chip raises an RX fifo overflow _error_
> > indication.
> >
> 
> The cause of the RX overflow is in my case is TCP.
> It is reproducible in mythtv.
> While watching LiveTV, press "s" for the program guide.
> The program guide is implemented into mythtv by a SQL query that
> results in a large response.
> The kernel is probably not servicing the RX FIFO quickly enough due to
> it being busy doing something else. In this case, probably a video
> mode switch.
> 

Thats strange, b44 has a big RX ring... and tcp sender should wait for
ACK...

> > Some hardware are buggy enough that such error indication is fatal and
> > _require_ hardware reset. Thats life. I suspect b44 driver doing a full
> > reset is not a random guess from driver author, but to avoid a complete
> > NIC lockup.
> >
> 
> Interesting, which hardware, apart from the b44, is it that "requires"
> a hardware reset after a RX FIFO overflow.

Just take a look at some net drivers and you'll see some of them have
this requirement.

rtl8169_rx_interrupt()
...
	if (status & RxFOVF) {
		rtl8169_schedule_work(dev, rtl8169_reset_task);
		dev->stats.rx_fifo_errors++;
	}





^ permalink raw reply

* Re: [PATCH -next] bnx2x: fail when try to setup unsupported features
From: Eilon Greenstein @ 2010-06-28 11:07 UTC (permalink / raw)
  To: Stanislaw Gruszka; +Cc: netdev@vger.kernel.org, Amerigo Wang
In-Reply-To: <20100628112811.29881285@dhcp-lab-109.englab.brq.redhat.com>

On Mon, 2010-06-28 at 02:28 -0700, Stanislaw Gruszka wrote:
> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>

Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Thanks Stanislaw!

> ---
>  drivers/net/bnx2x_main.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index 57ff5b3..0809f6c 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -10982,6 +10982,9 @@ static int bnx2x_set_flags(struct net_device *dev, u32 data)
>  	int changed = 0;
>  	int rc = 0;
>  
> +	if (data & ~(ETH_FLAG_LRO | ETH_FLAG_RXHASH))
> +		return -EOPNOTSUPP;
> +
>  	if (bp->recovery_state != BNX2X_RECOVERY_DONE) {
>  		printk(KERN_ERR "Handling parity error recovery. Try again later\n");
>  		return -EAGAIN;




^ permalink raw reply

* Re: [RFC][BUG-FIX] the problem of checksum checking in UDP protocol
From: Shan Wei @ 2010-06-28 10:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Ronciak, John, netdev
In-Reply-To: <1277530086.2481.15.camel@edumazet-laptop>

Eric Dumazet wrote, at 06/26/2010 01:28 PM:
>> (This patch is not complete, it's just for my idea.)
>> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
>> index 1dd1aff..47f7e86 100644
>> --- a/net/ipv6/udp.c
>> +++ b/net/ipv6/udp.c
>> @@ -723,6 +723,10 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>>                 if (ulen < skb->len) {
>>                         if (pskb_trim_rcsum(skb, ulen))
>>                                 goto short_packet;
>> +
>> +                       if (skb_csum_unnecessary(skb))
>> +                               skb->ip_summed = CHECKSUM_NONE;
>> +
>>                         saddr = &ipv6_hdr(skb)->saddr;
>>                         daddr = &ipv6_hdr(skb)->daddr;
>>                         uh = udp_hdr(skb);
>>
> 
> I really dont know if this fix is the right one.
> 
> pskb_trim_rcsum() already contains a check. Should this check be changed
> to include yours ?

Oh..... I don't think so.
pskb_trim_rcsum() is also used when IPv4/IPv6 protocol receiving packets
and reassembling fragments. IP protocol does the right check and should
trust CHECKSUM_UNNECESSARY flag that drivers set, So we no need to 
change IP protocol.
If we add the skb_csum_unnecessary(skb) check into pskb_trim_rcsum() and
reset ip_summed with CHECKSUM_NONE, the checksum check that NIC hardward 
has done is wasted.

Only for UDP protocol over IPv4/IPv6, and length parameter is lower than
skb->len, We reset ip_summed with CHECKSUM_NONE.

-- 
Best Regards
-----
Shan Wei

^ permalink raw reply

* Re: b44: Reset due to FIFO overflow.
From: James Courtier-Dutton @ 2010-06-28 10:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1277716394.4235.235.camel@edumazet-laptop>

On 28 June 2010 10:13, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Hi
>
> Problem is we dont know if a Receive Fifo overflow is a minor or major
> indication from b44 chip.
>
> A minor indication would be : Chip tells us one or more frame were lost.
> No special action needed from driver.
>
> A major indication (as of current implemented in b44 driver) is :
> I am completely out of order and need a reset. Please do it.
>
> Patch to switch from major to minor indication is easy, but we dont know
> if its valid or not.
>
> diff --git a/drivers/net/b44.h b/drivers/net/b44.h
> index e1905a4..514dc3a 100644
> --- a/drivers/net/b44.h
> +++ b/drivers/net/b44.h
> @@ -42,7 +42,7 @@
>  #define  ISTAT_EMAC            0x04000000 /* EMAC Interrupt */
>  #define  ISTAT_MII_WRITE       0x08000000 /* MII Write Interrupt */
>  #define  ISTAT_MII_READ                0x10000000 /* MII Read Interrupt */
> -#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_RFO|ISTAT_TFU)
> +#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_TFU)
>  #define B44_IMASK      0x0024UL /* Interrupt Mask */
>  #define  IMASK_DEF             (ISTAT_ERRORS | ISTAT_TO | ISTAT_RX | ISTAT_TX)
>  #define B44_GPTIMER    0x0028UL /* General Purpose Timer */
>
>
>

Ok, are you saying that all I have to do is apply this patch,
reproduce the problem condition, and if it recovers OK, then we can go
with this fix?
If so, I will try it out after work.

I will probably add a printk in before the ISTAT_ERRORS test, to
inform me when that ISTAT_RFO has actually happened.

But is doing nothing the right thing?
I would have thought that one would have to at least start and stop
the FIFO in order for the write/read pointers to be in the correct
positions or at least change the read pointer to do the equivalent of
flush the buffer.
Is there any of this sort of control over the FIFO possible?

Kind Regards

James

^ permalink raw reply

* Re: b44: Reset due to FIFO overflow.
From: James Courtier-Dutton @ 2010-06-28 10:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Mitchell Erblich, netdev
In-Reply-To: <1277719251.4235.306.camel@edumazet-laptop>

On 28 June 2010 11:00, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Problem is we receive a spike of RX network frames (possibly UDP or some
> other RX only trafic), and chip raises an RX fifo overflow _error_
> indication.
>

The cause of the RX overflow is in my case is TCP.
It is reproducible in mythtv.
While watching LiveTV, press "s" for the program guide.
The program guide is implemented into mythtv by a SQL query that
results in a large response.
The kernel is probably not servicing the RX FIFO quickly enough due to
it being busy doing something else. In this case, probably a video
mode switch.

> Some hardware are buggy enough that such error indication is fatal and
> _require_ hardware reset. Thats life. I suspect b44 driver doing a full
> reset is not a random guess from driver author, but to avoid a complete
> NIC lockup.
>

Interesting, which hardware, apart from the b44, is it that "requires"
a hardware reset after a RX FIFO overflow.
Now, it is true that the driver for the b44 is just as bad on windows,
but there the recovery requires a windows reboot!

Kind Regards

James

^ permalink raw reply

* nat bypass
From: ratheesh k @ 2010-06-28 10:13 UTC (permalink / raw)
  To: Netfilter mailing list, netdev

Hi,

  A -------> R ------->S

I have a linux machine A is connected to Linux machine R . Machine R
is having two network interfaces and acting as a router .
It has a dhcp server running  . It will assign ip in 192.168.1.0/24
subnet to all machine connected on lan side ( A is connected also in
lan side ) . Wan side of R is connected to HTTP server S . There is
also a DHCP server running on S to assign ip in 10.232.18.0/24 subnet
.  Is there any way , in which NAT should be bypassed to get ip from
DHCP server running  on S . My question is : How can A will get  an ip
from 10.232.18.0/24 pool ip .?
ebtables is an option ? How can we make it ?
Is there any other optimal way ?

Thanks,
Ratheesh

^ permalink raw reply

* [PATCHv2] vhost-net: add dhclient work-around from userspace
From: Michael S. Tsirkin @ 2010-06-28 10:08 UTC (permalink / raw)
  To: Michael S. Tsirkin, Aristeu Rozanski, Herbert Xu, Juan Quintela,
	David S. Miller

Userspace virtio server has the following hack
so guests rely on it, and we have to replicate it, too:

Use port number to detect incoming IPv4 DHCP response packets,
and fill in the checksum for these.

The issue we are solving is that on linux guests, some apps
that use recvmsg with AF_PACKET sockets, don't know how to
handle CHECKSUM_PARTIAL;
The interface to return the relevant information was added
in 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
and older userspace does not use it.
One important user of recvmsg with AF_PACKET is dhclient,
so we add a work-around just for DHCP.

Don't bother applying the hack to IPv6 as userspace virtio does not
have a work-around for that - let's hope guests will do the right
thing wrt IPv6.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Dave, I'm going to put this patch on the vhost tree,
no need for you to bother merging it - you'll get
it with a pull request.


 drivers/vhost/net.c |   44 +++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index cc19595..03bba6a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -24,6 +24,10 @@
 #include <linux/if_tun.h>
 #include <linux/if_macvlan.h>
 
+#include <linux/ip.h>
+#include <linux/udp.h>
+#include <linux/netdevice.h>
+
 #include <net/sock.h>
 
 #include "vhost.h"
@@ -186,6 +190,44 @@ static void handle_tx(struct vhost_net *net)
 	unuse_mm(net->dev.mm);
 }
 
+static int peek_head(struct sock *sk)
+{
+	struct sk_buff *skb;
+
+	lock_sock(sk);
+	skb = skb_peek(&sk->sk_receive_queue);
+	if (unlikely(!skb)) {
+		release_sock(sk);
+		return 0;
+	}
+	/* Userspace virtio server has the following hack so
+	 * guests rely on it, and we have to replicate it, too: */
+	/* Use port number to detect incoming IPv4 DHCP response packets,
+	 * and fill in the checksum. */
+
+	/* The issue we are solving is that on linux guests, some apps
+	 * that use recvmsg with AF_PACKET sockets, don't know how to
+	 * handle CHECKSUM_PARTIAL;
+	 * The interface to return the relevant information was added in
+	 * 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
+	 * and older userspace does not use it.
+	 * One important user of recvmsg with AF_PACKET is dhclient,
+	 * so we add a work-around just for DHCP. */
+	if (skb->ip_summed == CHECKSUM_PARTIAL &&
+	    skb_headlen(skb) >= skb_transport_offset(skb) +
+				sizeof(struct udphdr) &&
+	    udp_hdr(skb)->dest == htons(68) &&
+	    skb_network_header_len(skb) >= sizeof(struct iphdr) &&
+	    ip_hdr(skb)->protocol == IPPROTO_UDP &&
+	    skb->protocol == htons(ETH_P_IP)) {
+		skb_checksum_help(skb);
+		/* Restore ip_summed value: tun passes it to user. */
+		skb->ip_summed = CHECKSUM_PARTIAL;
+	}
+	release_sock(sk);
+	return 1;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
@@ -222,7 +264,7 @@ static void handle_rx(struct vhost_net *net)
 	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 		vq->log : NULL;
 
-	for (;;) {
+	while (peek_head(sock->sk)) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
-- 
1.7.1.12.g42b7f

^ permalink raw reply related

* Re: b44: Reset due to FIFO overflow.
From: Eric Dumazet @ 2010-06-28 10:00 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: James Courtier-Dutton, netdev
In-Reply-To: <5E7CE189-1002-4723-ACB2-D537B71BA5F3@earthlink.net>

Le lundi 28 juin 2010 à 02:33 -0700, Mitchell Erblich a écrit :

> 
> Why wouldn't the ability to recv frames after a Recv FIFO overflow
>  indicate that a reset is NOT required?
> 

Do you have datasheets of all b44 chips ? I dont.

> Thus,  should't it be an indication of congestion if associated with a single
> flow and either speed up (reduce latency to service) the recv side or 
> slow down the xmit side?

I dont understand what you are saying. xmit is not the problem here.
And driver is flow agnostic. Its well before network stack.

Problem is we receive a spike of RX network frames (possibly UDP or some
other RX only trafic), and chip raises an RX fifo overflow _error_
indication.

Some hardware are buggy enough that such error indication is fatal and
_require_ hardware reset. Thats life. I suspect b44 driver doing a full
reset is not a random guess from driver author, but to avoid a complete
NIC lockup.

Refs:

http://bugzilla.kernel.org/show_bug.cgi?id=7696
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=216338

commit 5fc7d61aee1a7f7d3448f8fbccaa93371ebeecb0

^ permalink raw reply

* Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
From: Michael S. Tsirkin @ 2010-06-28 10:00 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: Herbert Xu, Dong, Eddie, Stephen Hemminger,
	netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, davem@davemloft.net,
	jdike@linux.intel.com
In-Reply-To: <F2E9EB7348B8264F86B6AB8151CE2D7916B0129230@shsmsx502.ccr.corp.intel.com>

On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote:
> >-----Original Message-----
> >From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
> >Sent: Sunday, June 27, 2010 2:15 PM
> >To: Dong, Eddie
> >Cc: Xin, Xiaohui; Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
> >linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
> >jdike@linux.intel.com
> >Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
> >
> >On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
> >>
> >> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete
> >queue pairs) uses the buffer from guest, so it eliminates copy completely in software and
> >requires hardware to do so. If we can have an additonal place to store the buffer per skb (may
> >cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver
> >later on. But that may be not very clean either :(
> >
> >OK, if I understand you correctly then I don't think have a
> >problem.  With your current patch-set you have exactly the same
> >situation when the skb->data is reallocated as a kernel buffer.
> >
> 
> When will skb->data to be reallocated? May you point me the code path?
> 
> >This is OK because as you correctly argue, it is a rare situation.
> >
> >With my proposal you will need to get this extra external buffer
> >in even less cases, because you'd only need to do it if the skb
> >head grows, which only happens if it becomes encapsulated.
> >So let me explain it in a bit more detail:
> >
> >Our packet starts out as a purely non-linear skb, i.e., skb->head
> >contains nothing and all the page frags come from the guest.
> >
> >During host processing we may pull data into skb->head but the
> >first frag will remain unless we pull all of it.  If we did do
> >that then you would have a free external buffer anyway.
> >
> >Now in the common case the header may be modified or pulled, but
> >it very rarely grows.  So you can just copy the header back into
> >the first frag just before we give it to the guest.
> >
> Since the data is still there, so recompute the page offset and size is ok, right?

Question: can devices use parts of the same page
in frags of different skbs (or for other purposes)? If they do,
we'll corrupt that memory if we try to stick the header there.

We have another option, reserve some buffers
posted by guest and use them if we need to copy
the header. This seems the most straight-forward to me.

> >Only in the case where the packet header grows (e.g., encapsulation)
> >would you need to get an extra external buffer.
> >
> >Cheers,
> >--
> >Email: Herbert Xu <herbert@gondor.apana.org.au>
> >Home Page: http://gondor.apana.org.au/~herbert/
> >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
From: Xin, Xiaohui @ 2010-06-28  9:56 UTC (permalink / raw)
  To: Herbert Xu, Dong, Eddie
  Cc: Stephen Hemminger, netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mst@redhat.com, mingo@elte.hu,
	davem@davemloft.net, jdike@linux.intel.com
In-Reply-To: <20100627061455.GA16782@gondor.apana.org.au>

>-----Original Message-----
>From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
>Sent: Sunday, June 27, 2010 2:15 PM
>To: Dong, Eddie
>Cc: Xin, Xiaohui; Stephen Hemminger; netdev@vger.kernel.org; kvm@vger.kernel.org;
>linux-kernel@vger.kernel.org; mst@redhat.com; mingo@elte.hu; davem@davemloft.net;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
>>
>> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete
>queue pairs) uses the buffer from guest, so it eliminates copy completely in software and
>requires hardware to do so. If we can have an additonal place to store the buffer per skb (may
>cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver
>later on. But that may be not very clean either :(
>
>OK, if I understand you correctly then I don't think have a
>problem.  With your current patch-set you have exactly the same
>situation when the skb->data is reallocated as a kernel buffer.
>

When will skb->data to be reallocated? May you point me the code path?

>This is OK because as you correctly argue, it is a rare situation.
>
>With my proposal you will need to get this extra external buffer
>in even less cases, because you'd only need to do it if the skb
>head grows, which only happens if it becomes encapsulated.
>So let me explain it in a bit more detail:
>
>Our packet starts out as a purely non-linear skb, i.e., skb->head
>contains nothing and all the page frags come from the guest.
>
>During host processing we may pull data into skb->head but the
>first frag will remain unless we pull all of it.  If we did do
>that then you would have a free external buffer anyway.
>
>Now in the common case the header may be modified or pulled, but
>it very rarely grows.  So you can just copy the header back into
>the first frag just before we give it to the guest.
>
Since the data is still there, so recompute the page offset and size is ok, right?

>Only in the case where the packet header grows (e.g., encapsulation)
>would you need to get an extra external buffer.
>
>Cheers,
>--
>Email: Herbert Xu <herbert@gondor.apana.org.au>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH -next] netxen: fail when try to setup unsupported features
From: Stanislaw Gruszka @ 2010-06-28  9:33 UTC (permalink / raw)
  To: netdev; +Cc: Amerigo Wang, Amit Kumar Salecha

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 drivers/net/netxen/netxen_nic_ethtool.c |   13 ++++++++++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/netxen/netxen_nic_ethtool.c b/drivers/net/netxen/netxen_nic_ethtool.c
index 20f7c58..6d94ee5 100644
--- a/drivers/net/netxen/netxen_nic_ethtool.c
+++ b/drivers/net/netxen/netxen_nic_ethtool.c
@@ -887,12 +887,19 @@ static int netxen_nic_set_flags(struct net_device *netdev, u32 data)
 	struct netxen_adapter *adapter = netdev_priv(netdev);
 	int hw_lro;
 
+	if (data & ~ETH_FLAG_LRO)
+		return -EOPNOTSUPP;
+
 	if (!(adapter->capabilities & NX_FW_CAPABILITY_HW_LRO))
 		return -EINVAL;
 
-	ethtool_op_set_flags(netdev, data);
-
-	hw_lro = (data & ETH_FLAG_LRO) ? NETXEN_NIC_LRO_ENABLED : 0;
+	if (data & ETH_FLAG_LRO) {
+		hw_lro = NETXEN_NIC_LRO_ENABLED;
+		netdev->features |= NETIF_F_LRO;
+	} else {
+		hw_lro = 0;
+		netdev->features &= ~NETIF_F_LRO;
+	}
 
 	if (netxen_config_hw_lro(adapter, hw_lro))
 		return -EIO;
-- 
1.5.5.6


^ permalink raw reply related

* Re: b44: Reset due to FIFO overflow.
From: Mitchell Erblich @ 2010-06-28  9:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: James Courtier-Dutton, netdev
In-Reply-To: <1277716394.4235.235.camel@edumazet-laptop>


On Jun 28, 2010, at 2:13 AM, Eric Dumazet wrote:

> Le lundi 28 juin 2010 à 08:41 +0100, James Courtier-Dutton a écrit :
>> Hi,
>> 
>> Reference:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/279102
>> 
>> I can see this bug and can reproduce it 100% on demand.
>> The problem seems to be that when the b44 has a incoming FIFO buffer
>> overflow, it resets the entire card, dis-associates with the access
>> point and therefore takes some time before it can pass traffic again.
>> Can anyone point me to some code that would just recover the FIFO
>> instead of reset the entire card?
>> 
>> I am a kernel developer, but I don't have any data sheets on this card
>> so was hoping someone with more knowledge of its workings, could help
>> me.
>> 
>> I can then test it, and see if it is a good fix or not.
>> 
> 
> Hi
> 
> Problem is we dont know if a Receive Fifo overflow is a minor or major
> indication from b44 chip.
> 
> A minor indication would be : Chip tells us one or more frame were lost.
> No special action needed from driver.
> 
> A major indication (as of current implemented in b44 driver) is :
> I am completely out of order and need a reset. Please do it.
> 
> Patch to switch from major to minor indication is easy, but we dont know
> if its valid or not.
> 
> diff --git a/drivers/net/b44.h b/drivers/net/b44.h
> index e1905a4..514dc3a 100644
> --- a/drivers/net/b44.h
> +++ b/drivers/net/b44.h
> @@ -42,7 +42,7 @@
> #define  ISTAT_EMAC		0x04000000 /* EMAC Interrupt */
> #define  ISTAT_MII_WRITE	0x08000000 /* MII Write Interrupt */
> #define  ISTAT_MII_READ		0x10000000 /* MII Read Interrupt */
> -#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_RFO|ISTAT_TFU)
> +#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_TFU)
> #define B44_IMASK	0x0024UL /* Interrupt Mask */
> #define  IMASK_DEF		(ISTAT_ERRORS | ISTAT_TO | ISTAT_RX | ISTAT_TX)
> #define B44_GPTIMER	0x0028UL /* General Purpose Timer */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Why wouldn't the ability to recv frames after a Recv FIFO overflow
 indicate that a reset is NOT required?

Thus,  should't it be an indication of congestion if associated with a single
flow and either speed up (reduce latency to service) the recv side or 
slow down the xmit side?

Mitchell Erblich

^ permalink raw reply

* [PATCH -next] qlcnic: fail when try to setup unsupported features
From: Stanislaw Gruszka @ 2010-06-28  9:31 UTC (permalink / raw)
  To: netdev; +Cc: Amerigo Wang, Amit Kumar Salecha, Anirban Chakraborty

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 drivers/net/qlcnic/qlcnic_ethtool.c |   13 ++++++++++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/qlcnic/qlcnic_ethtool.c b/drivers/net/qlcnic/qlcnic_ethtool.c
index d4e803e..b9d5acb 100644
--- a/drivers/net/qlcnic/qlcnic_ethtool.c
+++ b/drivers/net/qlcnic/qlcnic_ethtool.c
@@ -983,12 +983,19 @@ static int qlcnic_set_flags(struct net_device *netdev, u32 data)
 	struct qlcnic_adapter *adapter = netdev_priv(netdev);
 	int hw_lro;
 
+	if (data & ~ETH_FLAG_LRO)
+		return -EOPNOTSUPP;
+
 	if (!(adapter->capabilities & QLCNIC_FW_CAPABILITY_HW_LRO))
 		return -EINVAL;
 
-	ethtool_op_set_flags(netdev, data);
-
-	hw_lro = (data & ETH_FLAG_LRO) ? QLCNIC_LRO_ENABLED : 0;
+	if (data & ETH_FLAG_LRO) {
+		hw_lro = QLCNIC_LRO_ENABLED;
+		netdev->features |= NETIF_F_LRO;
+	} else {
+		hw_lro = 0;
+		netdev->features &= ~NETIF_F_LRO;
+	}
 
 	if (qlcnic_config_hw_lro(adapter, hw_lro))
 		return -EIO;
-- 
1.5.5.6


^ permalink raw reply related

* [PATCH -next] vmxnet3: fail when try to setup unsupported features
From: Stanislaw Gruszka @ 2010-06-28  9:29 UTC (permalink / raw)
  To: netdev; +Cc: Amerigo Wang, Shreyas Bhatewara

Return EOPNOTSUPP in ethtool_ops->set_flags.

Fix coding style while at it.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 drivers/net/vmxnet3/vmxnet3_ethtool.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
index 3935c44..8a71a21 100644
--- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
+++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
@@ -276,16 +276,21 @@ vmxnet3_get_strings(struct net_device *netdev, u32 stringset, u8 *buf)
 }
 
 static u32
-vmxnet3_get_flags(struct net_device *netdev) {
+vmxnet3_get_flags(struct net_device *netdev)
+{
 	return netdev->features;
 }
 
 static int
-vmxnet3_set_flags(struct net_device *netdev, u32 data) {
+vmxnet3_set_flags(struct net_device *netdev, u32 data)
+{
 	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
 	u8 lro_requested = (data & ETH_FLAG_LRO) == 0 ? 0 : 1;
 	u8 lro_present = (netdev->features & NETIF_F_LRO) == 0 ? 0 : 1;
 
+	if (data & ~ETH_FLAG_LRO)
+		return -EOPNOTSUPP;
+
 	if (lro_requested ^ lro_present) {
 		/* toggle the LRO feature*/
 		netdev->features ^= NETIF_F_LRO;
-- 
1.5.5.6


^ permalink raw reply related

* [PATCH -next] bnx2x: fail when try to setup unsupported features
From: Stanislaw Gruszka @ 2010-06-28  9:28 UTC (permalink / raw)
  To: netdev; +Cc: Amerigo Wang, Eilon Greenstein

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 drivers/net/bnx2x_main.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
index 57ff5b3..0809f6c 100644
--- a/drivers/net/bnx2x_main.c
+++ b/drivers/net/bnx2x_main.c
@@ -10982,6 +10982,9 @@ static int bnx2x_set_flags(struct net_device *dev, u32 data)
 	int changed = 0;
 	int rc = 0;
 
+	if (data & ~(ETH_FLAG_LRO | ETH_FLAG_RXHASH))
+		return -EOPNOTSUPP;
+
 	if (bp->recovery_state != BNX2X_RECOVERY_DONE) {
 		printk(KERN_ERR "Handling parity error recovery. Try again later\n");
 		return -EAGAIN;
-- 
1.5.5.6


^ permalink raw reply related

* [PATCH -next] e1000e: fail when try to setup unsupported features
From: Stanislaw Gruszka @ 2010-06-28  9:26 UTC (permalink / raw)
  To: netdev; +Cc: Amerigo Wang, Jeff Kirsher

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 drivers/net/e1000e/ethtool.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/e1000e/ethtool.c b/drivers/net/e1000e/ethtool.c
index 77c5829..6355a1b 100644
--- a/drivers/net/e1000e/ethtool.c
+++ b/drivers/net/e1000e/ethtool.c
@@ -2051,7 +2051,6 @@ static const struct ethtool_ops e1000_ethtool_ops = {
 	.get_coalesce		= e1000_get_coalesce,
 	.set_coalesce		= e1000_set_coalesce,
 	.get_flags		= ethtool_op_get_flags,
-	.set_flags		= ethtool_op_set_flags,
 };
 
 void e1000e_set_ethtool_ops(struct net_device *netdev)
-- 
1.5.5.6

^ permalink raw reply related

* Re: b44: Reset due to FIFO overflow.
From: Eric Dumazet @ 2010-06-28  9:13 UTC (permalink / raw)
  To: James Courtier-Dutton; +Cc: netdev
In-Reply-To: <AANLkTinwnTw3Yqy3POghliZDRmRYUfJL6Gy67GNNLiYv@mail.gmail.com>

Le lundi 28 juin 2010 à 08:41 +0100, James Courtier-Dutton a écrit :
> Hi,
> 
> Reference:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/279102
> 
> I can see this bug and can reproduce it 100% on demand.
> The problem seems to be that when the b44 has a incoming FIFO buffer
> overflow, it resets the entire card, dis-associates with the access
> point and therefore takes some time before it can pass traffic again.
> Can anyone point me to some code that would just recover the FIFO
> instead of reset the entire card?
> 
> I am a kernel developer, but I don't have any data sheets on this card
> so was hoping someone with more knowledge of its workings, could help
> me.
> 
> I can then test it, and see if it is a good fix or not.
> 

Hi

Problem is we dont know if a Receive Fifo overflow is a minor or major
indication from b44 chip.

A minor indication would be : Chip tells us one or more frame were lost.
No special action needed from driver.

A major indication (as of current implemented in b44 driver) is :
I am completely out of order and need a reset. Please do it.

Patch to switch from major to minor indication is easy, but we dont know
if its valid or not.

diff --git a/drivers/net/b44.h b/drivers/net/b44.h
index e1905a4..514dc3a 100644
--- a/drivers/net/b44.h
+++ b/drivers/net/b44.h
@@ -42,7 +42,7 @@
 #define  ISTAT_EMAC		0x04000000 /* EMAC Interrupt */
 #define  ISTAT_MII_WRITE	0x08000000 /* MII Write Interrupt */
 #define  ISTAT_MII_READ		0x10000000 /* MII Read Interrupt */
-#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_RFO|ISTAT_TFU)
+#define  ISTAT_ERRORS (ISTAT_DSCE|ISTAT_DATAE|ISTAT_DPE|ISTAT_RDU|ISTAT_TFU)
 #define B44_IMASK	0x0024UL /* Interrupt Mask */
 #define  IMASK_DEF		(ISTAT_ERRORS | ISTAT_TO | ISTAT_RX | ISTAT_TX)
 #define B44_GPTIMER	0x0028UL /* General Purpose Timer */

^ permalink raw reply related

* Re: Reviewing batman-adv for net/
From: Sven Eckelmann @ 2010-06-28  8:55 UTC (permalink / raw)
  To: b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Hagen Paul Pfeifer,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwlltHuzzzSOjJt, David S. Miller
In-Reply-To: <20100627222706.GC8285@nuttenaction>

[-- Attachment #1: Type: Text/Plain, Size: 2409 bytes --]

Hagen Paul Pfeifer wrote:

Thanks for your comments. Marek already answered the first two questions, and 
I will write something about the last part.

[...]
> o The major code is twisted about bookkeeping stuff, configuration aspects
> and so on. The minor codebase is about protocol processing. What about a
> generalized architecture and a userspace implementation of the protocol?
> And communication via netlink sockets.

Even if it sounds interesting to have a userspace implementation - it just 
doesn't work. Maybe you have something different in mind - so please comment a 
little bit more about about your idea.

There were different implementations for that protocol, first one was a 
complete userspace implementation - but it was just slow and not really 
usable. The second one was not using skb because it was only a port of the 
userspace implementation to the kernel. It was faster (better throughput, not 
as much latency added), but still we noticed that we couldn't saturate our 
fast links due to the overhead we added. That was the time the current form of 
processing and forwarding using skbs was implemented. It was quite plain to us 
that those "tiny" parts must be processed as fast as possible and we are not 
able to communicate a lot inside the kernel to get the information we need to 
forward packets or process new ones.

So I would assume that when we communicate with userspace which does some 
processing of the incoming packets and changes them to get forwarded, we would 
have the same overhead problems as before. That means we add unnecessary 
latency and reduce the throughput a lot.

It works quite well for layer 3 mesh protocols (olsr, batman, babel, bmx, ..) 
because they must not care about the actual routing of the packets - 
everything is done by the IP routing code. But it does not work for things 
which must route ethernet frames as there does not exist such a framework and 
it is hard to create one which everyone will like and has enough information 
to provide all features they need. Meshing with ethernet frames is relative 
young (please correct me) and we see all the time that we need more things 
which couldn't be done by layer 3 (or at least with a lot more work). An 
example would be interface interface alternating which can reduce 
interferences when communicating over many hops.

thanks,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* b44: Reset due to FIFO overflow.
From: James Courtier-Dutton @ 2010-06-28  7:41 UTC (permalink / raw)
  To: netdev

Hi,

Reference:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/279102

I can see this bug and can reproduce it 100% on demand.
The problem seems to be that when the b44 has a incoming FIFO buffer
overflow, it resets the entire card, dis-associates with the access
point and therefore takes some time before it can pass traffic again.
Can anyone point me to some code that would just recover the FIFO
instead of reset the entire card?

I am a kernel developer, but I don't have any data sheets on this card
so was hoping someone with more knowledge of its workings, could help
me.

I can then test it, and see if it is a good fix or not.

Kind Regards

James

^ permalink raw reply

* Re: Reviewing batman-adv for net/
From: Henning Rogge @ 2010-06-28  5:32 UTC (permalink / raw)
  To: b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Hagen Paul Pfeifer, Marek Lindner,
	David S. Miller
In-Reply-To: <201006280115.25518.lindner_marek-LWAfsSFWpa4@public.gmane.org>

[-- Attachment #1: Type: Text/Plain, Size: 1518 bytes --]

Am Montag 28 Juni 2010, 01:15:23 schrieb Marek Lindner:
> the OLSR as standardized by the IETF is known to be flawed when used
> outside of a simulator (even the IETF Manet people know this - I spoke
> with one of them). We have assembled a few documents explaining some of
> its weaknesses on our website (www.open-mesh.org) but I suggest you get in
> touch with the folks of www.olsr.org. They can go into the details of why
> they don't follow the RFC.
OLSRv1 lacks the support for routing metrics in the IETF RFC document. Most of 
the research groups began integrating routing metrics later, and OLSRv2 (which 
is worked on at the moment, 2 RFC's done, 1 at IESG, 1 as a draft) will 
contain specified ways to include a routing metric on a link and how to use 
it. We were talking about the generic metric encoding some weeks ago on the 
MANET WG list.

Using hopcount metric (as described in the RFC 3626) will result in a "worst 
link first" strategy in wireless networks, because you always optimize for 
long links which will break often.

It's not difficult to integrate a routing metric into OLSRv1, but in the old 
OLSRv1 RFC (2003) there it's not specified. We (the olsr.org team) use a 
custom message format to add metric information to our hello/tc messages, the 
NRL use a different one for their own metric aware OLSR.

Henning Rogge (olsr.org team)

-- 
1) You can't win.
2) You can't break even.
3) You can't leave the game.
— The Laws of Thermodynamics, summarized

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2 v2] Driver core: reduce duplicated code
From: Eric Miao @ 2010-06-28  5:27 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Sascha Hauer, Greg KH, Randy Dunlap, Dmitry Torokhov,
	Anisse Astier, Greg Kroah-Hartman, Magnus Damm, Rafael J. Wysocki,
	Paul Mundt, linux-doc, linux-kernel, netdev
In-Reply-To: <20100628051606.GA16445@pengutronix.de>

2010/6/28 Uwe Kleine-König <u.kleine-koenig@pengutronix.de>:
> Hi Eric,
>
> On Mon, Jun 28, 2010 at 12:55:45PM +0800, Eric Miao wrote:
>> I suggest you to have a look into arch/arm/mach-mmp/devices.c and
>> arch/arm/mach-mmp/pxa{168,910}.c as well as
>> arch/arm/mach-mmp/include/mach/pxa{168,910}.h, maybe we can find
>> some common practice.
> I think I like this approach in general, I already thought about not
> passing all parameters as function/macro arguments, too.  But maybe this
> becomes too excessive for imx as I would need too many of these device
> desc for the different imx variants?!
>
> Anyhow a few things I thought when looking in the files you suggested:
>
>  - Why not use an array for all uart devdescs, maybe the code for
>   pxa168_add_uart could become a bit smaller then?:
>
>        extern struct pxa_device_desc pxa168_device_uart[2];
>        ...
>        static inline int pxa168_add_uart(int id)
>        {
>                struct pxa_device_desc *d = pxa168_device_uart + id;
>
>                if (id < 0 || id > 2)
>                        return -EINVAL;
>
>                return pxa_register_device(d, NULL, 0);
>        }
>
>   (Ditto for the other types obviously.)

That's a good suggestion, yet it came that way for two reasons:

1. the initial naming mess, uart0 was later renamed to uart1, e.g.
2. and the restrictions of PXA{168,910}_DEVICE() macros, these
   macros are handy to simplify the definition, but may require fancy
   tricks to make it support array

>
>  - shouldn't all these pxa_device_descs and pxa168_add_$device functions
>   be __initdata and __init?
>

pxa{168,910}_add_device() are actually 'static inline' so my assumption
is they will be inlined when referenced, otherwise won't occupy any code
space. The *_descs, however, they are __initdata if you look into the
definitions of PXA{168,910}_DEVICES

>  - pxa_register_device is better than my add_resndata function in (at
>   least) one aspect as it sets coherent_dma_mask, too.  This is
>   something I missed when trying to add mxc-mmc (IIRC) devices.
>
> Thanks
> Uwe
>
> --
> Pengutronix e.K.                           | Uwe Kleine-König            |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
>

^ permalink raw reply

* Re: [PATCH 2/2 v2] Driver core: reduce duplicated code
From: Uwe Kleine-König @ 2010-06-28  5:16 UTC (permalink / raw)
  To: Eric Miao, Sascha Hauer
  Cc: Greg KH, Randy Dunlap, Dmitry Torokhov, Anisse Astier,
	Greg Kroah-Hartman, Magnus Damm, Rafael J. Wysocki, Paul Mundt,
	linux-doc, linux-kernel, netdev
In-Reply-To: <AANLkTinMLrXyqKZ76AiWiw6N_glrWWrP-2_aUTzQm5Xr@mail.gmail.com>

Hi Eric,

On Mon, Jun 28, 2010 at 12:55:45PM +0800, Eric Miao wrote:
> I suggest you to have a look into arch/arm/mach-mmp/devices.c and
> arch/arm/mach-mmp/pxa{168,910}.c as well as
> arch/arm/mach-mmp/include/mach/pxa{168,910}.h, maybe we can find
> some common practice.
I think I like this approach in general, I already thought about not
passing all parameters as function/macro arguments, too.  But maybe this
becomes too excessive for imx as I would need too many of these device
desc for the different imx variants?!

Anyhow a few things I thought when looking in the files you suggested:

 - Why not use an array for all uart devdescs, maybe the code for
   pxa168_add_uart could become a bit smaller then?:

	extern struct pxa_device_desc pxa168_device_uart[2];
	...
	static inline int pxa168_add_uart(int id)
	{
		struct pxa_device_desc *d = pxa168_device_uart + id;

		if (id < 0 || id > 2)
			return -EINVAL;

		return pxa_register_device(d, NULL, 0);
	}

   (Ditto for the other types obviously.)

 - shouldn't all these pxa_device_descs and pxa168_add_$device functions
   be __initdata and __init?

 - pxa_register_device is better than my add_resndata function in (at
   least) one aspect as it sets coherent_dma_mask, too.  This is
   something I missed when trying to add mxc-mmc (IIRC) devices.

Thanks
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

^ permalink raw reply

* Re: dhclient, checksum and tap
From: David Miller @ 2010-06-28  4:59 UTC (permalink / raw)
  To: mst; +Cc: herbert.xu, netdev
In-Reply-To: <20100627082439.GA8472@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Sun, 27 Jun 2010 11:24:40 +0300

> Just to spell it out for me, you think the hack should be done
> in vhost-net?

Pretty much, yes.

^ permalink raw reply

* Re: [PATCH 2/2 v2] Driver core: reduce duplicated code
From: Eric Miao @ 2010-06-28  4:55 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Greg KH, Randy Dunlap, Dmitry Torokhov, Anisse Astier,
	Greg Kroah-Hartman, Magnus Damm, Rafael J. Wysocki, Paul Mundt,
	linux-doc, linux-kernel, netdev
In-Reply-To: <20100622052314.GA17128@pengutronix.de>

2010/6/22 Uwe Kleine-König <u.kleine-koenig@pengutronix.de>:
> Hi Greg,
>
>> > I changed the semantic slightly to only call
>> > platform_device_add_resources if data != NULL instead of size != 0.  The
>> > idea is to support wrappers like:
>> >
>> >     #define add_blablub(id, pdata) \
>> >             platform_device_register_resndata(NULL, "blablub", id, \
>> >                     NULL, 0, pdata, sizeof(struct blablub_platform_data))
>> >
>> > that don't fail if pdata=NULL.  Ditto for res.
>>
>> That's fine, but why would you want to have a #define for something like
>> this?  Is it really needed?
> Well, what is really needed?  I intend to use it on arm/imx.  I have
> several different machines using similar SoCs and so I want to have a
> function à la:
>
>        struct platform_device *__init imx_add_imx_i2c(int id,
>                resource_size_t iobase, resource_size_t iosize, int irq,
>                const struct imxi2c_platform_data *pdata)
>
> that builds a struct resource[] and then calls
> platform_device_register_resndata().  And then I have a set of macros
> like:
>
>        #define imx21_add_i2c_imx(pdata)        \
>                imx_add_imx_i2c(0, MX2x_I2C_BASE_ADDR, SZ_4K, MX2x_INT_I2C, pdata)
>        #define imx25_add_imx_i2c0(pdata)       \
>                imx_add_imx_i2c(0, MX25_I2C1_BASE_ADDR, SZ_16K, MX25_INT_I2C1, pdata)
>        ##define imx25_add_imx_i2c1(pdata)      \
>                imx_add_imx_i2c(1, MX25_I2C2_BASE_ADDR, SZ_16K, MX25_INT_I2C2, pdata)
>
> etc.  The final goal is to get rid of files like
> arch/arm/mach-mx3/devices.c.
>

Hi Uwe,

I suggest you to have a look into arch/arm/mach-mmp/devices.c and
arch/arm/mach-mmp/pxa{168,910}.c as well as
arch/arm/mach-mmp/include/mach/pxa{168,910}.h, maybe we can find
some common practice.

>> Anyway, this version looks fine to me, I'll go apply it.
> \o/
>
> Best regards and thanks
> Uwe
>
> --
> Pengutronix e.K.                           | Uwe Kleine-König            |
> Industrial Linux Solutions                 | http://www.pengutronix.de/  |
>

^ permalink raw reply

* [PATCH v2] netfilter: xtables target SYNPROXY
From: Changli Gao @ 2010-06-28  2:27 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David S. Miller, Alexey Kuznetsov, Jan Engelhardt,
	Jozsef Kadlecsik, Pekka Savola (ipv6), James Morris,
	Hideaki YOSHIFUJI, netfilter-devel, netdev, Changli Gao

xtables target SYNPROXY.

This patch implements an xtables target SYNPROXY. As the connection to the
TCP server won't be established until the ACK from the client is received, it
can protect the TCP server from the SYN-flood attacks.

It works in the raw table of the PREROUTING chain, before conntracking system.
Syncookies is used, so no new state is introduced into the conntracking system.
In fact, until the first connection is established, conntracking system doesn't
see any packets. So when there is a SYN-flood attack, conntracking system won't
be busy on finding and deleting the un-assured ct.

As the SYN-packet of the second connection request is sent locally, the DNAT
rules which are in the PREROUTING chain should be moved to the OUTPUT chain.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/net/netfilter/nf_conntrack.h        |   10 
 include/net/netfilter/nf_conntrack_core.h   |   21 
 include/net/netfilter/nf_conntrack_extend.h |    2 
 include/net/tcp.h                           |    7 
 net/ipv4/syncookies.c                       |   22 
 net/ipv4/tcp_ipv4.c                         |    9 
 net/netfilter/Kconfig                       |   17 
 net/netfilter/Makefile                      |    1 
 net/netfilter/nf_conntrack_core.c           |   45 +
 net/netfilter/xt_SYNPROXY.c                 |  678 ++++++++++++++++++++++++++++
 10 files changed, 793 insertions(+), 19 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index e624dae..5e6d8e4 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -311,5 +311,15 @@ do {							\
 #define MODULE_ALIAS_NFCT_HELPER(helper) \
         MODULE_ALIAS("nfct-helper-" helper)
 
+#if defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY) || \
+    defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY_MODULE)
+extern unsigned int (*syn_proxy_pre_hook)(struct sk_buff *skb,
+					  struct nf_conn *ct,
+					  enum ip_conntrack_info ctinfo);
+
+extern unsigned int (*syn_proxy_post_hook)(struct sk_buff *skb,
+					   struct nf_conn *ct,
+					   enum ip_conntrack_info ctinfo);
+#endif
 #endif /* __KERNEL__ */
 #endif /* _NF_CONNTRACK_H */
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index aced085..637b404 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -54,6 +54,23 @@ nf_conntrack_find_get(struct net *net, u16 zone,
 
 extern int __nf_conntrack_confirm(struct sk_buff *skb);
 
+static inline unsigned int syn_proxy_post_call(struct sk_buff *skb,
+					       struct nf_conn *ct,
+					       enum ip_conntrack_info ctinfo)
+{
+	unsigned int ret = NF_ACCEPT;
+#if defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY) || \
+    defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY_MODULE)
+	unsigned int (*syn_proxy)(struct sk_buff *, struct nf_conn *,
+				  enum ip_conntrack_info);
+	syn_proxy = rcu_dereference(syn_proxy_post_hook);
+	if (syn_proxy)
+		ret = syn_proxy(skb, ct, ctinfo);
+#endif
+
+	return ret;
+}
+
 /* Confirm a connection: returns NF_DROP if packet must be dropped. */
 static inline int nf_conntrack_confirm(struct sk_buff *skb)
 {
@@ -63,8 +80,10 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
 	if (ct && !nf_ct_is_untracked(ct)) {
 		if (!nf_ct_is_confirmed(ct))
 			ret = __nf_conntrack_confirm(skb);
-		if (likely(ret == NF_ACCEPT))
+		if (likely(ret == NF_ACCEPT)) {
 			nf_ct_deliver_cached_events(ct);
+			ret = syn_proxy_post_call(skb, ct, skb->nfctinfo);
+		}
 	}
 	return ret;
 }
diff --git a/include/net/netfilter/nf_conntrack_extend.h b/include/net/netfilter/nf_conntrack_extend.h
index 32d15bd..b2ae7e9 100644
--- a/include/net/netfilter/nf_conntrack_extend.h
+++ b/include/net/netfilter/nf_conntrack_extend.h
@@ -11,6 +11,7 @@ enum nf_ct_ext_id {
 	NF_CT_EXT_ACCT,
 	NF_CT_EXT_ECACHE,
 	NF_CT_EXT_ZONE,
+	NF_CT_EXT_SYNPROXY,
 	NF_CT_EXT_NUM,
 };
 
@@ -19,6 +20,7 @@ enum nf_ct_ext_id {
 #define NF_CT_EXT_ACCT_TYPE struct nf_conn_counter
 #define NF_CT_EXT_ECACHE_TYPE struct nf_conntrack_ecache
 #define NF_CT_EXT_ZONE_TYPE struct nf_conntrack_zone
+#define NF_CT_EXT_SYNPROXY_TYPE struct syn_proxy_state
 
 /* Extensions: optional stuff which isn't permanently in struct. */
 struct nf_ct_ext {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c2f96c2..06f28d3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -460,8 +460,11 @@ extern int			tcp_disconnect(struct sock *sk, int flags);
 extern __u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
 extern struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb, 
 				    struct ip_options *opt);
-extern __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, 
-				     __u16 *mss);
+extern __u32 __cookie_v4_init_sequence(__be32 saddr, __be32 daddr,
+				       __be16 sport, __be16 dport, __u32 seq,
+				       __u16 *mssp);
+extern int cookie_v4_check_sequence(const struct iphdr *iph,
+				    const struct tcphdr *th, __u32 cookie);
 
 extern __u32 cookie_init_timestamp(struct request_sock *req);
 extern bool cookie_check_timestamp(struct tcp_options_received *opt, bool *);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 650cace..3adcba3 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -159,26 +159,21 @@ static __u16 const msstab[] = {
  * Generate a syncookie.  mssp points to the mss, which is returned
  * rounded down to the value encoded in the cookie.
  */
-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
+__u32 __cookie_v4_init_sequence(__be32 saddr, __be32 daddr, __be16 sport,
+				__be16 dport, __u32 seq, __u16 *mssp)
 {
-	const struct iphdr *iph = ip_hdr(skb);
-	const struct tcphdr *th = tcp_hdr(skb);
 	int mssind;
 	const __u16 mss = *mssp;
 
-	tcp_synq_overflow(sk);
-
 	for (mssind = ARRAY_SIZE(msstab) - 1; mssind ; mssind--)
 		if (mss >= msstab[mssind])
 			break;
 	*mssp = msstab[mssind];
 
-	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT);
-
-	return secure_tcp_syn_cookie(iph->saddr, iph->daddr,
-				     th->source, th->dest, ntohl(th->seq),
+	return secure_tcp_syn_cookie(saddr, daddr, sport, dport, seq,
 				     jiffies / (HZ * 60), mssind);
 }
+EXPORT_SYMBOL(__cookie_v4_init_sequence);
 
 /*
  * This (misnamed) value is the age of syncookie which is permitted.
@@ -191,10 +186,9 @@ __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
  * Check if a ack sequence number is a valid syncookie.
  * Return the decoded mss if it is, or 0 if not.
  */
-static inline int cookie_check(struct sk_buff *skb, __u32 cookie)
+int cookie_v4_check_sequence(const struct iphdr *iph, const struct tcphdr *th,
+			     __u32 cookie)
 {
-	const struct iphdr *iph = ip_hdr(skb);
-	const struct tcphdr *th = tcp_hdr(skb);
 	__u32 seq = ntohl(th->seq) - 1;
 	__u32 mssind = check_tcp_syn_cookie(cookie, iph->saddr, iph->daddr,
 					    th->source, th->dest, seq,
@@ -203,6 +197,7 @@ static inline int cookie_check(struct sk_buff *skb, __u32 cookie)
 
 	return mssind < ARRAY_SIZE(msstab) ? msstab[mssind] : 0;
 }
+EXPORT_SYMBOL(cookie_v4_check_sequence);
 
 static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 					   struct request_sock *req,
@@ -282,7 +277,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 		goto out;
 
 	if (tcp_synq_no_recent_overflow(sk) ||
-	    (mss = cookie_check(skb, cookie)) == 0) {
+	    (mss = cookie_v4_check_sequence(ip_hdr(skb), tcp_hdr(skb),
+					    cookie)) == 0) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
 		goto out;
 	}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 8fa32f5..3b094c7 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1332,7 +1332,14 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		TCP_ECN_create_request(req, tcp_hdr(skb));
 
 	if (want_cookie) {
-		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
+		struct tcphdr *th;
+
+		tcp_synq_overflow(sk);
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT);
+		th = tcp_hdr(skb);
+		isn = __cookie_v4_init_sequence(saddr, daddr, th->source,
+						th->dest, ntohl(th->seq),
+						&req->mss);
 		req->cookie_ts = tmp_opt.tstamp_ok;
 	} else if (!isn) {
 		struct inet_peer *peer = NULL;
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 413ed24..fd8ad8c 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -560,6 +560,23 @@ config NETFILTER_XT_TARGET_SECMARK
 
 	  To compile it as a module, choose M here.  If unsure, say N.
 
+config NETFILTER_XT_TARGET_SYNPROXY
+	tristate '"SYNPROXY" target support (EXPERIMENTAL)'
+	depends on EXPERIMENTAL
+	depends on SYN_COOKIES
+	depends on IP_NF_RAW
+	depends on NF_CONNTRACK
+	depends on NETFILTER_ADVANCED
+	help
+	  The SYNPROXY target allows a raw rule to specify that some TCP
+	  connections are relayed to protect the TCP servers from the SYN-flood
+	  DoS attacks. Syn cookies is used to save the initial state, so no
+	  conntrack is needed until the client side connection is established.
+	  It frees the connection tracking system from creating/deleting
+	  conntracks when SYN-flood DoS attack acts.
+
+	  To compile it as a module, choose M here.  If unsure, say N.
+
 config NETFILTER_XT_TARGET_TCPMSS
 	tristate '"TCPMSS" target support'
 	depends on (IPV6 || IPV6=n)
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index e28420a..4e32834 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP) += xt_TCPOPTSTRIP.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_TEE) += xt_TEE.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_TRACE) += xt_TRACE.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_IDLETIMER) += xt_IDLETIMER.o
+obj-$(CONFIG_NETFILTER_XT_TARGET_SYNPROXY) += xt_SYNPROXY.o
 
 # matches
 obj-$(CONFIG_NETFILTER_XT_MATCH_CLUSTER) += xt_cluster.o
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 16b41b4..dd85d6f 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -800,6 +800,26 @@ resolve_normal_ct(struct net *net, struct nf_conn *tmpl,
 	return ct;
 }
 
+static inline unsigned int syn_proxy_pre_call(int protonum, struct sk_buff *skb,
+					      struct nf_conn *ct,
+					      enum ip_conntrack_info ctinfo)
+{
+	unsigned int ret = NF_ACCEPT;
+#if defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY) || \
+    defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY_MODULE)
+	unsigned int (*syn_proxy)(struct sk_buff *, struct nf_conn *,
+				  enum ip_conntrack_info);
+
+	if (protonum == IPPROTO_TCP) {
+		syn_proxy = rcu_dereference(syn_proxy_pre_hook);
+		if (syn_proxy)
+			ret = syn_proxy(skb, ct, ctinfo);
+	}
+#endif
+
+	return ret;
+}
+
 unsigned int
 nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 		struct sk_buff *skb)
@@ -855,8 +875,9 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 			       l3proto, l4proto, &set_reply, &ctinfo);
 	if (!ct) {
 		/* Not valid part of a connection */
-		NF_CT_STAT_INC_ATOMIC(net, invalid);
-		ret = NF_ACCEPT;
+		ret = syn_proxy_pre_call(protonum, skb, NULL, ctinfo);
+		if (ret == NF_ACCEPT)
+			NF_CT_STAT_INC_ATOMIC(net, invalid);
 		goto out;
 	}
 
@@ -869,6 +890,9 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
 
 	NF_CT_ASSERT(skb->nfct);
 
+	ret = syn_proxy_pre_call(protonum, skb, ct, ctinfo);
+	if (ret != NF_ACCEPT)
+		goto out;
 	ret = l4proto->packet(ct, skb, dataoff, ctinfo, pf, hooknum);
 	if (ret <= 0) {
 		/* Invalid: inverse of the return code tells
@@ -1476,6 +1500,17 @@ s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			u32 seq);
 EXPORT_SYMBOL_GPL(nf_ct_nat_offset);
 
+#if defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY) || \
+    defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY_MODULE)
+unsigned int (*syn_proxy_pre_hook)(struct sk_buff *skb, struct nf_conn *ct,
+				   enum ip_conntrack_info ctinfo);
+EXPORT_SYMBOL(syn_proxy_pre_hook);
+
+unsigned int (*syn_proxy_post_hook)(struct sk_buff *skb, struct nf_conn *ct,
+				    enum ip_conntrack_info ctinfo);
+EXPORT_SYMBOL(syn_proxy_post_hook);
+#endif
+
 int nf_conntrack_init(struct net *net)
 {
 	int ret;
@@ -1496,6 +1531,12 @@ int nf_conntrack_init(struct net *net)
 
 		/* Howto get NAT offsets */
 		rcu_assign_pointer(nf_ct_nat_offset, NULL);
+
+#if defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY) || \
+    defined(CONFIG_NETFILTER_XT_TARGET_SYNPROXY_MODULE)
+		rcu_assign_pointer(syn_proxy_pre_hook, NULL);
+		rcu_assign_pointer(syn_proxy_post_hook, NULL);
+#endif
 	}
 	return 0;
 
diff --git a/net/netfilter/xt_SYNPROXY.c b/net/netfilter/xt_SYNPROXY.c
new file mode 100644
index 0000000..5e05259
--- /dev/null
+++ b/net/netfilter/xt_SYNPROXY.c
@@ -0,0 +1,678 @@
+/* (C) 2010- Changli Gao <xiaosuo@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * It bases on ipt_REJECT.c
+ */
+#define pr_fmt(fmt) "SYNPROXY: " fmt
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/slab.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
+#include <linux/icmp.h>
+#include <linux/unaligned/access_ok.h>
+#include <net/icmp.h>
+#include <net/ip.h>
+#include <net/tcp.h>
+#include <net/route.h>
+#include <net/dst.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_extend.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter_ipv4/ip_tables.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Changli Gao <xiaosuo@gmail.com>");
+MODULE_DESCRIPTION("Xtables: \"SYNPROXY\" target for IPv4");
+MODULE_ALIAS("ipt_SYNPROXY");
+
+enum {
+	TCP_SEND_FLAG_NOTRACE	= 0x1,
+	TCP_SEND_FLAG_SYNCOOKIE	= 0x2,
+	TCP_SEND_FLAG_ACK2SYN	= 0x4,
+};
+
+struct syn_proxy_state {
+	u16	seq_inited;
+	__be16	window;
+	u32	seq_diff;
+};
+
+static int get_mtu(const struct dst_entry *dst)
+{
+	int mtu;
+
+	mtu = dst_mtu(dst);
+	if (mtu)
+		return mtu;
+
+	return dst->dev ? dst->dev->mtu : 0;
+}
+
+static int get_advmss(const struct dst_entry *dst)
+{
+	int advmss;
+
+	advmss = dst_metric(dst, RTAX_ADVMSS);
+	if (advmss)
+		return advmss;
+	advmss = get_mtu(dst);
+	if (advmss)
+		return advmss - (sizeof(struct iphdr) + sizeof(struct tcphdr));
+
+	return TCP_MSS_DEFAULT;
+}
+
+static int syn_proxy_route(struct sk_buff *skb, struct net *net, u16 *pmss)
+{
+	const struct iphdr *iph = ip_hdr(skb);
+	struct rtable *rt;
+	struct flowi fl = {};
+	unsigned int type;
+	int flags = 0;
+	int err;
+	u16 mss;
+
+	type = inet_addr_type(net, iph->saddr);
+	if (type != RTN_LOCAL) {
+		type = inet_addr_type(net, iph->daddr);
+		if (type == RTN_LOCAL)
+			flags |= FLOWI_FLAG_ANYSRC;
+	}
+
+	if (type == RTN_LOCAL) {
+		fl.nl_u.ip4_u.daddr = iph->daddr;
+		fl.nl_u.ip4_u.saddr = iph->saddr;
+		fl.nl_u.ip4_u.tos = RT_TOS(iph->tos);
+		fl.flags = flags;
+		err = ip_route_output_key(net, &rt, &fl);
+		if (err)
+			goto out;
+
+		skb_dst_set(skb, &rt->dst);
+	} else {
+		/* non-local src, find valid iif to satisfy
+		 * rp-filter when calling ip_route_input. */
+		fl.nl_u.ip4_u.daddr = iph->saddr;
+		err = ip_route_output_key(net, &rt, &fl);
+		if (err)
+			goto out;
+
+		err = ip_route_input(skb, iph->daddr, iph->saddr,
+				     RT_TOS(iph->tos), rt->dst.dev);
+		if (err) {
+			dst_release(&rt->dst);
+			goto out;
+		}
+		if (pmss) {
+			mss = get_advmss(&rt->dst);
+			if (*pmss > mss)
+				*pmss = mss;
+		}
+		dst_release(&rt->dst);
+	}
+
+	err = skb_dst(skb)->error;
+	if (!err && pmss) {
+		mss = get_advmss(skb_dst(skb));
+		if (*pmss > mss)
+			*pmss = mss;
+	}
+
+out:
+	return err;
+}
+
+static int tcp_send(__be32 src, __be32 dst, __be16 sport, __be16 dport,
+		    u32 seq, u32 ack_seq, __be16 window, u16 mss, u8 tcp_flags,
+		    u8 tos, struct net_device *dev, int flags,
+		    struct sk_buff *oskb)
+{
+	struct sk_buff *skb;
+	struct iphdr *iph;
+	struct tcphdr *th;
+	int err, len;
+
+	len = sizeof(*th);
+	if (mss)
+		len += TCPOLEN_MSS;
+
+	skb = NULL;
+	/* caller must give me a large enough oskb */
+	if (oskb) {
+		unsigned char *odata = oskb->data;
+
+		if (skb_recycle_check(oskb, 0)) {
+			oskb->data = odata;
+			skb_reset_tail_pointer(oskb);
+			skb = oskb;
+			pr_debug("recycle skb\n");
+		}
+	}
+	if (!skb) {
+		skb = alloc_skb(LL_MAX_HEADER + sizeof(*iph) + len, GFP_ATOMIC);
+		if (!skb) {
+			err = -ENOMEM;
+			goto out;
+		}
+		skb_reserve(skb, LL_MAX_HEADER);
+	}
+
+	skb_reset_network_header(skb);
+	if (!(flags & TCP_SEND_FLAG_ACK2SYN) || skb != oskb) {
+		iph = (struct iphdr *)skb_put(skb, sizeof(*iph));
+		iph->version	= 4;
+		iph->ihl	= sizeof(*iph) / 4;
+		iph->tos	= tos;
+		/* tot_len is set in ip_local_out() */
+		iph->id		= 0;
+		iph->frag_off	= htons(IP_DF);
+		iph->protocol	= IPPROTO_TCP;
+		iph->saddr	= src;
+		iph->daddr	= dst;
+		th = (struct tcphdr *)skb_put(skb, len);
+		th->source	= sport;
+		th->dest	= dport;
+	} else {
+		iph = (struct iphdr *)skb->data;
+		iph->id		= 0;
+		iph->frag_off	= htons(IP_DF);
+		skb_put(skb, iph->ihl * 4 + len);
+		th = (struct tcphdr *)(skb->data + iph->ihl * 4);
+	}
+
+	th->seq		= htonl(seq);
+	th->ack_seq	= htonl(ack_seq);
+	tcp_flag_byte(th) = tcp_flags;
+	th->doff	= len / 4;
+	th->window	= window;
+	th->urg_ptr	= 0;
+
+	if ((flags & TCP_SEND_FLAG_SYNCOOKIE) && mss)
+		err = syn_proxy_route(skb, dev_net(dev), &mss);
+	else
+		err = syn_proxy_route(skb, dev_net(dev), NULL);
+	if (err)
+		goto err_out;
+
+	if ((flags & TCP_SEND_FLAG_SYNCOOKIE)) {
+		if (mss) {
+			th->seq = htonl(__cookie_v4_init_sequence(dst, src,
+								  dport, sport,
+								  ack_seq - 1,
+								  &mss));
+		} else {
+			mss = TCP_MSS_DEFAULT;
+			th->seq = htonl(__cookie_v4_init_sequence(dst, src,
+								  dport, sport,
+								  ack_seq - 1,
+								  &mss));
+			mss = 0;
+		}
+	}
+
+	if (mss)
+		* (__force __be32 *)(th + 1) = htonl((TCPOPT_MSS << 24) |
+						     (TCPOLEN_MSS << 16) |
+						     mss);
+	skb->ip_summed = CHECKSUM_PARTIAL;
+	th->check = ~tcp_v4_check(len, src, dst, 0);
+	skb->csum_start = (unsigned char *)th - skb->head;
+	skb->csum_offset = offsetof(struct tcphdr, check);
+
+	if (!(flags & TCP_SEND_FLAG_ACK2SYN) || skb != oskb)
+		iph->ttl	= dst_metric(skb_dst(skb), RTAX_HOPLIMIT);
+
+	if (skb->len > get_mtu(skb_dst(skb))) {
+		if (printk_ratelimit())
+			pr_warning("%s has smaller mtu: %d\n",
+				   skb_dst(skb)->dev->name,
+				   get_mtu(skb_dst(skb)));
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	if ((flags & TCP_SEND_FLAG_NOTRACE)) {
+		skb->nfct = &nf_ct_untracked_get()->ct_general;
+		skb->nfctinfo = IP_CT_NEW;
+		nf_conntrack_get(skb->nfct);
+	}
+
+	pr_debug("ip_local_out: %pI4n:%hu -> %pI4n:%hu (seq=%u, "
+		 "ack_seq=%u mss=%hu flags=%hhx)\n", &src, ntohs(th->source),
+		 &dst, ntohs(th->dest), ntohl(th->seq), ack_seq, mss,
+		 tcp_flags);
+
+	err = ip_local_out(skb);
+	if (err > 0)
+		err = net_xmit_errno(err);
+
+	pr_debug("ip_local_out: return with %d\n", err);
+out:
+	if (oskb && oskb != skb)
+		kfree_skb(oskb);
+
+	return err;
+
+err_out:
+	kfree_skb(skb);
+	goto out;
+}
+
+static int get_mss(u8 *data, int len)
+{
+	u8 olen;
+
+	while (len >= TCPOLEN_MSS) {
+		switch (data[0]) {
+		case TCPOPT_EOL:
+			return 0;
+		case TCPOPT_NOP:
+			data++;
+			len--;
+			break;
+		case TCPOPT_MSS:
+			if (data[1] != TCPOLEN_MSS)
+				return -EINVAL;
+			return get_unaligned_be16(data + 2);
+		default:
+			olen = data[1];
+			if (olen < 2 || olen > len)
+				return -EINVAL;
+			data += olen;
+			len -= olen;
+			break;
+		}
+	}
+
+	return 0;
+}
+
+static DEFINE_PER_CPU(struct syn_proxy_state, syn_proxy_state);
+
+/* syn_proxy_pre isn't under the protection of nf_conntrack_proto_tcp.c */
+static unsigned int syn_proxy_pre(struct sk_buff *skb, struct nf_conn *ct,
+				  enum ip_conntrack_info ctinfo)
+{
+	struct syn_proxy_state *state;
+	struct iphdr *iph;
+	struct tcphdr *th, _th;
+
+	/* only support IPv4 now */
+	iph = ip_hdr(skb);
+	if (iph->version != 4)
+		return NF_ACCEPT;
+
+	th = skb_header_pointer(skb, iph->ihl * 4, sizeof(_th), &_th);
+	if (th == NULL)
+		return NF_DROP;
+
+	if (!ct || !nf_ct_is_confirmed(ct)) {
+		int ret;
+
+		if (!th->syn && th->ack) {
+			u16 mss;
+			struct sk_buff *rec_skb;
+
+			mss = cookie_v4_check_sequence(iph, th,
+						       ntohl(th->ack_seq) - 1);
+			if (!mss)
+				return NF_ACCEPT;
+
+			pr_debug("%pI4n:%hu -> %pI4n:%hu(mss=%hu)\n",
+				 &iph->saddr, ntohs(th->source),
+				 &iph->daddr, ntohs(th->dest), mss);
+
+			if (skb_tailroom(skb) < TCPOLEN_MSS &&
+			    skb->len < iph->ihl * 4 + sizeof(*th) + TCPOLEN_MSS)
+				rec_skb = NULL;
+			else
+				rec_skb = skb;
+
+			local_bh_disable();
+			state = &__get_cpu_var(syn_proxy_state);
+			state->seq_inited = 1;
+			state->window = th->window;
+			state->seq_diff = ntohl(th->ack_seq) - 1;
+			if (rec_skb)
+				tcp_send(iph->saddr, iph->daddr, 0, 0,
+					 ntohl(th->seq) - 1, 0, th->window,
+					 mss, TCPHDR_SYN, 0, skb->dev,
+					 TCP_SEND_FLAG_ACK2SYN, rec_skb);
+			else
+				tcp_send(iph->saddr, iph->daddr, th->source,
+					 th->dest, ntohl(th->seq) - 1, 0,
+					 th->window, mss, TCPHDR_SYN,
+					 iph->tos, skb->dev, 0, NULL);
+			state->seq_inited = 0;
+			local_bh_enable();
+
+			if (!rec_skb)
+				kfree_skb(skb);
+
+			return NF_STOLEN;
+		}
+
+		if (!ct || !th->syn || th->ack)
+			return NF_ACCEPT;
+
+		ret = NF_ACCEPT;
+		local_bh_disable();
+		state = &__get_cpu_var(syn_proxy_state);
+		if (state->seq_inited) {
+			struct syn_proxy_state *nstate;
+
+			nstate = nf_ct_ext_add(ct, NF_CT_EXT_SYNPROXY,
+					       GFP_ATOMIC);
+			if (nstate != NULL) {
+				nstate->seq_inited = 0;
+				nstate->window = state->window;
+				nstate->seq_diff = state->seq_diff;
+				pr_debug("seq_diff: %u\n", nstate->seq_diff);
+			} else {
+				ret = NF_DROP;
+			}
+		}
+		local_bh_enable();
+
+		return ret;
+	}
+
+	state = nf_ct_ext_find(ct, NF_CT_EXT_SYNPROXY);
+	if (!state)
+		return NF_ACCEPT;
+
+	if (CTINFO2DIR(ctinfo) == IP_CT_DIR_ORIGINAL) {
+		__be32 newack;
+
+		/* don't need to mangle duplicate SYN packets */
+		if (th->syn && !th->ack)
+			return NF_ACCEPT;
+		if (!skb_make_writable(skb, ip_hdrlen(skb) + sizeof(*th)))
+			return NF_DROP;
+		th = (struct tcphdr *)(skb->data + ip_hdrlen(skb));
+		newack = htonl(ntohl(th->ack_seq) - state->seq_diff);
+		inet_proto_csum_replace4(&th->check, skb, th->ack_seq, newack,
+					 0);
+		pr_debug("alter ack seq: %u -> %u\n",
+			 ntohl(th->ack_seq), ntohl(newack));
+		th->ack_seq = newack;
+	} else {
+		/* Simultaneous open ? Oh, no. The connection between
+		 * client and us is established. */
+		if (th->syn && !th->ack)
+			return NF_DROP;
+	}
+
+	return NF_ACCEPT;
+}
+
+static unsigned int syn_proxy_mangle_pkt(struct sk_buff *skb, struct iphdr *iph,
+					 struct tcphdr *th, u32 seq_diff)
+{
+	__be32 new;
+	int olen;
+
+	if (skb->len < (iph->ihl + th->doff) * 4)
+		return NF_DROP;
+	if (!skb_make_writable(skb, (iph->ihl + th->doff) * 4))
+		return NF_DROP;
+	iph = (struct iphdr *)(skb->data);
+	th = (struct tcphdr *)(skb->data + iph->ihl * 4);
+
+	new = tcp_flag_word(th) & (~TCP_FLAG_SYN);
+	inet_proto_csum_replace4(&th->check, skb, tcp_flag_word(th), new, 0);
+	tcp_flag_word(th) = new;
+
+	new = htonl(ntohl(th->seq) + seq_diff);
+	inet_proto_csum_replace4(&th->check, skb, th->seq, new, 0);
+	pr_debug("alter seq: %u -> %u\n", ntohl(th->seq), ntohl(new));
+	th->seq = new;
+
+	olen = th->doff - sizeof(*th) / 4;
+	if (olen) {
+		__be32 *opt;
+
+		opt = (__force __be32 *)(th + 1);
+#define TCPOPT_EOL_WORD ((TCPOPT_EOL << 24) + (TCPOPT_EOL << 16) + \
+			 (TCPOPT_EOL << 8) + TCPOPT_EOL)
+		inet_proto_csum_replace4(&th->check, skb, *opt, TCPOPT_EOL_WORD,
+					 0);
+		*opt = TCPOPT_EOL_WORD;
+	}
+
+	return NF_ACCEPT;
+}
+
+static unsigned int syn_proxy_post(struct sk_buff *skb, struct nf_conn *ct,
+				   enum ip_conntrack_info ctinfo)
+{
+	struct syn_proxy_state *state;
+	struct iphdr *iph;
+	struct tcphdr *th;
+
+	/* untraced packets don't have NF_CT_EXT_SYNPROXY ext, as they don't
+	 * enter syn_proxy_pre() */
+	state = nf_ct_ext_find(ct, NF_CT_EXT_SYNPROXY);
+	if (state == NULL)
+		return NF_ACCEPT;
+
+	iph = ip_hdr(skb);
+	if (!skb_make_writable(skb, iph->ihl * 4 + sizeof(*th)))
+		return NF_DROP;
+	th = (struct tcphdr *)(skb->data + iph->ihl * 4);
+	if (!state->seq_inited) {
+		if (th->syn) {
+			/* It must be from original direction, as the ones
+			 * from the other side are dropped in function
+			 * syn_proxy_pre() */
+			if (!th->ack)
+				return NF_ACCEPT;
+
+			pr_debug("SYN-ACK %pI4n:%hu -> %pI4n:%hu "
+				 "(seq=%u ack_seq=%u)\n",
+				 &iph->saddr, ntohs(th->source), &iph->daddr,
+				 ntohs(th->dest), ntohl(th->seq),
+				 ntohl(th->ack_seq));
+
+			/* SYN-ACK from reply direction with the protection
+			 * of conntrack */
+			spin_lock_bh(&ct->lock);
+			if (!state->seq_inited) {
+				state->seq_inited = 1;
+				pr_debug("update seq_diff %u -> %u\n",
+					 state->seq_diff,
+					 state->seq_diff - ntohl(th->seq));
+				state->seq_diff -= ntohl(th->seq);
+			}
+			spin_unlock_bh(&ct->lock);
+			tcp_send(iph->daddr, iph->saddr, th->dest, th->source,
+				 ntohl(th->ack_seq),
+				 ntohl(th->seq) + 1 + state->seq_diff,
+				 state->window, 0, TCPHDR_ACK, iph->tos,
+				 skb->dev, 0, NULL);
+
+			return syn_proxy_mangle_pkt(skb, iph, th,
+						    state->seq_diff + 1);
+		} else {
+			__be32 newseq;
+
+			if (!th->rst)
+				return NF_ACCEPT;
+			newseq = htonl(state->seq_diff + 1);
+			inet_proto_csum_replace4(&th->check, skb, th->seq,
+						 newseq, 0);
+			pr_debug("alter RST seq: %u -> %u\n",
+				 ntohl(th->seq), ntohl(newseq));
+			th->seq = newseq;
+
+			return NF_ACCEPT;
+		}
+	}
+
+	/* ct should be in ESTABLISHED state, but if the ack packets from
+	 * us are lost. */
+	if (th->syn) {
+		if (!th->ack)
+			return NF_ACCEPT;
+
+		tcp_send(iph->daddr, iph->saddr, th->dest, th->source,
+			 ntohl(th->ack_seq),
+			 ntohl(th->seq) + 1 + state->seq_diff,
+			 state->window, 0, TCPHDR_ACK, iph->tos,
+			 skb->dev, 0, NULL);
+
+		return syn_proxy_mangle_pkt(skb, iph, th, state->seq_diff + 1);
+	}
+
+	if (CTINFO2DIR(ctinfo) == IP_CT_DIR_REPLY) {
+		__be32 newseq;
+
+		newseq = htonl(ntohl(th->seq) + state->seq_diff);
+		inet_proto_csum_replace4(&th->check, skb, th->seq, newseq, 0);
+		pr_debug("alter seq: %u -> %u\n", ntohl(th->seq),
+			 ntohl(newseq));
+		th->seq = newseq;
+	}
+
+	return NF_ACCEPT;
+}
+
+static unsigned int tcp_process(struct sk_buff *skb)
+{
+	const struct iphdr *iph;
+	const struct tcphdr *th;
+	int err;
+	u16 mss;
+
+	iph = ip_hdr(skb);
+	if (iph->frag_off & htons(IP_OFFSET))
+		goto out;
+	if (!pskb_may_pull(skb, iph->ihl * 4 + sizeof(*th)))
+		goto out;
+	th = (const struct tcphdr *)(skb->data + iph->ihl * 4);
+	if ((tcp_flag_byte(th) &
+	     (TCPHDR_FIN | TCPHDR_RST | TCPHDR_ACK | TCPHDR_SYN)) != TCPHDR_SYN)
+		goto out;
+
+	if (nf_ip_checksum(skb, NF_INET_PRE_ROUTING, iph->ihl * 4, IPPROTO_TCP))
+		goto out;
+	mss = 0;
+	if (th->doff > sizeof(*th) / 4) {
+		if (!pskb_may_pull(skb, (iph->ihl + th->doff) * 4))
+			goto out;
+		err = get_mss((u8 *)(th + 1), th->doff * 4 - sizeof(*th));
+		if (err < 0)
+			goto out;
+		if (err != 0)
+			mss = err;
+	} else if (th->doff != sizeof(*th) / 4)
+		goto out;
+
+	tcp_send(iph->daddr, iph->saddr, th->dest, th->source, 0,
+		 ntohl(th->seq) + 1, 0, mss, TCPHDR_SYN | TCPHDR_ACK,
+		 iph->tos, skb->dev,
+		 TCP_SEND_FLAG_NOTRACE | TCP_SEND_FLAG_SYNCOOKIE, skb);
+
+	return NF_STOLEN;
+
+out:
+	return NF_DROP;
+}
+
+static unsigned int synproxy_tg(struct sk_buff *skb,
+				const struct xt_action_param *par)
+{
+	struct nf_conn *ct;
+	enum ip_conntrack_info ctinfo;
+	int ret;
+
+	/* received from lo */
+	ct = nf_ct_get(skb, &ctinfo);
+	if (ct)
+		return IPT_CONTINUE;
+
+	local_bh_disable();
+	if (!__get_cpu_var(syn_proxy_state).seq_inited)
+		ret = tcp_process(skb);
+	else
+		ret = IPT_CONTINUE;
+	local_bh_enable();
+
+	return ret;
+}
+
+static int synproxy_tg_check(const struct xt_tgchk_param *par)
+{
+	int ret;
+
+	ret = nf_ct_l3proto_try_module_get(par->family);
+	if (ret < 0)
+		pr_info("cannot load conntrack support for proto=%u\n",
+			par->family);
+
+	return ret;
+}
+
+static void synproxy_tg_destroy(const struct xt_tgdtor_param *par)
+{
+	nf_ct_l3proto_module_put(par->family);
+}
+
+static struct xt_target synproxy_tg_reg __read_mostly = {
+	.name		= "SYNPROXY",
+	.family		= NFPROTO_IPV4,
+	.target		= synproxy_tg,
+	.table		= "raw",
+	.hooks		= 1 << NF_INET_PRE_ROUTING,
+	.proto		= IPPROTO_TCP,
+	.checkentry	= synproxy_tg_check,
+	.destroy	= synproxy_tg_destroy,
+	.me		= THIS_MODULE,
+};
+
+static struct nf_ct_ext_type syn_proxy_state_ext __read_mostly = {
+	.len	= sizeof(struct syn_proxy_state),
+	.align	= __alignof__(struct syn_proxy_state),
+	.id	= NF_CT_EXT_SYNPROXY,
+};
+
+static int __init synproxy_tg_init(void)
+{
+	int err;
+
+	rcu_assign_pointer(syn_proxy_pre_hook, syn_proxy_pre);
+	rcu_assign_pointer(syn_proxy_post_hook, syn_proxy_post);
+	err = nf_ct_extend_register(&syn_proxy_state_ext);
+	if (err)
+		goto err_out;
+	err = xt_register_target(&synproxy_tg_reg);
+	if (err)
+		goto err_out2;
+
+	return err;
+
+err_out2:
+	nf_ct_extend_unregister(&syn_proxy_state_ext);
+err_out:
+	rcu_assign_pointer(syn_proxy_post_hook, NULL);
+	rcu_assign_pointer(syn_proxy_pre_hook, NULL);
+	rcu_barrier();
+
+	return err;
+}
+
+static void __exit synproxy_tg_exit(void)
+{
+	xt_unregister_target(&synproxy_tg_reg);
+	nf_ct_extend_unregister(&syn_proxy_state_ext);
+	rcu_assign_pointer(syn_proxy_post_hook, NULL);
+	rcu_assign_pointer(syn_proxy_pre_hook, NULL);
+	rcu_barrier();
+}
+
+module_init(synproxy_tg_init);
+module_exit(synproxy_tg_exit);

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox