Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 1/2] 8139cp: set ring address before enabling receiver
From: Jason Wang @ 2012-06-01  4:19 UTC (permalink / raw)
  To: netdev, davem, linux-kernel; +Cc: mst

Currently, we enable the receiver before setting the ring address which could
lead the card DMA into unexpected areas. Solving this by set the ring address
before enabling the receiver.

btw. I find and test this in qemu as I didn't have a 8139cp card in hand. please
review it carefully.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/ethernet/realtek/8139cp.c |   22 +++++++++++-----------
 1 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index 5eef290..7f08779 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -979,6 +979,17 @@ static void cp_init_hw (struct cp_private *cp)
 	cpw32_f (MAC0 + 0, le32_to_cpu (*(__le32 *) (dev->dev_addr + 0)));
 	cpw32_f (MAC0 + 4, le32_to_cpu (*(__le32 *) (dev->dev_addr + 4)));
 
+	cpw32_f(HiTxRingAddr, 0);
+	cpw32_f(HiTxRingAddr + 4, 0);
+
+	ring_dma = cp->ring_dma;
+	cpw32_f(RxRingAddr, ring_dma & 0xffffffff);
+	cpw32_f(RxRingAddr + 4, (ring_dma >> 16) >> 16);
+
+	ring_dma += sizeof(struct cp_desc) * CP_RX_RING_SIZE;
+	cpw32_f(TxRingAddr, ring_dma & 0xffffffff);
+	cpw32_f(TxRingAddr + 4, (ring_dma >> 16) >> 16);
+
 	cp_start_hw(cp);
 	cpw8(TxThresh, 0x06); /* XXX convert magic num to a constant */
 
@@ -992,17 +1003,6 @@ static void cp_init_hw (struct cp_private *cp)
 
 	cpw8(Config5, cpr8(Config5) & PMEStatus);
 
-	cpw32_f(HiTxRingAddr, 0);
-	cpw32_f(HiTxRingAddr + 4, 0);
-
-	ring_dma = cp->ring_dma;
-	cpw32_f(RxRingAddr, ring_dma & 0xffffffff);
-	cpw32_f(RxRingAddr + 4, (ring_dma >> 16) >> 16);
-
-	ring_dma += sizeof(struct cp_desc) * CP_RX_RING_SIZE;
-	cpw32_f(TxRingAddr, ring_dma & 0xffffffff);
-	cpw32_f(TxRingAddr + 4, (ring_dma >> 16) >> 16);
-
 	cpw16(MultiIntr, 0);
 
 	cpw8_f(Cfg9346, Cfg9346_Lock);

^ permalink raw reply related

* Re: [V2 PATCH] net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()
From: Jason Wang @ 2012-06-01  3:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, edumazet, mst, linux-kernel, stable
In-Reply-To: <20120531.182145.119572313886189417.davem@davemloft.net>

On 06/01/2012 06:21 AM, David Miller wrote:
> From: Jason Wang<jasowang@redhat.com>
> Date: Thu, 31 May 2012 15:18:10 +0800
>
>> We need to validate the number of pages consumed by data_len, otherwise frags
>> array could be overflowed by userspace. So this patch validate data_len and
>> return -EMSGSIZE when data_len may occupies more frags than MAX_SKB_FRAGS.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Applied and queued up for -stable.
>
> Please do not add explicit stable CC:'s to networking patches, I queue
> appropriate changes up myself, and submit them only when I feel that
> the change has had sufficient exposure and testing in Linus's tree.

Sure, would pay attention next time.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Fw: [Bug 43327] New: IP routing: cached route is applied to wrong network interface
From: Stephen Hemminger @ 2012-06-01  0:46 UTC (permalink / raw)
  To: netdev



Begin forwarded message:

Date: Fri,  1 Jun 2012 00:18:37 +0000 (UTC)
From: bugzilla-daemon@bugzilla.kernel.org
To: shemminger@linux-foundation.org
Subject: [Bug 43327] New: IP routing: cached route is applied to wrong network interface


https://bugzilla.kernel.org/show_bug.cgi?id=43327

           Summary: IP routing: cached route is applied to wrong network
                    interface
           Product: Networking
           Version: 2.5
    Kernel Version: 3.1.10
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: IPV4
        AssignedTo: shemminger@linux-foundation.org
        ReportedBy: dschnabel@appneta.com
        Regression: No


IP routing: cached route is applied to wrong network interface


Dynamic route changes like ICMP redirects are cached in the cache routing table
of the kernel. This cache table can be displayed using the command "route -nC"
or "ip route show cache".
Routes in this table are used before checking the Routing Policy Database
(RPDB). In a certain use case a wrong route entry is created in the cache
table.

This is my network setup:
* the Linux machine has 2 network interfaces (eth0 and eth1) with IP adresses
of different subnets
** eth0: 172.16.124.217/24 (Subnet A)
** eth1: 172.16.128.219/24 (Subnet B)
* IP rules to accomplish two default gateways
** root@myBox:~# ip rule show
0:      from all lookup local
32764:  from 172.16.128.219 lookup E1
32765:  from 172.16.124.217 lookup E0
32766:  from all lookup main
32767:  from all lookup default
** root@myBox:~# ip route show table E0
default via 172.16.124.254 dev eth0
** root@myBox:~# ip route show table E1
default via 172.16.128.254 dev eth1
* Both gateways are connected to Subnet C

This is how it looks like:

    ************                                                               
   #      ************
    * Subnet A *                                                               
   #      * Subnet C *
    ************             +-------------------+      +-------------------+  
   #      ************
                             |                   |      |                   |  
   #
         +-------------------+ GW 172.16.124.254 +------+ GW 172.16.124.18 
+------#---------------+
         | 172.16.124.217    |                   |      |                   |  
   #               |
  +------+--------+          +-------------------+      +---------+---------+  
   #               |
  |     eth0      |                                               |            
   #      +--------+----------+
  |               |                                               |            
   #      |      Target       |
  | Linux Machine |                                      ##################    
   #      |   IP 10.20.2.252  |
  |               |                                               |            
   #      +--------+----------+
  |     eth1      |                                               |            
   #
  +------+--------+          +-------------------+                |            
   #
         | 172.16.128.219    |                   |                |            
   #
         +-------------------+ GW 172.16.128.254 +----------------+            
   #
                             |                   |                             
   #
    ************             +-------------------+                             
   #
    * Subnet B *                                                               
   #
    ************                                                               
   #



I can ping the target from both interfaces:
ping 10.20.2.252 -I 172.16.124.217
ping 10.20.2.252 -I 172.16.128.219

When pining from eth0 (172.16.124.217) the Gateway 172.16.124.254 will return a
redirect to Gateway 172.16.124.18 since it's in the same network:
root@myBox:~# ping 10.20.2.252 -I 172.16.124.217
PING 10.20.2.252 (10.20.2.252) from 172.16.124.217 : 56(84) bytes of data.
64 bytes from 10.20.2.252: icmp_seq=1 ttl=63 time=81.4 ms
>From 172.16.124.254: icmp_seq=1 Redirect Host(New nexthop: 172.16.124.18)
64 bytes from 10.20.2.252: icmp_seq=2 ttl=63 time=0.277 ms
64 bytes from 10.20.2.252: icmp_seq=3 ttl=63 time=0.238 ms
64 bytes from 10.20.2.252: icmp_seq=4 ttl=63 time=0.236 ms

And this redirect will create a new entry in the cache table:
root@myBox:~# route -nC | grep 172.16.124.18
172.16.124.217  10.20.2.252     172.16.124.18         0      0        2 eth0

So far so good. Here comes the problem.

When I ping the same target now from eth1 (172.16.128.219) then it won't work
anymore:
root@myBox:~# ping 10.20.2.252 -I 172.16.128.219
PING 10.20.2.252 (10.20.2.252) from 172.16.128.219 : 56(84) bytes of data.
>From 172.16.128.219 icmp_seq=2 Destination Host Unreachable
>From 172.16.128.219 icmp_seq=3 Destination Host Unreachable
>From 172.16.128.219 icmp_seq=4 Destination Host Unreachable
^C
--- 10.20.2.252 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2999ms

I check the cache table and notice another entry:
root@portwell19:~# route -nC | grep 172.16.124.18
172.16.124.217  10.20.2.252     172.16.124.18         0      0        2 eth0
172.16.128.219  10.20.2.252     172.16.124.18         0      0        7 eth1

That means eth1 is now trying to reach 10.20.2.252 using the gateway
172.16.124.18. It's obvious that this won't work since eth1 is in a different
subnet.
So the entry in the cache table is wrong. After clearing the cache with "ip
route flush table cache" the ping from eth1 works again.

I did some research:
The cache routing table works on an AVL tree of Internet Peers. Those peers are
stored in a structure called inet_peer (include/net/inetpeer.h). A lookup is
done by the call to inet_getpeer_v4() in net/ipv4/route.c which takes the
destination address (10.20.2.252 in my case) as the first argument. So if the
destination address matches then the peer is returned and saved to the cache
table regardless of the source address.

Two possible fixes I can think of:
* A peer lookup should be done not only by the destination address but also by
the source address (or netmask)
* The inet_peer structure should contain a field for the source address (or
netmask). Then after lookup via inet_getpeer_v4() check the source address (or
netmask) of the returned peer.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply

* Re: r8169 vpd r/w failed
From: Francois Romieu @ 2012-05-31 23:49 UTC (permalink / raw)
  To: Dave Jones; +Cc: netdev, Hayes Wang, Fedora Kernel Team
In-Reply-To: <20120531174126.GA4189@redhat.com>

Dave Jones <davej@redhat.com> :
[...]
>  We occasionally see reports from r8169 where it seems to hang
> reading vpd data from the NIC. The printk in drivers/pci/
> suggests it's likely a firmware bug on the device, and recommends the user
> contact the vendor for an update (does Realtek even do firmware updates?)
> 
> Is this as the message suggests a firmware bug, or should r8169 be
> avoiding trying to read vpd ?

I have not practiced vpd with the 816x a lot but I have experienced an
unlimited read behavior with the 8168 while trying to access the vpd back
in march. The output exhibited a periodic pattern, so I got rid of
it with something similar to drivers/pci/quirks.c::quirk_brcm_570x_limit_vpd.

I can resurrect it but I'd rather give the pending softirq stuff a try
first and I must test davandra netif_napi_del patch as well. Btw I have
just upgraded my workstation to F17 and it's time for some sleep.

> https://bugzilla.redhat.com/show_bug.cgi?id=827137 is one example.

(990FX)

Adding a RTL_GIGA_MAC_VER_34 case similar to RTL_GIGA_MAC_VER_24 in
rtl_init_rxcfg may temporarily help if netdev watchdog triggers (see
the mess known as https://bugzilla.kernel.org/show_bug.cgi?id=42899).

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH] cipso: handle CIPSO options correctly when NetLabel is disabled
From: David Miller @ 2012-05-31 23:07 UTC (permalink / raw)
  To: pmoore; +Cc: netdev, linux-security-module, stable
In-Reply-To: <20120531200922.6265.81763.stgit@sifl>

From: Paul Moore <pmoore@redhat.com>
Date: Thu, 31 May 2012 16:09:23 -0400

> When NetLabel is not enabled, e.g. CONFIG_NETLABEL=n, and the system
> receives a CIPSO tagged packet it is dropped (cipso_v4_validate()
> returns non-zero).  In most cases this is the correct and desired
> behavior, however, in the case where we are simply forwarding the
> traffic, e.g. acting as a network bridge, this becomes a problem.
> 
> This patch fixes the forwarding problem by providing the basic CIPSO
> validation code directly in ip_options_compile() without the need for
> the NetLabel or CIPSO code.  The new validation code can not perform
> any of the CIPSO option label/value verification that
> cipso_v4_validate() does, but it can verify the basic CIPSO option
> format.
> 
> The behavior when NetLabel is enabled is unchanged.
> 
> Signed-off-by: Paul Moore <pmoore@redhat.com>

I don't like this at all.

The only conclusion I can come to is that cipso_v4_validate() is doing
the wrong thing when NETLABEL is disabled.

There is never a good reason to crap all over a function with ifdefs.
This is especially true when it's being done to paper over a function
with poor semantics.

The whole idea is to abstract and put all of this kind of logic into
cipso_v4_validate().


^ permalink raw reply

* Re: [PATCH net-next] tcp: avoid tx starvation by SYNACK packets
From: David Miller @ 2012-05-31 23:03 UTC (permalink / raw)
  To: eric.dumazet; +Cc: hans.schillstrom, netdev, ncardwell, therbert, brouer
In-Reply-To: <1338501397.2760.1395.camel@edumazet-glaptop>


Is the net-next tree open yet?

^ permalink raw reply

* Winning Notifications!!!!
From: UNC @ 2012-05-31 22:35 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 249 bytes --]



Dear 2012 winner,

Kindly open the attach claims application form and claim your winning fund
by filling the form and send it to our claims department. Please send it
to this email only: uncclaims@yahoo.co.jp

Mrs. Judith Adams.
Online coordinator

[-- Attachment #2: ATTACH COPY.doc --]
[-- Type: application/msword, Size: 251904 bytes --]

^ permalink raw reply

* Re: [V2 PATCH] net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()
From: David Miller @ 2012-05-31 22:21 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, edumazet, mst, linux-kernel, stable
In-Reply-To: <20120531071809.6392.26677.stgit@amd-6168-8-1.englab.nay.redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Thu, 31 May 2012 15:18:10 +0800

> We need to validate the number of pages consumed by data_len, otherwise frags
> array could be overflowed by userspace. So this patch validate data_len and
> return -EMSGSIZE when data_len may occupies more frags than MAX_SKB_FRAGS.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied and queued up for -stable.

Please do not add explicit stable CC:'s to networking patches, I queue
appropriate changes up myself, and submit them only when I feel that
the change has had sufficient exposure and testing in Linus's tree.

^ permalink raw reply

* Re: [PATCH net 3/3] bql: Avoid possible inconsistent calculation.
From: David Miller @ 2012-05-31 22:20 UTC (permalink / raw)
  To: shimoda.hiroaki; +Cc: therbert, eric.dumazet, denys, netdev
In-Reply-To: <20120531072537.920f0cb0.shimoda.hiroaki@gmail.com>

From: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Date: Thu, 31 May 2012 07:25:37 +0900

> dql->num_queued could change while processing dql_completed().
> To provide consistent calculation, added an on stack variable.
> 
> Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net 2/3] bql: Avoid unneeded limit decrement.
From: David Miller @ 2012-05-31 22:20 UTC (permalink / raw)
  To: shimoda.hiroaki; +Cc: therbert, eric.dumazet, denys, netdev
In-Reply-To: <20120531072519.16464513.shimoda.hiroaki@gmail.com>

From: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Date: Thu, 31 May 2012 07:25:19 +0900

> When below pattern is observed,
> 
>                                                TIME
>        dql_queued()         dql_completed()     |
>       a) initial state                          |
>                                                 |
>       b) X bytes queued                         V
> 
>       c) Y bytes queued
>                            d) X bytes completed
>       e) Z bytes queued
>                            f) Y bytes completed
> 
> a) dql->limit has already some value and there is no in-flight packet.
> b) X bytes queued.
> c) Y bytes queued and excess limit.
> d) X bytes completed and dql->prev_ovlimit is set and also
>    dql->prev_num_queued is set Y.
> e) Z bytes queued.
> f) Y bytes completed. inprogress and prev_inprogress are true.
> 
> At f), according to the comment, all_prev_completed becomes
> true and limit should be increased. But POSDIFF() ignores
> (completed == dql->prev_num_queued) case, so limit is decreased.
> 
> Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net 1/3] bql: Fix POSDIFF() to integer overflow aware.
From: David Miller @ 2012-05-31 22:19 UTC (permalink / raw)
  To: shimoda.hiroaki; +Cc: therbert, eric.dumazet, denys, netdev
In-Reply-To: <20120531072439.6c634a0b.shimoda.hiroaki@gmail.com>

From: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Date: Thu, 31 May 2012 07:24:39 +0900

> POSDIFF() fails to take into account integer overflow case.
> 
> Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net 0/6] batch of mlx4 fixes, mostly to SRIOV
From: David Miller @ 2012-05-31 22:19 UTC (permalink / raw)
  To: yevgenyp; +Cc: netdev, ogerlitz, yevgenyp, jackm
In-Reply-To: <1338405295-15427-1-git-send-email-yevgenyp@mellanox.com>

From: Yevgeny Petrilin <yevgenyp@mellanox.com>
Date: Wed, 30 May 2012 22:14:49 +0300

> Batch of fixes to the mlx4_core and mlx4_en drivers, prepared 
> by Jack Morgenstein, who leads our SRIOV development efforts and 
> fix various issues all except for one, relate to the driver 
> SRIOV functionality.

All applied, thanks.

^ permalink raw reply

* [PATCH net-next] tcp: avoid tx starvation by SYNACK packets
From: Eric Dumazet @ 2012-05-31 21:56 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: netdev, Neal Cardwell, Tom Herbert, Jesper Dangaard Brouer

From: Eric Dumazet <edumazet@google.com>

pfifo_fast being the default Qdisc, its pretty easy to fill it with
SYNACK (small) packets while host is under SYNFLOOD attack.

Packets of established TCP sessions are dropped and host appears almost
dead.

Avoid this problem assigning TC_PRIO_FILLER priority to SYNACK
generated in SYNCOOKIE mode, so that these packets are enqueued into
pfifo_fast band 2.

Other packets, queued to band 0 or 1 are dequeued before any SYNACK
packets waiting in band 2.

Reported-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
---
 net/dccp/ipv4.c                  |    3 +++
 net/ipv4/ip_output.c             |    2 +-
 net/ipv4/tcp_ipv4.c              |   13 +++++++++----
 net/ipv6/inet6_connection_sock.c |    1 +
 net/ipv6/ip6_output.c            |    2 +-
 net/ipv6/tcp_ipv6.c              |   10 +++++++---
 6 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 07f5579..d8a3d87 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -515,6 +515,8 @@ static int dccp_v4_send_response(struct sock *sk, struct request_sock *req,
 
 		dh->dccph_checksum = dccp_v4_csum_finish(skb, ireq->loc_addr,
 							      ireq->rmt_addr);
+		
+		skb->priority = sk->sk_priority;
 		err = ip_build_and_send_pkt(skb, sk, ireq->loc_addr,
 					    ireq->rmt_addr,
 					    ireq->opt);
@@ -556,6 +558,7 @@ static void dccp_v4_ctl_send_reset(struct sock *sk, struct sk_buff *rxskb)
 	skb_dst_set(skb, dst_clone(dst));
 
 	bh_lock_sock(ctl_sk);
+	skb->priority = ctl_sk->sk_priority;
 	err = ip_build_and_send_pkt(skb, ctl_sk,
 				    rxiph->daddr, rxiph->saddr, NULL);
 	bh_unlock_sock(ctl_sk);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 451f97c..407e2fc 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -168,7 +168,7 @@ int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
 		ip_options_build(skb, &opt->opt, daddr, rt, 0);
 	}
 
-	skb->priority = sk->sk_priority;
+	/* skb->priority is set by the caller */
 	skb->mark = sk->sk_mark;
 
 	/* Send it out. */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a43b87d..613e713 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -81,7 +81,7 @@
 #include <linux/stddef.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
-
+#include <linux/pkt_sched.h>
 #include <linux/crypto.h>
 #include <linux/scatterlist.h>
 
@@ -824,7 +824,8 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
  */
 static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
 			      struct request_sock *req,
-			      struct request_values *rvp)
+			      struct request_values *rvp,
+			      bool syncookie)
 {
 	const struct inet_request_sock *ireq = inet_rsk(req);
 	struct flowi4 fl4;
@@ -840,6 +841,9 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
 	if (skb) {
 		__tcp_v4_send_check(skb, ireq->loc_addr, ireq->rmt_addr);
 
+		/* SYNACK sent in SYNCOOKIE mode have low priority */
+		skb->priority = syncookie ? TC_PRIO_FILLER : sk->sk_priority;
+
 		err = ip_build_and_send_pkt(skb, sk, ireq->loc_addr,
 					    ireq->rmt_addr,
 					    ireq->opt);
@@ -854,7 +858,7 @@ static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req,
 			      struct request_values *rvp)
 {
 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-	return tcp_v4_send_synack(sk, NULL, req, rvp);
+	return tcp_v4_send_synack(sk, NULL, req, rvp, false);
 }
 
 /*
@@ -1422,7 +1426,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->snt_synack = tcp_time_stamp;
 
 	if (tcp_v4_send_synack(sk, dst, req,
-			       (struct request_values *)&tmp_ext) ||
+			       (struct request_values *)&tmp_ext,
+			       want_cookie) ||
 	    want_cookie)
 		goto drop_and_free;
 
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index e6cee52..5812a74 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -248,6 +248,7 @@ int inet6_csk_xmit(struct sk_buff *skb, struct flowi *fl_unused)
 	/* Restore final destination back after routing done */
 	fl6.daddr = np->daddr;
 
+	skb->priority = sk->sk_priority;
 	res = ip6_xmit(sk, skb, &fl6, np->opt, np->tclass);
 	rcu_read_unlock();
 	return res;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 17b8c67..61c0ea8 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -241,7 +241,7 @@ int ip6_xmit(struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
 	hdr->saddr = fl6->saddr;
 	hdr->daddr = *first_hop;
 
-	skb->priority = sk->sk_priority;
+	/* skb->priority is set by the caller */
 	skb->mark = sk->sk_mark;
 
 	mtu = dst_mtu(dst);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 554d599..b618413 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -43,6 +43,7 @@
 #include <linux/ipv6.h>
 #include <linux/icmpv6.h>
 #include <linux/random.h>
+#include <linux/pkt_sched.h>
 
 #include <net/tcp.h>
 #include <net/ndisc.h>
@@ -476,7 +477,7 @@ out:
 
 
 static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req,
-			      struct request_values *rvp)
+			      struct request_values *rvp, bool syncookie)
 {
 	struct inet6_request_sock *treq = inet6_rsk(req);
 	struct ipv6_pinfo *np = inet6_sk(sk);
@@ -512,6 +513,7 @@ static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req,
 	if (skb) {
 		__tcp_v6_send_check(skb, &treq->loc_addr, &treq->rmt_addr);
 
+		skb->priority = syncookie ? TC_PRIO_FILLER : sk->sk_priority;
 		fl6.daddr = treq->rmt_addr;
 		err = ip6_xmit(sk, skb, &fl6, opt, np->tclass);
 		err = net_xmit_eval(err);
@@ -528,7 +530,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req,
 			     struct request_values *rvp)
 {
 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-	return tcp_v6_send_synack(sk, req, rvp);
+	return tcp_v6_send_synack(sk, req, rvp, false);
 }
 
 static void tcp_v6_reqsk_destructor(struct request_sock *req)
@@ -906,6 +908,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
 	dst = ip6_dst_lookup_flow(ctl_sk, &fl6, NULL, false);
 	if (!IS_ERR(dst)) {
 		skb_dst_set(buff, dst);
+		skb->priority = ctl_sk->sk_priority;
 		ip6_xmit(ctl_sk, buff, &fl6, NULL, tclass);
 		TCP_INC_STATS_BH(net, TCP_MIB_OUTSEGS);
 		if (rst)
@@ -1213,7 +1216,8 @@ have_isn:
 	security_inet_conn_request(sk, skb, req);
 
 	if (tcp_v6_send_synack(sk, req,
-			       (struct request_values *)&tmp_ext) ||
+			       (struct request_values *)&tmp_ext,
+			       want_cookie) ||
 	    want_cookie)
 		goto drop_and_free;
 

^ permalink raw reply related

* r8169: IO_PAGE_FAULT & netdev watchdog
From: Vincent Pelletier @ 2012-05-31 21:31 UTC (permalink / raw)
  To: netdev

Hi.

First of all, I'm running 3.3.4 as of debian experimental (the rest of
userland is from sid). I am not subscribed to this list, so please keep me
in CC.

I'm getting consistently errors when using btlaunchmanycurses (multi-torrent
downloader) after a few minutes. I usually first notice the network being down
(no trafic) then find this in syslog (see at bottom).

Then, I "ifdown eth0;rmmod r8169;modprobe r8169" (which implicitely ifup's),
but network never comes back - at least no trafic can go through - until
reboot.

www.kerneloops.org being down (aparently for quite some time...) I though I
should report here.

I'm quite sure this problem also occured on 3.2, but I don't know the exact
version I was using at that time. I only have this motherboard since a few
months, and previous one didn't have an IOMMU - which in my understanding is
what causes (well, detects actually) this error.

May 31 22:54:55 x2 kernel: [78579.111904] AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0019 address=0x0000000000003000 flags=0x0050]
May 31 22:55:07 x2 kernel: [78590.832047] ------------[ cut here ]------------
May 31 22:55:07 x2 kernel: [78590.832067] WARNING: at /build/buildd-linux-2.6_3.3.4-1~experimental.1-amd64-_y3OdD/linux-2.6-3.3.4/debian/build/source_amd64_none/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151()
May 31 22:55:07 x2 kernel: [78590.832080] Hardware name: GA-990FXA-UD3
May 31 22:55:07 x2 kernel: [78590.832087] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
May 31 22:55:07 x2 kernel: [78590.832093] Modules linked in: pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hrtimer cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative xt_multiport iptable_filter ip_tables x_tables tun parport_pc ppdev lp parport binfmt_misc ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi fuse nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc ext3 mbcache jbd dm_crypt raid1 md_mod powernow_k8 mperf adt7475 it87 hwmon_vid snd_emu10k1_synth snd_emux_synth snd_seq_midi_emul snd_seq_virmidi snd_emu10k1 snd_util_mem snd_ac97_codec snd_hwdep snd_pcm_oss snd_mixer_oss joydev snd_pcm snd_page_alloc nouveau snd_seq_midi 
snd_seq_midi_event snd_rawmidi snd_seq video ttm drm_kms_helper drm sp5100_tco i2c_piix4 snd_seq_device k10temp snd_timer i2c_core mxm_wmi snd emu10k1_gp gameport edac_mce_amd edac_core evdev pcspkr wmi processor soundcore ac97_bus button thermal_sys sr_mod cdrom usbhid hid power_supply re
May 31 22:55:07 x2 kernel: iserfs dm_mod nbd usb_storage uas sd_mod crc_t10dif ohci_hcd firewire_ohci firewire_core crc_itu_t ahci libahci ehci_hcd xhci_hcd r8169 mii libata scsi_mod usbcore usb_common [last unloaded: scsi_wait_scan]
May 31 22:55:07 x2 kernel: [78590.832306] Pid: 0, comm: swapper/0 Tainted: G        W  O 3.3.0-trunk-amd64 #1
May 31 22:55:07 x2 kernel: [78590.832314] Call Trace:
May 31 22:55:07 x2 kernel: [78590.832319]  <IRQ>  [<ffffffff810387cb>] ? warn_slowpath_common+0x78/0x8c
May 31 22:55:07 x2 kernel: [78590.832339]  [<ffffffff81038877>] ? warn_slowpath_fmt+0x45/0x4a
May 31 22:55:07 x2 kernel: [78590.832349]  [<ffffffff812aa28d>] ? netif_tx_lock+0x40/0x76
May 31 22:55:07 x2 kernel: [78590.832363]  [<ffffffff812aa3ff>] ? dev_watchdog+0xf2/0x151
May 31 22:55:07 x2 kernel: [78590.832374]  [<ffffffff81043ef1>] ? run_timer_softirq+0x19a/0x261
May 31 22:55:07 x2 kernel: [78590.832383]  [<ffffffff812aa30d>] ? netif_tx_unlock+0x4a/0x4a
May 31 22:55:07 x2 kernel: [78590.832395]  [<ffffffff8103de20>] ? __do_softirq+0xb9/0x177
May 31 22:55:07 x2 kernel: [78590.832405]  [<ffffffff8106d15b>] ? timekeeping_get_ns+0xd/0x2a
May 31 22:55:07 x2 kernel: [78590.832417]  [<ffffffff81358b5c>] ? call_softirq+0x1c/0x30
May 31 22:55:07 x2 kernel: [78590.832428]  [<ffffffff8100fa35>] ? do_softirq+0x3c/0x7b
May 31 22:55:07 x2 kernel: [78590.832438]  [<ffffffff8103e088>] ? irq_exit+0x3c/0x96
May 31 22:55:07 x2 kernel: [78590.832447]  [<ffffffff8100f763>] ? do_IRQ+0x82/0x98
May 31 22:55:07 x2 kernel: [78590.832459]  [<ffffffff8135282e>] ? common_interrupt+0x6e/0x6e
May 31 22:55:07 x2 kernel: [78590.832464]  <EOI>  [<ffffffff8102b0c8>] ? native_safe_halt+0x2/0x3
May 31 22:55:07 x2 kernel: [78590.832481]  [<ffffffff81014798>] ? default_idle+0x47/0x7f
May 31 22:55:07 x2 kernel: [78590.832490]  [<ffffffff8101488f>] ? amd_e400_idle+0xbf/0xe4
May 31 22:55:07 x2 kernel: [78590.832500]  [<ffffffff8100d252>] ? cpu_idle+0xaf/0xf7
May 31 22:55:07 x2 kernel: [78590.832510]  [<ffffffff8169ab37>] ? start_kernel+0x3bd/0x3c8
May 31 22:55:07 x2 kernel: [78590.832519]  [<ffffffff8169a140>] ? early_idt_handlers+0x140/0x140
May 31 22:55:07 x2 kernel: [78590.832529]  [<ffffffff8169a3c3>] ? x86_64_start_kernel+0x104/0x111
May 31 22:55:07 x2 kernel: [78590.832537] ---[ end trace 627ebd8c70d61b1a ]---
May 31 22:55:07 x2 kernel: [78590.848660] r8169 0000:05:00.0: eth0: link up
May 31 22:55:19 x2 kernel: [78602.848659] r8169 0000:05:00.0: eth0: link up
May 31 22:55:31 x2 kernel: [78614.848656] r8169 0000:05:00.0: eth0: link up
May 31 22:55:43 x2 kernel: [78626.848800] r8169 0000:05:00.0: eth0: link up
May 31 22:55:55 x2 ovpn-nexedi[2610]: NOTE: OpenVPN 2.1 requires '--script-security 2' or higher to call user-defined scripts or executables
May 31 22:56:31 x2 kernel: [78674.848666] r8169 0000:05:00.0: eth0: link up
May 31 22:57:19 x2 kernel: [78722.848598] r8169 0000:05:00.0: eth0: link up
May 31 22:58:07 x2 kernel: [78770.848662] r8169 0000:05:00.0: eth0: link up
May 31 22:58:17 x2 avahi-daemon[2744]: Withdrawing address record for 192.168.0.16 on eth0.
May 31 22:58:17 x2 avahi-daemon[2744]: Leaving mDNS multicast group on interface eth0.IPv4 with address 192.168.0.16.
May 31 22:58:17 x2 avahi-daemon[2744]: Interface eth0.IPv4 no longer relevant for mDNS.
May 31 22:58:17 x2 avahi-daemon[2744]: Interface eth0.IPv6 no longer relevant for mDNS.
May 31 22:58:17 x2 avahi-daemon[2744]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::52e5:49ff:feb4:ed6f.
May 31 22:58:17 x2 avahi-daemon[2744]: Withdrawing address record for fe80::52e5:49ff:feb4:ed6f on eth0.
May 31 22:58:25 x2 avahi-daemon[2744]: Withdrawing workstation service for tun0.
May 31 22:59:29 x2 avahi-daemon[2744]: Withdrawing workstation service for eth0.
May 31 22:59:33 x2 kernel: [78856.929121] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
May 31 22:59:33 x2 kernel: [78856.929312] r8169 0000:05:00.0: irq 41 for MSI/MSI-X
May 31 22:59:33 x2 kernel: [78856.930671] r8169 0000:05:00.0: eth0: RTL8168evl/8111evl at 0xffffc90000c1e000, 50:e5:49:b4:ed:6f, XID 0c900880 IRQ 41
May 31 22:59:33 x2 kernel: [78856.930685] r8169 0000:05:00.0: eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
May 31 22:59:33 x2 avahi-daemon[2744]: Joining mDNS multicast group on interface eth0.IPv4 with address 192.168.0.16.
May 31 22:59:33 x2 kernel: [78857.169029] r8169 0000:05:00.0: eth0: link down
May 31 22:59:33 x2 kernel: [78857.169043] r8169 0000:05:00.0: eth0: link down
May 31 22:59:33 x2 kernel: [78857.171749] ADDRCONF(NETDEV_UP): eth0: link is not ready
May 31 22:59:33 x2 avahi-daemon[2744]: New relevant interface eth0.IPv4 for mDNS.
May 31 22:59:33 x2 avahi-daemon[2744]: Registering new address record for 192.168.0.16 on eth0.IPv4.
May 31 22:59:36 x2 kernel: [78859.538358] r8169 0000:05:00.0: eth0: link up
May 31 22:59:36 x2 kernel: [78859.539012] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May 31 22:59:37 x2 avahi-daemon[2744]: Joining mDNS multicast group on interface eth0.IPv6 with address fe80::52e5:49ff:feb4:ed6f.
May 31 22:59:37 x2 avahi-daemon[2744]: New relevant interface eth0.IPv6 for mDNS.
May 31 22:59:37 x2 avahi-daemon[2744]: Registering new address record for fe80::52e5:49ff:feb4:ed6f on eth0.*.
May 31 22:59:46 x2 kernel: [78870.104066] eth0: no IPv6 routers present
May 31 23:00:00 x2 kernel: [78883.792620] r8169 0000:05:00.0: eth0: link up
May 31 23:00:37 x2 kerneloops: Submitted 2 kernel oopses to www.kerneloops.org
May 31 23:00:48 x2 kernel: [78931.792643] r8169 0000:05:00.0: eth0: link up
May 31 23:01:21 x2 kernel: [78965.124469] r8169 0000:05:00.0: eth0: link down
May 31 23:01:26 x2 kernel: [78969.278184] r8169 0000:05:00.0: eth0: link up
May 31 23:01:27 x2 kerneloops: Submitted 1 kernel oopses to www.kerneloops.org
May 31 23:01:44 x2 kernel: [78987.792649] r8169 0000:05:00.0: eth0: link up
May 31 23:02:32 x2 kernel: [79035.792636] r8169 0000:05:00.0: eth0: link up
May 31 23:02:54 x2 shutdown[9402]: shutting down for system reboot

Regards,
-- 
Vincent Pelletier

^ permalink raw reply

* Re: [PATCH 2/4] can: cc770: Fix likely misuse of | for &
From: Marc Kleine-Budde @ 2012-05-31 20:54 UTC (permalink / raw)
  To: Joe Perches; +Cc: linux-kernel, Wolfgang Grandegger, linux-can, netdev
In-Reply-To: <6c251b03dc626215cf696e894ac1cdda530f38d9.1338408931.git.joe@perches.com>

[-- Attachment #1: Type: text/plain, Size: 1330 bytes --]

On 05/30/2012 10:25 PM, Joe Perches wrote:
> Using | with a constant is always true.
> Likely this should have be &.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Sounds reasonable. And there are no in tree users of the platform driver
that this fix could break.

commited to linux-can,
Marc

> ---
>  drivers/net/can/cc770/cc770_platform.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/can/cc770/cc770_platform.c b/drivers/net/can/cc770/cc770_platform.c
> index 53115ee..688371c 100644
> --- a/drivers/net/can/cc770/cc770_platform.c
> +++ b/drivers/net/can/cc770/cc770_platform.c
> @@ -154,7 +154,7 @@ static int __devinit cc770_get_platform_data(struct platform_device *pdev,
>  	struct cc770_platform_data *pdata = pdev->dev.platform_data;
>  
>  	priv->can.clock.freq = pdata->osc_freq;
> -	if (priv->cpu_interface | CPUIF_DSC)
> +	if (priv->cpu_interface & CPUIF_DSC)
>  		priv->can.clock.freq /= 2;
>  	priv->clkout = pdata->cor;
>  	priv->bus_config = pdata->bcr;


-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply

* [PATCH] cipso: handle CIPSO options correctly when NetLabel is disabled
From: Paul Moore @ 2012-05-31 20:09 UTC (permalink / raw)
  To: netdev; +Cc: linux-security-module, stable

When NetLabel is not enabled, e.g. CONFIG_NETLABEL=n, and the system
receives a CIPSO tagged packet it is dropped (cipso_v4_validate()
returns non-zero).  In most cases this is the correct and desired
behavior, however, in the case where we are simply forwarding the
traffic, e.g. acting as a network bridge, this becomes a problem.

This patch fixes the forwarding problem by providing the basic CIPSO
validation code directly in ip_options_compile() without the need for
the NetLabel or CIPSO code.  The new validation code can not perform
any of the CIPSO option label/value verification that
cipso_v4_validate() does, but it can verify the basic CIPSO option
format.

The behavior when NetLabel is enabled is unchanged.

Signed-off-by: Paul Moore <pmoore@redhat.com>
---
 net/ipv4/ip_options.c |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index 708b994..ca2c919 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -439,10 +439,30 @@ int ip_options_compile(struct net *net,
 				goto error;
 			}
 			opt->cipso = optptr - iph;
+#ifndef CONFIG_NETLABEL
+			if (optlen < 8) {
+				pp_ptr = optptr + 1;
+				goto error;
+			}
+			if (get_unaligned_be32(&optptr[2]) != 0) {
+				unsigned int iter;
+				for (iter = 6; iter < optlen;) {
+					if (optptr[iter+1] > (optlen - iter)) {
+						pp_ptr = optptr + iter;
+						goto error;
+					}
+					iter += optptr[iter + 1];
+				}
+			} else {
+				pp_ptr = optptr + 2;
+				goto error;
+			}
+#else
 			if (cipso_v4_validate(skb, &optptr)) {
 				pp_ptr = optptr;
 				goto error;
 			}
+#endif /* CONFIG_NETLABEL */
 			break;
 		      case IPOPT_SEC:
 		      case IPOPT_SID:

^ permalink raw reply related

* Re: [PATCH 1/3] drivers/net: Convert compare_ether_addr to ether_addr_equal
From: Joe Perches @ 2012-05-31 18:49 UTC (permalink / raw)
  To: Jussi Kivilinna
  Cc: David Miller, linville-2XuSBdqkA4R54TAoqtyWWQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20120531150124.119853l3a0cbvj40-tzMWlZeEOor1KXRcyAk9cg@public.gmane.org>

On Thu, 2012-05-31 at 15:01 +0300, Jussi Kivilinna wrote:
> Quoting David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>:
> > From: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
> > Date: Thu, 10 May 2012 09:11:28 -0700
> >> On Thu, 2012-05-10 at 17:32 +0300, Jussi Kivilinna wrote:
> >>> Quoting Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>:
> >>> > Use the new bool function ether_addr_equal to add
> >>> > some clarity and reduce the likelihood for misuse
> >>> > of compare_ether_addr for sorting.
> >> []
> >>> > diff --git a/drivers/net/wireless/rndis_wlan.c
> >> []
> >>> > @@ -2139,7 +2139,7 @@ resize_buf:
> >>> >  	while (check_bssid_list_item(bssid, bssid_len, buf, len)) {
> >>> >  		if (rndis_bss_info_update(usbdev, bssid) && match_bssid &&
> >>> >  		    matched) {
> >>> > -			if (compare_ether_addr(bssid->mac, match_bssid))
> >>> > +			if (!ether_addr_equal(bssid->mac, match_bssid))
> >>>
> >>> While reviewing this, noticed that above original code is wrong. It
> >>> should be !compare_ether_addr. So do I push patch fixing this through
> >>> wireless-testing althought it will later cause conflict with this patch?
[]
> That line/compare was added as response to hardware bug, where bssid-list does
> not contain BSSID and other information of currently connected AP  
> (spec insists
> that device must provide this information in the list when connected). Lack
> bssid-data on current connection then causes WARN_ON somewhere in cfg80211.
> Workaround was to check if bssid-list returns current bssid and if it  
> does not,
> manually construct bssid information in other ways. And this  
> workaround worked,
> with inverse check. Which must mean that when hardware is experiencing the
> problem, it's actually returning empty bssid-list.
> 
> Inverse check causes workaround be activated when bssid-list returns only
> entry, currently connected BSSID. That does not cause problems in itself, just
> slightly more inaccurate information in scan-list.

Thanks.

That information would be useful in the
eventual commit message.

cheers, Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* r8169 vpd r/w failed
From: Dave Jones @ 2012-05-31 17:41 UTC (permalink / raw)
  To: romieu; +Cc: netdev, Fedora Kernel Team

Francois,
 We occasionally see reports from r8169 where it seems to hang
reading vpd data from the NIC. The printk in drivers/pci/
suggests it's likely a firmware bug on the device, and recommends the user
contact the vendor for an update (does Realtek even do firmware updates?)

Is this as the message suggests a firmware bug, or should r8169 be
avoiding trying to read vpd ?

https://bugzilla.redhat.com/show_bug.cgi?id=827137 is one example.

thanks,

	Dave

^ permalink raw reply

* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Eric Dumazet @ 2012-05-31 17:16 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Rick Jones, Andi Kleen, Jesper Dangaard Brouer,
	Jesper Dangaard Brouer, netdev@vger.kernel.org, Christoph Paasch,
	David S. Miller, Martin Topholm, Florian Westphal, Tom Herbert
In-Reply-To: <201205311731.57159.hans.schillstrom@ericsson.com>

On Thu, 2012-05-31 at 17:31 +0200, Hans Schillstrom wrote:
> On Thursday 31 May 2012 16:09:21 Eric Dumazet wrote:
> > On Thu, 2012-05-31 at 10:45 +0200, Hans Schillstrom wrote:
> > 
> > > I can see plenty "IPv4: dst cache overflow"
> > > 
> > 
> > This is probably the most problematic problem in DDOS attacks.
> > 
> > I have a patch for this problem.
> > 
> > Idea is to not cache dst entries for following cases :
> > 
> > 1) Input dst, if listener queue is full (syncookies possibly engaged)
> > 
> > 2) Output dst of SYNACK messages.
> > 
> Sound like a good idea, 
> if you need some testing just the patches 
> 

Here is the patch, works pretty well for me

 include/net/dst.h   |    1 +
 net/ipv4/route.c    |   20 +++++++++++++++-----
 net/ipv4/tcp_ipv4.c |    6 ++++++
 3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index bed833d..e0109c4 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -60,6 +60,7 @@ struct dst_entry {
 #define DST_NOCOUNT		0x0020
 #define DST_NOPEER		0x0040
 #define DST_FAKE_RTABLE		0x0080
+#define DST_EPHEMERAL		0x0100
 
 	short			error;
 	short			obsolete;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 98b30d0..51b3e78 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -754,6 +754,15 @@ static inline int rt_is_expired(struct rtable *rth)
 	return rth->rt_genid != rt_genid(dev_net(rth->dst.dev));
 }
 
+static bool rt_is_expired_or_ephemeral(struct rtable *rth)
+{
+	if (rt_is_expired(rth))
+		return true;
+
+	return (atomic_read(&rth->dst.__refcnt) == 0) && 
+	       (rth->dst.flags & DST_EPHEMERAL);
+}
+
 /*
  * Perform a full scan of hash table and free all entries.
  * Can be called by a softirq or a process.
@@ -873,7 +882,7 @@ static void rt_check_expire(void)
 		while ((rth = rcu_dereference_protected(*rthp,
 					lockdep_is_held(rt_hash_lock_addr(i)))) != NULL) {
 			prefetch(rth->dst.rt_next);
-			if (rt_is_expired(rth)) {
+			if (rt_is_expired_or_ephemeral(rth)) {
 				*rthp = rth->dst.rt_next;
 				rt_free(rth);
 				continue;
@@ -1040,7 +1049,7 @@ static int rt_garbage_collect(struct dst_ops *ops)
 			spin_lock_bh(rt_hash_lock_addr(k));
 			while ((rth = rcu_dereference_protected(*rthp,
 					lockdep_is_held(rt_hash_lock_addr(k)))) != NULL) {
-				if (!rt_is_expired(rth) &&
+				if (!rt_is_expired_or_ephemeral(rth) &&
 					!rt_may_expire(rth, tmo, expire)) {
 					tmo >>= 1;
 					rthp = &rth->dst.rt_next;
@@ -1159,7 +1168,8 @@ restart:
 	candp = NULL;
 	now = jiffies;
 
-	if (!rt_caching(dev_net(rt->dst.dev))) {
+	if (!rt_caching(dev_net(rt->dst.dev)) ||
+	    dst_entries_get_fast(&ipv4_dst_ops) > (ip_rt_max_size >> 1)) {
 		/*
 		 * If we're not caching, just tell the caller we
 		 * were successful and don't touch the route.  The
@@ -1194,7 +1204,7 @@ restart:
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = rcu_dereference_protected(*rthp,
 			lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
-		if (rt_is_expired(rth)) {
+		if (rt_is_expired_or_ephemeral(rth)) {
 			*rthp = rth->dst.rt_next;
 			rt_free(rth);
 			continue;
@@ -1390,7 +1400,7 @@ static void rt_del(unsigned int hash, struct rtable *rt)
 	ip_rt_put(rt);
 	while ((aux = rcu_dereference_protected(*rthp,
 			lockdep_is_held(rt_hash_lock_addr(hash)))) != NULL) {
-		if (aux == rt || rt_is_expired(aux)) {
+		if (aux == rt || rt_is_expired_or_ephemeral(aux)) {
 			*rthp = aux->dst.rt_next;
 			rt_free(aux);
 			continue;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a43b87d..30c5275 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -835,6 +835,9 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
 	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
 		return -1;
 
+	if (atomic_read(&dst->__refcnt) == 1)
+		dst->flags |= DST_EPHEMERAL;
+
 	skb = tcp_make_synack(sk, dst, req, rvp);
 
 	if (skb) {
@@ -1291,6 +1294,9 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	 * evidently real one.
 	 */
 	if (inet_csk_reqsk_queue_is_full(sk) && !isn) {
+		/* under attack, free dst as soon as possible */
+		skb_dst(skb)->flags |= DST_EPHEMERAL;
+
 		want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
 		if (!want_cookie)
 			goto drop;

^ permalink raw reply related

* Re: [PATCH 1/25 v2] rdma/cm: define native IB address
From: Hal Rosenstock @ 2012-05-31 15:53 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Roland Dreier, David Miller,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1828884A29C6694DAF28B7E6B8A823733B7669D7-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>

On 3/7/2012 12:45 PM, Hefty, Sean wrote:
>> On Mon, Feb 27, 2012 at 2:22 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>>>> --- a/include/linux/socket.h
>>>>> +++ b/include/linux/socket.h
>>>>> @@ -184,6 +184,7 @@ struct ucred {
>>>>>  #define AF_PPPOX       24      /* PPPoX sockets                */
>>>>>  #define AF_WANPIPE     25      /* Wanpipe API Sockets */
>>>>>  #define AF_LLC         26      /* Linux LLC                    */
>>>>> +#define AF_IB          27      /* Native InfiniBand address    */
>>>>>  #define AF_CAN         29      /* Controller Area Network      */
>>>>>  #define AF_TIPC                30      /* TIPC
>> sockets                 */
>>>>>  #define AF_BLUETOOTH   31      /* Bluetooth sockets            */
>>>>> @@ -227,6 +228,7 @@ struct ucred {
>>>>>  #define PF_PPPOX       AF_PPPOX
>>>>>  #define PF_WANPIPE     AF_WANPIPE
>>>>>  #define PF_LLC         AF_LLC
>>>>> +#define PF_IB          AF_IB
>>>>>  #define PF_CAN         AF_CAN
>>>>>  #define PF_TIPC                AF_TIPC
>>>>>  #define PF_BLUETOOTH   AF_BLUETOOTH
>>>>
>>>> Has this been run by the networking community?  Are they OK with this
>>>> assignment?
>>>
>>> I did copy netdev on the original submissions, but I don't remember any
>> explicit ack or nack.
>>
>> David, any feeling yay or nay about adding these?
>>
>> Is the kernel the final arbiter of AF_xxx / PF_xxx assignments, or
>> is there anything else we have to worry about?
> 
> To clarify the intent of this change:
> 
> The RDMA CM allows users to specify addresses using struct sockaddr.  Today, only INET/6 are supported.  
> The intent is to allow a user to specify native InfiniBand addresses through that interface.  
> In the more immediate, this helps to solve InfiniBand scaling issues.

Yes, this is key for InfiniBand scaling.

> Longer term, this can also be used to control path failover.

This AF_IB patch series appears to be stalled unless I missed something
on the list. What needs to be done to revive it/move it along ?

Thanks.

-- Hal

> 
> - Sean
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Hans Schillstrom @ 2012-05-31 15:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, Andi Kleen, Jesper Dangaard Brouer,
	Jesper Dangaard Brouer, netdev@vger.kernel.org, Christoph Paasch,
	David S. Miller, Martin Topholm, Florian Westphal, Tom Herbert
In-Reply-To: <1338473361.2760.1361.camel@edumazet-glaptop>

On Thursday 31 May 2012 16:09:21 Eric Dumazet wrote:
> On Thu, 2012-05-31 at 10:45 +0200, Hans Schillstrom wrote:
> 
> > I can see plenty "IPv4: dst cache overflow"
> > 
> 
> This is probably the most problematic problem in DDOS attacks.
> 
> I have a patch for this problem.
> 
> Idea is to not cache dst entries for following cases :
> 
> 1) Input dst, if listener queue is full (syncookies possibly engaged)
> 
> 2) Output dst of SYNACK messages.
> 
Sound like a good idea, 
if you need some testing just the patches 

-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply

* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Eric Dumazet @ 2012-05-31 14:09 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Rick Jones, Andi Kleen, Jesper Dangaard Brouer,
	Jesper Dangaard Brouer, netdev@vger.kernel.org, Christoph Paasch,
	David S. Miller, Martin Topholm, Florian Westphal, Tom Herbert
In-Reply-To: <201205311045.03556.hans.schillstrom@ericsson.com>

On Thu, 2012-05-31 at 10:45 +0200, Hans Schillstrom wrote:

> I can see plenty "IPv4: dst cache overflow"
> 

This is probably the most problematic problem in DDOS attacks.

I have a patch for this problem.

Idea is to not cache dst entries for following cases :

1) Input dst, if listener queue is full (syncookies possibly engaged)

2) Output dst of SYNACK messages.

^ permalink raw reply

* [RFC v2 PATCH 3/3] tcp: SYN retransmits, fallback to slow-locked/no-cookie path
From: Jesper Dangaard Brouer @ 2012-05-31 13:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Christoph Paasch, Eric Dumazet,
	David S. Miller, Martin Topholm
  Cc: Florian Westphal, Hans Schillstrom
In-Reply-To: <20120531133807.10311.79711.stgit@localhost.localdomain>

Handle retransmitted SYN packets, by falling back to the slow
locked processing path (instead of dropping the reqsk, as
previous patch).

This will handle the case, where the original SYN/ACK didn't get
dropped, but somehow were delayed in the network and the
SYN-retransmission timer on the client-side fires before the
SYN/ACK reaches the client.

Notice, this does introduce a new SYN attack vector.  Using this
vector of false retransmits, on big machine in testlab, the performance
is reduced to 251Kpps SYN packets (compared to approx 400Kpps
when early dropping reqsk's. SYN generator speed 750Kpps).

Signed-off-by: Martin Topholm <mph@hoth.dk>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 net/ipv4/tcp_ipv4.c |   20 +++++++++-----------
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 29e9c4a..d2ff5c3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1307,24 +1307,22 @@ int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
 
 	/* Check for existing connection request (reqsk) as this might
 	 *   be a retransmitted SYN which have gotten into the
-	 *   reqsk_queue.  If so, we choose to drop the reqsk, and use
-	 *   SYN cookies to restore the state later, even-though this
-	 *   can cause issues, if the original SYN/ACK didn't get
+	 *   reqsk_queue.  If so, we simple fallback to the slow
+	 *   locked processing path.  Even-though this might introduce
+	 *   a new SYN attack vector.
+	 *   This will handle the case, where the original SYN/ACK didn't get
 	 *   dropped, but somehow were delayed in the network and the
 	 *   SYN-retransmission timer on the client-side fires before
-	 *   the SYN/ACK reaches the client.  We choose to neglect
-	 *   this situation as we are under attack, and don't want to
-	 *   open an attack vector, of falling back to the slow locked
-	 *   path.
+	 *   the SYN/ACK reaches the client.
 	 */
 	bh_lock_sock(sk);
 	exist_req = inet_csk_search_req(sk, &prev, tcp_hdr(skb)->source, saddr, daddr);
-	if (exist_req) { /* Drop existing reqsk */
+	if (exist_req) {
 		if (TCP_SKB_CB(skb)->seq == tcp_rsk(exist_req)->rcv_isn)
 			net_warn_ratelimited("Retransmitted SYN from %pI4"
-					     " (orig reqsk dropped)", &saddr);
-
-		inet_csk_reqsk_queue_drop(sk, exist_req, prev);
+					     " (don't do SYN cookie)", &saddr);
+		bh_unlock_sock(sk);
+		goto no_limit;
 	}
 	bh_unlock_sock(sk);
 

^ permalink raw reply related

* [RFC v2 PATCH 2/3] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Jesper Dangaard Brouer @ 2012-05-31 13:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Christoph Paasch, Eric Dumazet,
	David S. Miller, Martin Topholm
  Cc: Florian Westphal, Hans Schillstrom
In-Reply-To: <20120531133807.10311.79711.stgit@localhost.localdomain>

TCP SYN handling is on the slow path via tcp_v4_rcv(), and is
performed while holding spinlock bh_lock_sock().

Real-life and testlab experiments show, that the kernel choks
when reaching 130Kpps SYN floods (powerful Nehalem 16 cores).
Measuring with perf reveals, that its caused by
bh_lock_sock_nested() call in tcp_v4_rcv().

With this patch, the machine can handle 750Kpps (max of the SYN
flood generator) with cycles to spare, CPU load on the big machine
dropped to 1%, from 100%.

Notice we only handle syn cookie early on, normal SYN packets
are still processed under the bh_lock_sock().

V2:
 - Check for existing connection request (reqsk)
 - Avoid (unlikely) variable race in tcp_make_synack for tcp_full_space(sk)

Signed-off-by: Martin Topholm <mph@hoth.dk>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 net/ipv4/tcp_ipv4.c   |   48 +++++++++++++++++++++++++++++++++++++++++-------
 net/ipv4/tcp_output.c |   20 ++++++++++++++------
 2 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ed9d35a..29e9c4a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1274,8 +1274,10 @@ static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
  */
 int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
 {
-	struct request_sock *req;
+	struct request_sock *req = NULL;
 	struct inet_request_sock *ireq;
+	struct request_sock *exist_req;
+	struct request_sock **prev;
 	struct tcp_options_received tmp_opt;
 	__be32 saddr = ip_hdr(skb)->saddr;
 	__be32 daddr = ip_hdr(skb)->daddr;
@@ -1290,7 +1292,10 @@ int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
 	if (isn)
 		goto no_limit;
 
-	/* Start sending SYN cookies when request sock queue is full*/
+	/* Start sending SYN cookies when request sock queue is full
+	 * - Should lock while full queue check, but we don't need
+	 *   that precise/exact threshold here.
+	 */
 	if (!inet_csk_reqsk_queue_is_full(sk))
 		goto no_limit;
 
@@ -1300,6 +1305,29 @@ int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
 	if (!tcp_syn_flood_action(sk, skb, "TCP"))
 		goto drop; /* Not enabled, indicate drop, due to queue full */
 
+	/* Check for existing connection request (reqsk) as this might
+	 *   be a retransmitted SYN which have gotten into the
+	 *   reqsk_queue.  If so, we choose to drop the reqsk, and use
+	 *   SYN cookies to restore the state later, even-though this
+	 *   can cause issues, if the original SYN/ACK didn't get
+	 *   dropped, but somehow were delayed in the network and the
+	 *   SYN-retransmission timer on the client-side fires before
+	 *   the SYN/ACK reaches the client.  We choose to neglect
+	 *   this situation as we are under attack, and don't want to
+	 *   open an attack vector, of falling back to the slow locked
+	 *   path.
+	 */
+	bh_lock_sock(sk);
+	exist_req = inet_csk_search_req(sk, &prev, tcp_hdr(skb)->source, saddr, daddr);
+	if (exist_req) { /* Drop existing reqsk */
+		if (TCP_SKB_CB(skb)->seq == tcp_rsk(exist_req)->rcv_isn)
+			net_warn_ratelimited("Retransmitted SYN from %pI4"
+					     " (orig reqsk dropped)", &saddr);
+
+		inet_csk_reqsk_queue_drop(sk, exist_req, prev);
+	}
+	bh_unlock_sock(sk);
+
 	/* Allocate a request_sock */
 	req = inet_reqsk_alloc(&tcp_request_sock_ops);
 	if (!req) {
@@ -1331,6 +1359,7 @@ int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
 	ireq->no_srccheck = inet_sk(sk)->transparent;
 	ireq->opt = tcp_v4_save_options(sk, skb);
 
+	/* Considering lock here, cannot determine security module behavior */
 	if (security_inet_conn_request(sk, skb, req))
 		goto drop_and_free;
 
@@ -1345,7 +1374,10 @@ int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_rsk(req)->snt_synack = tcp_time_stamp;
 
-	/* Send SYN-ACK containing cookie */
+	/* Send SYN-ACK containing cookie
+	 * - tcp_v4_send_synack() handles alloc of a dst route cache,
+	 *   but also releases it immediately afterwards
+	 */
 	tcp_v4_send_synack(sk, NULL, req, NULL);
 
 drop_and_free:
@@ -1382,10 +1414,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
 		goto drop;
 
-	/* SYN cookie handling */
-	if (tcp_v4_syn_conn_limit(sk, skb))
-		goto drop;
-
 	req = inet_reqsk_alloc(&tcp_request_sock_ops);
 	if (!req)
 		goto drop;
@@ -1792,6 +1820,12 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	if (!sk)
 		goto no_tcp_socket;
 
+	/* Early and parallel SYN limit check, that sends syncookies */
+	if (sk->sk_state == TCP_LISTEN && th->syn && !th->ack && !th->fin) {
+		if (tcp_v4_syn_conn_limit(sk, skb))
+			goto discard_and_relse;
+	}
+
 process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 803cbfe..81fd4fc 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2458,6 +2458,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	int tcp_header_size;
 	int mss;
 	int s_data_desired = 0;
+	int tcp_full_space_val;
 
 	if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired)
 		s_data_desired = cvp->s_data_desired;
@@ -2479,13 +2480,16 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 		/* Set this up on the first call only */
 		req->window_clamp = tp->window_clamp ? : dst_metric(dst, RTAX_WINDOW);
 
+		/* Instruct compiler not do additional loads */
+		ACCESS_ONCE(tcp_full_space_val) = tcp_full_space(sk);
+
 		/* limit the window selection if the user enforce a smaller rx buffer */
 		if (sk->sk_userlocks & SOCK_RCVBUF_LOCK &&
-		    (req->window_clamp > tcp_full_space(sk) || req->window_clamp == 0))
-			req->window_clamp = tcp_full_space(sk);
+		    (req->window_clamp > tcp_full_space_val || req->window_clamp == 0))
+			req->window_clamp = tcp_full_space_val;
 
 		/* tcp_full_space because it is guaranteed to be the first packet */
-		tcp_select_initial_window(tcp_full_space(sk),
+		tcp_select_initial_window(tcp_full_space_val,
 			mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
 			&req->rcv_wnd,
 			&req->window_clamp,
@@ -2582,6 +2586,7 @@ void tcp_connect_init(struct sock *sk)
 {
 	const struct dst_entry *dst = __sk_dst_get(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	int tcp_full_space_val;
 	__u8 rcv_wscale;
 
 	/* We'll fix this up when we get a response from the other end.
@@ -2610,12 +2615,15 @@ void tcp_connect_init(struct sock *sk)
 
 	tcp_initialize_rcv_mss(sk);
 
+	/* Instruct compiler not do additional loads */
+	ACCESS_ONCE(tcp_full_space_val) = tcp_full_space(sk);
+
 	/* limit the window selection if the user enforce a smaller rx buffer */
 	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK &&
-	    (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
-		tp->window_clamp = tcp_full_space(sk);
+	    (tp->window_clamp > tcp_full_space_val || tp->window_clamp == 0))
+		tp->window_clamp = tcp_full_space_val;
 
-	tcp_select_initial_window(tcp_full_space(sk),
+	tcp_select_initial_window(tcp_full_space_val,
 				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,

^ permalink raw reply related

* [RFC v2 PATCH 1/3] tcp: extract syncookie part of tcp_v4_conn_request()
From: Jesper Dangaard Brouer @ 2012-05-31 13:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Christoph Paasch, Eric Dumazet,
	David S. Miller, Martin Topholm
  Cc: Florian Westphal, Hans Schillstrom
In-Reply-To: <20120531133807.10311.79711.stgit@localhost.localdomain>

From: Jesper Dangaard Brouer <jbrouer@redhat.com>

Place SYN cookie handling, from tcp_v4_conn_request() into seperate
function, named tcp_v4_syn_conn_limit(). The semantics should be
almost the same.

Besides code cleanup, this patch is preparing for handling SYN cookie
in an ealier step, to avoid a spinlock and achive parallel processing.

Signed-off-by: Martin Topholm <mph@hoth.dk>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 net/ipv4/tcp_ipv4.c |  122 +++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 98 insertions(+), 24 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a43b87d..ed9d35a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1268,6 +1268,95 @@ static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
 };
 #endif
 
+/* Check SYN connect limit and send SYN-ACK cookies
+ * - Return 0 = No limitation needed, continue processing
+ * - Return 1 = Stop processing, free SKB, SYN cookie send (if enabled)
+ */
+int tcp_v4_syn_conn_limit(struct sock *sk, struct sk_buff *skb)
+{
+	struct request_sock *req;
+	struct inet_request_sock *ireq;
+	struct tcp_options_received tmp_opt;
+	__be32 saddr = ip_hdr(skb)->saddr;
+	__be32 daddr = ip_hdr(skb)->daddr;
+	__u32 isn = TCP_SKB_CB(skb)->when;
+	const u8 *hash_location; /* No really used */
+
+	/* Never answer to SYNs send to broadcast or multicast */
+	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+		goto drop;
+
+	/* If "isn" is not zero, this request hit alive timewait bucket */
+	if (isn)
+		goto no_limit;
+
+	/* Start sending SYN cookies when request sock queue is full*/
+	if (!inet_csk_reqsk_queue_is_full(sk))
+		goto no_limit;
+
+	/* Check if SYN cookies are enabled
+	 * - Side effect: NET_INC_STATS_BH counters + printk logging
+	 */
+	if (!tcp_syn_flood_action(sk, skb, "TCP"))
+		goto drop; /* Not enabled, indicate drop, due to queue full */
+
+	/* Allocate a request_sock */
+	req = inet_reqsk_alloc(&tcp_request_sock_ops);
+	if (!req) {
+		net_warn_ratelimited ("%s: Could not alloc request_sock"
+				      ", drop conn from %pI4",
+				      __func__, &saddr);
+		goto drop;
+	}
+
+#ifdef CONFIG_TCP_MD5SIG
+	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
+#endif
+
+	tcp_clear_options(&tmp_opt);
+	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
+	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
+	tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
+
+	if (!tmp_opt.saw_tstamp)
+		tcp_clear_options(&tmp_opt);
+
+	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+	tcp_openreq_init(req, &tmp_opt, skb);
+
+	/* Update req as an inet_request_sock (typecast trick)*/
+	ireq = inet_rsk(req);
+	ireq->loc_addr = daddr;
+	ireq->rmt_addr = saddr;
+	ireq->no_srccheck = inet_sk(sk)->transparent;
+	ireq->opt = tcp_v4_save_options(sk, skb);
+
+	if (security_inet_conn_request(sk, skb, req))
+		goto drop_and_free;
+
+	/* Cookie support for ECN if TCP timestamp option avail */
+	if (tmp_opt.tstamp_ok)
+		TCP_ECN_create_request(req, skb);
+
+	/* Encode cookie in InitialSeqNum of SYN-ACK packet */
+	isn = cookie_v4_init_sequence(sk, skb, &req->mss);
+	req->cookie_ts = tmp_opt.tstamp_ok;
+
+	tcp_rsk(req)->snt_isn = isn;
+	tcp_rsk(req)->snt_synack = tcp_time_stamp;
+
+	/* Send SYN-ACK containing cookie */
+	tcp_v4_send_synack(sk, NULL, req, NULL);
+
+drop_and_free:
+	reqsk_free(req);
+drop:
+	return 1;
+no_limit:
+	return 0;
+}
+
+/* Handle SYN request */
 int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_extend_values tmp_ext;
@@ -1280,22 +1369,11 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	__be32 saddr = ip_hdr(skb)->saddr;
 	__be32 daddr = ip_hdr(skb)->daddr;
 	__u32 isn = TCP_SKB_CB(skb)->when;
-	bool want_cookie = false;
 
 	/* Never answer to SYNs send to broadcast or multicast */
 	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
 		goto drop;
 
-	/* TW buckets are converted to open requests without
-	 * limitations, they conserve resources and peer is
-	 * evidently real one.
-	 */
-	if (inet_csk_reqsk_queue_is_full(sk) && !isn) {
-		want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
-		if (!want_cookie)
-			goto drop;
-	}
-
 	/* Accept backlog is full. If we have already queued enough
 	 * of warm entries in syn queue, drop request. It is better than
 	 * clogging syn queue with openreqs with exponentially increasing
@@ -1304,6 +1382,10 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
 		goto drop;
 
+	/* SYN cookie handling */
+	if (tcp_v4_syn_conn_limit(sk, skb))
+		goto drop;
+
 	req = inet_reqsk_alloc(&tcp_request_sock_ops);
 	if (!req)
 		goto drop;
@@ -1317,6 +1399,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
 	tcp_parse_options(skb, &tmp_opt, &hash_location, 0);
 
+	/* Handle RFC6013 - TCP Cookie Transactions (TCPCT) options */
 	if (tmp_opt.cookie_plus > 0 &&
 	    tmp_opt.saw_tstamp &&
 	    !tp->rx_opt.cookie_out_never &&
@@ -1339,7 +1422,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		while (l-- > 0)
 			*c++ ^= *hash_location++;
 
-		want_cookie = false;	/* not our kind of cookie */
 		tmp_ext.cookie_out_never = 0; /* false */
 		tmp_ext.cookie_plus = tmp_opt.cookie_plus;
 	} else if (!tp->rx_opt.cookie_in_always) {
@@ -1351,12 +1433,10 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	}
 	tmp_ext.cookie_in_always = tp->rx_opt.cookie_in_always;
 
-	if (want_cookie && !tmp_opt.saw_tstamp)
-		tcp_clear_options(&tmp_opt);
-
 	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
 	tcp_openreq_init(req, &tmp_opt, skb);
 
+	/* Update req as an inet_request_sock (typecast trick)*/
 	ireq = inet_rsk(req);
 	ireq->loc_addr = daddr;
 	ireq->rmt_addr = saddr;
@@ -1366,13 +1446,9 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	if (security_inet_conn_request(sk, skb, req))
 		goto drop_and_free;
 
-	if (!want_cookie || tmp_opt.tstamp_ok)
-		TCP_ECN_create_request(req, skb);
+	TCP_ECN_create_request(req, skb);
 
-	if (want_cookie) {
-		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
-		req->cookie_ts = tmp_opt.tstamp_ok;
-	} else if (!isn) {
+	if (!isn) {
 		struct inet_peer *peer = NULL;
 		struct flowi4 fl4;
 
@@ -1422,8 +1498,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->snt_synack = tcp_time_stamp;
 
 	if (tcp_v4_send_synack(sk, dst, req,
-			       (struct request_values *)&tmp_ext) ||
-	    want_cookie)
+			       (struct request_values *)&tmp_ext))
 		goto drop_and_free;
 
 	inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
@@ -1438,7 +1513,6 @@ drop:
 }
 EXPORT_SYMBOL(tcp_v4_conn_request);
 
-
 /*
  * The three way handshake has completed - we got a valid synack -
  * now create the new socket.

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox