Netdev List
 help / color / mirror / Atom feed
* [PATCH 0/4 v4] net: Implement fast TX queue selection
From: Krishna Kumar @ 2009-10-20  9:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

Notes:
------
    1.  Eric suggested:
	- To use u16 for txq#, but I am using an "int" for now as that
	  avoids one unnecessary subtraction during tx.
	- An improvement of caching the txq at connection establishment
	  time so as to use rxq# = txq# (TBD later).
	- Drivers can call sk_tx_queue_set() to set the txq if they are
	  going to call skb_tx_hash() internally.
    2. v3 & v4 patch stress tested with 1000 netperfs, reboot's, etc.


Changelog [from v3]:
--------------------
	1. Changed the order of patches so that the patch setting the
	   txq is moved to the end. This results in bisect-safe patches.
	2. Fixed a build failure.


Changelog [from v2]:
--------------------
	1. Changed names of functions setting, getting and returning the
	   txq#; and added a new one to reset the txq#.
	2. Free sk doesn't need to reset txq#.


Changelog [from v1]:
--------------------
	1. Changed IPv6 code to call __sk_dst_reset() directly.
	2. Removed the patch re-arranging ("encapsulating") __sk_dst_reset()


Multiqueue cards on routers/firewalls set skb->queue_mapping on
input which helps in faster xmit. Implement fast queue selection
for locally generated packets also, by saving the txq# for
connected sockets (in dev_pick_tx) and use it in subsequent
iterations. Locally generated packets for a connection will xmit
on the same txq, but routing & firewall loads should not be
affected by this patch. Tests shows the distribution across txq's
for 1-4 netperf sessions is similar to existing code.


                   Testing & results:
                   ------------------

1. Cycles/Iter (C/I) used by dev_pick_tx:
         (B -> Billion,   M -> Million)
   |--------------|------------------------|------------------------|
   |              |          ORG           |          NEW           |
   |  Test        |--------|---------|-----|--------|---------|-----|
   |              | Cycles |  Iters  | C/I | Cycles | Iters   | C/I |
   |--------------|--------|---------|-----|--------|---------|-----|
   | [TCP_STREAM, | 3.98 B | 12.47 M | 320 | 1.95 B | 12.92 M | 152 |
   |  UDP_STREAM, |        |         |     |        |         |     |         
   |  TCP_RR,     |        |         |     |        |         |     |        
   |  UDP_RR]     |        |         |     |        |         |     |        
   |--------------|--------|---------|-----|--------|---------|-----|        
   | [TCP_STREAM, | 8.92 B | 29.66 M | 300 | 3.82 B | 38.88 M | 98  |        
   |  TCP_RR,     |        |         |     |        |         |     |         
   |  UDP_RR]     |        |         |     |        |         |     |         
   |--------------|--------|---------|-----|--------|---------|-----|

2. Stress test (over 48 hours) : 1000 netperfs running combination
   of TCP_STREAM/RR, UDP_STREAM/RR (v4/6, NODELAY/~NODELAY for all
   tests), with some ssh sessions, reboots, modprobe -r driver, etc.

3. Performance test (10 hours): Single 10 hour netperf run of
   TCP_STREAM/RR, TCP_STREAM + NO_DELAY and UDP_RR. Results show an
   improvement in both performance and cpu utilization.

Tested on a 4-processor AMD Opteron 2.8 GHz system with 1GB memory,
10G Chelsio card. Each BW number is the sum of 3 iterations of
individual tests using 512, 16K, 64K & 128K I/O sizes, in Mb/s:

------------------------  TCP Tests  -----------------------
#procs  Org BW     New BW (%)     Org SD     New SD (%)
------------------------------------------------------------
1       77777.7    81011.0 (4.15)    42.3     40.2 (-5.11)
4       91599.2    91878.8 (.30)    955.9    919.3 (-3.83)
6       89533.3    91792.2 (2.52)  2262.0   2143.0 (-5.25)
8       87507.5    89161.9 (1.89)  4363.4   4073.6 (-6.64)
10      85152.4    85607.8 (.53)   6890.4   6851.2 (-.56)
------------------------------------------------------------

------------------------- TCP NO_DELAY Tests ---------------
#procs  Org BW     New BW (%)      Org SD      New SD (%)
------------------------------------------------------------
1       57001.9    57888.0 (1.55)     67.7      70.2 (3.75)
4       69555.1    69957.4 (.57)     823.0     834.3 (1.36)
6       71359.3    71918.7 (.78)    1740.8    1724.5 (-.93)
8       72577.6    72496.1 (-.11)   2955.4    2937.7 (-.59)
10      70829.6    71444.2 (.86)    4826.1    4673.4 (-3.16)
------------------------------------------------------------

----------------------- Request Response Tests --------------------
#procs  Org TPS     New TPS (%)      Org SD    New SD (%)
(1-10)
-------------------------------------------------------------------
TCP     1019245.9   1042626.4 (2.29) 16352.9   16459.8 (.65)
UDP     934598.64   942956.9  (.89)  11607.3   11593.2 (-.12)
-------------------------------------------------------------------

Thanks,

- KK

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
---

^ permalink raw reply

* [PATCH 1/4 v4] net: Introduce sk_tx_queue_mapping
From: Krishna Kumar @ 2009-10-20  9:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091020094607.10404.81794.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/sock.h |   26 ++++++++++++++++++++++++++
 net/core/sock.c    |    5 ++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff -ruNp org/include/net/sock.h new/include/net/sock.h
--- org/include/net/sock.h	2009-10-16 18:53:40.000000000 +0530
+++ new/include/net/sock.h	2009-10-16 21:38:44.000000000 +0530
@@ -107,6 +107,7 @@ struct net;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_refcnt: reference count
+ *	@skc_tx_queue_mapping: tx queue number for this connection
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_family: network address family
  *	@skc_state: Connection state
@@ -128,6 +129,7 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	atomic_t		skc_refcnt;
+	int			skc_tx_queue_mapping;
 
 	unsigned int		skc_hash;
 	unsigned short		skc_family;
@@ -215,6 +217,7 @@ struct sock {
 #define sk_node			__sk_common.skc_node
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
+#define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
 
 #define sk_copy_start		__sk_common.skc_hash
 #define sk_hash			__sk_common.skc_hash
@@ -1094,8 +1097,29 @@ static inline void sock_put(struct sock 
 extern int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 			  const int nested);
 
+static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
+{
+	sk->sk_tx_queue_mapping = tx_queue;
+}
+
+static inline void sk_tx_queue_clear(struct sock *sk)
+{
+	sk->sk_tx_queue_mapping = -1;
+}
+
+static inline int sk_tx_queue_get(const struct sock *sk)
+{
+	return sk->sk_tx_queue_mapping;
+}
+
+static inline bool sk_tx_queue_recorded(const struct sock *sk)
+{
+	return (sk && sk->sk_tx_queue_mapping >= 0);
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
+	sk_tx_queue_clear(sk);
 	sk->sk_socket = sock;
 }
 
@@ -1152,6 +1176,7 @@ __sk_dst_set(struct sock *sk, struct dst
 {
 	struct dst_entry *old_dst;
 
+	sk_tx_queue_clear(sk);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = dst;
 	dst_release(old_dst);
@@ -1170,6 +1195,7 @@ __sk_dst_reset(struct sock *sk)
 {
 	struct dst_entry *old_dst;
 
+	sk_tx_queue_clear(sk);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = NULL;
 	dst_release(old_dst);
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-16 18:53:40.000000000 +0530
+++ new/net/core/sock.c	2009-10-16 21:29:02.000000000 +0530
@@ -357,6 +357,7 @@ struct dst_entry *__sk_dst_check(struct 
 	struct dst_entry *dst = sk->sk_dst_cache;
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+		sk_tx_queue_clear(sk);
 		sk->sk_dst_cache = NULL;
 		dst_release(dst);
 		return NULL;
@@ -953,7 +954,8 @@ static void sock_copy(struct sock *nsk, 
 	void *sptr = nsk->sk_security;
 #endif
 	BUILD_BUG_ON(offsetof(struct sock, sk_copy_start) !=
-		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt));
+		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt) +
+		     sizeof(osk->sk_tx_queue_mapping));
 	memcpy(&nsk->sk_copy_start, &osk->sk_copy_start,
 	       osk->sk_prot->obj_size - offsetof(struct sock, sk_copy_start));
 #ifdef CONFIG_SECURITY_NETWORK
@@ -997,6 +999,7 @@ static struct sock *sk_prot_alloc(struct
 
 		if (!try_module_get(prot->owner))
 			goto out_free_sec;
+		sk_tx_queue_clear(sk);
 	}
 
 	return sk;

^ permalink raw reply

* [PATCH 2/4 v4] net: IPv6 changes
From: Krishna Kumar @ 2009-10-20  9:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091020094607.10404.81794.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

IPv6: Reset sk_tx_queue_mapping when dst_cache is reset. Use existing
macro to do the work.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/ipv6/inet6_connection_sock.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff -ruNp org/net/ipv6/inet6_connection_sock.c new/net/ipv6/inet6_connection_sock.c
--- org/net/ipv6/inet6_connection_sock.c	2009-10-16 21:29:19.000000000 +0530
+++ new/net/ipv6/inet6_connection_sock.c	2009-10-16 21:31:00.000000000 +0530
@@ -168,8 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
 	if (dst) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
 		if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
-			sk->sk_dst_cache = NULL;
-			dst_release(dst);
+			__sk_dst_reset(sk);
 			dst = NULL;
 		}
 	}

^ permalink raw reply

* [patch 1/3] KS8851: Add soft reset at probe time
From: Ben Dooks @ 2009-10-20  9:49 UTC (permalink / raw)
  To: netdev; +Cc: Ping.Doong
In-Reply-To: <20091020094902.274646871@fluff.org.uk>

[-- Attachment #1: ks8851-reset-at-probe.patch --]
[-- Type: text/plain, Size: 984 bytes --]

Issue a full soft reset at probe time.

This was reported by Doong Ping of Micrel, but no explanation of why this
is necessary or what bug it is fixing. Add it as it does not seem to hurt
the current driver and ensures that the device is in a known state when we
start setting it up.

Signed-off-by: Ben Dooks <ben@simtec.co.uk>

---
 drivers/net/ks8851.c |    3 +++
 1 file changed, 3 insertions(+)

Index: b/drivers/net/ks8851.c
===================================================================
--- a/drivers/net/ks8851.c	2009-10-12 23:04:15.000000000 +0100
+++ b/drivers/net/ks8851.c	2009-10-13 13:28:43.000000000 +0100
@@ -1239,6 +1239,9 @@ static int __devinit ks8851_probe(struct
 	ndev->netdev_ops = &ks8851_netdev_ops;
 	ndev->irq = spi->irq;
 
+	/* issue a global soft reset to reset the device. */
+	ks8851_soft_reset(ks, GRR_GSR);
+
 	/* simple check for a valid chip being connected to the bus */
 
 	if ((ks8851_rdreg16(ks, KS_CIDER) & ~CIDER_REV_MASK) != CIDER_ID) {


^ permalink raw reply

* [patch 3/3] KS8851: Fix ks8851_set_rx_mode() for IFF_MULTICAST
From: Ben Dooks @ 2009-10-20  9:49 UTC (permalink / raw)
  To: netdev; +Cc: Ping.Doong
In-Reply-To: <20091020094902.274646871@fluff.org.uk>

[-- Attachment #1: ks8851-fix-mc-rxmode.patch --]
[-- Type: text/plain, Size: 1096 bytes --]

In ks8851_set_rx_mode() the case handling IFF_MULTICAST was also setting
the RXCR1_AE bit by accident. This meant that all unicast frames where
being accepted by the device. Remove RXCR1_AE from this case.

Note, RXCR1_AE was also masking a problem with setting the MAC address
properly, so needs to be applied after fixing the MAC write order.

Fixes a bug reported by Doong, Ping of Micrel. This version of the
patch avoids setting RXCR1_ME for all cases.

Signed-off-by: Ben Dooks <ben@simtec.co.uk>

---
 drivers/net/ks8851.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/drivers/net/ks8851.c
===================================================================
--- a/drivers/net/ks8851.c	2009-10-13 13:28:50.000000000 +0100
+++ b/drivers/net/ks8851.c	2009-10-13 13:28:51.000000000 +0100
@@ -980,7 +980,7 @@ static void ks8851_set_rx_mode(struct ne
 			mcptr = mcptr->next;
 		}
 
-		rxctrl.rxcr1 = RXCR1_RXME | RXCR1_RXAE | RXCR1_RXPAFMA;
+		rxctrl.rxcr1 = RXCR1_RXME | RXCR1_RXPAFMA;
 	} else {
 		/* just accept broadcast / unicast */
 		rxctrl.rxcr1 = RXCR1_RXPAFMA;


^ permalink raw reply

* [patch 0/3] KS8851 updates for -rc5
From: Ben Dooks @ 2009-10-20  9:49 UTC (permalink / raw)
  To: netdev; +Cc: Ping.Doong

Patches for some minor problems with the KS8851 SPI network driver
for inclusion into the kernel. This set is based on -rc5.

^ permalink raw reply

* [patch 2/3] KS8851: Fix MAC address write order
From: Ben Dooks @ 2009-10-20  9:49 UTC (permalink / raw)
  To: netdev; +Cc: Ping.Doong
In-Reply-To: <20091020094902.274646871@fluff.org.uk>

[-- Attachment #1: ks8851-fix-mac-order.patch --]
[-- Type: text/plain, Size: 2490 bytes --]

The MAC address register was being written in the wrong order, so add
a new address macro to convert mac-address byte to register address and
a ks8851_wrreg8() function to write each byte without having to worry
about any difficult byte swapping.

Fixes a bug reported by Doong, Ping of Micrel.

Signed-off-by: Ben Dooks <ben@simtec.co.uk>

---
 drivers/net/ks8851.c |   39 ++++++++++++++++++++++++++++++++++-----
 drivers/net/ks8851.h |    1 +
 2 files changed, 35 insertions(+), 5 deletions(-)

Index: b/drivers/net/ks8851.c
===================================================================
--- a/drivers/net/ks8851.c	2009-10-06 14:25:53.000000000 +0100
+++ b/drivers/net/ks8851.c	2009-10-06 14:42:39.000000000 +0100
@@ -171,6 +171,36 @@ static void ks8851_wrreg16(struct ks8851
 }
 
 /**
+ * ks8851_wrreg8 - write 8bit register value to chip
+ * @ks: The chip state
+ * @reg: The register address
+ * @val: The value to write
+ *
+ * Issue a write to put the value @val into the register specified in @reg.
+ */
+static void ks8851_wrreg8(struct ks8851_net *ks, unsigned reg, unsigned val)
+{
+	struct spi_transfer *xfer = &ks->spi_xfer1;
+	struct spi_message *msg = &ks->spi_msg1;
+	__le16 txb[2];
+	int ret;
+	int bit;
+
+	bit = 1 << (reg & 3);
+
+	txb[0] = cpu_to_le16(MK_OP(bit, reg) | KS_SPIOP_WR);
+	txb[1] = val;
+
+	xfer->tx_buf = txb;
+	xfer->rx_buf = NULL;
+	xfer->len = 3;
+
+	ret = spi_sync(ks->spidev, msg);
+	if (ret < 0)
+		ks_err(ks, "spi_sync() failed\n");
+}
+
+/**
  * ks8851_rx_1msg - select whether to use one or two messages for spi read
  * @ks: The device structure
  *
@@ -322,13 +352,12 @@ static void ks8851_soft_reset(struct ks8
 static int ks8851_write_mac_addr(struct net_device *dev)
 {
 	struct ks8851_net *ks = netdev_priv(dev);
-	u16 *mcp = (u16 *)dev->dev_addr;
+	int i;
 
 	mutex_lock(&ks->lock);
 
-	ks8851_wrreg16(ks, KS_MARL, mcp[0]);
-	ks8851_wrreg16(ks, KS_MARM, mcp[1]);
-	ks8851_wrreg16(ks, KS_MARH, mcp[2]);
+	for (i = 0; i < ETH_ALEN; i++)
+		ks8851_wrreg8(ks, KS_MAR(i), dev->dev_addr[i]);
 
 	mutex_unlock(&ks->lock);
 
Index: b/drivers/net/ks8851.h
===================================================================
--- a/drivers/net/ks8851.h	2009-10-06 14:34:32.000000000 +0100
+++ b/drivers/net/ks8851.h	2009-10-06 14:35:08.000000000 +0100
@@ -16,6 +16,7 @@
 #define CCR_32PIN				(1 << 0)
 
 /* MAC address registers */
+#define KS_MAR(_m)				0x15 - (_m)
 #define KS_MARL					0x10
 #define KS_MARM					0x12
 #define KS_MARH					0x14


^ permalink raw reply

* [PATCH 3/4 v4] net: Fix for dst_negative_advice
From: Krishna Kumar @ 2009-10-20  9:46 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091020094607.10404.81794.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.

(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/dst.h      |   12 ++++++++++--
 net/core/sock.c        |    6 ++++++
 net/dccp/timer.c       |    4 ++--
 net/decnet/af_decnet.c |    2 +-
 net/ipv4/tcp_timer.c   |    4 ++--
 5 files changed, 21 insertions(+), 7 deletions(-)

diff -ruNp org/include/net/dst.h new/include/net/dst.h
--- org/include/net/dst.h	2009-10-16 21:30:56.000000000 +0530
+++ new/include/net/dst.h	2009-10-16 21:31:30.000000000 +0530
@@ -222,11 +222,19 @@ static inline void dst_confirm(struct ds
 		neigh_confirm(dst->neighbour);
 }
 
-static inline void dst_negative_advice(struct dst_entry **dst_p)
+static inline void dst_negative_advice(struct dst_entry **dst_p,
+				       struct sock *sk)
 {
 	struct dst_entry * dst = *dst_p;
-	if (dst && dst->ops->negative_advice)
+	if (dst && dst->ops->negative_advice) {
 		*dst_p = dst->ops->negative_advice(dst);
+
+		if (dst != *dst_p) {
+			extern void sk_reset_txq(struct sock *sk);
+
+			sk_reset_txq(sk);
+		}
+	}
 }
 
 static inline void dst_link_failure(struct sk_buff *skb)
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/core/sock.c	2009-10-16 21:32:33.000000000 +0530
@@ -352,6 +352,12 @@ discard_and_relse:
 }
 EXPORT_SYMBOL(sk_receive_skb);
 
+void sk_reset_txq(struct sock *sk)
+{
+	sk_tx_queue_clear(sk);
+}
+EXPORT_SYMBOL(sk_reset_txq);
+
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
 	struct dst_entry *dst = sk->sk_dst_cache;
diff -ruNp org/net/dccp/timer.c new/net/dccp/timer.c
--- org/net/dccp/timer.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/dccp/timer.c	2009-10-16 21:31:30.000000000 +0530
@@ -38,7 +38,7 @@ static int dccp_write_timeout(struct soc
 
 	if (sk->sk_state == DCCP_REQUESTING || sk->sk_state == DCCP_PARTOPEN) {
 		if (icsk->icsk_retransmits != 0)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ?
 			    : sysctl_dccp_request_retries;
 	} else {
@@ -63,7 +63,7 @@ static int dccp_write_timeout(struct soc
 			   Golden words :-).
 		   */
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_dccp_retries2;
diff -ruNp org/net/decnet/af_decnet.c new/net/decnet/af_decnet.c
--- org/net/decnet/af_decnet.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/decnet/af_decnet.c	2009-10-16 21:31:30.000000000 +0530
@@ -1955,7 +1955,7 @@ static int dn_sendmsg(struct kiocb *iocb
 	}
 
 	if ((flags & MSG_TRYHARD) && sk->sk_dst_cache)
-		dst_negative_advice(&sk->sk_dst_cache);
+		dst_negative_advice(&sk->sk_dst_cache, sk);
 
 	mss = scp->segsize_rem;
 	fctype = scp->services_rem & NSP_FC_MASK;
diff -ruNp org/net/ipv4/tcp_timer.c new/net/ipv4/tcp_timer.c
--- org/net/ipv4/tcp_timer.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/ipv4/tcp_timer.c	2009-10-16 21:31:30.000000000 +0530
@@ -141,14 +141,14 @@ static int tcp_write_timeout(struct sock
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
 		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_tcp_retries2;

^ permalink raw reply

* [PATCH 4/4 v4] net: Use sk_tx_queue_mapping for connected sockets
From: Krishna Kumar @ 2009-10-20  9:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091020094607.10404.81794.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/core/dev.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2009-10-19 15:43:30.000000000 +0530
+++ new/net/core/dev.c	2009-10-20 12:24:40.000000000 +0530
@@ -1791,13 +1791,25 @@ EXPORT_SYMBOL(skb_tx_hash);
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
-	const struct net_device_ops *ops = dev->netdev_ops;
-	u16 queue_index = 0;
+	u16 queue_index;
+	struct sock *sk = skb->sk;
+
+	if (sk_tx_queue_recorded(sk)) {
+		queue_index = sk_tx_queue_get(sk);
+	} else {
+		const struct net_device_ops *ops = dev->netdev_ops;
 
-	if (ops->ndo_select_queue)
-		queue_index = ops->ndo_select_queue(dev, skb);
-	else if (dev->real_num_tx_queues > 1)
-		queue_index = skb_tx_hash(dev, skb);
+		if (ops->ndo_select_queue) {
+			queue_index = ops->ndo_select_queue(dev, skb);
+		} else {
+			queue_index = 0;
+			if (dev->real_num_tx_queues > 1)
+				queue_index = skb_tx_hash(dev, skb);
+
+			if (sk && sk->sk_dst_cache)
+				sk_tx_queue_set(sk, queue_index);
+		}
+	}
 
 	skb_set_queue_mapping(skb, queue_index);
 	return netdev_get_tx_queue(dev, queue_index);

^ permalink raw reply

* Re: [PATCH] Add new cs89x0 driver
From: Ben Dooks @ 2009-10-20  9:55 UTC (permalink / raw)
  To: Sascha Hauer
  Cc: netdev, Lennert Buytenhek, Ivo Clarysse, Gilles Chanteperdrix
In-Reply-To: <1240387172-21818-1-git-send-email-s.hauer@pengutronix.de>

On Wed, Apr 22, 2009 at 09:59:30AM +0200, Sascha Hauer wrote:
> While this is a very old chip it is still in use on some embedded boards.
> I see myself unable to fix the in Kernel driver to bring it to driver
> model support, so this patch adds a new driver designed to replace
> the old one, at least for non-ISA hardware.

I think it is worthwile, there's a number of SMDK boards and similar
which have used this board, including boards released in the last year
or so.

I will try and see if I can do some reviewing and testing of this in
the next few days.

-- 
Ben (ben@fluff.org, http://www.fluff.org/)

  'a smiley only costs 4 bytes'

^ permalink raw reply

* Re: [Patch] (updated) netfilter: remove deprecated CONFIG_NF_CT_ACCT
From: Cong Wang @ 2009-10-20  9:58 UTC (permalink / raw)
  To: Krzysztof Olędzki; +Cc: linux-kernel, netdev, akpm, netfilter-devel
In-Reply-To: <4AD70FC2.5050302@ans.pl>


Sorry that I lost this thread. :(

Krzysztof Olędzki wrote:
> [resend, previously I incorrectly copied netfilter-devel-owner]
> 
> Adding CC to netfilter-devel@vger.kernel.org, where netfilter
> related issues are discussed.
> 
> On Thu, 15 Oct 2009, Amerigo Wang wrote:
> 
>>
>> CONFIG_NF_CT_ACCT is scheduled to be removed in 2.6.29.
>>
>> Cc: Krzysztof Piotr Oledzki <ole@ans.pl>
>> Signed-off-by: WANG Cong <amwang@redhat.com>
> 
> This is not so easy - it was decided that we don't want to remove it 
> just that because of connbytes. The final conclusion was to keep it 
> disabled by default (if not enabled by kernel/modules/sysctl option) but 
> enabling it automatically for the current NS if nefilter rules contains 
> "-m connbytes".
> 
> Sorry, I should have gotten into this and finished it earlier, my bad. :(

No problem.

So we just want CONFIG_NF_CT_ACCT=n in default configs?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with  "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-20 11:36 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <aa79d98a0910192004g43de90d5p823293b888cf0c1d@mail.gmail.com>

On Tuesday 20 October 2009 06:04:35 Cyrill Gorcunov wrote:
> On 10/20/09, Denys Fedoryschenko <denys@visp.net.lb> wrote:
> > On Tuesday 20 October 2009 00:22:39 Michal Ostrowski wrote:
> >> I'm assuming that there was a race in us sending patches at nearly the
> >> same
> >> time I'm convinced now that the flush_lock can die, and the patch I sent
> >> out kills it.
> >
> > o_O
> >
> > I am drowning in patches. Just let me know which one to test :-)
>
> Oh ;) Try out latest Michal's patch (and then mine). I'll continue
> digg this issue at next spare time slot. Thanks!
Tried your patch, panic almost immediately
Here is a text of panic message over netconsole
http://www.nuclearcat.com/files/panic_pppoe3.txt
It is  different now, before it was pppoe_device_event, now in pppoe_rcv


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20 11:50 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <200910201436.50249.denys@visp.net.lb>

On 10/20/09, Denys Fedoryschenko <denys@visp.net.lb> wrote:
> On Tuesday 20 October 2009 06:04:35 Cyrill Gorcunov wrote:
>> On 10/20/09, Denys Fedoryschenko <denys@visp.net.lb> wrote:
>> > On Tuesday 20 October 2009 00:22:39 Michal Ostrowski wrote:
>> >> I'm assuming that there was a race in us sending patches at nearly the
>> >> same
>> >> time I'm convinced now that the flush_lock can die, and the patch I
>> >> sent
>> >> out kills it.
>> >
>> > o_O
>> >
>> > I am drowning in patches. Just let me know which one to test :-)
>>
>> Oh ;) Try out latest Michal's patch (and then mine). I'll continue
>> digg this issue at next spare time slot. Thanks!
> Tried your patch, panic almost immediately
> Here is a text of panic message over netconsole
> http://www.nuclearcat.com/files/panic_pppoe3.txt
> It is  different now, before it was pppoe_device_event, now in pppoe_rcv
>
>
Thanks a lot! I'll back with new one in a couple of hours. Meanwhile i
suppose you may try Michal's patch as well.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with  "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-20 11:52 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <aa79d98a0910200450n1a283f87hf946bb3b756d0bd@mail.gmail.com>

On Tuesday 20 October 2009 14:50:09 Cyrill Gorcunov wrote:
> On 10/20/09, Denys Fedoryschenko <denys@visp.net.lb> wrote:
> > On Tuesday 20 October 2009 06:04:35 Cyrill Gorcunov wrote:
> >> On 10/20/09, Denys Fedoryschenko <denys@visp.net.lb> wrote:
> >> > On Tuesday 20 October 2009 00:22:39 Michal Ostrowski wrote:
> >> >> I'm assuming that there was a race in us sending patches at nearly
> >> >> the same
> >> >> time I'm convinced now that the flush_lock can die, and the patch I
> >> >> sent
> >> >> out kills it.
> >> >
> >> > o_O
> >> >
> >> > I am drowning in patches. Just let me know which one to test :-)
> >>
> >> Oh ;) Try out latest Michal's patch (and then mine). I'll continue
> >> digg this issue at next spare time slot. Thanks!
> >
> > Tried your patch, panic almost immediately
> > Here is a text of panic message over netconsole
> > http://www.nuclearcat.com/files/panic_pppoe3.txt
> > It is  different now, before it was pppoe_device_event, now in pppoe_rcv
>
> Thanks a lot! I'll back with new one in a couple of hours. Meanwhile i
> suppose you may try Michal's patch as well.
I did, it didn't help.
Maybe i can run some debugging options in kernel?
Also i can add debug(printk) lines in kernel if you want, to see where is bug 
appearing.
Note, i told to Michal, so will tell here, this pc is hyperthreading P4, as i 
know it is very good to trigger various SMP race conditions.
I can try also it with nosmp if u want.


^ permalink raw reply

* Re: [PATCH net-next V2 1/3] iwmc3200top: Add Intel Wireless MultiCom 3200 top driver.
From: Tomas Winkler @ 2009-10-20 11:53 UTC (permalink / raw)
  To: David Miller, Marcel Holtmann
  Cc: linville, netdev, linux-wireless, linux-mmc, yi.zhu,
	inaky.perez-gonzalez, cindy.h.kao, guy.cohen, ron.rindjunsky
In-Reply-To: <20091019.215457.43252934.davem@davemloft.net>

On Tue, Oct 20, 2009 at 6:54 AM, David Miller <davem@davemloft.net> wrote:
> From: Tomas Winkler <tomas.winkler@intel.com>
> Date: Sat, 17 Oct 2009 21:09:34 +0200
>
>> This patch adds Intel Wireless MultiCom 3200 top driver.
>> IWMC3200 is 4Wireless Com CHIP (GPS/BT/WiFi/WiMAX).
>> Top driver is responsible for device initialization and firmware download.
>> Firmware handled by top is responsible for top itself and
>> as well as bluetooth and GPS coms. (Wifi and WiMax provide their own firmware)
>> In addition top driver is used to retrieve firmware logs
>> and supports other debugging features
>>
>> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
>
> Applied to net-next-2.6

Thanks Dave

Marcel
I want to send out now the BT driver, would like the patch against
bluetooth-next-2.6.git, then I wait till you sync or can you also pick
it from net-next if Dave is OK with that?

Thanks
Tomas

^ permalink raw reply

* [PATCH] ifb: should not use __dev_get_by_index() without locks
From: Eric Dumazet @ 2009-10-20 12:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091019.212855.179405364.davem@davemloft.net>

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 20 Oct 2009 06:23:54 +0200
> 
>> I wonder if the whole thing could use RCU somehow, since some
>> workloads hit this dev_base_lock rwlock pretty hard...
> 
> True, but for now we'll put your fix in :-)

Here is another vulnerable point, needing following patch.

Thanks

[PATCH] ifb: should not use __dev_get_by_index() without locks

At this point (ri_tasklet()), RTNL or dev_base_lock are not held,
we must use dev_get_by_index() instead of __dev_get_by_index()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 801f088..030913f 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -98,12 +98,13 @@ static void ri_tasklet(unsigned long dev)
 		stats->tx_packets++;
 		stats->tx_bytes +=skb->len;
 
-		skb->dev = __dev_get_by_index(&init_net, skb->iif);
+		skb->dev = dev_get_by_index(&init_net, skb->iif);
 		if (!skb->dev) {
 			dev_kfree_skb(skb);
 			stats->tx_dropped++;
 			break;
 		}
+		dev_put(skb->dev);
 		skb->iif = _dev->ifindex;
 
 		if (from & AT_EGRESS) {


^ permalink raw reply related

* Policy routing + route "via" gives a strange behavior
From: Guido Trotter @ 2009-10-20 13:28 UTC (permalink / raw)
  To: netdev


Hi,

I'm seeing what I think might be a strange kernel behavior when setting up a
route "via" a gateway, with policy routing. When adding a route with a gateway,
the kernel accepts it only if the gateway is reachable via that device. For
example:

ip route add default dev eth1 via 192.168.5.254

is only accepted if there is a route like:
192.168.5.0/24 dev eth1 scope link

in the main routing table. which, of course, is ok, otherwise the kernel
wouldn't be able to reach 192.168.5.254 in the first place.

Now, when adding policy routing to the mix, if I do:
ip route add table 100 default dev eth1 via 192.168.5.254

This is also refused unless a route like the one before appears in the default
table, even though it does appear in table 100. Is this the right behavior, and
if yes, why? It seems to me that it should be acceptable to have the network
route as well just in the separate routing table, since the "via" will only be
used by traffic hitting that table anyway.

Thanks a lot,

Guido


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20 13:42 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <200910201452.58789.denys@visp.net.lb>

[Denys Fedoryschenko - Tue, Oct 20, 2009 at 02:52:58PM +0300]
...
| > Thanks a lot! I'll back with new one in a couple of hours. Meanwhile i
| > suppose you may try Michal's patch as well.
| I did, it didn't help.
| Maybe i can run some debugging options in kernel?
| Also i can add debug(printk) lines in kernel if you want, to see where is bug 
| appearing.
| Note, i told to Michal, so will tell here, this pc is hyperthreading P4, as i 
| know it is very good to trigger various SMP race conditions.
| I can try also it with nosmp if u want.
| 

Thanks Denys, I'm preparing new patch (just back from office
and had no inet connection that is why reply is delayed, sorry).

	-- Cyrill

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-20 13:50 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <20091020134217.GA5181@lenovo>

On Tuesday 20 October 2009 16:42:17 Cyrill Gorcunov wrote:
> [Denys Fedoryschenko - Tue, Oct 20, 2009 at 02:52:58PM +0300]
> ...
>
> | > Thanks a lot! I'll back with new one in a couple of hours. Meanwhile i
> | > suppose you may try Michal's patch as well.
> |
> | I did, it didn't help.
> | Maybe i can run some debugging options in kernel?
> | Also i can add debug(printk) lines in kernel if you want, to see where is
> | bug appearing.
> | Note, i told to Michal, so will tell here, this pc is hyperthreading P4,
> | as i know it is very good to trigger various SMP race conditions.
> | I can try also it with nosmp if u want.
>
> Thanks Denys, I'm preparing new patch (just back from office
> and had no inet connection that is why reply is delayed, sorry).
There is no problem at all.
This rename operation is just future operation and host is redundant, so i can 
do tests on it anytime.


^ permalink raw reply

* Re: [PATCHv2 1/4] First Patch on TFRC-SP. Copy base files from TFRC
From: Ivo Calado @ 2009-10-20 13:51 UTC (permalink / raw)
  To: Gerrit Renker, Ivo Calado, dccp, netdev
In-Reply-To: <20091019052153.GB3366@gerrit.erg.abdn.ac.uk>

On Mon, Oct 19, 2009 at 02:21, Gerrit Renker <gerrit@erg.abdn.ac.uk> wrote:
> | First Patch on TFRC-SP.
> Please find attached one edit that I made.
>
> I added unwinding the initialisation of tfrc_lib in the case where the
> initialisation of tfrc_sp_lib fails.  Unwinding is now done in the reverse
> order of the steps done during initialisation.
>

Agree.



-- 
Ivo Augusto Andrade Rocha Calado
MSc. Candidate
Embedded Systems and Pervasive Computing Lab - http://embedded.ufcg.edu.br
Systems and Computing Department - http://www.dsc.ufcg.edu.br
Electrical Engineering and Informatics Center - http://www.ceei.ufcg.edu.br
Federal University of Campina Grande - http://www.ufcg.edu.br

PGP: 0x03422935
Quidquid latine dictum sit, altum viditur.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20 13:59 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <200910201650.10066.denys@visp.net.lb>

[Denys Fedoryschenko - Tue, Oct 20, 2009 at 04:50:09PM +0300]
|
...
| >
| > Thanks Denys, I'm preparing new patch (just back from office
| > and had no inet connection that is why reply is delayed, sorry).
| There is no problem at all.
| This rename operation is just future operation and host is redundant, so i can 
| do tests on it anytime.
| 

ok, here is it, please try (it's still a draft version though)

	-- Cyrill
---
 drivers/net/pppoe.c |  106 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 81 insertions(+), 25 deletions(-)

Index: linux-2.6.git/drivers/net/pppoe.c
=====================================================================
--- linux-2.6.git.orig/drivers/net/pppoe.c
+++ linux-2.6.git/drivers/net/pppoe.c
@@ -313,8 +313,8 @@ static void pppoe_flush_dev(struct net_d
 			sk = sk_pppox(po);
 			spin_lock(&flush_lock);
 			po->pppoe_dev = NULL;
-			spin_unlock(&flush_lock);
 			dev_put(dev);
+			spin_unlock(&flush_lock);
 
 			/* We always grab the socket lock, followed by the
 			 * hash_lock, in that order.  Since we should
@@ -386,13 +386,21 @@ static struct notifier_block pppoe_notif
 static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
 {
 	struct pppox_sock *po = pppox_sk(sk);
-	struct pppox_sock *relay_po;
+	struct pppox_sock *relay_po = NULL;
+	struct net_device *dev = NULL;
 
 	if (sk->sk_state & PPPOX_BOUND) {
 		ppp_input(&po->chan, skb);
 	} else if (sk->sk_state & PPPOX_RELAY) {
-		relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
-						&po->pppoe_relay);
+		struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
+		read_lock_bh(&pn->hash_lock);
+		dev = po->pppoe_dev;
+		if (dev) {
+			dev_hold(dev);
+			relay_po = get_item_by_addr(dev_net(dev),
+					&po->pppoe_relay);
+		}
+		read_unlock_bh(&pn->hash_lock);
 		if (relay_po == NULL)
 			goto abort_kfree;
 
@@ -401,6 +409,7 @@ static int pppoe_rcv_core(struct sock *s
 
 		if (!__pppoe_xmit(sk_pppox(relay_po), skb))
 			goto abort_put;
+		dev_put(dev);
 	} else {
 		if (sock_queue_rcv_skb(sk, skb))
 			goto abort_kfree;
@@ -412,6 +421,8 @@ abort_put:
 	sock_put(sk_pppox(relay_po));
 
 abort_kfree:
+	if (dev)
+		dev_put(dev);
 	kfree_skb(skb);
 	return NET_RX_DROP;
 }
@@ -625,8 +636,8 @@ static int pppoe_connect(struct socket *
 	struct sock *sk = sock->sk;
 	struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev;
-	struct pppoe_net *pn;
+	struct net_device *dev = NULL;
+	struct pppoe_net *pn = NULL;
 	int error;
 
 	lock_sock(sk);
@@ -652,12 +663,15 @@ static int pppoe_connect(struct socket *
 	/* Delete the old binding */
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
+		spin_lock(&flush_lock);
 		if (po->pppoe_dev) {
 			pn = pppoe_pernet(dev_net(po->pppoe_dev));
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			po->pppoe_dev = NULL;
 		}
+		spin_unlock(&flush_lock);
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));
 		sk->sk_state = PPPOX_NONE;
@@ -670,10 +684,11 @@ static int pppoe_connect(struct socket *
 		if (!dev)
 			goto end;
 
+		write_lock_bh(&pn->hash_lock);
+		dev_hold(dev);
 		po->pppoe_dev = dev;
 		po->pppoe_ifindex = dev->ifindex;
 		pn = pppoe_pernet(dev_net(dev));
-		write_lock_bh(&pn->hash_lock);
 		if (!(dev->flags & IFF_UP)) {
 			write_unlock_bh(&pn->hash_lock);
 			goto err_put;
@@ -700,6 +715,7 @@ static int pppoe_connect(struct socket *
 			goto err_put;
 
 		sk->sk_state = PPPOX_CONNECTED;
+		dev_put(dev);
 	}
 
 	po->num = sp->sa_addr.pppoe.sid;
@@ -708,10 +724,13 @@ end:
 	release_sock(sk);
 	return error;
 err_put:
+	dev_put(dev);
+	write_lock_bh(&pn->hash_lock);
 	if (po->pppoe_dev) {
 		dev_put(po->pppoe_dev);
 		po->pppoe_dev = NULL;
 	}
+	write_unlock_bh(&pn->hash_lock);
 	goto end;
 }
 
@@ -738,6 +757,8 @@ static int pppoe_ioctl(struct socket *so
 {
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po = pppox_sk(sk);
+	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
+	unsigned int mtu = 0;
 	int val;
 	int err;
 
@@ -746,11 +767,17 @@ static int pppoe_ioctl(struct socket *so
 		err = -ENXIO;
 		if (!(sk->sk_state & PPPOX_CONNECTED))
 			break;
-
+		read_lock_bh(&pn->hash_lock);
+		err = -ENODEV;
+		if (po->pppoe_dev) {
+			mtu = po->pppoe_dev->mtu;
+			err = 0;
+		}
+		read_unlock_bh(&pn->hash_lock);
+		if (err)
+			break;
 		err = -EFAULT;
-		if (put_user(po->pppoe_dev->mtu -
-			     sizeof(struct pppoe_hdr) -
-			     PPP_HDRLEN,
+		if (put_user(mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN,
 			     (int __user *)arg))
 			break;
 		err = 0;
@@ -761,13 +788,21 @@ static int pppoe_ioctl(struct socket *so
 		if (!(sk->sk_state & PPPOX_CONNECTED))
 			break;
 
+		read_lock_bh(&pn->hash_lock);
+		err = -ENODEV;
+		if (po->pppoe_dev) {
+			mtu = po->pppoe_dev->mtu;
+			err = 0;
+		}
+		read_unlock_bh(&pn->hash_lock);
+		if (err)
+			break;
+
 		err = -EFAULT;
 		if (get_user(val, (int __user *)arg))
 			break;
 
-		if (val < (po->pppoe_dev->mtu
-			   - sizeof(struct pppoe_hdr)
-			   - PPP_HDRLEN))
+		if (val < (mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN))
 			err = 0;
 		else
 			err = -EINVAL;
@@ -839,10 +874,11 @@ static int pppoe_sendmsg(struct kiocb *i
 	struct sk_buff *skb;
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po = pppox_sk(sk);
+	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
 	int error;
 	struct pppoe_hdr hdr;
 	struct pppoe_hdr *ph;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	char *start;
 
 	lock_sock(sk);
@@ -856,18 +892,27 @@ static int pppoe_sendmsg(struct kiocb *i
 	hdr.code = 0;
 	hdr.sid = po->num;
 
-	dev = po->pppoe_dev;
+	read_lock_bh(&pn->hash_lock);
+	error = -ENODEV;
+	if (po->pppoe_dev) {
+		dev = po->pppoe_dev;
+		dev_hold(dev);
+		error = 0;
+	}
+	read_unlock_bh(&pn->hash_lock);
+	if (error)
+		goto end;
 
 	error = -EMSGSIZE;
 	if (total_len > (dev->mtu + dev->hard_header_len))
-		goto end;
+		goto end_put;
 
 
 	skb = sock_wmalloc(sk, total_len + dev->hard_header_len + 32,
 			   0, GFP_KERNEL);
 	if (!skb) {
 		error = -ENOMEM;
-		goto end;
+		goto end_put;
 	}
 
 	/* Reserve space for headers. */
@@ -885,7 +930,7 @@ static int pppoe_sendmsg(struct kiocb *i
 	error = memcpy_fromiovec(start, m->msg_iov, total_len);
 	if (error < 0) {
 		kfree_skb(skb);
-		goto end;
+		goto end_put;
 	}
 
 	error = total_len;
@@ -898,6 +943,8 @@ static int pppoe_sendmsg(struct kiocb *i
 
 	dev_queue_xmit(skb);
 
+end_put:
+	dev_put(dev);
 end:
 	release_sock(sk);
 	return error;
@@ -911,21 +958,28 @@ end:
 static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
 {
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev = po->pppoe_dev;
+	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
+	struct net_device *dev;
 	struct pppoe_hdr *ph;
 	int data_len = skb->len;
 
-	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
+	read_lock_bh(&pn->hash_lock);
+	if (!po->pppoe_dev) {
+		read_unlock_bh(&pn->hash_lock);
 		goto abort;
+	}
+	dev = po->pppoe_dev;
+	dev_hold(dev);
+	read_unlock_bh(&pn->hash_lock);
 
-	if (!dev)
-		goto abort;
+	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
+		goto abort_put;
 
 	/* Copy the data if there is no space for the header or if it's
 	 * read-only.
 	 */
 	if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len))
-		goto abort;
+		goto abort_put;
 
 	__skb_push(skb, sizeof(*ph));
 	skb_reset_network_header(skb);
@@ -944,9 +998,11 @@ static int __pppoe_xmit(struct sock *sk,
 			po->pppoe_pa.remote, NULL, data_len);
 
 	dev_queue_xmit(skb);
-
+	dev_put(dev);
 	return 1;
 
+abort_put:
+	dev_put(dev);
 abort:
 	kfree_skb(skb);
 	return 1;

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-20 14:20 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <20091020135920.GB5181@lenovo>

It panics almost immediately on boot(even on old operations  that was stable, 
seems on first pppoe customer login attempt), i will rebuild kernel and if 
interesting will try to get panic message.

On Tuesday 20 October 2009 16:59:20 Cyrill Gorcunov wrote:
> [Denys Fedoryschenko - Tue, Oct 20, 2009 at 04:50:09PM +0300]
>
> ...
>
> | > Thanks Denys, I'm preparing new patch (just back from office
> | > and had no inet connection that is why reply is delayed, sorry).
> |
> | There is no problem at all.
> | This rename operation is just future operation and host is redundant, so
> | i can do tests on it anytime.
>
> ok, here is it, please try (it's still a draft version though)
>
> 	-- Cyrill
> ---
>  drivers/net/pppoe.c |  106
> +++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 81
> insertions(+), 25 deletions(-)
>
> Index: linux-2.6.git/drivers/net/pppoe.c
> =====================================================================
> --- linux-2.6.git.orig/drivers/net/pppoe.c
> +++ linux-2.6.git/drivers/net/pppoe.c
> @@ -313,8 +313,8 @@ static void pppoe_flush_dev(struct net_d
>  			sk = sk_pppox(po);
>  			spin_lock(&flush_lock);
>  			po->pppoe_dev = NULL;
> -			spin_unlock(&flush_lock);
>  			dev_put(dev);
> +			spin_unlock(&flush_lock);
>
>  			/* We always grab the socket lock, followed by the
>  			 * hash_lock, in that order.  Since we should
> @@ -386,13 +386,21 @@ static struct notifier_block pppoe_notif
>  static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
>  {
>  	struct pppox_sock *po = pppox_sk(sk);
> -	struct pppox_sock *relay_po;
> +	struct pppox_sock *relay_po = NULL;
> +	struct net_device *dev = NULL;
>
>  	if (sk->sk_state & PPPOX_BOUND) {
>  		ppp_input(&po->chan, skb);
>  	} else if (sk->sk_state & PPPOX_RELAY) {
> -		relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
> -						&po->pppoe_relay);
> +		struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
> +		read_lock_bh(&pn->hash_lock);
> +		dev = po->pppoe_dev;
> +		if (dev) {
> +			dev_hold(dev);
> +			relay_po = get_item_by_addr(dev_net(dev),
> +					&po->pppoe_relay);
> +		}
> +		read_unlock_bh(&pn->hash_lock);
>  		if (relay_po == NULL)
>  			goto abort_kfree;
>
> @@ -401,6 +409,7 @@ static int pppoe_rcv_core(struct sock *s
>
>  		if (!__pppoe_xmit(sk_pppox(relay_po), skb))
>  			goto abort_put;
> +		dev_put(dev);
>  	} else {
>  		if (sock_queue_rcv_skb(sk, skb))
>  			goto abort_kfree;
> @@ -412,6 +421,8 @@ abort_put:
>  	sock_put(sk_pppox(relay_po));
>
>  abort_kfree:
> +	if (dev)
> +		dev_put(dev);
>  	kfree_skb(skb);
>  	return NET_RX_DROP;
>  }
> @@ -625,8 +636,8 @@ static int pppoe_connect(struct socket *
>  	struct sock *sk = sock->sk;
>  	struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
>  	struct pppox_sock *po = pppox_sk(sk);
> -	struct net_device *dev;
> -	struct pppoe_net *pn;
> +	struct net_device *dev = NULL;
> +	struct pppoe_net *pn = NULL;
>  	int error;
>
>  	lock_sock(sk);
> @@ -652,12 +663,15 @@ static int pppoe_connect(struct socket *
>  	/* Delete the old binding */
>  	if (stage_session(po->pppoe_pa.sid)) {
>  		pppox_unbind_sock(sk);
> +		spin_lock(&flush_lock);
>  		if (po->pppoe_dev) {
>  			pn = pppoe_pernet(dev_net(po->pppoe_dev));
>  			delete_item(pn, po->pppoe_pa.sid,
>  				po->pppoe_pa.remote, po->pppoe_ifindex);
>  			dev_put(po->pppoe_dev);
> +			po->pppoe_dev = NULL;
>  		}
> +		spin_unlock(&flush_lock);
>  		memset(sk_pppox(po) + 1, 0,
>  		       sizeof(struct pppox_sock) - sizeof(struct sock));
>  		sk->sk_state = PPPOX_NONE;
> @@ -670,10 +684,11 @@ static int pppoe_connect(struct socket *
>  		if (!dev)
>  			goto end;
>
> +		write_lock_bh(&pn->hash_lock);
> +		dev_hold(dev);
>  		po->pppoe_dev = dev;
>  		po->pppoe_ifindex = dev->ifindex;
>  		pn = pppoe_pernet(dev_net(dev));
> -		write_lock_bh(&pn->hash_lock);
>  		if (!(dev->flags & IFF_UP)) {
>  			write_unlock_bh(&pn->hash_lock);
>  			goto err_put;
> @@ -700,6 +715,7 @@ static int pppoe_connect(struct socket *
>  			goto err_put;
>
>  		sk->sk_state = PPPOX_CONNECTED;
> +		dev_put(dev);
>  	}
>
>  	po->num = sp->sa_addr.pppoe.sid;
> @@ -708,10 +724,13 @@ end:
>  	release_sock(sk);
>  	return error;
>  err_put:
> +	dev_put(dev);
> +	write_lock_bh(&pn->hash_lock);
>  	if (po->pppoe_dev) {
>  		dev_put(po->pppoe_dev);
>  		po->pppoe_dev = NULL;
>  	}
> +	write_unlock_bh(&pn->hash_lock);
>  	goto end;
>  }
>
> @@ -738,6 +757,8 @@ static int pppoe_ioctl(struct socket *so
>  {
>  	struct sock *sk = sock->sk;
>  	struct pppox_sock *po = pppox_sk(sk);
> +	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
> +	unsigned int mtu = 0;
>  	int val;
>  	int err;
>
> @@ -746,11 +767,17 @@ static int pppoe_ioctl(struct socket *so
>  		err = -ENXIO;
>  		if (!(sk->sk_state & PPPOX_CONNECTED))
>  			break;
> -
> +		read_lock_bh(&pn->hash_lock);
> +		err = -ENODEV;
> +		if (po->pppoe_dev) {
> +			mtu = po->pppoe_dev->mtu;
> +			err = 0;
> +		}
> +		read_unlock_bh(&pn->hash_lock);
> +		if (err)
> +			break;
>  		err = -EFAULT;
> -		if (put_user(po->pppoe_dev->mtu -
> -			     sizeof(struct pppoe_hdr) -
> -			     PPP_HDRLEN,
> +		if (put_user(mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN,
>  			     (int __user *)arg))
>  			break;
>  		err = 0;
> @@ -761,13 +788,21 @@ static int pppoe_ioctl(struct socket *so
>  		if (!(sk->sk_state & PPPOX_CONNECTED))
>  			break;
>
> +		read_lock_bh(&pn->hash_lock);
> +		err = -ENODEV;
> +		if (po->pppoe_dev) {
> +			mtu = po->pppoe_dev->mtu;
> +			err = 0;
> +		}
> +		read_unlock_bh(&pn->hash_lock);
> +		if (err)
> +			break;
> +
>  		err = -EFAULT;
>  		if (get_user(val, (int __user *)arg))
>  			break;
>
> -		if (val < (po->pppoe_dev->mtu
> -			   - sizeof(struct pppoe_hdr)
> -			   - PPP_HDRLEN))
> +		if (val < (mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN))
>  			err = 0;
>  		else
>  			err = -EINVAL;
> @@ -839,10 +874,11 @@ static int pppoe_sendmsg(struct kiocb *i
>  	struct sk_buff *skb;
>  	struct sock *sk = sock->sk;
>  	struct pppox_sock *po = pppox_sk(sk);
> +	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
>  	int error;
>  	struct pppoe_hdr hdr;
>  	struct pppoe_hdr *ph;
> -	struct net_device *dev;
> +	struct net_device *dev = NULL;
>  	char *start;
>
>  	lock_sock(sk);
> @@ -856,18 +892,27 @@ static int pppoe_sendmsg(struct kiocb *i
>  	hdr.code = 0;
>  	hdr.sid = po->num;
>
> -	dev = po->pppoe_dev;
> +	read_lock_bh(&pn->hash_lock);
> +	error = -ENODEV;
> +	if (po->pppoe_dev) {
> +		dev = po->pppoe_dev;
> +		dev_hold(dev);
> +		error = 0;
> +	}
> +	read_unlock_bh(&pn->hash_lock);
> +	if (error)
> +		goto end;
>
>  	error = -EMSGSIZE;
>  	if (total_len > (dev->mtu + dev->hard_header_len))
> -		goto end;
> +		goto end_put;
>
>
>  	skb = sock_wmalloc(sk, total_len + dev->hard_header_len + 32,
>  			   0, GFP_KERNEL);
>  	if (!skb) {
>  		error = -ENOMEM;
> -		goto end;
> +		goto end_put;
>  	}
>
>  	/* Reserve space for headers. */
> @@ -885,7 +930,7 @@ static int pppoe_sendmsg(struct kiocb *i
>  	error = memcpy_fromiovec(start, m->msg_iov, total_len);
>  	if (error < 0) {
>  		kfree_skb(skb);
> -		goto end;
> +		goto end_put;
>  	}
>
>  	error = total_len;
> @@ -898,6 +943,8 @@ static int pppoe_sendmsg(struct kiocb *i
>
>  	dev_queue_xmit(skb);
>
> +end_put:
> +	dev_put(dev);
>  end:
>  	release_sock(sk);
>  	return error;
> @@ -911,21 +958,28 @@ end:
>  static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
>  {
>  	struct pppox_sock *po = pppox_sk(sk);
> -	struct net_device *dev = po->pppoe_dev;
> +	struct pppoe_net *pn = pppoe_pernet(sock_net(sk));
> +	struct net_device *dev;
>  	struct pppoe_hdr *ph;
>  	int data_len = skb->len;
>
> -	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
> +	read_lock_bh(&pn->hash_lock);
> +	if (!po->pppoe_dev) {
> +		read_unlock_bh(&pn->hash_lock);
>  		goto abort;
> +	}
> +	dev = po->pppoe_dev;
> +	dev_hold(dev);
> +	read_unlock_bh(&pn->hash_lock);
>
> -	if (!dev)
> -		goto abort;
> +	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
> +		goto abort_put;
>
>  	/* Copy the data if there is no space for the header or if it's
>  	 * read-only.
>  	 */
>  	if (skb_cow_head(skb, sizeof(*ph) + dev->hard_header_len))
> -		goto abort;
> +		goto abort_put;
>
>  	__skb_push(skb, sizeof(*ph));
>  	skb_reset_network_header(skb);
> @@ -944,9 +998,11 @@ static int __pppoe_xmit(struct sock *sk,
>  			po->pppoe_pa.remote, NULL, data_len);
>
>  	dev_queue_xmit(skb);
> -
> +	dev_put(dev);
>  	return 1;
>
> +abort_put:
> +	dev_put(dev);
>  abort:
>  	kfree_skb(skb);
>  	return 1;



^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-20 14:23 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Michal Ostrowski, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <200910201720.00473.denys@visp.net.lb>

[Denys Fedoryschenko - Tue, Oct 20, 2009 at 05:20:00PM +0300]
|
| It panics almost immediately on boot(even on old operations  that was stable, 
| seems on first pppoe customer login attempt), i will rebuild kernel and if 
| interesting will try to get panic message.
| 
...

ok, thanks. I continue digging.

^ permalink raw reply

* [PATCH RFC] Per route TCP options
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef

Turn the global sysctls allowing disabling of TCP SACK, DSCAK,
time stamp and window scale into per route entry feature options,
laying the ground to future removal of the relevant global sysctls.

You really only want to disable SACK, DSACK, time stamp or window
scale if you've got a piece of broken networking equipment somewhere 
as a stop gap until you can bring a big enough hammer to deal with
the broken network equipment. It doesn't make sense to "punish" the
entire connections going through the machine to destinations not 
related to the broken equipment.

This is doubly true when you're dealing with network containers
used to isolate several virtual domains.

Per route options implemented in free bits in the features route
entry property, which in some cases were reserved by name for these
options, so this does not inflate any structure and I expect that
when the apropriate global sysctls will be removed the overall code
base will be smaller.

Tested on x86 using Qemu/KVM.  

Will send the matching patch to iproute2 if/when this is ACKed or
if someone wants to test this.

Patchset based on original work by Ori Finkelman and Yoni Amit 
from ComSleep Ltd.

Gilad Ben-Yossef (8):
  Only parse time stamp TCP option in time wait sock
  Allow tcp_parse_options to consult dst entry
  Infrastructure for querying route entry features
  Add the no SACK route option feature
  Allow disabling TCP timestamp options per route
  Allow to turn off TCP window scale opt per route
  Allow disabling of DSACK TCP option per route
  Document future removal of sysctl_tcp_* options

 Documentation/feature-removal-schedule.txt |   12 ++++++++++++
 include/linux/rtnetlink.h                  |    6 ++++--
 include/net/dst.h                          |    8 +++++++-
 include/net/tcp.h                          |    3 ++-
 net/ipv4/syncookies.c                      |   27 ++++++++++++++-------------
 net/ipv4/tcp_input.c                       |   26 ++++++++++++++++++--------
 net/ipv4/tcp_ipv4.c                        |   19 ++++++++++---------
 net/ipv4/tcp_minisocks.c                   |    8 +++++---
 net/ipv4/tcp_output.c                      |   18 +++++++++++++-----
 net/ipv6/syncookies.c                      |   28 +++++++++++++++-------------
 net/ipv6/tcp_ipv6.c                        |    3 ++-
 11 files changed, 102 insertions(+), 56 deletions(-)


^ permalink raw reply

* [PATCH RFC] Allow tcp_parse_options to consult dst entry
From: Gilad Ben-Yossef @ 2009-10-20 15:22 UTC (permalink / raw)
  To: netdev; +Cc: ori, Gilad Ben-Yossef
In-Reply-To: <1256052161-14156-2-git-send-email-gilad@codefidence.com>

We need tcp_parse_options to be aware of dst_entry to 
take into account per dst_entry TCP options settings

Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>

---
 include/net/tcp.h        |    3 ++-
 net/ipv4/syncookies.c    |   27 ++++++++++++++-------------
 net/ipv4/tcp_input.c     |    9 ++++++---
 net/ipv4/tcp_ipv4.c      |   19 ++++++++++---------
 net/ipv4/tcp_minisocks.c |    7 +++++--
 net/ipv6/syncookies.c    |   28 +++++++++++++++-------------
 net/ipv6/tcp_ipv6.c      |    3 ++-
 7 files changed, 54 insertions(+), 42 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..740d09b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -409,7 +409,8 @@ extern int			tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
 
 extern void			tcp_parse_options(struct sk_buff *skb,
 						  struct tcp_options_received *opt_rx,
-						  int estab);
+						  int estab,
+						  struct dst_entry *dst);
 
 extern u8			*tcp_parse_md5sig_option(struct tcphdr *th);
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index a6e0e07..4990dd4 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -276,13 +276,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
 
-	/* check for timestamp cookie support */
-	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
-
-	if (tcp_opt.saw_tstamp)
-		cookie_check_timestamp(&tcp_opt);
-
 	ret = NULL;
 	req = inet_reqsk_alloc(&tcp_request_sock_ops); /* for safety */
 	if (!req)
@@ -298,12 +291,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	ireq->loc_addr		= ip_hdr(skb)->daddr;
 	ireq->rmt_addr		= ip_hdr(skb)->saddr;
 	ireq->ecn_ok		= 0;
-	ireq->snd_wscale	= tcp_opt.snd_wscale;
-	ireq->rcv_wscale	= tcp_opt.rcv_wscale;
-	ireq->sack_ok		= tcp_opt.sack_ok;
-	ireq->wscale_ok		= tcp_opt.wscale_ok;
-	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
-	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 
 	/* We throwed the options of the initial SYN away, so we hope
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
@@ -351,6 +338,20 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	/* check for timestamp cookie support */
+	memset(&tcp_opt, 0, sizeof(tcp_opt));
+	tcp_parse_options(skb, &tcp_opt, 0, &rt->u.dst);
+
+	if (tcp_opt.saw_tstamp)
+		cookie_check_timestamp(&tcp_opt);
+
+	ireq->snd_wscale        = tcp_opt.snd_wscale;
+	ireq->rcv_wscale        = tcp_opt.rcv_wscale;
+	ireq->sack_ok           = tcp_opt.sack_ok;
+	ireq->wscale_ok         = tcp_opt.wscale_ok;
+	ireq->tstamp_ok         = tcp_opt.saw_tstamp;
+	req->ts_recent          = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
+
 	/* Try to redo what tcp_v4_send_synack did. */
 	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->u.dst, RTAX_WINDOW);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d86784b..d502f49 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3698,12 +3698,14 @@ old_ack:
  * the fast version below fails.
  */
 void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
-		       int estab)
+		       int estab,  struct dst_entry *dst)
 {
 	unsigned char *ptr;
 	struct tcphdr *th = tcp_hdr(skb);
 	int length = (th->doff * 4) - sizeof(struct tcphdr);
 
+	BUG_ON(!estab && !dst);
+
 	ptr = (unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
 
@@ -3820,7 +3822,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
 		if (tcp_parse_aligned_timestamp(tp, th))
 			return 1;
 	}
-	tcp_parse_options(skb, &tp->rx_opt, 1);
+	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
 	return 1;
 }
 
@@ -5364,8 +5366,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	int saved_clamp = tp->rx_opt.mss_clamp;
+	struct dst_entry *dst = __sk_dst_get(sk);
 
-	tcp_parse_options(skb, &tp->rx_opt, 0);
+	tcp_parse_options(skb, &tp->rx_opt, 0, dst);
 
 	if (th->ack) {
 		/* rfc793:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7cda24b..1cb0ec4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1256,11 +1256,18 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
 #endif
 
+	ireq = inet_rsk(req);
+	ireq->loc_addr = daddr;
+	ireq->rmt_addr = saddr;
+	ireq->no_srccheck = inet_sk(sk)->transparent;
+	ireq->opt = tcp_v4_save_options(sk, skb);
+
+	dst = inet_csk_route_req(sk, req);
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = 536;
 	tmp_opt.user_mss  = tcp_sk(sk)->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
@@ -1269,14 +1276,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 
 	tcp_openreq_init(req, &tmp_opt, skb);
 
-	ireq = inet_rsk(req);
-	ireq->loc_addr = daddr;
-	ireq->rmt_addr = saddr;
-	ireq->no_srccheck = inet_sk(sk)->transparent;
-	ireq->opt = tcp_v4_save_options(sk, skb);
-
 	if (security_inet_conn_request(sk, skb, req))
-		goto drop_and_free;
+		goto drop_and_release;
 
 	if (!want_cookie)
 		TCP_ECN_create_request(req, tcp_hdr(skb));
@@ -1301,7 +1302,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		 */
 		if (tmp_opt.saw_tstamp &&
 		    tcp_death_row.sysctl_tw_recycle &&
-		    (dst = inet_csk_route_req(sk, req)) != NULL &&
+		    dst != NULL &&
 		    (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
 		    peer->v4daddr == saddr) {
 			if (get_seconds() < peer->tcp_ts_stamp + TCP_PAWS_MSL &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index c49a550..70ff955 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -101,7 +101,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	int paws_reject = 0;
 
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(skb, &tmp_opt, 1);
+		tcp_parse_options(skb, &tmp_opt, 1, NULL);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent	= tcptw->tw_ts_recent;
@@ -499,10 +499,11 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	int paws_reject = 0;
 	struct tcp_options_received tmp_opt;
 	struct sock *child;
+	struct dst_entry *dst = inet_csk_route_req(sk, req);
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
-		tcp_parse_options(skb, &tmp_opt, 0);
+		tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent = req->ts_recent;
@@ -515,6 +516,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	dst_release(dst);
+
 	/* Check for pure retransmitted SYN. */
 	if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn &&
 	    flg == TCP_FLAG_SYN &&
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 6b6ae91..6ece408 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -184,13 +184,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
 
-	/* check for timestamp cookie support */
-	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(skb, &tcp_opt, 0);
-
-	if (tcp_opt.saw_tstamp)
-		cookie_check_timestamp(&tcp_opt);
-
 	ret = NULL;
 	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
 	if (!req)
@@ -224,12 +217,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	req->expires = 0UL;
 	req->retrans = 0;
 	ireq->ecn_ok		= 0;
-	ireq->snd_wscale	= tcp_opt.snd_wscale;
-	ireq->rcv_wscale	= tcp_opt.rcv_wscale;
-	ireq->sack_ok		= tcp_opt.sack_ok;
-	ireq->wscale_ok		= tcp_opt.wscale_ok;
-	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
-	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 	treq->rcv_isn = ntohl(th->seq) - 1;
 	treq->snt_isn = cookie;
 
@@ -264,6 +251,21 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 			goto out_free;
 	}
 
+	/* check for timestamp cookie support */
+	memset(&tcp_opt, 0, sizeof(tcp_opt));
+	tcp_parse_options(skb, &tcp_opt, 0, dst);
+
+	if (tcp_opt.saw_tstamp)
+		cookie_check_timestamp(&tcp_opt);
+
+	req->ts_recent          = tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
+
+	ireq->snd_wscale        = tcp_opt.snd_wscale;
+	ireq->rcv_wscale        = tcp_opt.rcv_wscale;
+	ireq->sack_ok           = tcp_opt.sack_ok;
+	ireq->wscale_ok         = tcp_opt.wscale_ok;
+	ireq->tstamp_ok         = tcp_opt.saw_tstamp;
+
 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
 	tcp_select_initial_window(tcp_full_space(sk), req->mss,
 				  &req->rcv_wnd, &req->window_clamp,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 21d100b..2eebab5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1165,6 +1165,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct request_sock *req = NULL;
 	__u32 isn = TCP_SKB_CB(skb)->when;
+	struct dst_entry *dst = __sk_dst_get(sk);
 #ifdef CONFIG_SYN_COOKIES
 	int want_cookie = 0;
 #else
@@ -1203,7 +1204,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
 
-	tcp_parse_options(skb, &tmp_opt, 0);
+	tcp_parse_options(skb, &tmp_opt, 0, dst);
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
-- 
1.5.6.3


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox