[RFC] [PATCH 0/5] net: Implement fast TX queue selection

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] [PATCH 0/5] net: Implement fast TX queue selection
@ 2009-10-15  5:56 Krishna Kumar
  2009-10-15  5:56 ` [RFC] [PATCH 1/5] net: Introduce sk_tx_queue_mapping Krishna Kumar
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Krishna Kumar @ 2009-10-15  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

Multiqueue cards on routers/firewalls set skb->queue_mapping on
input which helps in faster xmit. Implement fast queue selection
for locally generated packets also, by saving the txq# for
connected sockets (in dev_pick_tx) and use it in subsequent
iterations. Locally generated packets for a connection will xmit
on the same txq, but routing & firewall loads should not be
affected by this patch. Tests shows the distribution across txq's
for 1-4 netperf sessions is similar to existing code.


                   Testing & results:
                   ------------------

1. Cycles/Iter (C/I) used by dev_pick_tx:
         (B -> Billion,   M -> Million)
   |--------------|------------------------|------------------------|
   |              |          ORG           |          NEW           |
   |  Test        |--------|---------|-----|--------|---------|-----|
   |              | Cycles |  Iters  | C/I | Cycles | Iters   | C/I |
   |--------------|--------|---------|-----|--------|---------|-----|
   | [TCP_STREAM, | 3.98 B | 12.47 M | 320 | 1.95 B | 12.92 M | 152 |
   |  UDP_STREAM, |        |         |     |        |         |     |         
   |  TCP_RR,     |        |         |     |        |         |     |        
   |  UDP_RR]     |        |         |     |        |         |     |        
   |--------------|--------|---------|-----|--------|---------|-----|        
   | [TCP_STREAM, | 8.92 B | 29.66 M | 300 | 3.82 B | 38.88 M | 98  |        
   |  TCP_RR,     |        |         |     |        |         |     |         
   |  UDP_RR]     |        |         |     |        |         |     |         
   |--------------|--------|---------|-----|--------|---------|-----|

2. Stress test (over 48 hours) : 1000 netperfs running combination
   of TCP_STREAM/RR, UDP_STREAM/RR (v4/6, NODELAY/~NODELAY for all
   tests), with some ssh sessions, reboots, modprobe -r driver, etc.

3. Performance test (10 hours): Single 10 hour netperf run of
   TCP_STREAM/RR, TCP_STREAM + NO_DELAY and UDP_RR. Results show an
   improvement in both performance and cpu utilization.

Tested on a 4-processor AMD Opteron 2.8 GHz system with 1GB memory,
10G Chelsio card. Each BW number is the sum of 3 iterations of
individual tests using 512, 16K, 64K & 128K I/O sizes, in Mb/s:

------------------------  TCP Tests  -----------------------
#procs  Org BW     New BW (%)     Org SD     New SD (%)
------------------------------------------------------------
1       77777.7    81011.0 (4.15)    42.3     40.2 (-5.11)
4       91599.2    91878.8 (.30)    955.9    919.3 (-3.83)
6       89533.3    91792.2 (2.52)  2262.0   2143.0 (-5.25)
8       87507.5    89161.9 (1.89)  4363.4   4073.6 (-6.64)
10      85152.4    85607.8 (.53)   6890.4   6851.2 (-.56)
------------------------------------------------------------

------------------------- TCP NO_DELAY Tests ---------------
#procs  Org BW     New BW (%)      Org SD      New SD (%)
------------------------------------------------------------
1       57001.9    57888.0 (1.55)     67.7      70.2 (3.75)
4       69555.1    69957.4 (.57)     823.0     834.3 (1.36)
6       71359.3    71918.7 (.78)    1740.8    1724.5 (-.93)
8       72577.6    72496.1 (-.11)   2955.4    2937.7 (-.59)
10      70829.6    71444.2 (.86)    4826.1    4673.4 (-3.16)
------------------------------------------------------------

----------------------- Request Response Tests --------------------
#procs  Org TPS     New TPS (%)      Org SD    New SD (%)
(1-10)
-------------------------------------------------------------------
TCP     1019245.9   1042626.4 (2.29) 16352.9   16459.8 (.65)
UDP     934598.64   942956.9  (.89)  11607.3   11593.2 (-.12)
-------------------------------------------------------------------

Thanks,

- KK

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC] [PATCH 1/5] net: Introduce sk_tx_queue_mapping
  2009-10-15  5:56 [RFC] [PATCH 0/5] net: Implement fast TX queue selection Krishna Kumar
@ 2009-10-15  5:56 ` Krishna Kumar
  2009-10-15 10:32   ` Eric Dumazet
  2009-10-15  5:57 ` [RFC] [PATCH 2/5] net: Use sk_tx_queue_mapping for connected sockets Krishna Kumar
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Krishna Kumar @ 2009-10-15  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

Introduce sk_tx_queue_mapping; and functions that set, test and get
this value. Reset sk_tx_queue_mapping to -1 whenever the dst cache
is set/reset, and in socket alloc & free (free probably doesn't need
it).

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/sock.h |   21 +++++++++++++++++++++
 net/core/sock.c    |    7 ++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff -ruNp org/include/net/sock.h new/include/net/sock.h
--- org/include/net/sock.h	2009-10-14 10:36:52.000000000 +0530
+++ new/include/net/sock.h	2009-10-14 17:59:44.000000000 +0530
@@ -107,6 +107,7 @@ struct net;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_refcnt: reference count
+ *	@skc_tx_queue_mapping: tx queue number for this connection
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_family: network address family
  *	@skc_state: Connection state
@@ -128,6 +129,7 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	atomic_t		skc_refcnt;
+	int			skc_tx_queue_mapping;
 
 	unsigned int		skc_hash;
 	unsigned short		skc_family;
@@ -215,6 +217,7 @@ struct sock {
 #define sk_node			__sk_common.skc_node
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
+#define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
 
 #define sk_copy_start		__sk_common.skc_hash
 #define sk_hash			__sk_common.skc_hash
@@ -1094,8 +1097,24 @@ static inline void sock_put(struct sock 
 extern int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 			  const int nested);
 
+static inline void sk_record_tx_queue(struct sock *sk, int tx_queue)
+{
+        sk->sk_tx_queue_mapping = tx_queue;
+}
+
+static inline int sk_get_tx_queue(const struct sock *sk)
+{
+        return sk->sk_tx_queue_mapping;
+}
+
+static inline bool sk_tx_queue_recorded(const struct sock *sk)
+{
+        return (sk && sk->sk_tx_queue_mapping >= 0);
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
+	sk_record_tx_queue(sk, -1);
 	sk->sk_socket = sock;
 }
 
@@ -1152,6 +1171,7 @@ __sk_dst_set(struct sock *sk, struct dst
 {
 	struct dst_entry *old_dst;
 
+	sk_record_tx_queue(sk, -1);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = dst;
 	dst_release(old_dst);
@@ -1170,6 +1190,7 @@ __sk_dst_reset(struct sock *sk)
 {
 	struct dst_entry *old_dst;
 
+	sk_record_tx_queue(sk, -1);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = NULL;
 	dst_release(old_dst);
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-14 10:36:52.000000000 +0530
+++ new/net/core/sock.c	2009-10-14 17:59:46.000000000 +0530
@@ -358,6 +358,7 @@ struct dst_entry *__sk_dst_check(struct 
 	struct dst_entry *dst = sk->sk_dst_cache;
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+		sk_record_tx_queue(sk, -1);
 		sk->sk_dst_cache = NULL;
 		dst_release(dst);
 		return NULL;
@@ -954,7 +955,8 @@ static void sock_copy(struct sock *nsk, 
 	void *sptr = nsk->sk_security;
 #endif
 	BUILD_BUG_ON(offsetof(struct sock, sk_copy_start) !=
-		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt));
+		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt) +
+		     sizeof(osk->sk_tx_queue_mapping));
 	memcpy(&nsk->sk_copy_start, &osk->sk_copy_start,
 	       osk->sk_prot->obj_size - offsetof(struct sock, sk_copy_start));
 #ifdef CONFIG_SECURITY_NETWORK
@@ -998,6 +1000,7 @@ static struct sock *sk_prot_alloc(struct
 
 		if (!try_module_get(prot->owner))
 			goto out_free_sec;
+		sk_record_tx_queue(sk, -1);
 	}
 
 	return sk;
@@ -1017,6 +1020,8 @@ static void sk_prot_free(struct proto *p
 	struct kmem_cache *slab;
 	struct module *owner;
 
+	sk_record_tx_queue(sk, -1);
+
 	owner = prot->owner;
 	slab = prot->slab;
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC] [PATCH 2/5] net: Use sk_tx_queue_mapping for connected sockets
  2009-10-15  5:56 [RFC] [PATCH 0/5] net: Implement fast TX queue selection Krishna Kumar
  2009-10-15  5:56 ` [RFC] [PATCH 1/5] net: Introduce sk_tx_queue_mapping Krishna Kumar
@ 2009-10-15  5:57 ` Krishna Kumar
  2009-10-15  9:41   ` Eric Dumazet
  2009-10-15  5:57 ` [RFC] [PATCH 3/5] net: IPv6 changes Krishna Kumar
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Krishna Kumar @ 2009-10-15  5:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either skb rx was recorded, or if the device has a queue select
handler. Next iterations of dev_pick_tx uses the cached value of
sk_tx_queue_mapping.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/core/dev.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2009-10-14 17:59:40.000000000 +0530
+++ new/net/core/dev.c	2009-10-14 18:00:04.000000000 +0530
@@ -1791,13 +1791,25 @@ EXPORT_SYMBOL(skb_tx_hash);
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
-	const struct net_device_ops *ops = dev->netdev_ops;
-	u16 queue_index = 0;
+	u16 queue_index;
+	struct sock *sk = skb->sk;
+
+	if (sk_tx_queue_recorded(sk)) {
+		queue_index = sk_get_tx_queue(sk);
+	} else {
+		const struct net_device_ops *ops = dev->netdev_ops;
 
-	if (ops->ndo_select_queue)
-		queue_index = ops->ndo_select_queue(dev, skb);
-	else if (dev->real_num_tx_queues > 1)
-		queue_index = skb_tx_hash(dev, skb);
+		if (ops->ndo_select_queue) {
+			queue_index = ops->ndo_select_queue(dev, skb);
+		} else {
+			queue_index = 0;
+			if (dev->real_num_tx_queues > 1)
+				queue_index = skb_tx_hash(dev, skb);
+
+			if (sk && sk->sk_dst_cache)
+				sk_record_tx_queue(sk, queue_index);
+		}
+	}
 
 	skb_set_queue_mapping(skb, queue_index);
 	return netdev_get_tx_queue(dev, queue_index);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC] [PATCH 3/5] net: IPv6 changes
  2009-10-15  5:56 [RFC] [PATCH 0/5] net: Implement fast TX queue selection Krishna Kumar
  2009-10-15  5:56 ` [RFC] [PATCH 1/5] net: Introduce sk_tx_queue_mapping Krishna Kumar
  2009-10-15  5:57 ` [RFC] [PATCH 2/5] net: Use sk_tx_queue_mapping for connected sockets Krishna Kumar
@ 2009-10-15  5:57 ` Krishna Kumar
  2009-10-15  5:57 ` [RFC] [PATCH 4/5] net: Fix for dst_negative_advice Krishna Kumar
  2009-10-15  5:57 ` [RFC] [PATCH 5/5] net: Encapsulate inner code of __sk_dst_reset Krishna Kumar
  4 siblings, 0 replies; 10+ messages in thread
From: Krishna Kumar @ 2009-10-15  5:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

IPv6: Reset sk_tx_queue_mapping when dst_cache is reset.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/ipv6/inet6_connection_sock.c |    1 +
 1 file changed, 1 insertion(+)

diff -ruNp org/net/ipv6/inet6_connection_sock.c new/net/ipv6/inet6_connection_sock.c
--- org/net/ipv6/inet6_connection_sock.c	2009-10-14 17:59:58.000000000 +0530
+++ new/net/ipv6/inet6_connection_sock.c	2009-10-14 18:00:11.000000000 +0530
@@ -168,6 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
 	if (dst) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
 		if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
+			sk_record_tx_queue(sk, -1);
 			sk->sk_dst_cache = NULL;
 			dst_release(dst);
 			dst = NULL;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC] [PATCH 4/5] net: Fix for dst_negative_advice
  2009-10-15  5:56 [RFC] [PATCH 0/5] net: Implement fast TX queue selection Krishna Kumar
                   ` (2 preceding siblings ...)
  2009-10-15  5:57 ` [RFC] [PATCH 3/5] net: IPv6 changes Krishna Kumar
@ 2009-10-15  5:57 ` Krishna Kumar
  2009-10-15  5:57 ` [RFC] [PATCH 5/5] net: Encapsulate inner code of __sk_dst_reset Krishna Kumar
  4 siblings, 0 replies; 10+ messages in thread
From: Krishna Kumar @ 2009-10-15  5:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.

(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/dst.h      |   12 ++++++++++--
 net/core/sock.c        |    6 ++++++
 net/dccp/timer.c       |    4 ++--
 net/decnet/af_decnet.c |    2 +-
 net/ipv4/tcp_timer.c   |    4 ++--
 5 files changed, 21 insertions(+), 7 deletions(-)

diff -ruNp org/include/net/dst.h new/include/net/dst.h
--- org/include/net/dst.h	2009-10-14 18:00:07.000000000 +0530
+++ new/include/net/dst.h	2009-10-14 18:00:22.000000000 +0530
@@ -222,11 +222,19 @@ static inline void dst_confirm(struct ds
 		neigh_confirm(dst->neighbour);
 }
 
-static inline void dst_negative_advice(struct dst_entry **dst_p)
+static inline void dst_negative_advice(struct dst_entry **dst_p,
+				       struct sock *sk)
 {
 	struct dst_entry * dst = *dst_p;
-	if (dst && dst->ops->negative_advice)
+	if (dst && dst->ops->negative_advice) {
 		*dst_p = dst->ops->negative_advice(dst);
+
+		if (dst != *dst_p) {
+			extern void sk_reset_txq(struct sock *sk);
+
+			sk_reset_txq(sk);
+		}
+	}
 }
 
 static inline void dst_link_failure(struct sk_buff *skb)
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-14 18:00:07.000000000 +0530
+++ new/net/core/sock.c	2009-10-14 18:00:22.000000000 +0530
@@ -353,6 +353,12 @@ discard_and_relse:
 }
 EXPORT_SYMBOL(sk_receive_skb);
 
+void sk_reset_txq(struct sock *sk)
+{
+	sk_record_tx_queue(sk, -1);
+}
+EXPORT_SYMBOL(sk_reset_txq);
+
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
 	struct dst_entry *dst = sk->sk_dst_cache;
diff -ruNp org/net/dccp/timer.c new/net/dccp/timer.c
--- org/net/dccp/timer.c	2009-10-14 18:00:07.000000000 +0530
+++ new/net/dccp/timer.c	2009-10-14 18:00:22.000000000 +0530
@@ -38,7 +38,7 @@ static int dccp_write_timeout(struct soc
 
 	if (sk->sk_state == DCCP_REQUESTING || sk->sk_state == DCCP_PARTOPEN) {
 		if (icsk->icsk_retransmits != 0)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ?
 			    : sysctl_dccp_request_retries;
 	} else {
@@ -63,7 +63,7 @@ static int dccp_write_timeout(struct soc
 			   Golden words :-).
 		   */
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_dccp_retries2;
diff -ruNp org/net/decnet/af_decnet.c new/net/decnet/af_decnet.c
--- org/net/decnet/af_decnet.c	2009-10-14 18:00:07.000000000 +0530
+++ new/net/decnet/af_decnet.c	2009-10-14 18:00:22.000000000 +0530
@@ -1955,7 +1955,7 @@ static int dn_sendmsg(struct kiocb *iocb
 	}
 
 	if ((flags & MSG_TRYHARD) && sk->sk_dst_cache)
-		dst_negative_advice(&sk->sk_dst_cache);
+		dst_negative_advice(&sk->sk_dst_cache, sk);
 
 	mss = scp->segsize_rem;
 	fctype = scp->services_rem & NSP_FC_MASK;
diff -ruNp org/net/ipv4/tcp_timer.c new/net/ipv4/tcp_timer.c
--- org/net/ipv4/tcp_timer.c	2009-10-14 18:00:07.000000000 +0530
+++ new/net/ipv4/tcp_timer.c	2009-10-14 18:00:22.000000000 +0530
@@ -141,14 +141,14 @@ static int tcp_write_timeout(struct sock
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
 		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_tcp_retries2;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC] [PATCH 5/5] net: Encapsulate inner code of __sk_dst_reset
  2009-10-15  5:56 [RFC] [PATCH 0/5] net: Implement fast TX queue selection Krishna Kumar
                   ` (3 preceding siblings ...)
  2009-10-15  5:57 ` [RFC] [PATCH 4/5] net: Fix for dst_negative_advice Krishna Kumar
@ 2009-10-15  5:57 ` Krishna Kumar
  2009-10-15  9:53   ` Eric Dumazet
  4 siblings, 1 reply; 10+ messages in thread
From: Krishna Kumar @ 2009-10-15  5:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

Reuse code by adding ___sk_dst_reset() which is called by
__sk_dst_reset() and __sk_dst_check(). This new API is also
called from IPv6 to hide the internals of __sk_dst_reset
and the setting of sk_tx_queue_mapping.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/sock.h               |   13 +++++++++----
 net/core/sock.c                  |    4 +---
 net/ipv6/inet6_connection_sock.c |    4 +---
 3 files changed, 11 insertions(+), 10 deletions(-)

diff -ruNp org/include/net/sock.h new/include/net/sock.h
--- org/include/net/sock.h	2009-10-14 18:00:17.000000000 +0530
+++ new/include/net/sock.h	2009-10-14 18:00:30.000000000 +0530
@@ -1186,17 +1186,22 @@ sk_dst_set(struct sock *sk, struct dst_e
 }
 
 static inline void
-__sk_dst_reset(struct sock *sk)
+___sk_dst_reset(struct sock *sk, struct dst_entry *old_dst)
 {
-	struct dst_entry *old_dst;
-
 	sk_record_tx_queue(sk, -1);
-	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = NULL;
 	dst_release(old_dst);
 }
 
 static inline void
+__sk_dst_reset(struct sock *sk)
+{
+	struct dst_entry *old_dst = sk->sk_dst_cache;
+
+	___sk_dst_reset(sk, old_dst);
+}
+
+static inline void
 sk_dst_reset(struct sock *sk)
 {
 	write_lock(&sk->sk_dst_lock);
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-14 18:00:22.000000000 +0530
+++ new/net/core/sock.c	2009-10-14 18:00:30.000000000 +0530
@@ -364,9 +364,7 @@ struct dst_entry *__sk_dst_check(struct 
 	struct dst_entry *dst = sk->sk_dst_cache;
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
-		sk_record_tx_queue(sk, -1);
-		sk->sk_dst_cache = NULL;
-		dst_release(dst);
+		___sk_dst_reset(sk, dst);
 		return NULL;
 	}
 
diff -ruNp org/net/ipv6/inet6_connection_sock.c new/net/ipv6/inet6_connection_sock.c
--- org/net/ipv6/inet6_connection_sock.c	2009-10-14 18:00:17.000000000 +0530
+++ new/net/ipv6/inet6_connection_sock.c	2009-10-14 18:00:30.000000000 +0530
@@ -168,9 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
 	if (dst) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
 		if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
-			sk_record_tx_queue(sk, -1);
-			sk->sk_dst_cache = NULL;
-			dst_release(dst);
+			___sk_dst_reset(sk, dst);
 			dst = NULL;
 		}
 	}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] [PATCH 2/5] net: Use sk_tx_queue_mapping for connected sockets
  2009-10-15  5:57 ` [RFC] [PATCH 2/5] net: Use sk_tx_queue_mapping for connected sockets Krishna Kumar
@ 2009-10-15  9:41   ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2009-10-15  9:41 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, netdev, herbert

Krishna Kumar a écrit :
> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> For connected sockets, the first run of dev_pick_tx saves the
> calculated txq in sk_tx_queue_mapping. This is not saved if
> either skb rx was recorded, or if the device has a queue select
> handler. Next iterations of dev_pick_tx uses the cached value of
> sk_tx_queue_mapping.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  net/core/dev.c |   24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff -ruNp org/net/core/dev.c new/net/core/dev.c
> --- org/net/core/dev.c	2009-10-14 17:59:40.000000000 +0530
> +++ new/net/core/dev.c	2009-10-14 18:00:04.000000000 +0530
> @@ -1791,13 +1791,25 @@ EXPORT_SYMBOL(skb_tx_hash);
>  static struct netdev_queue *dev_pick_tx(struct net_device *dev,
>  					struct sk_buff *skb)
>  {
> -	const struct net_device_ops *ops = dev->netdev_ops;
> -	u16 queue_index = 0;
> +	u16 queue_index;
> +	struct sock *sk = skb->sk;
> +
> +	if (sk_tx_queue_recorded(sk)) {
> +		queue_index = sk_get_tx_queue(sk);
> +	} else {
> +		const struct net_device_ops *ops = dev->netdev_ops;
>  
> -	if (ops->ndo_select_queue)
> -		queue_index = ops->ndo_select_queue(dev, skb);
> -	else if (dev->real_num_tx_queues > 1)
> -		queue_index = skb_tx_hash(dev, skb);
> +		if (ops->ndo_select_queue) {
> +			queue_index = ops->ndo_select_queue(dev, skb);
> +		} else {
> +			queue_index = 0;
> +			if (dev->real_num_tx_queues > 1)
> +				queue_index = skb_tx_hash(dev, skb);
> +
> +			if (sk && sk->sk_dst_cache)
> +				sk_record_tx_queue(sk, queue_index);
> +		}
> +	}
>  
>  	skb_set_queue_mapping(skb, queue_index);
>  	return netdev_get_tx_queue(dev, queue_index);
> 
> 

Hmm, why not cache ops->ndo_select_queue(dev, skb) choice too ?

		if (ops->ndo_select_queue)
			queue_index = ops->ndo_select_queue(dev, skb);
		else {
			queue_index = 0;
			if (dev->real_num_tx_queues > 1)
				queue_index = skb_tx_hash(dev, skb);
		}
		if (sk && sk->sk_dst_cache)
			sk_record_tx_queue(sk, queue_index);

Or should ndo_select_queue() method take care of calling sk_record_tx_queue() itself ?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] [PATCH 5/5] net: Encapsulate inner code of __sk_dst_reset
  2009-10-15  5:57 ` [RFC] [PATCH 5/5] net: Encapsulate inner code of __sk_dst_reset Krishna Kumar
@ 2009-10-15  9:53   ` Eric Dumazet
  2009-10-15 12:53     ` Krishna Kumar2
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2009-10-15  9:53 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, netdev, herbert

Krishna Kumar a écrit :
> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> Reuse code by adding ___sk_dst_reset() which is called by
> __sk_dst_reset() and __sk_dst_check(). This new API is also
> called from IPv6 to hide the internals of __sk_dst_reset
> and the setting of sk_tx_queue_mapping.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  include/net/sock.h               |   13 +++++++++----
>  net/core/sock.c                  |    4 +---
>  net/ipv6/inet6_connection_sock.c |    4 +---
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff -ruNp org/include/net/sock.h new/include/net/sock.h
> --- org/include/net/sock.h	2009-10-14 18:00:17.000000000 +0530
> +++ new/include/net/sock.h	2009-10-14 18:00:30.000000000 +0530
> @@ -1186,17 +1186,22 @@ sk_dst_set(struct sock *sk, struct dst_e
>  }
>  
>  static inline void
> -__sk_dst_reset(struct sock *sk)
> +___sk_dst_reset(struct sock *sk, struct dst_entry *old_dst)
>  {
> -	struct dst_entry *old_dst;
> -
>  	sk_record_tx_queue(sk, -1);
> -	old_dst = sk->sk_dst_cache;
>  	sk->sk_dst_cache = NULL;
>  	dst_release(old_dst);
>  }
>  
>  static inline void
> +__sk_dst_reset(struct sock *sk)
> +{
> +	struct dst_entry *old_dst = sk->sk_dst_cache;
> +
> +	___sk_dst_reset(sk, old_dst);
> +}
> +
> +static inline void
>  sk_dst_reset(struct sock *sk)
>  {
>  	write_lock(&sk->sk_dst_lock);
> diff -ruNp org/net/core/sock.c new/net/core/sock.c
> --- org/net/core/sock.c	2009-10-14 18:00:22.000000000 +0530
> +++ new/net/core/sock.c	2009-10-14 18:00:30.000000000 +0530
> @@ -364,9 +364,7 @@ struct dst_entry *__sk_dst_check(struct 
>  	struct dst_entry *dst = sk->sk_dst_cache;
>  
>  	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
> -		sk_record_tx_queue(sk, -1);
> -		sk->sk_dst_cache = NULL;
> -		dst_release(dst);
> +		___sk_dst_reset(sk, dst);
>  		return NULL;
>  	}
>  
> diff -ruNp org/net/ipv6/inet6_connection_sock.c new/net/ipv6/inet6_connection_sock.c
> --- org/net/ipv6/inet6_connection_sock.c	2009-10-14 18:00:17.000000000 +0530
> +++ new/net/ipv6/inet6_connection_sock.c	2009-10-14 18:00:30.000000000 +0530
> @@ -168,9 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
>  	if (dst) {
>  		struct rt6_info *rt = (struct rt6_info *)dst;
>  		if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
> -			sk_record_tx_queue(sk, -1);
> -			sk->sk_dst_cache = NULL;
> -			dst_release(dst);
> +			___sk_dst_reset(sk, dst);
>  			dst = NULL;
>  		}
>  	}
> 
> 

Encapsulation seems un-necessary to me, since only use cases are
___sk_dst_reset(sk, sk->sk_dst_cache)


static inline void __sk_dst_reset(struct sock *sk)
{
	struct dst_entry *old_dst = sk->sk_dst_cache;

	sk_record_tx_queue(sk, -1);
  	sk->sk_dst_cache = NULL;
 	dst_release(old_dst);
}


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] [PATCH 1/5] net: Introduce sk_tx_queue_mapping
  2009-10-15  5:56 ` [RFC] [PATCH 1/5] net: Introduce sk_tx_queue_mapping Krishna Kumar
@ 2009-10-15 10:32   ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2009-10-15 10:32 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, netdev, herbert

Krishna Kumar a écrit :
> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> Introduce sk_tx_queue_mapping; and functions that set, test and get
> this value. Reset sk_tx_queue_mapping to -1 whenever the dst cache
> is set/reset, and in socket alloc & free (free probably doesn't need
> it).
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  include/net/sock.h |   21 +++++++++++++++++++++
>  net/core/sock.c    |    7 ++++++-
>  2 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff -ruNp org/include/net/sock.h new/include/net/sock.h
> --- org/include/net/sock.h	2009-10-14 10:36:52.000000000 +0530
> +++ new/include/net/sock.h	2009-10-14 17:59:44.000000000 +0530
> @@ -107,6 +107,7 @@ struct net;
>   *	@skc_node: main hash linkage for various protocol lookup tables
>   *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
>   *	@skc_refcnt: reference count
> + *	@skc_tx_queue_mapping: tx queue number for this connection
>   *	@skc_hash: hash value used with various protocol lookup tables
>   *	@skc_family: network address family
>   *	@skc_state: Connection state
> @@ -128,6 +129,7 @@ struct sock_common {
>  		struct hlist_nulls_node skc_nulls_node;
>  	};
>  	atomic_t		skc_refcnt;
> +	int			skc_tx_queue_mapping;
>  

Hmm, two remarks :

1) It adds a 32bits hole on 64bit arches
2) sk_tx_queue_mapping is only read in tx path, but sits close to
skc_refcnt, which is now only read/written in rx path (by socket lookups)

But since sock_common is small, 56 bytes on x86_64,(under a cache line),
there is nothing we can do at this moment.

My plan is to move skc_refcnt at the end of sock_common and I'll need to add
new generic fields into sock_common to make offsetof(skc_refcnt) = 64.

Next to sock_common, will be placed fields used in rx path.

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] [PATCH 5/5] net: Encapsulate inner code of __sk_dst_reset
  2009-10-15  9:53   ` Eric Dumazet
@ 2009-10-15 12:53     ` Krishna Kumar2
  0 siblings, 0 replies; 10+ messages in thread
From: Krishna Kumar2 @ 2009-10-15 12:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, herbert, netdev

Hi Eric,

Eric Dumazet <eric.dumazet@gmail.com> wrote on 10/15/2009 03:23:30 PM:

I am responding in one post for your convenience.

> Hmm, why not cache ops->ndo_select_queue(dev, skb) choice too ?
>
>       if (ops->ndo_select_queue)
>          queue_index = ops->ndo_select_queue(dev, skb);
>       else {
>          queue_index = 0;
>          if (dev->real_num_tx_queues > 1)
>             queue_index = skb_tx_hash(dev, skb);
>       }
>       if (sk && sk->sk_dst_cache)
>          sk_record_tx_queue(sk, queue_index);
>
> Or should ndo_select_queue() method take care of calling
sk_record_tx_queue() itself ?

I initially cached for ndo_select_queue too, but felt it should not do
that and leave to the driver for each skb. But your idea of the driver
caching is good - I think ixgbe_select_queue & mlx4_en_select_queue
(but not iwm_select_queue) can internally cache it if they are calling
skb_tx_hash.

> > diff -ruNp org/net/ipv6/inet6_connection_sock.c
new/net/ipv6/inet6_connection_sock.c
> > --- org/net/ipv6/inet6_connection_sock.c   2009-10-14
18:00:17.000000000 +0530
> > +++ new/net/ipv6/inet6_connection_sock.c   2009-10-14
18:00:30.000000000 +0530
> > @@ -168,9 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
> >     if (dst) {
> >        struct rt6_info *rt = (struct rt6_info *)dst;
> >        if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid))
{
> > -         sk_record_tx_queue(sk, -1);
> > -         sk->sk_dst_cache = NULL;
> > -         dst_release(dst);
> > +         ___sk_dst_reset(sk, dst);
> >           dst = NULL;
> >        }
> >     }
>
> Encapsulation seems un-necessary to me, since only use cases are
> ___sk_dst_reset(sk, sk->sk_dst_cache)
>
>
> static inline void __sk_dst_reset(struct sock *sk)
> {
>    struct dst_entry *old_dst = sk->sk_dst_cache;
>
>    sk_record_tx_queue(sk, -1);
>      sk->sk_dst_cache = NULL;
>     dst_release(old_dst);
> }

That's right. For the IPv6 case, I can simply call __sk_dst_reset(sk).
I will make this change and remove [patch 5/5].

> Hmm, two remarks :
>
> 1) It adds a 32bits hole on 64bit arches
> 2) sk_tx_queue_mapping is only read in tx path, but sits close to
> skc_refcnt, which is now only read/written in rx path (by socket lookups)
>
> But since sock_common is small, 56 bytes on x86_64,(under a cache line),
> there is nothing we can do at this moment.
>
> My plan is to move skc_refcnt at the end of sock_common and I'll need to
add
> new generic fields into sock_common to make offsetof(skc_refcnt) = 64.
>
> Next to sock_common, will be placed fields used in rx path.

So that will fix this problem too as tx mapping will then be part of
read-cache line, I guess.

> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks for your feedback and approval. I will wait for any more comments
(on the ndo_select_queue also), and resubmit v2 tomorrow.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-10-15 12:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-15  5:56 [RFC] [PATCH 0/5] net: Implement fast TX queue selection Krishna Kumar
2009-10-15  5:56 ` [RFC] [PATCH 1/5] net: Introduce sk_tx_queue_mapping Krishna Kumar
2009-10-15 10:32   ` Eric Dumazet
2009-10-15  5:57 ` [RFC] [PATCH 2/5] net: Use sk_tx_queue_mapping for connected sockets Krishna Kumar
2009-10-15  9:41   ` Eric Dumazet
2009-10-15  5:57 ` [RFC] [PATCH 3/5] net: IPv6 changes Krishna Kumar
2009-10-15  5:57 ` [RFC] [PATCH 4/5] net: Fix for dst_negative_advice Krishna Kumar
2009-10-15  5:57 ` [RFC] [PATCH 5/5] net: Encapsulate inner code of __sk_dst_reset Krishna Kumar
2009-10-15  9:53   ` Eric Dumazet
2009-10-15 12:53     ` Krishna Kumar2

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).