* [PATCH 0/4 v4] net: Implement fast TX queue selection
@ 2009-10-20 9:46 Krishna Kumar
2009-10-20 9:46 ` [PATCH 1/4 v4] net: Introduce sk_tx_queue_mapping Krishna Kumar
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Krishna Kumar @ 2009-10-20 9:46 UTC (permalink / raw)
To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
From: Krishna Kumar <krkumar2@in.ibm.com>
Notes:
------
1. Eric suggested:
- To use u16 for txq#, but I am using an "int" for now as that
avoids one unnecessary subtraction during tx.
- An improvement of caching the txq at connection establishment
time so as to use rxq# = txq# (TBD later).
- Drivers can call sk_tx_queue_set() to set the txq if they are
going to call skb_tx_hash() internally.
2. v3 & v4 patch stress tested with 1000 netperfs, reboot's, etc.
Changelog [from v3]:
--------------------
1. Changed the order of patches so that the patch setting the
txq is moved to the end. This results in bisect-safe patches.
2. Fixed a build failure.
Changelog [from v2]:
--------------------
1. Changed names of functions setting, getting and returning the
txq#; and added a new one to reset the txq#.
2. Free sk doesn't need to reset txq#.
Changelog [from v1]:
--------------------
1. Changed IPv6 code to call __sk_dst_reset() directly.
2. Removed the patch re-arranging ("encapsulating") __sk_dst_reset()
Multiqueue cards on routers/firewalls set skb->queue_mapping on
input which helps in faster xmit. Implement fast queue selection
for locally generated packets also, by saving the txq# for
connected sockets (in dev_pick_tx) and use it in subsequent
iterations. Locally generated packets for a connection will xmit
on the same txq, but routing & firewall loads should not be
affected by this patch. Tests shows the distribution across txq's
for 1-4 netperf sessions is similar to existing code.
Testing & results:
------------------
1. Cycles/Iter (C/I) used by dev_pick_tx:
(B -> Billion, M -> Million)
|--------------|------------------------|------------------------|
| | ORG | NEW |
| Test |--------|---------|-----|--------|---------|-----|
| | Cycles | Iters | C/I | Cycles | Iters | C/I |
|--------------|--------|---------|-----|--------|---------|-----|
| [TCP_STREAM, | 3.98 B | 12.47 M | 320 | 1.95 B | 12.92 M | 152 |
| UDP_STREAM, | | | | | | |
| TCP_RR, | | | | | | |
| UDP_RR] | | | | | | |
|--------------|--------|---------|-----|--------|---------|-----|
| [TCP_STREAM, | 8.92 B | 29.66 M | 300 | 3.82 B | 38.88 M | 98 |
| TCP_RR, | | | | | | |
| UDP_RR] | | | | | | |
|--------------|--------|---------|-----|--------|---------|-----|
2. Stress test (over 48 hours) : 1000 netperfs running combination
of TCP_STREAM/RR, UDP_STREAM/RR (v4/6, NODELAY/~NODELAY for all
tests), with some ssh sessions, reboots, modprobe -r driver, etc.
3. Performance test (10 hours): Single 10 hour netperf run of
TCP_STREAM/RR, TCP_STREAM + NO_DELAY and UDP_RR. Results show an
improvement in both performance and cpu utilization.
Tested on a 4-processor AMD Opteron 2.8 GHz system with 1GB memory,
10G Chelsio card. Each BW number is the sum of 3 iterations of
individual tests using 512, 16K, 64K & 128K I/O sizes, in Mb/s:
------------------------ TCP Tests -----------------------
#procs Org BW New BW (%) Org SD New SD (%)
------------------------------------------------------------
1 77777.7 81011.0 (4.15) 42.3 40.2 (-5.11)
4 91599.2 91878.8 (.30) 955.9 919.3 (-3.83)
6 89533.3 91792.2 (2.52) 2262.0 2143.0 (-5.25)
8 87507.5 89161.9 (1.89) 4363.4 4073.6 (-6.64)
10 85152.4 85607.8 (.53) 6890.4 6851.2 (-.56)
------------------------------------------------------------
------------------------- TCP NO_DELAY Tests ---------------
#procs Org BW New BW (%) Org SD New SD (%)
------------------------------------------------------------
1 57001.9 57888.0 (1.55) 67.7 70.2 (3.75)
4 69555.1 69957.4 (.57) 823.0 834.3 (1.36)
6 71359.3 71918.7 (.78) 1740.8 1724.5 (-.93)
8 72577.6 72496.1 (-.11) 2955.4 2937.7 (-.59)
10 70829.6 71444.2 (.86) 4826.1 4673.4 (-3.16)
------------------------------------------------------------
----------------------- Request Response Tests --------------------
#procs Org TPS New TPS (%) Org SD New SD (%)
(1-10)
-------------------------------------------------------------------
TCP 1019245.9 1042626.4 (2.29) 16352.9 16459.8 (.65)
UDP 934598.64 942956.9 (.89) 11607.3 11593.2 (-.12)
-------------------------------------------------------------------
Thanks,
- KK
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
---
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/4 v4] net: Introduce sk_tx_queue_mapping
2009-10-20 9:46 [PATCH 0/4 v4] net: Implement fast TX queue selection Krishna Kumar
@ 2009-10-20 9:46 ` Krishna Kumar
2009-10-20 9:46 ` [PATCH 2/4 v4] net: IPv6 changes Krishna Kumar
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Krishna Kumar @ 2009-10-20 9:46 UTC (permalink / raw)
To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
From: Krishna Kumar <krkumar2@in.ibm.com>
Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/net/sock.h | 26 ++++++++++++++++++++++++++
net/core/sock.c | 5 ++++-
2 files changed, 30 insertions(+), 1 deletion(-)
diff -ruNp org/include/net/sock.h new/include/net/sock.h
--- org/include/net/sock.h 2009-10-16 18:53:40.000000000 +0530
+++ new/include/net/sock.h 2009-10-16 21:38:44.000000000 +0530
@@ -107,6 +107,7 @@ struct net;
* @skc_node: main hash linkage for various protocol lookup tables
* @skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
* @skc_refcnt: reference count
+ * @skc_tx_queue_mapping: tx queue number for this connection
* @skc_hash: hash value used with various protocol lookup tables
* @skc_family: network address family
* @skc_state: Connection state
@@ -128,6 +129,7 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
atomic_t skc_refcnt;
+ int skc_tx_queue_mapping;
unsigned int skc_hash;
unsigned short skc_family;
@@ -215,6 +217,7 @@ struct sock {
#define sk_node __sk_common.skc_node
#define sk_nulls_node __sk_common.skc_nulls_node
#define sk_refcnt __sk_common.skc_refcnt
+#define sk_tx_queue_mapping __sk_common.skc_tx_queue_mapping
#define sk_copy_start __sk_common.skc_hash
#define sk_hash __sk_common.skc_hash
@@ -1094,8 +1097,29 @@ static inline void sock_put(struct sock
extern int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
const int nested);
+static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
+{
+ sk->sk_tx_queue_mapping = tx_queue;
+}
+
+static inline void sk_tx_queue_clear(struct sock *sk)
+{
+ sk->sk_tx_queue_mapping = -1;
+}
+
+static inline int sk_tx_queue_get(const struct sock *sk)
+{
+ return sk->sk_tx_queue_mapping;
+}
+
+static inline bool sk_tx_queue_recorded(const struct sock *sk)
+{
+ return (sk && sk->sk_tx_queue_mapping >= 0);
+}
+
static inline void sk_set_socket(struct sock *sk, struct socket *sock)
{
+ sk_tx_queue_clear(sk);
sk->sk_socket = sock;
}
@@ -1152,6 +1176,7 @@ __sk_dst_set(struct sock *sk, struct dst
{
struct dst_entry *old_dst;
+ sk_tx_queue_clear(sk);
old_dst = sk->sk_dst_cache;
sk->sk_dst_cache = dst;
dst_release(old_dst);
@@ -1170,6 +1195,7 @@ __sk_dst_reset(struct sock *sk)
{
struct dst_entry *old_dst;
+ sk_tx_queue_clear(sk);
old_dst = sk->sk_dst_cache;
sk->sk_dst_cache = NULL;
dst_release(old_dst);
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c 2009-10-16 18:53:40.000000000 +0530
+++ new/net/core/sock.c 2009-10-16 21:29:02.000000000 +0530
@@ -357,6 +357,7 @@ struct dst_entry *__sk_dst_check(struct
struct dst_entry *dst = sk->sk_dst_cache;
if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+ sk_tx_queue_clear(sk);
sk->sk_dst_cache = NULL;
dst_release(dst);
return NULL;
@@ -953,7 +954,8 @@ static void sock_copy(struct sock *nsk,
void *sptr = nsk->sk_security;
#endif
BUILD_BUG_ON(offsetof(struct sock, sk_copy_start) !=
- sizeof(osk->sk_node) + sizeof(osk->sk_refcnt));
+ sizeof(osk->sk_node) + sizeof(osk->sk_refcnt) +
+ sizeof(osk->sk_tx_queue_mapping));
memcpy(&nsk->sk_copy_start, &osk->sk_copy_start,
osk->sk_prot->obj_size - offsetof(struct sock, sk_copy_start));
#ifdef CONFIG_SECURITY_NETWORK
@@ -997,6 +999,7 @@ static struct sock *sk_prot_alloc(struct
if (!try_module_get(prot->owner))
goto out_free_sec;
+ sk_tx_queue_clear(sk);
}
return sk;
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/4 v4] net: IPv6 changes
2009-10-20 9:46 [PATCH 0/4 v4] net: Implement fast TX queue selection Krishna Kumar
2009-10-20 9:46 ` [PATCH 1/4 v4] net: Introduce sk_tx_queue_mapping Krishna Kumar
@ 2009-10-20 9:46 ` Krishna Kumar
2009-10-20 9:46 ` [PATCH 3/4 v4] net: Fix for dst_negative_advice Krishna Kumar
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Krishna Kumar @ 2009-10-20 9:46 UTC (permalink / raw)
To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
From: Krishna Kumar <krkumar2@in.ibm.com>
IPv6: Reset sk_tx_queue_mapping when dst_cache is reset. Use existing
macro to do the work.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
net/ipv6/inet6_connection_sock.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff -ruNp org/net/ipv6/inet6_connection_sock.c new/net/ipv6/inet6_connection_sock.c
--- org/net/ipv6/inet6_connection_sock.c 2009-10-16 21:29:19.000000000 +0530
+++ new/net/ipv6/inet6_connection_sock.c 2009-10-16 21:31:00.000000000 +0530
@@ -168,8 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
if (dst) {
struct rt6_info *rt = (struct rt6_info *)dst;
if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
- sk->sk_dst_cache = NULL;
- dst_release(dst);
+ __sk_dst_reset(sk);
dst = NULL;
}
}
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 3/4 v4] net: Fix for dst_negative_advice
2009-10-20 9:46 [PATCH 0/4 v4] net: Implement fast TX queue selection Krishna Kumar
2009-10-20 9:46 ` [PATCH 1/4 v4] net: Introduce sk_tx_queue_mapping Krishna Kumar
2009-10-20 9:46 ` [PATCH 2/4 v4] net: IPv6 changes Krishna Kumar
@ 2009-10-20 9:46 ` Krishna Kumar
2009-10-20 9:50 ` [PATCH 4/4 v4] net: Use sk_tx_queue_mapping for connected sockets Krishna Kumar
2009-10-21 1:59 ` [PATCH 0/4 v4] net: Implement fast TX queue selection David Miller
4 siblings, 0 replies; 6+ messages in thread
From: Krishna Kumar @ 2009-10-20 9:46 UTC (permalink / raw)
To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
From: Krishna Kumar <krkumar2@in.ibm.com>
dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.
(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/net/dst.h | 12 ++++++++++--
net/core/sock.c | 6 ++++++
net/dccp/timer.c | 4 ++--
net/decnet/af_decnet.c | 2 +-
net/ipv4/tcp_timer.c | 4 ++--
5 files changed, 21 insertions(+), 7 deletions(-)
diff -ruNp org/include/net/dst.h new/include/net/dst.h
--- org/include/net/dst.h 2009-10-16 21:30:56.000000000 +0530
+++ new/include/net/dst.h 2009-10-16 21:31:30.000000000 +0530
@@ -222,11 +222,19 @@ static inline void dst_confirm(struct ds
neigh_confirm(dst->neighbour);
}
-static inline void dst_negative_advice(struct dst_entry **dst_p)
+static inline void dst_negative_advice(struct dst_entry **dst_p,
+ struct sock *sk)
{
struct dst_entry * dst = *dst_p;
- if (dst && dst->ops->negative_advice)
+ if (dst && dst->ops->negative_advice) {
*dst_p = dst->ops->negative_advice(dst);
+
+ if (dst != *dst_p) {
+ extern void sk_reset_txq(struct sock *sk);
+
+ sk_reset_txq(sk);
+ }
+ }
}
static inline void dst_link_failure(struct sk_buff *skb)
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c 2009-10-16 21:30:56.000000000 +0530
+++ new/net/core/sock.c 2009-10-16 21:32:33.000000000 +0530
@@ -352,6 +352,12 @@ discard_and_relse:
}
EXPORT_SYMBOL(sk_receive_skb);
+void sk_reset_txq(struct sock *sk)
+{
+ sk_tx_queue_clear(sk);
+}
+EXPORT_SYMBOL(sk_reset_txq);
+
struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
{
struct dst_entry *dst = sk->sk_dst_cache;
diff -ruNp org/net/dccp/timer.c new/net/dccp/timer.c
--- org/net/dccp/timer.c 2009-10-16 21:30:56.000000000 +0530
+++ new/net/dccp/timer.c 2009-10-16 21:31:30.000000000 +0530
@@ -38,7 +38,7 @@ static int dccp_write_timeout(struct soc
if (sk->sk_state == DCCP_REQUESTING || sk->sk_state == DCCP_PARTOPEN) {
if (icsk->icsk_retransmits != 0)
- dst_negative_advice(&sk->sk_dst_cache);
+ dst_negative_advice(&sk->sk_dst_cache, sk);
retry_until = icsk->icsk_syn_retries ?
: sysctl_dccp_request_retries;
} else {
@@ -63,7 +63,7 @@ static int dccp_write_timeout(struct soc
Golden words :-).
*/
- dst_negative_advice(&sk->sk_dst_cache);
+ dst_negative_advice(&sk->sk_dst_cache, sk);
}
retry_until = sysctl_dccp_retries2;
diff -ruNp org/net/decnet/af_decnet.c new/net/decnet/af_decnet.c
--- org/net/decnet/af_decnet.c 2009-10-16 21:30:56.000000000 +0530
+++ new/net/decnet/af_decnet.c 2009-10-16 21:31:30.000000000 +0530
@@ -1955,7 +1955,7 @@ static int dn_sendmsg(struct kiocb *iocb
}
if ((flags & MSG_TRYHARD) && sk->sk_dst_cache)
- dst_negative_advice(&sk->sk_dst_cache);
+ dst_negative_advice(&sk->sk_dst_cache, sk);
mss = scp->segsize_rem;
fctype = scp->services_rem & NSP_FC_MASK;
diff -ruNp org/net/ipv4/tcp_timer.c new/net/ipv4/tcp_timer.c
--- org/net/ipv4/tcp_timer.c 2009-10-16 21:30:56.000000000 +0530
+++ new/net/ipv4/tcp_timer.c 2009-10-16 21:31:30.000000000 +0530
@@ -141,14 +141,14 @@ static int tcp_write_timeout(struct sock
if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
if (icsk->icsk_retransmits)
- dst_negative_advice(&sk->sk_dst_cache);
+ dst_negative_advice(&sk->sk_dst_cache, sk);
retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
} else {
if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
/* Black hole detection */
tcp_mtu_probing(icsk, sk);
- dst_negative_advice(&sk->sk_dst_cache);
+ dst_negative_advice(&sk->sk_dst_cache, sk);
}
retry_until = sysctl_tcp_retries2;
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 4/4 v4] net: Use sk_tx_queue_mapping for connected sockets
2009-10-20 9:46 [PATCH 0/4 v4] net: Implement fast TX queue selection Krishna Kumar
` (2 preceding siblings ...)
2009-10-20 9:46 ` [PATCH 3/4 v4] net: Fix for dst_negative_advice Krishna Kumar
@ 2009-10-20 9:50 ` Krishna Kumar
2009-10-21 1:59 ` [PATCH 0/4 v4] net: Implement fast TX queue selection David Miller
4 siblings, 0 replies; 6+ messages in thread
From: Krishna Kumar @ 2009-10-20 9:50 UTC (permalink / raw)
To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
From: Krishna Kumar <krkumar2@in.ibm.com>
For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
net/core/dev.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c 2009-10-19 15:43:30.000000000 +0530
+++ new/net/core/dev.c 2009-10-20 12:24:40.000000000 +0530
@@ -1791,13 +1791,25 @@ EXPORT_SYMBOL(skb_tx_hash);
static struct netdev_queue *dev_pick_tx(struct net_device *dev,
struct sk_buff *skb)
{
- const struct net_device_ops *ops = dev->netdev_ops;
- u16 queue_index = 0;
+ u16 queue_index;
+ struct sock *sk = skb->sk;
+
+ if (sk_tx_queue_recorded(sk)) {
+ queue_index = sk_tx_queue_get(sk);
+ } else {
+ const struct net_device_ops *ops = dev->netdev_ops;
- if (ops->ndo_select_queue)
- queue_index = ops->ndo_select_queue(dev, skb);
- else if (dev->real_num_tx_queues > 1)
- queue_index = skb_tx_hash(dev, skb);
+ if (ops->ndo_select_queue) {
+ queue_index = ops->ndo_select_queue(dev, skb);
+ } else {
+ queue_index = 0;
+ if (dev->real_num_tx_queues > 1)
+ queue_index = skb_tx_hash(dev, skb);
+
+ if (sk && sk->sk_dst_cache)
+ sk_tx_queue_set(sk, queue_index);
+ }
+ }
skb_set_queue_mapping(skb, queue_index);
return netdev_get_tx_queue(dev, queue_index);
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/4 v4] net: Implement fast TX queue selection
2009-10-20 9:46 [PATCH 0/4 v4] net: Implement fast TX queue selection Krishna Kumar
` (3 preceding siblings ...)
2009-10-20 9:50 ` [PATCH 4/4 v4] net: Use sk_tx_queue_mapping for connected sockets Krishna Kumar
@ 2009-10-21 1:59 ` David Miller
4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2009-10-21 1:59 UTC (permalink / raw)
To: krkumar2; +Cc: netdev, herbert, dada1
I've applied this set to net-next-2.6, will push out to kernel.org
after some build tests, thanks!
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-10-21 1:59 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-20 9:46 [PATCH 0/4 v4] net: Implement fast TX queue selection Krishna Kumar
2009-10-20 9:46 ` [PATCH 1/4 v4] net: Introduce sk_tx_queue_mapping Krishna Kumar
2009-10-20 9:46 ` [PATCH 2/4 v4] net: IPv6 changes Krishna Kumar
2009-10-20 9:46 ` [PATCH 3/4 v4] net: Fix for dst_negative_advice Krishna Kumar
2009-10-20 9:50 ` [PATCH 4/4 v4] net: Use sk_tx_queue_mapping for connected sockets Krishna Kumar
2009-10-21 1:59 ` [PATCH 0/4 v4] net: Implement fast TX queue selection David Miller
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.