* [PATCH v2 net-next 1/3] RDS: Use a single TCP socket for both send and receive.
2015-09-30 20:54 [PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements Sowmini Varadhan
@ 2015-09-30 20:54 ` Sowmini Varadhan
2015-09-30 20:54 ` [PATCH v2 net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune Sowmini Varadhan
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 20:54 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov,
sowmini.varadhan, santosh.shilimkar, sergei.shtylyov
Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
for an incoming connection.") modified rds-tcp so that an incoming SYN
would ignore an existing "client" TCP connection which had the local
port set to the transient port. The motivation for ignoring the existing
"client" connection in f711a6ae was to avoid race conditions and an
endless duel of reconnect attempts triggered by a restart/abort of one
of the nodes in the TCP connection.
However, having separate sockets for active and passive sides
is avoidable, and the simpler model of a single TCP socket for
both send and receives of all RDS connections associated with
that tcp socket makes for easier observability. We avoid the race
conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
The c_outgoing bit is initialized in __rds_conn_create().
A side-effect of re-using the client rds_connection for an incoming
SYN is the potential of encountering duelling SYNs, i.e., we
have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
SYN. The logic to arbitrate this criss-crossing SYN exchange in
rds_tcp_accept_one() has been modified to emulate the BGP state
machine: the smaller IP address should back off from the connection attempt.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2: kbuild-test-robot warning around __be32, modify subject line per
Santosh Shilimkar
net/rds/connection.c | 22 ++++++----------------
net/rds/rds.h | 4 +++-
net/rds/tcp_listen.c | 22 +++++++++-------------
3 files changed, 18 insertions(+), 30 deletions(-)
diff --git a/net/rds/connection.c b/net/rds/connection.c
index 49adeef..d456403 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -128,10 +128,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
struct rds_transport *loop_trans;
unsigned long flags;
int ret;
- struct rds_transport *otrans = trans;
- if (!is_outgoing && otrans->t_type == RDS_TRANS_TCP)
- goto new_conn;
rcu_read_lock();
conn = rds_conn_lookup(net, head, laddr, faddr, trans);
if (conn && conn->c_loopback && conn->c_trans != &rds_loop_transport &&
@@ -147,7 +144,6 @@ static struct rds_connection *__rds_conn_create(struct net *net,
if (conn)
goto out;
-new_conn:
conn = kmem_cache_zalloc(rds_conn_slab, gfp);
if (!conn) {
conn = ERR_PTR(-ENOMEM);
@@ -207,6 +203,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
atomic_set(&conn->c_state, RDS_CONN_DOWN);
conn->c_send_gen = 0;
+ conn->c_outgoing = (is_outgoing ? 1 : 0);
conn->c_reconnect_jiffies = 0;
INIT_DELAYED_WORK(&conn->c_send_w, rds_send_worker);
INIT_DELAYED_WORK(&conn->c_recv_w, rds_recv_worker);
@@ -243,22 +240,13 @@ static struct rds_connection *__rds_conn_create(struct net *net,
/* Creating normal conn */
struct rds_connection *found;
- if (!is_outgoing && otrans->t_type == RDS_TRANS_TCP)
- found = NULL;
- else
- found = rds_conn_lookup(net, head, laddr, faddr, trans);
+ found = rds_conn_lookup(net, head, laddr, faddr, trans);
if (found) {
trans->conn_free(conn->c_transport_data);
kmem_cache_free(rds_conn_slab, conn);
conn = found;
} else {
- if ((is_outgoing && otrans->t_type == RDS_TRANS_TCP) ||
- (otrans->t_type != RDS_TRANS_TCP)) {
- /* Only the active side should be added to
- * reconnect list for TCP.
- */
- hlist_add_head_rcu(&conn->c_hash_node, head);
- }
+ hlist_add_head_rcu(&conn->c_hash_node, head);
rds_cong_add_conn(conn);
rds_conn_count++;
}
@@ -337,7 +325,9 @@ void rds_conn_shutdown(struct rds_connection *conn)
rcu_read_lock();
if (!hlist_unhashed(&conn->c_hash_node)) {
rcu_read_unlock();
- rds_queue_reconnect(conn);
+ if (conn->c_trans->t_type != RDS_TRANS_TCP ||
+ conn->c_outgoing == 1)
+ rds_queue_reconnect(conn);
} else {
rcu_read_unlock();
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index afb4048..b4c7ac0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -86,7 +86,9 @@ struct rds_connection {
struct hlist_node c_hash_node;
__be32 c_laddr;
__be32 c_faddr;
- unsigned int c_loopback:1;
+ unsigned int c_loopback:1,
+ c_outgoing:1,
+ c_pad_to_32:30;
struct rds_connection *c_passive;
struct rds_cong_map *c_lcong;
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 444d78d..1d90240 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -110,28 +110,24 @@ int rds_tcp_accept_one(struct socket *sock)
goto out;
}
/* An incoming SYN request came in, and TCP just accepted it.
- * We always create a new conn for listen side of TCP, and do not
- * add it to the c_hash_list.
*
* If the client reboots, this conn will need to be cleaned up.
* rds_tcp_state_change() will do that cleanup
*/
rs_tcp = (struct rds_tcp_connection *)conn->c_transport_data;
- WARN_ON(!rs_tcp || rs_tcp->t_sock);
-
- /*
- * see the comment above rds_queue_delayed_reconnect()
- */
- if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) {
- if (rds_conn_state(conn) == RDS_CONN_UP)
- rds_tcp_stats_inc(s_tcp_listen_closed_stale);
- else
- rds_tcp_stats_inc(s_tcp_connect_raced);
- rds_conn_drop(conn);
+ if (rs_tcp->t_sock &&
+ ntohl(inet->inet_saddr) < ntohl(inet->inet_daddr)) {
+ struct sock *nsk = new_sock->sk;
+
+ nsk->sk_user_data = NULL;
+ nsk->sk_prot->disconnect(nsk, 0);
+ tcp_done(nsk);
+ new_sock = NULL;
ret = 0;
goto out;
}
+ rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING);
rds_tcp_set_callbacks(new_sock, conn);
rds_connect_complete(conn);
new_sock = NULL;
--
1.7.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune
2015-09-30 20:54 [PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements Sowmini Varadhan
2015-09-30 20:54 ` [PATCH v2 net-next 1/3] RDS: Use a single TCP socket for both send and receive Sowmini Varadhan
@ 2015-09-30 20:54 ` Sowmini Varadhan
2015-09-30 20:54 ` [PATCH v2 net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit Sowmini Varadhan
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 20:54 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov,
sowmini.varadhan, santosh.shilimkar, sergei.shtylyov
Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
clobbers efficient use of TSO because it inflates the size_goal
that is computed in tcp_sendmsg/tcp_sendpage and skews packet
latency, and the default values for these parameters actually
results in significantly better performance.
In request-response tests using rds-stress with a packet size of
100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
between a single pair of IP addresses achieves a throughput of
6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
equivalent conditions on these platforms.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
net/rds/tcp.c | 16 ++++------------
1 files changed, 4 insertions(+), 12 deletions(-)
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index c42b60b..9d6ddba 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -67,21 +67,13 @@ void rds_tcp_nonagle(struct socket *sock)
set_fs(oldfs);
}
+/* All module specific customizations to the RDS-TCP socket should be done in
+ * rds_tcp_tune() and applied after socket creation. In general these
+ * customizations should be tunable via module_param()
+ */
void rds_tcp_tune(struct socket *sock)
{
- struct sock *sk = sock->sk;
-
rds_tcp_nonagle(sock);
-
- /*
- * We're trying to saturate gigabit with the default,
- * see svc_sock_setbufsize().
- */
- lock_sock(sk);
- sk->sk_sndbuf = RDS_TCP_DEFAULT_BUFSIZE;
- sk->sk_rcvbuf = RDS_TCP_DEFAULT_BUFSIZE;
- sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
- release_sock(sk);
}
u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc)
--
1.7.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit
2015-09-30 20:54 [PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements Sowmini Varadhan
2015-09-30 20:54 ` [PATCH v2 net-next 1/3] RDS: Use a single TCP socket for both send and receive Sowmini Varadhan
2015-09-30 20:54 ` [PATCH v2 net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune Sowmini Varadhan
@ 2015-09-30 20:54 ` Sowmini Varadhan
2015-09-30 23:25 ` [PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements santosh shilimkar
2015-10-05 10:35 ` David Miller
4 siblings, 0 replies; 6+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 20:54 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov,
sowmini.varadhan, santosh.shilimkar, sergei.shtylyov
For the same reasons as commit 2f5338442425 ("tcp: allow splice() to
build full TSO packets") and commit 35f9c09fe9c7 ("tcp: tcp_sendpages()
should call tcp_push() once"), rds_tcp_xmit may have multiple pages to
send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to
tcp_sendpage()
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2: Sergei Shtylov, Santosh Shilimkar comments (some parens retained for
readability)
net/rds/tcp_send.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 53b17ca..2894e60 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -83,6 +83,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
struct rds_tcp_connection *tc = conn->c_transport_data;
int done = 0;
int ret = 0;
+ int more;
if (hdr_off == 0) {
/*
@@ -116,12 +117,15 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
goto out;
}
+ more = rm->data.op_nents > 1 ? (MSG_MORE | MSG_SENDPAGE_NOTLAST) : 0;
while (sg < rm->data.op_nents) {
+ int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more;
+
ret = tc->t_sock->ops->sendpage(tc->t_sock,
sg_page(&rm->data.op_sg[sg]),
rm->data.op_sg[sg].offset + off,
rm->data.op_sg[sg].length - off,
- MSG_DONTWAIT|MSG_NOSIGNAL);
+ flags);
rdsdebug("tcp sendpage %p:%u:%u ret %d\n", (void *)sg_page(&rm->data.op_sg[sg]),
rm->data.op_sg[sg].offset + off, rm->data.op_sg[sg].length - off,
ret);
@@ -134,6 +138,8 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
off = 0;
sg++;
}
+ if (sg == rm->data.op_nents - 1)
+ more = 0;
}
out:
--
1.7.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements
2015-09-30 20:54 [PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements Sowmini Varadhan
` (2 preceding siblings ...)
2015-09-30 20:54 ` [PATCH v2 net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit Sowmini Varadhan
@ 2015-09-30 23:25 ` santosh shilimkar
2015-10-05 10:35 ` David Miller
4 siblings, 0 replies; 6+ messages in thread
From: santosh shilimkar @ 2015-09-30 23:25 UTC (permalink / raw)
To: Sowmini Varadhan, netdev, linux-kernel
Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov,
sergei.shtylyov
On 9/30/2015 1:54 PM, Sowmini Varadhan wrote:
> A 3-part patchset that (a) improves current RDS-TCP perf
> by 2X-3X and (b) refactors earlier robustness code for
> better observability/scaling.
>
> Patch 1 is an enhancment of earlier robustness fixes
> that had used separate sockets for client and server endpoints to
> resolve race conditions. It is possible to have an equivalent
> solution that does not use 2 sockets. The benefit of a
> single socket solution is that it results in more predictable
> and observable behavior for the underlying TCP pipe of an
> RDS connection
>
> Patches 2 and 3 are simple, straightforward perf bug fixes
> that align the RDS TCP socket with other parts of the kernel stack.
>
> v2: fix kbuild-test-robot warnings, comments from Sergei Shtylov
> and Santosh Shilimkar.
>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements
2015-09-30 20:54 [PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements Sowmini Varadhan
` (3 preceding siblings ...)
2015-09-30 23:25 ` [PATCH v2 net-next 0/3] RDS: RDS-TCP perf enhancements santosh shilimkar
@ 2015-10-05 10:35 ` David Miller
4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2015-10-05 10:35 UTC (permalink / raw)
To: sowmini.varadhan
Cc: netdev, linux-kernel, rds-devel, ajaykumar.hotchandani,
igor.maximov, santosh.shilimkar, sergei.shtylyov
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 30 Sep 2015 16:54:06 -0400
> A 3-part patchset that (a) improves current RDS-TCP perf
> by 2X-3X and (b) refactors earlier robustness code for
> better observability/scaling.
>
> Patch 1 is an enhancment of earlier robustness fixes
> that had used separate sockets for client and server endpoints to
> resolve race conditions. It is possible to have an equivalent
> solution that does not use 2 sockets. The benefit of a
> single socket solution is that it results in more predictable
> and observable behavior for the underlying TCP pipe of an
> RDS connection
>
> Patches 2 and 3 are simple, straightforward perf bug fixes
> that align the RDS TCP socket with other parts of the kernel stack.
>
> v2: fix kbuild-test-robot warnings, comments from Sergei Shtylov
> and Santosh Shilimkar.
Series applied to net-next, thanks.
^ permalink raw reply [flat|nested] 6+ messages in thread