netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3]  RDS: RDS-TCP perf enhancements
@ 2015-09-30 13:45 Sowmini Varadhan
  2015-09-30 13:45 ` [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive Sowmini Varadhan
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 13:45 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov,
	sowmini.varadhan, santosh.shilimkar

A 3-part patchset that (a) improves current RDS-TCP perf
by 2X-3X and (b) refactors earlier robustness code for
better observability/scaling.

Patch 1 is an enhancment of earlier robustness fixes 
that had used separate sockets for client and server endpoints to
resolve race conditions. It is possible to have an equivalent
solution that does not use 2 sockets. The benefit of a
single socket solution is that it results in more predictable
and observable behavior for the underlying TCP pipe of an 
RDS connection

Patches 2 and 3 are simple, straightforward perf bug fixes
that align the RDS TCP socket with other parts of the kernel stack.

Sowmini Varadhan (3):
  Use a single TCP socket for both send and receive.
  Do not bloat sndbuf/rcvbuf in rds_tcp_tune
  Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in
    rds_tcp_xmit

 net/rds/connection.c |   22 ++++++----------------
 net/rds/rds.h        |    4 +++-
 net/rds/tcp.c        |   16 ++++------------
 net/rds/tcp_listen.c |   19 +++++++------------
 net/rds/tcp_send.c   |    8 +++++++-
 5 files changed, 27 insertions(+), 42 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive.
  2015-09-30 13:45 [PATCH net-next 0/3] RDS: RDS-TCP perf enhancements Sowmini Varadhan
@ 2015-09-30 13:45 ` Sowmini Varadhan
  2015-09-30 14:45   ` kbuild test robot
  2015-09-30 15:50   ` santosh shilimkar
  2015-09-30 13:45 ` [PATCH net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune Sowmini Varadhan
  2015-09-30 13:45 ` [PATCH net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit Sowmini Varadhan
  2 siblings, 2 replies; 14+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 13:45 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov,
	sowmini.varadhan, santosh.shilimkar

Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
for an incoming connection.") modified rds-tcp so that an incoming SYN
would ignore an existing "client" TCP connection which had the local
port set to the transient port.  The motivation for ignoring the existing
"client" connection in f711a6ae was to avoid race conditions and an
endless duel of reconnect attempts triggered by a restart/abort of one
of the nodes in the TCP connection.

However, having separate sockets for active and passive sides
is avoidable, and the simpler model of a single TCP socket for
both send and receives of all RDS connections associated with
that tcp socket makes for easier observability. We avoid the race
conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
The c_outgoing bit is initialized in __rds_conn_create().

A side-effect of re-using the client rds_connection for an incoming
SYN is the potential of encountering duelling SYNs, i.e., we
have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
SYN. The logic to arbitrate this criss-crossing SYN exchange in
rds_tcp_accept_one() has been modified to emulate the BGP state
machine: the smaller IP address should back off from the connection attempt.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/connection.c |   22 ++++++----------------
 net/rds/rds.h        |    4 +++-
 net/rds/tcp_listen.c |   19 +++++++------------
 3 files changed, 16 insertions(+), 29 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 49adeef..d456403 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -128,10 +128,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	struct rds_transport *loop_trans;
 	unsigned long flags;
 	int ret;
-	struct rds_transport *otrans = trans;
 
-	if (!is_outgoing && otrans->t_type == RDS_TRANS_TCP)
-		goto new_conn;
 	rcu_read_lock();
 	conn = rds_conn_lookup(net, head, laddr, faddr, trans);
 	if (conn && conn->c_loopback && conn->c_trans != &rds_loop_transport &&
@@ -147,7 +144,6 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	if (conn)
 		goto out;
 
-new_conn:
 	conn = kmem_cache_zalloc(rds_conn_slab, gfp);
 	if (!conn) {
 		conn = ERR_PTR(-ENOMEM);
@@ -207,6 +203,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 
 	atomic_set(&conn->c_state, RDS_CONN_DOWN);
 	conn->c_send_gen = 0;
+	conn->c_outgoing = (is_outgoing ? 1 : 0);
 	conn->c_reconnect_jiffies = 0;
 	INIT_DELAYED_WORK(&conn->c_send_w, rds_send_worker);
 	INIT_DELAYED_WORK(&conn->c_recv_w, rds_recv_worker);
@@ -243,22 +240,13 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		/* Creating normal conn */
 		struct rds_connection *found;
 
-		if (!is_outgoing && otrans->t_type == RDS_TRANS_TCP)
-			found = NULL;
-		else
-			found = rds_conn_lookup(net, head, laddr, faddr, trans);
+		found = rds_conn_lookup(net, head, laddr, faddr, trans);
 		if (found) {
 			trans->conn_free(conn->c_transport_data);
 			kmem_cache_free(rds_conn_slab, conn);
 			conn = found;
 		} else {
-			if ((is_outgoing && otrans->t_type == RDS_TRANS_TCP) ||
-			    (otrans->t_type != RDS_TRANS_TCP)) {
-				/* Only the active side should be added to
-				 * reconnect list for TCP.
-				 */
-				hlist_add_head_rcu(&conn->c_hash_node, head);
-			}
+			hlist_add_head_rcu(&conn->c_hash_node, head);
 			rds_cong_add_conn(conn);
 			rds_conn_count++;
 		}
@@ -337,7 +325,9 @@ void rds_conn_shutdown(struct rds_connection *conn)
 	rcu_read_lock();
 	if (!hlist_unhashed(&conn->c_hash_node)) {
 		rcu_read_unlock();
-		rds_queue_reconnect(conn);
+		if (conn->c_trans->t_type != RDS_TRANS_TCP ||
+		    conn->c_outgoing == 1)
+			rds_queue_reconnect(conn);
 	} else {
 		rcu_read_unlock();
 	}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index afb4048..b4c7ac0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -86,7 +86,9 @@ struct rds_connection {
 	struct hlist_node	c_hash_node;
 	__be32			c_laddr;
 	__be32			c_faddr;
-	unsigned int		c_loopback:1;
+	unsigned int		c_loopback:1,
+				c_outgoing:1,
+				c_pad_to_32:30;
 	struct rds_connection	*c_passive;
 
 	struct rds_cong_map	*c_lcong;
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 444d78d..ee70d13 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -110,28 +110,23 @@ int rds_tcp_accept_one(struct socket *sock)
 		goto out;
 	}
 	/* An incoming SYN request came in, and TCP just accepted it.
-	 * We always create a new conn for listen side of TCP, and do not
-	 * add it to the c_hash_list.
 	 *
 	 * If the client reboots, this conn will need to be cleaned up.
 	 * rds_tcp_state_change() will do that cleanup
 	 */
 	rs_tcp = (struct rds_tcp_connection *)conn->c_transport_data;
-	WARN_ON(!rs_tcp || rs_tcp->t_sock);
+	if (rs_tcp->t_sock && inet->inet_saddr < inet->inet_daddr) {
+		struct sock *nsk = new_sock->sk;
 
-	/*
-	 * see the comment above rds_queue_delayed_reconnect()
-	 */
-	if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) {
-		if (rds_conn_state(conn) == RDS_CONN_UP)
-			rds_tcp_stats_inc(s_tcp_listen_closed_stale);
-		else
-			rds_tcp_stats_inc(s_tcp_connect_raced);
-		rds_conn_drop(conn);
+		nsk->sk_user_data = NULL;
+		nsk->sk_prot->disconnect(nsk, 0);
+		tcp_done(nsk);
+		new_sock = NULL;
 		ret = 0;
 		goto out;
 	}
 
+	rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING);
 	rds_tcp_set_callbacks(new_sock, conn);
 	rds_connect_complete(conn);
 	new_sock = NULL;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune
  2015-09-30 13:45 [PATCH net-next 0/3] RDS: RDS-TCP perf enhancements Sowmini Varadhan
  2015-09-30 13:45 ` [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive Sowmini Varadhan
@ 2015-09-30 13:45 ` Sowmini Varadhan
  2015-09-30 15:54   ` santosh shilimkar
  2015-09-30 13:45 ` [PATCH net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit Sowmini Varadhan
  2 siblings, 1 reply; 14+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 13:45 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov,
	sowmini.varadhan, santosh.shilimkar

Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
clobbers efficient use of TSO because it inflates the size_goal
that is computed in tcp_sendmsg/tcp_sendpage and skews packet
latency, and the default values for these parameters actually
results in significantly better performance.

In request-response tests using rds-stress with a packet size of
100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
between a single pair of IP addresses achieves a throughput of
6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
equivalent conditions on these platforms.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/tcp.c |   16 ++++------------
 1 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index c42b60b..9d6ddba 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -67,21 +67,13 @@ void rds_tcp_nonagle(struct socket *sock)
 	set_fs(oldfs);
 }
 
+/* All module specific customizations to the RDS-TCP socket should be done in
+ * rds_tcp_tune() and applied after socket creation. In general these
+ * customizations should be tunable via module_param()
+ */
 void rds_tcp_tune(struct socket *sock)
 {
-	struct sock *sk = sock->sk;
-
 	rds_tcp_nonagle(sock);
-
-	/*
-	 * We're trying to saturate gigabit with the default,
-	 * see svc_sock_setbufsize().
-	 */
-	lock_sock(sk);
-	sk->sk_sndbuf = RDS_TCP_DEFAULT_BUFSIZE;
-	sk->sk_rcvbuf = RDS_TCP_DEFAULT_BUFSIZE;
-	sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
-	release_sock(sk);
 }
 
 u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit
  2015-09-30 13:45 [PATCH net-next 0/3] RDS: RDS-TCP perf enhancements Sowmini Varadhan
  2015-09-30 13:45 ` [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive Sowmini Varadhan
  2015-09-30 13:45 ` [PATCH net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune Sowmini Varadhan
@ 2015-09-30 13:45 ` Sowmini Varadhan
  2015-09-30 14:53   ` Sergei Shtylyov
  2015-09-30 15:56   ` santosh shilimkar
  2 siblings, 2 replies; 14+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 13:45 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov,
	sowmini.varadhan, santosh.shilimkar

For the same reasons as 2f53384424 and 35f9c09fe9, rds_tcp_xmit
may have multiple pages to send, so use the MSG_MORE and
MSG_SENDPAGE_NOTLAST as hints to tcp_sendpage()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/tcp_send.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 53b17ca..5f3e3fa 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -83,6 +83,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	struct rds_tcp_connection *tc = conn->c_transport_data;
 	int done = 0;
 	int ret = 0;
+	int more;
 
 	if (hdr_off == 0) {
 		/*
@@ -116,12 +117,15 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 			goto out;
 	}
 
+	more = (rm->data.op_nents > 1 ? (MSG_MORE | MSG_SENDPAGE_NOTLAST) : 0);
 	while (sg < rm->data.op_nents) {
+		int flags = (MSG_DONTWAIT | MSG_NOSIGNAL | more);
+
 		ret = tc->t_sock->ops->sendpage(tc->t_sock,
 						sg_page(&rm->data.op_sg[sg]),
 						rm->data.op_sg[sg].offset + off,
 						rm->data.op_sg[sg].length - off,
-						MSG_DONTWAIT|MSG_NOSIGNAL);
+						flags);
 		rdsdebug("tcp sendpage %p:%u:%u ret %d\n", (void *)sg_page(&rm->data.op_sg[sg]),
 			 rm->data.op_sg[sg].offset + off, rm->data.op_sg[sg].length - off,
 			 ret);
@@ -134,6 +138,8 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 			off = 0;
 			sg++;
 		}
+		if (sg == rm->data.op_nents - 1)
+			more = 0;
 	}
 
 out:
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive.
  2015-09-30 13:45 ` [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive Sowmini Varadhan
@ 2015-09-30 14:45   ` kbuild test robot
  2015-09-30 15:50   ` santosh shilimkar
  1 sibling, 0 replies; 14+ messages in thread
From: kbuild test robot @ 2015-09-30 14:45 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: kbuild-all, netdev, linux-kernel, davem, rds-devel,
	ajaykumar.hotchandani, igor.maximov, sowmini.varadhan,
	santosh.shilimkar

Hi Sowmini,

[auto build test results on v4.3-rc3 -- if it's inappropriate base, please ignore]

reproduce:
  # apt-get install sparse
  make ARCH=x86_64 allmodconfig
  make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> net/rds/tcp_listen.c:118:35: sparse: restricted __be32 degrades to integer
   net/rds/tcp_listen.c:118:56: sparse: restricted __be32 degrades to integer
   net/rds/tcp_listen.c:187:29: sparse: incorrect type in assignment (different base types)
   net/rds/tcp_listen.c:187:29:    expected restricted __be32 [assigned] [usertype] s_addr
   net/rds/tcp_listen.c:187:29:    got unsigned int [unsigned] [usertype] <noident>
   net/rds/tcp_listen.c:188:22: sparse: incorrect type in assignment (different base types)
   net/rds/tcp_listen.c:188:22:    expected restricted __be16 [assigned] [usertype] sin_port
   net/rds/tcp_listen.c:188:22:    got unsigned short [unsigned] [usertype] <noident>

vim +118 net/rds/tcp_listen.c

   102			 &inet->inet_saddr, ntohs(inet->inet_sport),
   103			 &inet->inet_daddr, ntohs(inet->inet_dport));
   104	
   105		conn = rds_conn_create(sock_net(sock->sk),
   106				       inet->inet_saddr, inet->inet_daddr,
   107				       &rds_tcp_transport, GFP_KERNEL);
   108		if (IS_ERR(conn)) {
   109			ret = PTR_ERR(conn);
   110			goto out;
   111		}
   112		/* An incoming SYN request came in, and TCP just accepted it.
   113		 *
   114		 * If the client reboots, this conn will need to be cleaned up.
   115		 * rds_tcp_state_change() will do that cleanup
   116		 */
   117		rs_tcp = (struct rds_tcp_connection *)conn->c_transport_data;
 > 118		if (rs_tcp->t_sock && inet->inet_saddr < inet->inet_daddr) {
   119			struct sock *nsk = new_sock->sk;
   120	
   121			nsk->sk_user_data = NULL;
   122			nsk->sk_prot->disconnect(nsk, 0);
   123			tcp_done(nsk);
   124			new_sock = NULL;
   125			ret = 0;
   126			goto out;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit
  2015-09-30 13:45 ` [PATCH net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit Sowmini Varadhan
@ 2015-09-30 14:53   ` Sergei Shtylyov
  2015-09-30 15:56   ` santosh shilimkar
  1 sibling, 0 replies; 14+ messages in thread
From: Sergei Shtylyov @ 2015-09-30 14:53 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev, linux-kernel
  Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov,
	santosh.shilimkar

Hello.

On 09/30/2015 04:45 PM, Sowmini Varadhan wrote:

> For the same reasons as 2f53384424 and 35f9c09fe9, rds_tcp_xmit
> may have multiple pages to send, so use the MSG_MORE and
> MSG_SENDPAGE_NOTLAST as hints to tcp_sendpage()
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>   net/rds/tcp_send.c |    8 +++++++-
>   1 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
> index 53b17ca..5f3e3fa 100644
> --- a/net/rds/tcp_send.c
> +++ b/net/rds/tcp_send.c
[...]
> @@ -116,12 +117,15 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
>   			goto out;
>   	}
>
> +	more = (rm->data.op_nents > 1 ? (MSG_MORE | MSG_SENDPAGE_NOTLAST) : 0);

    No need for either inner or outer parens.

>   	while (sg < rm->data.op_nents) {
> +		int flags = (MSG_DONTWAIT | MSG_NOSIGNAL | more);

    Parens not needed as well.

[...]

MBR, Sergei

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive.
  2015-09-30 13:45 ` [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive Sowmini Varadhan
  2015-09-30 14:45   ` kbuild test robot
@ 2015-09-30 15:50   ` santosh shilimkar
  2015-09-30 15:58     ` Sowmini Varadhan
  2015-09-30 16:09     ` Sowmini Varadhan
  1 sibling, 2 replies; 14+ messages in thread
From: santosh shilimkar @ 2015-09-30 15:50 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev, linux-kernel
  Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov

minor nit though not a strict rule. Just to be consistent based on
what we are following.

- core RDS patches "RDS:"
- RDS IB patches "RDS: IB:" or "RDS/IB:"
- RDS IW patches "RDS: IW:" or
- RDS TCP can use "RDS: TCP" or "RDS/TCP:"

$subject
s/net/rds:/RDS:

On 9/30/2015 6:45 AM, Sowmini Varadhan wrote:
> Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
> for an incoming connection.") modified rds-tcp so that an incoming SYN
> would ignore an existing "client" TCP connection which had the local
> port set to the transient port.  The motivation for ignoring the existing
> "client" connection in f711a6ae was to avoid race conditions and an
> endless duel of reconnect attempts triggered by a restart/abort of one
> of the nodes in the TCP connection.
>
> However, having separate sockets for active and passive sides
> is avoidable, and the simpler model of a single TCP socket for
> both send and receives of all RDS connections associated with
> that tcp socket makes for easier observability. We avoid the race
> conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
> if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
> The c_outgoing bit is initialized in __rds_conn_create().
>
> A side-effect of re-using the client rds_connection for an incoming
> SYN is the potential of encountering duelling SYNs, i.e., we
> have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
> SYN. The logic to arbitrate this criss-crossing SYN exchange in
> rds_tcp_accept_one() has been modified to emulate the BGP state
> machine: the smaller IP address should back off from the connection attempt.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>   net/rds/connection.c |   22 ++++++----------------
>   net/rds/rds.h        |    4 +++-
>   net/rds/tcp_listen.c |   19 +++++++------------
>   3 files changed, 16 insertions(+), 29 deletions(-)
>

[...]

> diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> index 444d78d..ee70d13 100644
> --- a/net/rds/tcp_listen.c
> +++ b/net/rds/tcp_listen.c
> @@ -110,28 +110,23 @@ int rds_tcp_accept_one(struct socket *sock)
>   		goto out;
>   	}
>   	/* An incoming SYN request came in, and TCP just accepted it.
> -	 * We always create a new conn for listen side of TCP, and do not
> -	 * add it to the c_hash_list.
>   	 *
>   	 * If the client reboots, this conn will need to be cleaned up.
>   	 * rds_tcp_state_change() will do that cleanup
>   	 */
>   	rs_tcp = (struct rds_tcp_connection *)conn->c_transport_data;
> -	WARN_ON(!rs_tcp || rs_tcp->t_sock);
> +	if (rs_tcp->t_sock && inet->inet_saddr < inet->inet_daddr) {
> +		struct sock *nsk = new_sock->sk;
>
Any reason you dropped the WARN_ON. Note that till we got commit
74e98eb0 (" RDS: verify the underlying transport exists before creating
a connection") merged, we had an issue. That guards it now.

Am curious about WARN_ON() and hence the question.

Rest of the patch looks fine to me.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune
  2015-09-30 13:45 ` [PATCH net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune Sowmini Varadhan
@ 2015-09-30 15:54   ` santosh shilimkar
  0 siblings, 0 replies; 14+ messages in thread
From: santosh shilimkar @ 2015-09-30 15:54 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev, linux-kernel
  Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov

On 9/30/2015 6:45 AM, Sowmini Varadhan wrote:
> Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
> clobbers efficient use of TSO because it inflates the size_goal
> that is computed in tcp_sendmsg/tcp_sendpage and skews packet
> latency, and the default values for these parameters actually
> results in significantly better performance.
>
> In request-response tests using rds-stress with a packet size of
> 100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
> between a single pair of IP addresses achieves a throughput of
> 6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
> equivalent conditions on these platforms.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>   net/rds/tcp.c |   16 ++++------------
>   1 files changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index c42b60b..9d6ddba 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -67,21 +67,13 @@ void rds_tcp_nonagle(struct socket *sock)
>   	set_fs(oldfs);
>   }
>
> +/* All module specific customizations to the RDS-TCP socket should be done in
> + * rds_tcp_tune() and applied after socket creation. In general these
> + * customizations should be tunable via module_param()
> + */
>   void rds_tcp_tune(struct socket *sock)
>   {
> -	struct sock *sk = sock->sk;
> -
>   	rds_tcp_nonagle(sock);
> -
> -	/*
> -	 * We're trying to saturate gigabit with the default,
> -	 * see svc_sock_setbufsize().
> -	 */
> -	lock_sock(sk);
> -	sk->sk_sndbuf = RDS_TCP_DEFAULT_BUFSIZE;
> -	sk->sk_rcvbuf = RDS_TCP_DEFAULT_BUFSIZE;
> -	sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
> -	release_sock(sk);
>   }
>
>   u32 rds_tcp_snd_nxt(struct rds_tcp_connection *tc)
>
We should at least start with sndbuf/rcvbuf parameters.
Nice work. Almost ~3X lift in RDS TCP performance.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit
  2015-09-30 13:45 ` [PATCH net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit Sowmini Varadhan
  2015-09-30 14:53   ` Sergei Shtylyov
@ 2015-09-30 15:56   ` santosh shilimkar
  2015-09-30 16:00     ` Sowmini Varadhan
  1 sibling, 1 reply; 14+ messages in thread
From: santosh shilimkar @ 2015-09-30 15:56 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev, linux-kernel
  Cc: davem, rds-devel, ajaykumar.hotchandani, igor.maximov

On 9/30/2015 6:45 AM, Sowmini Varadhan wrote:
> For the same reasons as 2f53384424 and 35f9c09fe9, rds_tcp_xmit
> may have multiple pages to send, so use the MSG_MORE and
> MSG_SENDPAGE_NOTLAST as hints to tcp_sendpage()
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
Your checkpatch.pl should have complained about commit
reference in the change-log. You might want to fix that
for consistency.

Patch itself is fine.
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive.
  2015-09-30 15:50   ` santosh shilimkar
@ 2015-09-30 15:58     ` Sowmini Varadhan
  2015-09-30 16:04       ` santosh shilimkar
  2015-09-30 16:09     ` Sowmini Varadhan
  1 sibling, 1 reply; 14+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 15:58 UTC (permalink / raw)
  To: santosh shilimkar
  Cc: netdev, linux-kernel, davem, rds-devel, ajaykumar.hotchandani,
	igor.maximov

On (09/30/15 08:50), santosh shilimkar wrote:
> minor nit though not a strict rule. Just to be consistent based on
> what we are following.
> 
> - core RDS patches "RDS:"
> - RDS IB patches "RDS: IB:" or "RDS/IB:"
> - RDS IW patches "RDS: IW:" or
> - RDS TCP can use "RDS: TCP" or "RDS/TCP:"

Ok, but in this case patch 1/3 the changes affect both core and rds-tcp
modules. 

Working on patchv2 that will address Sergei's comments and the
kbuild-test-robot warning as well

> 
> $subject
> s/net/rds:/RDS:
> 
> On 9/30/2015 6:45 AM, Sowmini Varadhan wrote:
> >Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
> >for an incoming connection.") modified rds-tcp so that an incoming SYN
> >would ignore an existing "client" TCP connection which had the local
> >port set to the transient port.  The motivation for ignoring the existing
> >"client" connection in f711a6ae was to avoid race conditions and an
> >endless duel of reconnect attempts triggered by a restart/abort of one
> >of the nodes in the TCP connection.
> >
> >However, having separate sockets for active and passive sides
> >is avoidable, and the simpler model of a single TCP socket for
> >both send and receives of all RDS connections associated with
> >that tcp socket makes for easier observability. We avoid the race
> >conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
> >if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
> >The c_outgoing bit is initialized in __rds_conn_create().
> >
> >A side-effect of re-using the client rds_connection for an incoming
> >SYN is the potential of encountering duelling SYNs, i.e., we
> >have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
> >SYN. The logic to arbitrate this criss-crossing SYN exchange in
> >rds_tcp_accept_one() has been modified to emulate the BGP state
> >machine: the smaller IP address should back off from the connection attempt.
> >
> >Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> >---
> >  net/rds/connection.c |   22 ++++++----------------
> >  net/rds/rds.h        |    4 +++-
> >  net/rds/tcp_listen.c |   19 +++++++------------
> >  3 files changed, 16 insertions(+), 29 deletions(-)
> >
> 
> [...]
> 
> >diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
> >index 444d78d..ee70d13 100644
> >--- a/net/rds/tcp_listen.c
> >+++ b/net/rds/tcp_listen.c
> >@@ -110,28 +110,23 @@ int rds_tcp_accept_one(struct socket *sock)
> >  		goto out;
> >  	}
> >  	/* An incoming SYN request came in, and TCP just accepted it.
> >-	 * We always create a new conn for listen side of TCP, and do not
> >-	 * add it to the c_hash_list.
> >  	 *
> >  	 * If the client reboots, this conn will need to be cleaned up.
> >  	 * rds_tcp_state_change() will do that cleanup
> >  	 */
> >  	rs_tcp = (struct rds_tcp_connection *)conn->c_transport_data;
> >-	WARN_ON(!rs_tcp || rs_tcp->t_sock);
> >+	if (rs_tcp->t_sock && inet->inet_saddr < inet->inet_daddr) {
> >+		struct sock *nsk = new_sock->sk;
> >
> Any reason you dropped the WARN_ON. Note that till we got commit
> 74e98eb0 (" RDS: verify the underlying transport exists before creating
> a connection") merged, we had an issue. That guards it now.
> 
> Am curious about WARN_ON() and hence the question.
> 
> Rest of the patch looks fine to me.
> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit
  2015-09-30 15:56   ` santosh shilimkar
@ 2015-09-30 16:00     ` Sowmini Varadhan
  0 siblings, 0 replies; 14+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 16:00 UTC (permalink / raw)
  To: santosh shilimkar
  Cc: netdev, linux-kernel, davem, rds-devel, ajaykumar.hotchandani,
	igor.maximov

On (09/30/15 08:56), santosh shilimkar wrote:
> Your checkpatch.pl should have complained about commit
> reference in the change-log. You might want to fix that
> for consistency.

It didnt. But ok, I'll fix this nit as well.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive.
  2015-09-30 15:58     ` Sowmini Varadhan
@ 2015-09-30 16:04       ` santosh shilimkar
  0 siblings, 0 replies; 14+ messages in thread
From: santosh shilimkar @ 2015-09-30 16:04 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: netdev, linux-kernel, davem, rds-devel, ajaykumar.hotchandani,
	igor.maximov

On 9/30/2015 8:58 AM, Sowmini Varadhan wrote:
> On (09/30/15 08:50), santosh shilimkar wrote:
>> minor nit though not a strict rule. Just to be consistent based on
>> what we are following.
>>
>> - core RDS patches "RDS:"
>> - RDS IB patches "RDS: IB:" or "RDS/IB:"
>> - RDS IW patches "RDS: IW:" or
>> - RDS TCP can use "RDS: TCP" or "RDS/TCP:"
>
> Ok, but in this case patch 1/3 the changes affect both core and rds-tcp
> modules.
>
As I said, these are not strict rules but just what have been followed.
I would use "RDS: TCP:" for first patch as well but I let you
take a call :-)

> Working on patchv2 that will address Sergei's comments and the
> kbuild-test-robot warning as well
>
OK. How about the dropped WARN_ON() question ?

>>
>> $subject
>> s/net/rds:/RDS:
>>
>> On 9/30/2015 6:45 AM, Sowmini Varadhan wrote:
>>> Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
>>> for an incoming connection.") modified rds-tcp so that an incoming SYN
>>> would ignore an existing "client" TCP connection which had the local
>>> port set to the transient port.  The motivation for ignoring the existing
>>> "client" connection in f711a6ae was to avoid race conditions and an
>>> endless duel of reconnect attempts triggered by a restart/abort of one
>>> of the nodes in the TCP connection.
>>>
>>> However, having separate sockets for active and passive sides
>>> is avoidable, and the simpler model of a single TCP socket for
>>> both send and receives of all RDS connections associated with
>>> that tcp socket makes for easier observability. We avoid the race
>>> conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
>>> if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
>>> The c_outgoing bit is initialized in __rds_conn_create().
>>>
>>> A side-effect of re-using the client rds_connection for an incoming
>>> SYN is the potential of encountering duelling SYNs, i.e., we
>>> have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
>>> SYN. The logic to arbitrate this criss-crossing SYN exchange in
>>> rds_tcp_accept_one() has been modified to emulate the BGP state
>>> machine: the smaller IP address should back off from the connection attempt.
>>>
>>> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>>> ---
>>>   net/rds/connection.c |   22 ++++++----------------
>>>   net/rds/rds.h        |    4 +++-
>>>   net/rds/tcp_listen.c |   19 +++++++------------
>>>   3 files changed, 16 insertions(+), 29 deletions(-)
>>>
>>
>> [...]
>>
>>> diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
>>> index 444d78d..ee70d13 100644
>>> --- a/net/rds/tcp_listen.c
>>> +++ b/net/rds/tcp_listen.c
>>> @@ -110,28 +110,23 @@ int rds_tcp_accept_one(struct socket *sock)
>>>   		goto out;
>>>   	}
>>>   	/* An incoming SYN request came in, and TCP just accepted it.
>>> -	 * We always create a new conn for listen side of TCP, and do not
>>> -	 * add it to the c_hash_list.
>>>   	 *
>>>   	 * If the client reboots, this conn will need to be cleaned up.
>>>   	 * rds_tcp_state_change() will do that cleanup
>>>   	 */
>>>   	rs_tcp = (struct rds_tcp_connection *)conn->c_transport_data;
>>> -	WARN_ON(!rs_tcp || rs_tcp->t_sock);
>>> +	if (rs_tcp->t_sock && inet->inet_saddr < inet->inet_daddr) {
>>> +		struct sock *nsk = new_sock->sk;
>>>
>> Any reason you dropped the WARN_ON. Note that till we got commit
>> 74e98eb0 (" RDS: verify the underlying transport exists before creating
>> a connection") merged, we had an issue. That guards it now.
>>
>> Am curious about WARN_ON() and hence the question.
>>
>> Rest of the patch looks fine to me.
>> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive.
  2015-09-30 15:50   ` santosh shilimkar
  2015-09-30 15:58     ` Sowmini Varadhan
@ 2015-09-30 16:09     ` Sowmini Varadhan
  2015-09-30 16:13       ` santosh shilimkar
  1 sibling, 1 reply; 14+ messages in thread
From: Sowmini Varadhan @ 2015-09-30 16:09 UTC (permalink / raw)
  To: santosh shilimkar
  Cc: netdev, linux-kernel, davem, rds-devel, ajaykumar.hotchandani,
	igor.maximov

On (09/30/15 08:50), santosh shilimkar wrote:
> >  	rs_tcp = (struct rds_tcp_connection *)conn->c_transport_data;
> >-	WARN_ON(!rs_tcp || rs_tcp->t_sock);
> >+	if (rs_tcp->t_sock && inet->inet_saddr < inet->inet_daddr) {
> >+		struct sock *nsk = new_sock->sk;
> >
> Any reason you dropped the WARN_ON. Note that till we got commit
> 74e98eb0 (" RDS: verify the underlying transport exists before creating
> a connection") merged, we had an issue. That guards it now.

That was done deliberately. Now that we have only one tcp socket,
we can run into an rds_tcp_connection for an outgoing connection
that we initiated, thus rs_tcp->t_sock can be non-null - which is
why a new check is added in the newly added line in the patch.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive.
  2015-09-30 16:09     ` Sowmini Varadhan
@ 2015-09-30 16:13       ` santosh shilimkar
  0 siblings, 0 replies; 14+ messages in thread
From: santosh shilimkar @ 2015-09-30 16:13 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: netdev, linux-kernel, davem, rds-devel, ajaykumar.hotchandani,
	igor.maximov

On 9/30/2015 9:09 AM, Sowmini Varadhan wrote:
> On (09/30/15 08:50), santosh shilimkar wrote:
>>>   	rs_tcp = (struct rds_tcp_connection *)conn->c_transport_data;
>>> -	WARN_ON(!rs_tcp || rs_tcp->t_sock);
>>> +	if (rs_tcp->t_sock && inet->inet_saddr < inet->inet_daddr) {
>>> +		struct sock *nsk = new_sock->sk;
>>>
>> Any reason you dropped the WARN_ON. Note that till we got commit
>> 74e98eb0 (" RDS: verify the underlying transport exists before creating
>> a connection") merged, we had an issue. That guards it now.
>
> That was done deliberately. Now that we have only one tcp socket,
> we can run into an rds_tcp_connection for an outgoing connection
> that we initiated, thus rs_tcp->t_sock can be non-null - which is
> why a new check is added in the newly added line in the patch.
>
Thanks for clarification.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-09-30 16:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-30 13:45 [PATCH net-next 0/3] RDS: RDS-TCP perf enhancements Sowmini Varadhan
2015-09-30 13:45 ` [PATCH net-next 1/3] net/rds: Use a single TCP socket for both send and receive Sowmini Varadhan
2015-09-30 14:45   ` kbuild test robot
2015-09-30 15:50   ` santosh shilimkar
2015-09-30 15:58     ` Sowmini Varadhan
2015-09-30 16:04       ` santosh shilimkar
2015-09-30 16:09     ` Sowmini Varadhan
2015-09-30 16:13       ` santosh shilimkar
2015-09-30 13:45 ` [PATCH net-next 2/3] RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune Sowmini Varadhan
2015-09-30 15:54   ` santosh shilimkar
2015-09-30 13:45 ` [PATCH net-next 3/3] RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit Sowmini Varadhan
2015-09-30 14:53   ` Sergei Shtylyov
2015-09-30 15:56   ` santosh shilimkar
2015-09-30 16:00     ` Sowmini Varadhan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).