Netdev List
 help / color / mirror / Atom feed
* Re: TCP reaching to maximum throughput after a long time
From: Ben Greear @ 2016-04-12 15:04 UTC (permalink / raw)
  To: Machani, Yaniv, netdev
  Cc: Eric Dumazet, David S. Miller, Eric Dumazet, Neal Cardwell,
	Yuchung Cheng, Nandita Dukkipati, open list, Kama, Meirav
In-Reply-To: <1460472764.6473.589.camel@edumazet-glaptop3.roam.corp.google.com>

On 04/12/2016 07:52 AM, Eric Dumazet wrote:
> On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
>> Hi,
>> After updating from Kernel 3.14 to Kernel 4.4 we have seen a TCP performance degradation over Wi-Fi.
>> In 3.14 kernel, TCP got to its max throughout after less than a second, while in the 4.4  it is taking ~20-30 seconds.
>> UDP TX/RX and TCP RX performance is as expected.
>> We are using a Beagle Bone Black and a WiLink8 device.
>>
>> Were there any related changes that might cause such behavior ?
>> Kernel configuration and sysctl values were compared, but no significant differences have been found.

If you are using 'Cubic' TCP congestion control, then please try something different.
It was broken last I checked, at least when used with the ath10k driver.

https://marc.info/?l=linux-netdev&m=144405216005715&w=2

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* [PATCH 0/3] RxRPC: 2nd rewrite part 2
From: David Howells @ 2016-04-12 15:05 UTC (permalink / raw)
  To: linux-afs; +Cc: dhowells, netdev, linux-kernel


Here's the next part of the AF_RXRPC rewrite.  In this set I make some
changes to the user interface for AF_RXRPC:

 (1) connect() is no longer supported on an AF_RXRPC socket.  It is
     redundant given that sendmsg() can be given the target address;
     indeed, even on a connected client socket, sendmsg() can still be used
     with an address other than the connection address.

 (2) listen() is required to allow a service socket to begin accepting
     incoming calls.  Previously, bind() with a service ID set in the
     address caused the socket to begin listening.  Listen only adjusted
     the backlog parameter on the socket previously.

 (3) The maximum backlog size can be adjusted through a sysctl - though it
     is still limited to the range 4-32.  At some point I would like to
     have some preallocated rxrpc_call structs prepared for incoming calls,
     using the backlog to limit the preallocation.  Passing INT_MAX to
     listen() requests the maximum allowed.

 (4) Calling sendmsg() on a socket that is not yet bound shifts the socket
     to be a purely client socket and binds a random local UDP port.

 (5) sendmsg() with a RXRPC_ACCEPT control message must not also have an
     address specified in msg_name.  It doesn't make sense to supply an
     address here.

 (6) If sendmsg() is asked to make a call with a particular user call ID
     which doesn't yet exist, the user call ID must not come into existence
     whilst sendmsg() is off creating a new call.  Previously it would just
     add its data to the call.

I would also like to consider making further changes, but I think they'd
probably be too much of a change:

 (*) Require a control message of RXRPC_NEW_CALL to be passed to sendmsg()
     when beginning a new call to make it clear that we're instituting a
     new user call ID, not expecting the user call ID to already exist with
     the socket.  This would make (6) above cleaner.

 (*) Provide RXRPC_LOCALLY_ABORTED and RXRPC_REMOTELY_ABORTED control
     messages for recvmsg() to return instead of RXRPC_ABORT (which would
     then be for sendmsg() only).  Another way to do this is to return an
     additional control message that, say, indicates that the termination
     was remote.

 (*) Allow userspace to presupply user call IDs for incoming calls to use.
     These would be used instead of RXRPC_ACCEPT.  A control message would
     be required: one for sendmsg() to supply a user ID (RXRPC_PREACCEPT
     say) and then RXRPC_NEW_CALL would be given a parameter through
     recvmsg() to indicate the number of user call IDs available.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
	rxrpc-rewrite-20160412

This is based on net-next/master

David
---
David Howells (3):
      rxrpc: Don't permit use of connect() op and simplify sendmsg() op
      rxrpc: The RXRPC_ACCEPT control message should not have an address
      rxrpc: Use the listen() system call to move to listening state


 Documentation/networking/rxrpc.txt |    8 -
 fs/afs/rxrpc.c                     |   34 +++---
 include/linux/rxrpc.h              |   18 ++-
 net/rxrpc/af_rxrpc.c               |  209 ++++++++++--------------------------
 net/rxrpc/ar-call.c                |  158 +++++++++++----------------
 net/rxrpc/ar-connection.c          |   17 ---
 net/rxrpc/ar-internal.h            |   22 ++--
 net/rxrpc/ar-output.c              |  187 +++++++++++++++-----------------
 net/rxrpc/misc.c                   |    6 +
 net/rxrpc/sysctl.c                 |   10 ++
 10 files changed, 269 insertions(+), 400 deletions(-)

^ permalink raw reply

* [PATCH 1/3] rxrpc: Don't permit use of connect() op and simplify sendmsg() op
From: David Howells @ 2016-04-12 15:05 UTC (permalink / raw)
  To: linux-afs; +Cc: dhowells, netdev, linux-kernel
In-Reply-To: <20160412150533.20637.23952.stgit@warthog.procyon.org.uk>

Simplify the RxRPC user interface and remove the use of connect() to direct
client calls.  It is redundant given that sendmsg() can be given the target
address and calls to multiple targets are permitted from a client socket
and also from a service socket.

Simplify sendmsg() also.  If we can't find a call immediately, we create
one, as now, but if the call then exists when we try and add it, we give an
error rather than using the call we found at the second attempt.  We should
never see this situation unless two threads are racing, trying to create a
call with the same ID - which would be an error.

It also isn't required to provide sendmsg() with an address - provided the
control message data holds a user ID that maps to a currently active call.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/networking/rxrpc.txt |    8 --
 include/linux/rxrpc.h              |   18 ++-
 net/rxrpc/af_rxrpc.c               |  185 +++++++++---------------------------
 net/rxrpc/ar-call.c                |  158 ++++++++++++-------------------
 net/rxrpc/ar-connection.c          |   17 ---
 net/rxrpc/ar-internal.h            |   21 ++--
 net/rxrpc/ar-output.c              |  186 +++++++++++++++++-------------------
 7 files changed, 219 insertions(+), 374 deletions(-)

diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
index 16a924c486bf..a1089f93e4ce 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -216,12 +216,8 @@ Interaction with the user of the RxRPC socket:
      be used in all other sendmsgs or recvmsgs associated with that call.  The
      tag is carried in the control data.
 
- (*) connect() is used to supply a default destination address for a client
-     socket.  This may be overridden by supplying an alternate address to the
-     first sendmsg() of a call (struct msghdr::msg_name).
-
- (*) If connect() is called on an unbound client, a random local port will
-     bound before the operation takes place.
+ (*) connect() is not used.  The target address for a client call must be
+     supplied to the first sendmsg() of a call (struct msghdr::msg_name).
 
  (*) A server socket may also be used to make client calls.  To do this, the
      first sendmsg() of the call must specify the target address.  The server's
diff --git a/include/linux/rxrpc.h b/include/linux/rxrpc.h
index a53915cd5581..e4182d3f8c8b 100644
--- a/include/linux/rxrpc.h
+++ b/include/linux/rxrpc.h
@@ -40,16 +40,18 @@ struct sockaddr_rxrpc {
 
 /*
  * RxRPC control messages
+ * - data type is specified by default (ie. not abort or accept)
  * - terminal messages mean that a user call ID tag can be recycled
+ * - s/r/- indicate whether these are applicable to sendmsg() and/or recvmsg()
  */
-#define RXRPC_USER_CALL_ID	1	/* user call ID specifier */
-#define RXRPC_ABORT		2	/* abort request / notification [terminal] */
-#define RXRPC_ACK		3	/* [Server] RPC op final ACK received [terminal] */
-#define RXRPC_NET_ERROR		5	/* network error received [terminal] */
-#define RXRPC_BUSY		6	/* server busy received [terminal] */
-#define RXRPC_LOCAL_ERROR	7	/* local error generated [terminal] */
-#define RXRPC_NEW_CALL		8	/* [Server] new incoming call notification */
-#define RXRPC_ACCEPT		9	/* [Server] accept request */
+#define RXRPC_USER_CALL_ID	1	/* sr: user call ID specifier */
+#define RXRPC_ABORT		2	/* sr: abort request / notification [terminal] */
+#define RXRPC_ACK		3	/* -r: [Service] RPC op final ACK received [terminal] */
+#define RXRPC_NET_ERROR		5	/* -r: network error received [terminal] */
+#define RXRPC_BUSY		6	/* -r: server busy received [terminal] */
+#define RXRPC_LOCAL_ERROR	7	/* -r: local error generated [terminal] */
+#define RXRPC_NEW_CALL		8	/* -r: [Service] new incoming call notification */
+#define RXRPC_ACCEPT		9	/* s-: [Service] accept request */
 
 /*
  * RxRPC security levels
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index e45e94ca030f..dd462352a79c 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -137,33 +137,33 @@ static int rxrpc_bind(struct socket *sock, struct sockaddr *saddr, int len)
 
 	lock_sock(&rx->sk);
 
-	if (rx->sk.sk_state != RXRPC_UNCONNECTED) {
+	if (rx->sk.sk_state != RXRPC_UNBOUND) {
 		ret = -EINVAL;
 		goto error_unlock;
 	}
 
 	memcpy(&rx->srx, srx, sizeof(rx->srx));
 
-	/* Find or create a local transport endpoint to use */
 	local = rxrpc_lookup_local(&rx->srx);
 	if (IS_ERR(local)) {
 		ret = PTR_ERR(local);
 		goto error_unlock;
 	}
 
-	rx->local = local;
-	if (srx->srx_service) {
+	if (rx->srx.srx_service) {
 		write_lock_bh(&local->services_lock);
 		list_for_each_entry(prx, &local->services, listen_link) {
-			if (prx->srx.srx_service == srx->srx_service)
+			if (prx->srx.srx_service == rx->srx.srx_service)
 				goto service_in_use;
 		}
 
+		rx->local = local;
 		list_add_tail(&rx->listen_link, &local->services);
 		write_unlock_bh(&local->services_lock);
 
 		rx->sk.sk_state = RXRPC_SERVER_BOUND;
 	} else {
+		rx->local = local;
 		rx->sk.sk_state = RXRPC_CLIENT_BOUND;
 	}
 
@@ -172,8 +172,9 @@ static int rxrpc_bind(struct socket *sock, struct sockaddr *saddr, int len)
 	return 0;
 
 service_in_use:
-	ret = -EADDRINUSE;
 	write_unlock_bh(&local->services_lock);
+	rxrpc_put_local(local);
+	ret = -EADDRINUSE;
 error_unlock:
 	release_sock(&rx->sk);
 error:
@@ -195,11 +196,11 @@ static int rxrpc_listen(struct socket *sock, int backlog)
 	lock_sock(&rx->sk);
 
 	switch (rx->sk.sk_state) {
-	case RXRPC_UNCONNECTED:
+	case RXRPC_UNBOUND:
 		ret = -EADDRNOTAVAIL;
 		break;
+	case RXRPC_CLIENT_UNBOUND:
 	case RXRPC_CLIENT_BOUND:
-	case RXRPC_CLIENT_CONNECTED:
 	default:
 		ret = -EBUSY;
 		break;
@@ -219,20 +220,18 @@ static int rxrpc_listen(struct socket *sock, int backlog)
 /*
  * find a transport by address
  */
-static struct rxrpc_transport *rxrpc_name_to_transport(struct socket *sock,
-						       struct sockaddr *addr,
-						       int addr_len, int flags,
-						       gfp_t gfp)
+struct rxrpc_transport *rxrpc_name_to_transport(struct rxrpc_sock *rx,
+						struct sockaddr *addr,
+						int addr_len, int flags,
+						gfp_t gfp)
 {
 	struct sockaddr_rxrpc *srx = (struct sockaddr_rxrpc *) addr;
 	struct rxrpc_transport *trans;
-	struct rxrpc_sock *rx = rxrpc_sk(sock->sk);
 	struct rxrpc_peer *peer;
 
 	_enter("%p,%p,%d,%d", rx, addr, addr_len, flags);
 
 	ASSERT(rx->local != NULL);
-	ASSERT(rx->sk.sk_state > RXRPC_UNCONNECTED);
 
 	if (rx->srx.transport_type != srx->transport_type)
 		return ERR_PTR(-ESOCKTNOSUPPORT);
@@ -254,7 +253,7 @@ static struct rxrpc_transport *rxrpc_name_to_transport(struct socket *sock,
 /**
  * rxrpc_kernel_begin_call - Allow a kernel service to begin a call
  * @sock: The socket on which to make the call
- * @srx: The address of the peer to contact (defaults to socket setting)
+ * @srx: The address of the peer to contact
  * @key: The security context to use (defaults to socket setting)
  * @user_call_ID: The ID to use
  *
@@ -280,25 +279,14 @@ struct rxrpc_call *rxrpc_kernel_begin_call(struct socket *sock,
 
 	lock_sock(&rx->sk);
 
-	if (srx) {
-		trans = rxrpc_name_to_transport(sock, (struct sockaddr *) srx,
-						sizeof(*srx), 0, gfp);
-		if (IS_ERR(trans)) {
-			call = ERR_CAST(trans);
-			trans = NULL;
-			goto out_notrans;
-		}
-	} else {
-		trans = rx->trans;
-		if (!trans) {
-			call = ERR_PTR(-ENOTCONN);
-			goto out_notrans;
-		}
-		atomic_inc(&trans->usage);
+	trans = rxrpc_name_to_transport(rx, (struct sockaddr *)srx,
+					sizeof(*srx), 0, gfp);
+	if (IS_ERR(trans)) {
+		call = ERR_CAST(trans);
+		trans = NULL;
+		goto out_notrans;
 	}
 
-	if (!srx)
-		srx = &rx->srx;
 	if (!key)
 		key = rx->key;
 	if (key && !key->payload.data[0])
@@ -310,8 +298,7 @@ struct rxrpc_call *rxrpc_kernel_begin_call(struct socket *sock,
 		goto out;
 	}
 
-	call = rxrpc_get_client_call(rx, trans, bundle, user_call_ID, true,
-				     gfp);
+	call = rxrpc_new_client_call(rx, trans, bundle, user_call_ID, gfp);
 	rxrpc_put_bundle(trans, bundle);
 out:
 	rxrpc_put_transport(trans);
@@ -360,69 +347,14 @@ void rxrpc_kernel_intercept_rx_messages(struct socket *sock,
 EXPORT_SYMBOL(rxrpc_kernel_intercept_rx_messages);
 
 /*
- * connect an RxRPC socket
- * - this just targets it at a specific destination; no actual connection
- *   negotiation takes place
+ * We don't permit connection of an RxRPC socket.  It's pointless since
+ * sendmsg() takes the target address for a new call and a socket can support
+ * calls to multiple servers simultaneously.
  */
 static int rxrpc_connect(struct socket *sock, struct sockaddr *addr,
 			 int addr_len, int flags)
 {
-	struct sockaddr_rxrpc *srx = (struct sockaddr_rxrpc *) addr;
-	struct sock *sk = sock->sk;
-	struct rxrpc_transport *trans;
-	struct rxrpc_local *local;
-	struct rxrpc_sock *rx = rxrpc_sk(sk);
-	int ret;
-
-	_enter("%p,%p,%d,%d", rx, addr, addr_len, flags);
-
-	ret = rxrpc_validate_address(rx, srx, addr_len);
-	if (ret < 0) {
-		_leave(" = %d [bad addr]", ret);
-		return ret;
-	}
-
-	lock_sock(&rx->sk);
-
-	switch (rx->sk.sk_state) {
-	case RXRPC_UNCONNECTED:
-		/* find a local transport endpoint if we don't have one already */
-		ASSERTCMP(rx->local, ==, NULL);
-		rx->srx.srx_family = AF_RXRPC;
-		rx->srx.srx_service = 0;
-		rx->srx.transport_type = srx->transport_type;
-		rx->srx.transport_len = sizeof(sa_family_t);
-		rx->srx.transport.family = srx->transport.family;
-		local = rxrpc_lookup_local(&rx->srx);
-		if (IS_ERR(local)) {
-			release_sock(&rx->sk);
-			return PTR_ERR(local);
-		}
-		rx->local = local;
-		rx->sk.sk_state = RXRPC_CLIENT_BOUND;
-	case RXRPC_CLIENT_BOUND:
-		break;
-	case RXRPC_CLIENT_CONNECTED:
-		release_sock(&rx->sk);
-		return -EISCONN;
-	default:
-		release_sock(&rx->sk);
-		return -EBUSY; /* server sockets can't connect as well */
-	}
-
-	trans = rxrpc_name_to_transport(sock, addr, addr_len, flags,
-					GFP_KERNEL);
-	if (IS_ERR(trans)) {
-		release_sock(&rx->sk);
-		_leave(" = %ld", PTR_ERR(trans));
-		return PTR_ERR(trans);
-	}
-
-	rx->trans = trans;
-	rx->sk.sk_state = RXRPC_CLIENT_CONNECTED;
-
-	release_sock(&rx->sk);
-	return 0;
+	return -EOPNOTSUPP;
 }
 
 /*
@@ -436,7 +368,7 @@ static int rxrpc_connect(struct socket *sock, struct sockaddr *addr,
  */
 static int rxrpc_sendmsg(struct socket *sock, struct msghdr *m, size_t len)
 {
-	struct rxrpc_transport *trans;
+	struct rxrpc_local *local;
 	struct rxrpc_sock *rx = rxrpc_sk(sock->sk);
 	int ret;
 
@@ -453,48 +385,33 @@ static int rxrpc_sendmsg(struct socket *sock, struct msghdr *m, size_t len)
 		}
 	}
 
-	trans = NULL;
 	lock_sock(&rx->sk);
 
-	if (m->msg_name) {
-		ret = -EISCONN;
-		trans = rxrpc_name_to_transport(sock, m->msg_name,
-						m->msg_namelen, 0, GFP_KERNEL);
-		if (IS_ERR(trans)) {
-			ret = PTR_ERR(trans);
-			trans = NULL;
-			goto out;
-		}
-	} else {
-		trans = rx->trans;
-		if (trans)
-			atomic_inc(&trans->usage);
-	}
-
 	switch (rx->sk.sk_state) {
-	case RXRPC_SERVER_LISTENING:
-		if (!m->msg_name) {
-			ret = rxrpc_server_sendmsg(rx, m, len);
-			break;
+	case RXRPC_UNBOUND:
+		local = rxrpc_lookup_local(&rx->srx);
+		if (IS_ERR(local)) {
+			ret = PTR_ERR(local);
+			goto error_unlock;
 		}
-	case RXRPC_SERVER_BOUND:
+
+		rx->local = local;
+		rx->sk.sk_state = RXRPC_CLIENT_UNBOUND;
+		/* Fall through */
+
+	case RXRPC_CLIENT_UNBOUND:
 	case RXRPC_CLIENT_BOUND:
-		if (!m->msg_name) {
-			ret = -ENOTCONN;
-			break;
-		}
-	case RXRPC_CLIENT_CONNECTED:
-		ret = rxrpc_client_sendmsg(rx, trans, m, len);
+	case RXRPC_SERVER_BOUND:
+	case RXRPC_SERVER_LISTENING:
+		ret = rxrpc_do_sendmsg(rx, m, len);
 		break;
 	default:
-		ret = -ENOTCONN;
+		ret = -EINVAL;
 		break;
 	}
 
-out:
+error_unlock:
 	release_sock(&rx->sk);
-	if (trans)
-		rxrpc_put_transport(trans);
 	_leave(" = %d", ret);
 	return ret;
 }
@@ -521,7 +438,7 @@ static int rxrpc_setsockopt(struct socket *sock, int level, int optname,
 			if (optlen != 0)
 				goto error;
 			ret = -EISCONN;
-			if (rx->sk.sk_state != RXRPC_UNCONNECTED)
+			if (rx->sk.sk_state != RXRPC_UNBOUND)
 				goto error;
 			set_bit(RXRPC_SOCK_EXCLUSIVE_CONN, &rx->flags);
 			goto success;
@@ -531,7 +448,7 @@ static int rxrpc_setsockopt(struct socket *sock, int level, int optname,
 			if (rx->key)
 				goto error;
 			ret = -EISCONN;
-			if (rx->sk.sk_state != RXRPC_UNCONNECTED)
+			if (rx->sk.sk_state != RXRPC_UNBOUND)
 				goto error;
 			ret = rxrpc_request_key(rx, optval, optlen);
 			goto error;
@@ -541,7 +458,7 @@ static int rxrpc_setsockopt(struct socket *sock, int level, int optname,
 			if (rx->key)
 				goto error;
 			ret = -EISCONN;
-			if (rx->sk.sk_state != RXRPC_UNCONNECTED)
+			if (rx->sk.sk_state != RXRPC_UNBOUND)
 				goto error;
 			ret = rxrpc_server_keyring(rx, optval, optlen);
 			goto error;
@@ -551,7 +468,7 @@ static int rxrpc_setsockopt(struct socket *sock, int level, int optname,
 			if (optlen != sizeof(unsigned int))
 				goto error;
 			ret = -EISCONN;
-			if (rx->sk.sk_state != RXRPC_UNCONNECTED)
+			if (rx->sk.sk_state != RXRPC_UNBOUND)
 				goto error;
 			ret = get_user(min_sec_level,
 				       (unsigned int __user *) optval);
@@ -630,7 +547,7 @@ static int rxrpc_create(struct net *net, struct socket *sock, int protocol,
 		return -ENOMEM;
 
 	sock_init_data(sock, sk);
-	sk->sk_state		= RXRPC_UNCONNECTED;
+	sk->sk_state		= RXRPC_UNBOUND;
 	sk->sk_write_space	= rxrpc_write_space;
 	sk->sk_max_ack_backlog	= sysctl_rxrpc_max_qlen;
 	sk->sk_destruct		= rxrpc_sock_destructor;
@@ -703,14 +620,6 @@ static int rxrpc_release_sock(struct sock *sk)
 		rx->conn = NULL;
 	}
 
-	if (rx->bundle) {
-		rxrpc_put_bundle(rx->trans, rx->bundle);
-		rx->bundle = NULL;
-	}
-	if (rx->trans) {
-		rxrpc_put_transport(rx->trans);
-		rx->trans = NULL;
-	}
 	if (rx->local) {
 		rxrpc_put_local(rx->local);
 		rx->local = NULL;
diff --git a/net/rxrpc/ar-call.c b/net/rxrpc/ar-call.c
index 571a41fd5a32..9296bdb26c24 100644
--- a/net/rxrpc/ar-call.c
+++ b/net/rxrpc/ar-call.c
@@ -194,6 +194,43 @@ struct rxrpc_call *rxrpc_find_call_hash(
 }
 
 /*
+ * find an extant server call
+ * - called in process context with IRQs enabled
+ */
+struct rxrpc_call *rxrpc_find_call_by_user_ID(struct rxrpc_sock *rx,
+					      unsigned long user_call_ID)
+{
+	struct rxrpc_call *call;
+	struct rb_node *p;
+
+	_enter("%p,%lx", rx, user_call_ID);
+
+	read_lock(&rx->call_lock);
+
+	p = rx->calls.rb_node;
+	while (p) {
+		call = rb_entry(p, struct rxrpc_call, sock_node);
+
+		if (user_call_ID < call->user_call_ID)
+			p = p->rb_left;
+		else if (user_call_ID > call->user_call_ID)
+			p = p->rb_right;
+		else
+			goto found_extant_call;
+	}
+
+	read_unlock(&rx->call_lock);
+	_leave(" = NULL");
+	return NULL;
+
+found_extant_call:
+	rxrpc_get_call(call);
+	read_unlock(&rx->call_lock);
+	_leave(" = %p [%d]", call, atomic_read(&call->usage));
+	return call;
+}
+
+/*
  * allocate a new call
  */
 static struct rxrpc_call *rxrpc_alloc_call(gfp_t gfp)
@@ -309,51 +346,27 @@ static struct rxrpc_call *rxrpc_alloc_client_call(
  * set up a call for the given data
  * - called in process context with IRQs enabled
  */
-struct rxrpc_call *rxrpc_get_client_call(struct rxrpc_sock *rx,
+struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx,
 					 struct rxrpc_transport *trans,
 					 struct rxrpc_conn_bundle *bundle,
 					 unsigned long user_call_ID,
-					 int create,
 					 gfp_t gfp)
 {
-	struct rxrpc_call *call, *candidate;
-	struct rb_node *p, *parent, **pp;
+	struct rxrpc_call *call, *xcall;
+	struct rb_node *parent, **pp;
 
-	_enter("%p,%d,%d,%lx,%d",
-	       rx, trans ? trans->debug_id : -1, bundle ? bundle->debug_id : -1,
-	       user_call_ID, create);
+	_enter("%p,%d,%d,%lx",
+	       rx, trans->debug_id, bundle ? bundle->debug_id : -1,
+	       user_call_ID);
 
-	/* search the extant calls first for one that matches the specified
-	 * user ID */
-	read_lock(&rx->call_lock);
-
-	p = rx->calls.rb_node;
-	while (p) {
-		call = rb_entry(p, struct rxrpc_call, sock_node);
-
-		if (user_call_ID < call->user_call_ID)
-			p = p->rb_left;
-		else if (user_call_ID > call->user_call_ID)
-			p = p->rb_right;
-		else
-			goto found_extant_call;
+	call = rxrpc_alloc_client_call(rx, trans, bundle, gfp);
+	if (IS_ERR(call)) {
+		_leave(" = %ld", PTR_ERR(call));
+		return call;
 	}
 
-	read_unlock(&rx->call_lock);
-
-	if (!create || !trans)
-		return ERR_PTR(-EBADSLT);
-
-	/* not yet present - create a candidate for a new record and then
-	 * redo the search */
-	candidate = rxrpc_alloc_client_call(rx, trans, bundle, gfp);
-	if (IS_ERR(candidate)) {
-		_leave(" = %ld", PTR_ERR(candidate));
-		return candidate;
-	}
-
-	candidate->user_call_ID = user_call_ID;
-	__set_bit(RXRPC_CALL_HAS_USERID, &candidate->flags);
+	call->user_call_ID = user_call_ID;
+	__set_bit(RXRPC_CALL_HAS_USERID, &call->flags);
 
 	write_lock(&rx->call_lock);
 
@@ -361,19 +374,16 @@ struct rxrpc_call *rxrpc_get_client_call(struct rxrpc_sock *rx,
 	parent = NULL;
 	while (*pp) {
 		parent = *pp;
-		call = rb_entry(parent, struct rxrpc_call, sock_node);
+		xcall = rb_entry(parent, struct rxrpc_call, sock_node);
 
-		if (user_call_ID < call->user_call_ID)
+		if (user_call_ID < xcall->user_call_ID)
 			pp = &(*pp)->rb_left;
-		else if (user_call_ID > call->user_call_ID)
+		else if (user_call_ID > xcall->user_call_ID)
 			pp = &(*pp)->rb_right;
 		else
-			goto found_extant_second;
+			goto found_user_ID_now_present;
 	}
 
-	/* second search also failed; add the new call */
-	call = candidate;
-	candidate = NULL;
 	rxrpc_get_call(call);
 
 	rb_link_node(&call->sock_node, parent, pp);
@@ -389,20 +399,16 @@ struct rxrpc_call *rxrpc_get_client_call(struct rxrpc_sock *rx,
 	_leave(" = %p [new]", call);
 	return call;
 
-	/* we found the call in the list immediately */
-found_extant_call:
-	rxrpc_get_call(call);
-	read_unlock(&rx->call_lock);
-	_leave(" = %p [extant %d]", call, atomic_read(&call->usage));
-	return call;
-
-	/* we found the call on the second time through the list */
-found_extant_second:
-	rxrpc_get_call(call);
+	/* We unexpectedly found the user ID in the list after taking
+	 * the call_lock.  This shouldn't happen unless the user races
+	 * with itself and tries to add the same user ID twice at the
+	 * same time in different threads.
+	 */
+found_user_ID_now_present:
 	write_unlock(&rx->call_lock);
-	rxrpc_put_call(candidate);
-	_leave(" = %p [second %d]", call, atomic_read(&call->usage));
-	return call;
+	rxrpc_put_call(call);
+	_leave(" = -EEXIST [%p]", call);
+	return ERR_PTR(-EEXIST);
 }
 
 /*
@@ -564,46 +570,6 @@ old_call:
 }
 
 /*
- * find an extant server call
- * - called in process context with IRQs enabled
- */
-struct rxrpc_call *rxrpc_find_server_call(struct rxrpc_sock *rx,
-					  unsigned long user_call_ID)
-{
-	struct rxrpc_call *call;
-	struct rb_node *p;
-
-	_enter("%p,%lx", rx, user_call_ID);
-
-	/* search the extant calls for one that matches the specified user
-	 * ID */
-	read_lock(&rx->call_lock);
-
-	p = rx->calls.rb_node;
-	while (p) {
-		call = rb_entry(p, struct rxrpc_call, sock_node);
-
-		if (user_call_ID < call->user_call_ID)
-			p = p->rb_left;
-		else if (user_call_ID > call->user_call_ID)
-			p = p->rb_right;
-		else
-			goto found_extant_call;
-	}
-
-	read_unlock(&rx->call_lock);
-	_leave(" = NULL");
-	return NULL;
-
-	/* we found the call in the list immediately */
-found_extant_call:
-	rxrpc_get_call(call);
-	read_unlock(&rx->call_lock);
-	_leave(" = %p [%d]", call, atomic_read(&call->usage));
-	return call;
-}
-
-/*
  * detach a call from a socket and set up for release
  */
 void rxrpc_release_call(struct rxrpc_call *call)
diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
index 97f4fae74bca..5307dba4a13a 100644
--- a/net/rxrpc/ar-connection.c
+++ b/net/rxrpc/ar-connection.c
@@ -78,11 +78,6 @@ struct rxrpc_conn_bundle *rxrpc_get_bundle(struct rxrpc_sock *rx,
 	_enter("%p{%x},%x,%hx,",
 	       rx, key_serial(key), trans->debug_id, service_id);
 
-	if (rx->trans == trans && rx->bundle) {
-		atomic_inc(&rx->bundle->usage);
-		return rx->bundle;
-	}
-
 	/* search the extant bundles first for one that matches the specified
 	 * user ID */
 	spin_lock(&trans->client_lock);
@@ -136,10 +131,6 @@ struct rxrpc_conn_bundle *rxrpc_get_bundle(struct rxrpc_sock *rx,
 	rb_insert_color(&bundle->node, &trans->bundles);
 	spin_unlock(&trans->client_lock);
 	_net("BUNDLE new on trans %d", trans->debug_id);
-	if (!rx->bundle && rx->sk.sk_state == RXRPC_CLIENT_CONNECTED) {
-		atomic_inc(&bundle->usage);
-		rx->bundle = bundle;
-	}
 	_leave(" = %p [new]", bundle);
 	return bundle;
 
@@ -148,10 +139,6 @@ found_extant_bundle:
 	atomic_inc(&bundle->usage);
 	spin_unlock(&trans->client_lock);
 	_net("BUNDLE old on trans %d", trans->debug_id);
-	if (!rx->bundle && rx->sk.sk_state == RXRPC_CLIENT_CONNECTED) {
-		atomic_inc(&bundle->usage);
-		rx->bundle = bundle;
-	}
 	_leave(" = %p [extant %d]", bundle, atomic_read(&bundle->usage));
 	return bundle;
 
@@ -161,10 +148,6 @@ found_extant_second:
 	spin_unlock(&trans->client_lock);
 	kfree(candidate);
 	_net("BUNDLE old2 on trans %d", trans->debug_id);
-	if (!rx->bundle && rx->sk.sk_state == RXRPC_CLIENT_CONNECTED) {
-		atomic_inc(&bundle->usage);
-		rx->bundle = bundle;
-	}
 	_leave(" = %p [second %d]", bundle, atomic_read(&bundle->usage));
 	return bundle;
 }
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index f0b807a163fa..bbf2443af875 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -39,9 +39,9 @@ struct rxrpc_crypt {
  * sk_state for RxRPC sockets
  */
 enum {
-	RXRPC_UNCONNECTED = 0,
+	RXRPC_UNBOUND = 0,
+	RXRPC_CLIENT_UNBOUND,		/* Unbound socket used as client */
 	RXRPC_CLIENT_BOUND,		/* client local address bound */
-	RXRPC_CLIENT_CONNECTED,		/* client is connected */
 	RXRPC_SERVER_BOUND,		/* server local address bound */
 	RXRPC_SERVER_LISTENING,		/* server listening for connections */
 	RXRPC_CLOSE,			/* socket is being closed */
@@ -55,8 +55,6 @@ struct rxrpc_sock {
 	struct sock		sk;
 	rxrpc_interceptor_t	interceptor;	/* kernel service Rx interceptor function */
 	struct rxrpc_local	*local;		/* local endpoint */
-	struct rxrpc_transport	*trans;		/* transport handler */
-	struct rxrpc_conn_bundle *bundle;	/* virtual connection bundle */
 	struct rxrpc_connection	*conn;		/* exclusive virtual connection */
 	struct list_head	listen_link;	/* link in the local endpoint's listen list */
 	struct list_head	secureq;	/* calls awaiting connection security clearance */
@@ -477,6 +475,10 @@ extern u32 rxrpc_epoch;
 extern atomic_t rxrpc_debug_id;
 extern struct workqueue_struct *rxrpc_workqueue;
 
+struct rxrpc_transport *rxrpc_name_to_transport(struct rxrpc_sock *,
+						struct sockaddr *,
+						int, int, gfp_t);
+
 /*
  * ar-accept.c
  */
@@ -502,14 +504,15 @@ extern rwlock_t rxrpc_call_lock;
 
 struct rxrpc_call *rxrpc_find_call_hash(struct rxrpc_host_header *,
 					void *, sa_family_t, const void *);
-struct rxrpc_call *rxrpc_get_client_call(struct rxrpc_sock *,
+struct rxrpc_call *rxrpc_find_call_by_user_ID(struct rxrpc_sock *,
+					      unsigned long);
+struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *,
 					 struct rxrpc_transport *,
 					 struct rxrpc_conn_bundle *,
-					 unsigned long, int, gfp_t);
+					 unsigned long, gfp_t);
 struct rxrpc_call *rxrpc_incoming_call(struct rxrpc_sock *,
 				       struct rxrpc_connection *,
 				       struct rxrpc_host_header *);
-struct rxrpc_call *rxrpc_find_server_call(struct rxrpc_sock *, unsigned long);
 void rxrpc_release_call(struct rxrpc_call *);
 void rxrpc_release_calls_on_socket(struct rxrpc_sock *);
 void __rxrpc_put_call(struct rxrpc_call *);
@@ -581,9 +584,7 @@ int rxrpc_get_server_data_key(struct rxrpc_connection *, const void *, time_t,
 extern unsigned int rxrpc_resend_timeout;
 
 int rxrpc_send_packet(struct rxrpc_transport *, struct sk_buff *);
-int rxrpc_client_sendmsg(struct rxrpc_sock *, struct rxrpc_transport *,
-			 struct msghdr *, size_t);
-int rxrpc_server_sendmsg(struct rxrpc_sock *, struct msghdr *, size_t);
+int rxrpc_do_sendmsg(struct rxrpc_sock *, struct msghdr *, size_t);
 
 /*
  * ar-peer.c
diff --git a/net/rxrpc/ar-output.c b/net/rxrpc/ar-output.c
index 51cb10062a8d..b87fda075b45 100644
--- a/net/rxrpc/ar-output.c
+++ b/net/rxrpc/ar-output.c
@@ -30,13 +30,13 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 /*
  * extract control messages from the sendmsg() control buffer
  */
-static int rxrpc_sendmsg_cmsg(struct rxrpc_sock *rx, struct msghdr *msg,
+static int rxrpc_sendmsg_cmsg(struct msghdr *msg,
 			      unsigned long *user_call_ID,
 			      enum rxrpc_command *command,
-			      u32 *abort_code,
-			      bool server)
+			      u32 *abort_code)
 {
 	struct cmsghdr *cmsg;
+	bool got_user_ID = false;
 	int len;
 
 	*command = RXRPC_CMD_SEND_DATA;
@@ -68,6 +68,7 @@ static int rxrpc_sendmsg_cmsg(struct rxrpc_sock *rx, struct msghdr *msg,
 					CMSG_DATA(cmsg);
 			}
 			_debug("User Call ID %lx", *user_call_ID);
+			got_user_ID = true;
 			break;
 
 		case RXRPC_ABORT:
@@ -88,8 +89,6 @@ static int rxrpc_sendmsg_cmsg(struct rxrpc_sock *rx, struct msghdr *msg,
 			*command = RXRPC_CMD_ACCEPT;
 			if (len != 0)
 				return -EINVAL;
-			if (!server)
-				return -EISCONN;
 			break;
 
 		default:
@@ -97,6 +96,8 @@ static int rxrpc_sendmsg_cmsg(struct rxrpc_sock *rx, struct msghdr *msg,
 		}
 	}
 
+	if (!got_user_ID)
+		return -EINVAL;
 	_leave(" = 0");
 	return 0;
 }
@@ -124,55 +125,96 @@ static void rxrpc_send_abort(struct rxrpc_call *call, u32 abort_code)
 }
 
 /*
+ * Create a new client call for sendmsg().
+ */
+static struct rxrpc_call *
+rxrpc_new_client_call_for_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg,
+				  unsigned long user_call_ID)
+{
+	struct rxrpc_conn_bundle *bundle;
+	struct rxrpc_transport *trans;
+	struct rxrpc_call *call;
+	struct key *key;
+	long ret;
+
+	DECLARE_SOCKADDR(struct sockaddr_rxrpc *, srx, msg->msg_name);
+
+	_enter("");
+
+	if (!msg->msg_name)
+		return ERR_PTR(-EDESTADDRREQ);
+
+	trans = rxrpc_name_to_transport(rx, msg->msg_name, msg->msg_namelen, 0,
+					GFP_KERNEL);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+
+	key = rx->key;
+	if (key && !rx->key->payload.data[0])
+		key = NULL;
+	bundle = rxrpc_get_bundle(rx, trans, key, srx->srx_service, GFP_KERNEL);
+	if (IS_ERR(bundle)) {
+		ret = PTR_ERR(bundle);
+		goto out_trans;
+	}
+
+	call = rxrpc_new_client_call(rx, trans, bundle, user_call_ID,
+				     GFP_KERNEL);
+	rxrpc_put_bundle(trans, bundle);
+	rxrpc_put_transport(trans);
+	if (IS_ERR(call)) {
+		ret = PTR_ERR(call);
+		goto out_trans;
+	}
+
+	_leave(" = %p\n", call);
+	return call;
+
+out_trans:
+	rxrpc_put_transport(trans);
+out:
+	_leave(" = %ld", ret);
+	return ERR_PTR(ret);
+}
+
+/*
  * send a message forming part of a client call through an RxRPC socket
  * - caller holds the socket locked
  * - the socket may be either a client socket or a server socket
  */
-int rxrpc_client_sendmsg(struct rxrpc_sock *rx, struct rxrpc_transport *trans,
-			 struct msghdr *msg, size_t len)
+int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg, size_t len)
 {
-	struct rxrpc_conn_bundle *bundle;
 	enum rxrpc_command cmd;
 	struct rxrpc_call *call;
 	unsigned long user_call_ID = 0;
-	struct key *key;
-	u16 service_id;
 	u32 abort_code = 0;
 	int ret;
 
 	_enter("");
 
-	ASSERT(trans != NULL);
-
-	ret = rxrpc_sendmsg_cmsg(rx, msg, &user_call_ID, &cmd, &abort_code,
-				 false);
+	ret = rxrpc_sendmsg_cmsg(msg, &user_call_ID, &cmd, &abort_code);
 	if (ret < 0)
 		return ret;
 
-	bundle = NULL;
-	if (trans) {
-		service_id = rx->srx.srx_service;
-		if (msg->msg_name) {
-			DECLARE_SOCKADDR(struct sockaddr_rxrpc *, srx,
-					 msg->msg_name);
-			service_id = srx->srx_service;
-		}
-		key = rx->key;
-		if (key && !rx->key->payload.data[0])
-			key = NULL;
-		bundle = rxrpc_get_bundle(rx, trans, key, service_id,
-					  GFP_KERNEL);
-		if (IS_ERR(bundle))
-			return PTR_ERR(bundle);
+	if (cmd == RXRPC_CMD_ACCEPT) {
+		if (rx->sk.sk_state != RXRPC_SERVER_LISTENING)
+			return -EINVAL;
+		call = rxrpc_accept_call(rx, user_call_ID);
+		if (IS_ERR(call))
+			return PTR_ERR(call);
+		rxrpc_put_call(call);
+		return 0;
 	}
 
-	call = rxrpc_get_client_call(rx, trans, bundle, user_call_ID,
-				     abort_code == 0, GFP_KERNEL);
-	if (trans)
-		rxrpc_put_bundle(trans, bundle);
-	if (IS_ERR(call)) {
-		_leave(" = %ld", PTR_ERR(call));
-		return PTR_ERR(call);
+	call = rxrpc_find_call_by_user_ID(rx, user_call_ID);
+	if (!call) {
+		if (cmd != RXRPC_CMD_SEND_DATA)
+			return -EBADSLT;
+		call = rxrpc_new_client_call_for_sendmsg(rx, msg, user_call_ID);
+		if (IS_ERR(call))
+			return PTR_ERR(call);
 	}
 
 	_debug("CALL %d USR %lx ST %d on CONN %p",
@@ -180,14 +222,21 @@ int rxrpc_client_sendmsg(struct rxrpc_sock *rx, struct rxrpc_transport *trans,
 
 	if (call->state >= RXRPC_CALL_COMPLETE) {
 		/* it's too late for this call */
-		ret = -ESHUTDOWN;
+		ret = -ECONNRESET;
 	} else if (cmd == RXRPC_CMD_SEND_ABORT) {
 		rxrpc_send_abort(call, abort_code);
+		ret = 0;
 	} else if (cmd != RXRPC_CMD_SEND_DATA) {
 		ret = -EINVAL;
-	} else if (call->state != RXRPC_CALL_CLIENT_SEND_REQUEST) {
+	} else if (!call->in_clientflag &&
+		   call->state != RXRPC_CALL_CLIENT_SEND_REQUEST) {
 		/* request phase complete for this client call */
 		ret = -EPROTO;
+	} else if (call->in_clientflag &&
+		   call->state != RXRPC_CALL_SERVER_ACK_REQUEST &&
+		   call->state != RXRPC_CALL_SERVER_SEND_REPLY) {
+		/* Reply phase not begun or not complete for service call. */
+		ret = -EPROTO;
 	} else {
 		ret = rxrpc_send_data(rx, call, msg, len);
 	}
@@ -266,67 +315,6 @@ void rxrpc_kernel_abort_call(struct rxrpc_call *call, u32 abort_code)
 EXPORT_SYMBOL(rxrpc_kernel_abort_call);
 
 /*
- * send a message through a server socket
- * - caller holds the socket locked
- */
-int rxrpc_server_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg, size_t len)
-{
-	enum rxrpc_command cmd;
-	struct rxrpc_call *call;
-	unsigned long user_call_ID = 0;
-	u32 abort_code = 0;
-	int ret;
-
-	_enter("");
-
-	ret = rxrpc_sendmsg_cmsg(rx, msg, &user_call_ID, &cmd, &abort_code,
-				 true);
-	if (ret < 0)
-		return ret;
-
-	if (cmd == RXRPC_CMD_ACCEPT) {
-		call = rxrpc_accept_call(rx, user_call_ID);
-		if (IS_ERR(call))
-			return PTR_ERR(call);
-		rxrpc_put_call(call);
-		return 0;
-	}
-
-	call = rxrpc_find_server_call(rx, user_call_ID);
-	if (!call)
-		return -EBADSLT;
-	if (call->state >= RXRPC_CALL_COMPLETE) {
-		ret = -ESHUTDOWN;
-		goto out;
-	}
-
-	switch (cmd) {
-	case RXRPC_CMD_SEND_DATA:
-		if (call->state != RXRPC_CALL_CLIENT_SEND_REQUEST &&
-		    call->state != RXRPC_CALL_SERVER_ACK_REQUEST &&
-		    call->state != RXRPC_CALL_SERVER_SEND_REPLY) {
-			/* Tx phase not yet begun for this call */
-			ret = -EPROTO;
-			break;
-		}
-
-		ret = rxrpc_send_data(rx, call, msg, len);
-		break;
-
-	case RXRPC_CMD_SEND_ABORT:
-		rxrpc_send_abort(call, abort_code);
-		break;
-	default:
-		BUG();
-	}
-
-	out:
-	rxrpc_put_call(call);
-	_leave(" = %d", ret);
-	return ret;
-}
-
-/*
  * send a packet through the transport endpoint
  */
 int rxrpc_send_packet(struct rxrpc_transport *trans, struct sk_buff *skb)

^ permalink raw reply related

* [PATCH 2/3] rxrpc: The RXRPC_ACCEPT control message should not have an address
From: David Howells @ 2016-04-12 15:05 UTC (permalink / raw)
  To: linux-afs; +Cc: dhowells, netdev, linux-kernel
In-Reply-To: <20160412150533.20637.23952.stgit@warthog.procyon.org.uk>

When sendmsg() is called with the RXRPC_ACCEPT control message, sendmsg()
shouldn't also be given an address in msg_name.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/ar-output.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rxrpc/ar-output.c b/net/rxrpc/ar-output.c
index b87fda075b45..044de9bf34a4 100644
--- a/net/rxrpc/ar-output.c
+++ b/net/rxrpc/ar-output.c
@@ -199,7 +199,8 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg, size_t len)
 		return ret;
 
 	if (cmd == RXRPC_CMD_ACCEPT) {
-		if (rx->sk.sk_state != RXRPC_SERVER_LISTENING)
+		if (rx->sk.sk_state != RXRPC_SERVER_LISTENING ||
+		    msg->msg_name)
 			return -EINVAL;
 		call = rxrpc_accept_call(rx, user_call_ID);
 		if (IS_ERR(call))

^ permalink raw reply related

* [PATCH 3/3] rxrpc: Use the listen() system call to move to listening state
From: David Howells @ 2016-04-12 15:05 UTC (permalink / raw)
  To: linux-afs; +Cc: dhowells, netdev, linux-kernel
In-Reply-To: <20160412150533.20637.23952.stgit@warthog.procyon.org.uk>

Use the listen() system call to move to listening state and to set the
socket backlog queue size.  A limit is placed on the maximum queue size by
way of:

	/proc/sys/net/rxrpc/max_backlog

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/rxrpc.c          |   34 +++++++++++++++++++---------------
 net/rxrpc/af_rxrpc.c    |   26 ++++++++++++++------------
 net/rxrpc/ar-internal.h |    1 +
 net/rxrpc/misc.c        |    6 ++++++
 net/rxrpc/sysctl.c      |   10 ++++++++++
 5 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index 63cd9f939f19..4832de84d52c 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -85,18 +85,14 @@ int afs_open_socket(void)
 
 	skb_queue_head_init(&afs_incoming_calls);
 
+	ret = -ENOMEM;
 	afs_async_calls = create_singlethread_workqueue("kafsd");
-	if (!afs_async_calls) {
-		_leave(" = -ENOMEM [wq]");
-		return -ENOMEM;
-	}
+	if (!afs_async_calls)
+		goto error_0;
 
 	ret = sock_create_kern(&init_net, AF_RXRPC, SOCK_DGRAM, PF_INET, &socket);
-	if (ret < 0) {
-		destroy_workqueue(afs_async_calls);
-		_leave(" = %d [socket]", ret);
-		return ret;
-	}
+	if (ret < 0)
+		goto error_1;
 
 	socket->sk->sk_allocation = GFP_NOFS;
 
@@ -111,18 +107,26 @@ int afs_open_socket(void)
 	       sizeof(srx.transport.sin.sin_addr));
 
 	ret = kernel_bind(socket, (struct sockaddr *) &srx, sizeof(srx));
-	if (ret < 0) {
-		sock_release(socket);
-		destroy_workqueue(afs_async_calls);
-		_leave(" = %d [bind]", ret);
-		return ret;
-	}
+	if (ret < 0)
+		goto error_2;
+
+	ret = kernel_listen(socket, INT_MAX);
+	if (ret < 0)
+		goto error_2;
 
 	rxrpc_kernel_intercept_rx_messages(socket, afs_rx_interceptor);
 
 	afs_socket = socket;
 	_leave(" = 0");
 	return 0;
+
+error_2:
+	sock_release(socket);
+error_1:
+	destroy_workqueue(afs_async_calls);
+error_0:
+	_leave(" = %d", ret);
+	return ret;
 }
 
 /*
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index dd462352a79c..7b1aedd79b7c 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -31,8 +31,6 @@ unsigned int rxrpc_debug; // = RXRPC_DEBUG_KPROTO;
 module_param_named(debug, rxrpc_debug, uint, S_IWUSR | S_IRUGO);
 MODULE_PARM_DESC(debug, "RxRPC debugging mask");
 
-static int sysctl_rxrpc_max_qlen __read_mostly = 10;
-
 static struct proto rxrpc_proto;
 static const struct proto_ops rxrpc_rpc_ops;
 
@@ -191,7 +189,7 @@ static int rxrpc_listen(struct socket *sock, int backlog)
 	struct rxrpc_sock *rx = rxrpc_sk(sk);
 	int ret;
 
-	_enter("%p,%d", rx, backlog);
+	_enter("%p{%d},%d", rx, rx->sk.sk_state, backlog);
 
 	lock_sock(&rx->sk);
 
@@ -199,16 +197,20 @@ static int rxrpc_listen(struct socket *sock, int backlog)
 	case RXRPC_UNBOUND:
 		ret = -EADDRNOTAVAIL;
 		break;
-	case RXRPC_CLIENT_UNBOUND:
-	case RXRPC_CLIENT_BOUND:
-	default:
-		ret = -EBUSY;
-		break;
 	case RXRPC_SERVER_BOUND:
 		ASSERT(rx->local != NULL);
-		sk->sk_max_ack_backlog = backlog;
-		rx->sk.sk_state = RXRPC_SERVER_LISTENING;
-		ret = 0;
+		if (backlog == INT_MAX)
+			backlog = rxrpc_max_backlog;
+		if (backlog > rxrpc_max_backlog) {
+			ret = -EINVAL;
+		} else {
+			sk->sk_max_ack_backlog = backlog;
+			rx->sk.sk_state = RXRPC_SERVER_LISTENING;
+			ret = 0;
+		}
+		break;
+	default:
+		ret = -EBUSY;
 		break;
 	}
 
@@ -549,7 +551,7 @@ static int rxrpc_create(struct net *net, struct socket *sock, int protocol,
 	sock_init_data(sock, sk);
 	sk->sk_state		= RXRPC_UNBOUND;
 	sk->sk_write_space	= rxrpc_write_space;
-	sk->sk_max_ack_backlog	= sysctl_rxrpc_max_qlen;
+	sk->sk_max_ack_backlog	= 0;
 	sk->sk_destruct		= rxrpc_sock_destructor;
 
 	rx = rxrpc_sk(sk);
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index bbf2443af875..4c29cf236dea 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -640,6 +640,7 @@ extern const struct rxrpc_security rxrpc_no_security;
 /*
  * misc.c
  */
+extern unsigned int rxrpc_max_backlog __read_mostly;
 extern unsigned int rxrpc_requested_ack_delay;
 extern unsigned int rxrpc_soft_ack_delay;
 extern unsigned int rxrpc_idle_ack_delay;
diff --git a/net/rxrpc/misc.c b/net/rxrpc/misc.c
index 1afe9876e79f..bdc5e42fe600 100644
--- a/net/rxrpc/misc.c
+++ b/net/rxrpc/misc.c
@@ -15,6 +15,12 @@
 #include "ar-internal.h"
 
 /*
+ * The maximum listening backlog queue size that may be set on a socket by
+ * listen().
+ */
+unsigned int rxrpc_max_backlog __read_mostly = 10;
+
+/*
  * How long to wait before scheduling ACK generation after seeing a
  * packet with RXRPC_REQUEST_ACK set (in jiffies).
  */
diff --git a/net/rxrpc/sysctl.c b/net/rxrpc/sysctl.c
index d20ed575acf4..a99690a8a3da 100644
--- a/net/rxrpc/sysctl.c
+++ b/net/rxrpc/sysctl.c
@@ -18,6 +18,7 @@ static struct ctl_table_header *rxrpc_sysctl_reg_table;
 static const unsigned int zero = 0;
 static const unsigned int one = 1;
 static const unsigned int four = 4;
+static const unsigned int thirtytwo = 32;
 static const unsigned int n_65535 = 65535;
 static const unsigned int n_max_acks = RXRPC_MAXACKS;
 
@@ -100,6 +101,15 @@ static struct ctl_table rxrpc_sysctl_table[] = {
 
 	/* Non-time values */
 	{
+		.procname	= "max_backlog",
+		.data		= &rxrpc_max_backlog,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= (void *)&four,
+		.extra2		= (void *)&thirtytwo,
+	},
+	{
 		.procname	= "rx_window_size",
 		.data		= &rxrpc_rx_window_size,
 		.maxlen		= sizeof(unsigned int),

^ permalink raw reply related

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Alexander Duyck @ 2016-04-12 15:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf@lists.linux-foundation.org, James Bottomley, Sagi Grimberg,
	Tom Herbert, Brenden Blanco, Christoph Hellwig, linux-mm,
	netdev@vger.kernel.org, Bart Van Assche,
	lsf-pc@lists.linux-foundation.org, Alexei Starovoitov
In-Reply-To: <20160412082838.4ce17c1a@redhat.com>

On Mon, Apr 11, 2016 at 11:28 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Mon, 11 Apr 2016 15:02:51 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> Have you taken a look at possibly trying to optimize the DMA pool API
>> to work with pages?  It sounds like it is supposed to do something
>> similar to what you are wanting to do.
>
> Yes, I have looked at the mm/dmapool.c API. AFAIK this is for DMA
> coherent memory (see use of dma_alloc_coherent/dma_free_coherent).
>
> What we are doing is "streaming" DMA memory, when processing the RX
> ring.
>
> (NIC are only using DMA coherent memory for the descriptors, which are
> allocated on driver init)

Yes, I know that but it shouldn't take much to extend the API to
provide the option for a streaming DMA mapping.  That was why I
thought you might want to look in this direction.

- Alex

^ permalink raw reply

* Re: [PATCH net-next 00/11] FUJITSU Extended Socket driver version 1.1
From: David Miller @ 2016-04-12 15:43 UTC (permalink / raw)
  To: izumi.taku; +Cc: netdev
In-Reply-To: <E86EADE93E2D054CBCD4E708C38D364A734EF16E@G01JPEXMBYT01>

From: "Izumi, Taku" <izumi.taku@jp.fujitsu.com>
Date: Tue, 12 Apr 2016 08:35:09 +0000

>  But I'd like to keep some debugfs facility for status information
>  and some specific stats other thatn net_stats.

We have a facility for arbitrary driver stats, remove this debugfs crap
please.

I'm not going to say this again.

^ permalink raw reply

* Re: Configuring ethernet link fails with No such device
From: David Miller @ 2016-04-12 15:44 UTC (permalink / raw)
  To: bob.ham
  Cc: fabio.estevam, systemd-devel, netdev, bryan.wu, u.kleine-koenig,
	l.stach
In-Reply-To: <1460451492.6333.6.camel@collabora.com>

From: Bob Ham <bob.ham@collabora.com>
Date: Tue, 12 Apr 2016 09:58:12 +0100

> On Mon, 2016-04-11 at 15:46 -0700, Stefan Agner wrote:
> 
>> Or in other words: Is this a Kernel or systemd issue?
> 
> From what I recall, both; an issue with the FEC driver, and issues in
> systemd/udevd's handling of link-level settings.

This is my impression of the situation as well.
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply

* Re: pull request: bluetooth-next 2016-04-12
From: David Miller @ 2016-04-12 15:58 UTC (permalink / raw)
  To: johan.hedberg; +Cc: netdev, linux-bluetooth
In-Reply-To: <20160412080815.GA17516@t440s>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Tue, 12 Apr 2016 11:08:15 +0300

> Here's a set of Bluetooth & 802.15.4 patches intended for the 4.7 kernel:
> 
>  - Fix for race condition in vhci driver
>  - Memory leak fix for ieee802154/adf7242 driver
>  - Improvements to deal with single-mode (LE-only) Bluetooth controllers
>  - Fix for allowing the BT_SECURITY_FIPS security level
>  - New BCM2E71 ACPI ID
>  - NULL pointer dereference fix fox hci_ldisc driver
> 
> Let me know if there are any issues pulling. Thanks.

Pulled, thanks Johan.

^ permalink raw reply

* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Ian Campbell @ 2016-04-12 16:07 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stefan Hajnoczi
  Cc: marius vlad, Stefan Hajnoczi, kvm, netdev, virtualization,
	Claudio Imbrenda, Matt Benjamin, Greg Kurz, Christoffer Dall
In-Reply-To: <20160411154517-mutt-send-email-mst@redhat.com>

Some how Stefan's reply disapeared from my INBOX (although I did see
it) so replying here.

On Mon, 2016-04-11 at 15:54 +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 11, 2016 at 11:45:48AM +0100, Stefan Hajnoczi wrote:
> > 
> > On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> > > 
> > > On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > > > 
> > > > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> > > > 
> > > > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > > > first I want to share the latest version of the code.  Several people are
> > > > playing with vsock now so sharing the latest code should avoid duplicate work.
> > > Thanks for this, I've been using it in my project and it mostly seems
> > > fine.
> > > 
> > > One wrinkle I came across, which I'm not sure if it is by design or a
> > > problem is that I can see this sequence coming from the guest (with
> > > other activity in between):
> > > 
> > >     1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > >     2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > >     3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
> > > 
> > > I orignally had my backend close things down at #2, however this meant
> > > that when #3 arrived it was for a non-existent socket (or, worse, an
> > > active one if the ports got reused). I checked v5 of the spec
> > > proposal[0] which says:
> > >     If these bits are set and there are no more virtqueue buffers
> > >     pending the socket is disconnected.
> > > 
> > > but I'm not entirely sure if this behaviour contradicts this or not
> > > (the bits have both been set at #2, but not at the same time).
> > > 
> > > BTW, how does one tell if there are no more virtqueue buffers pending
> > > or not while processing the op?
> > #2 is odd.  The shutdown bits are sticky so they cannot be cleared once
> > set.  I would have expected just #1 and #3.  The behavior you observe
> > look like a bug.
> > 
> > The spec text does not convey the meaning of OP_SHUTDOWN well.
> > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> > for this connection.  "there are no more virtqueue buffers pending the
> > socket" really means that this isn't an immediate close from the
> > perspective of the application.  If the application still has unread rx
> > buffers then the socket stays readable until the rx data has been fully
> > read.

Thanks, distinguishing the local buffer to the application from the
vring would make that clearer. Perhaps by not talking about "virtqueue
buffers" since they sound like a vring thing.

However, as Michael observes I'm not sure that's the whole story.

> Yes but you also wrote:
> 	If these bits are set and there are no more virtqueue buffers
> 	pending the socket is disconnected.
> 
> how does remote know that there are no buffers pending and so it's safe
> to reuse the same source/destination address now?

Indeed this is one of the things I struggled with. e.g. If I send a
SHUTDOWN_RX to my peer am I supposed to wait for that buffer to come
back (so I know the peer has seen it) and then wait for an entire
"cycle" of the TX ring to know there is nothing still in flight? That's
some tricky book-keeping.

>   Maybe destination
> should send RST at that point?

i.e. upon receipt of SHUTDOWN_RX|SHUTDOWN_TX from the peer you are
expected to send a RST. When the peer observes that then they know
there is no further data in that connection on the ring?

That sounds like it would be helpful.

> > > Another thing I noticed, which is really more to do with the generic
> > > AF_VSOCK bits than anything to do with your patches is that there is no
> > > limitations on which vsock ports a non-privileged user can bind to and
> > > relatedly that there is no netns support so e.g. users in unproivileged
> > > containers can bind to any vsock port and talk to the host, which might
> > > be undesirable. For my use for now I just went with the big hammer
> > > approach of denying access from anything other than init_net
> > > namespace[1] while I consider what the right answer is.
> > From the vhost point of view each netns should have its own AF_VSOCK
> > namespace.  This way two containers could act as "the host" (CID 2) for
> > their respective guests.

When you say "should" you mean that's the intended design as opposed to
what the current code is actually doing, right?

Ian.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [RFC PATCH net-next] sunvnet: Add ethtool support for netdev stastics
From: Atish Patra @ 2016-04-12 16:19 UTC (permalink / raw)
  To: netdev; +Cc: aaron.young, sowmini.varadhan, atish.patra

Currently, sunvnet does not support statstics display
via sysfs entry using ethtool.

This patch enables the basic support in sunvnet that
displays the all the netdev stastics for the specific
interface using ethtool. Other sunvnet statstics and
support for ldmvsw will be added later.

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 drivers/net/ethernet/sun/sunvnet.c        |   26 +++++++++++++
 drivers/net/ethernet/sun/sunvnet_common.c |   47 ++++++++++++++++++++++++
 drivers/net/ethernet/sun/sunvnet_common.h |   57 +++++++++++++++++++++++++++++
 3 files changed, 130 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index a2f9b47..7c8aa98 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -77,10 +77,36 @@ static void vnet_set_msglevel(struct net_device *dev, u32 value)
 	vp->msg_enable = value;
 }
 
+static int vnet_get_sset_count(struct net_device *netdev, int sset)
+{
+	int scount;
+
+	scount = sunvnet_get_sset_count_common(netdev, sset,
+					       SUNVNET_ETHTOOL);
+	return scount;
+}
+
+static void vnet_get_ethtool_stats(struct net_device *netdev,
+				   struct ethtool_stats *stats, u64 *data)
+{
+	sunvnet_get_ethtool_stats_common(netdev, stats,
+					 data, SUNVNET_ETHTOOL);
+}
+
+static void vnet_get_strings(struct net_device *netdev,
+			     u32 stringset, u8 *data)
+{
+	sunvnet_get_strings_common(netdev, stringset,
+				   data, SUNVNET_ETHTOOL);
+}
+
 static const struct ethtool_ops vnet_ethtool_ops = {
 	.get_drvinfo		= vnet_get_drvinfo,
 	.get_msglevel		= vnet_get_msglevel,
 	.set_msglevel		= vnet_set_msglevel,
+	.get_strings            = vnet_get_strings,
+	.get_ethtool_stats      = vnet_get_ethtool_stats,
+	.get_sset_count         = vnet_get_sset_count,
 	.get_link		= ethtool_op_get_link,
 };
 
diff --git a/drivers/net/ethernet/sun/sunvnet_common.c b/drivers/net/ethernet/sun/sunvnet_common.c
index 904a5a1..2781740 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.c
+++ b/drivers/net/ethernet/sun/sunvnet_common.c
@@ -1730,3 +1730,50 @@ void sunvnet_port_rm_txq_common(struct vnet_port *port)
 						port->q_index));
 }
 EXPORT_SYMBOL_GPL(sunvnet_port_rm_txq_common);
+
+int sunvnet_get_sset_count_common(struct net_device *netdev,
+				  int sset, unsigned int drv_type)
+{
+	switch (sset) {
+	case ETH_SS_STATS:
+		return VNET_NETDEV_STATS_LEN;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL_GPL(sunvnet_get_sset_count_common);
+
+void sunvnet_get_ethtool_stats_common(struct net_device *netdev,
+				      struct ethtool_stats *stats, u64 *data,
+				      unsigned int drv_type)
+{
+	int i = 0;
+	int j;
+	char *p;
+
+	for (j = 0; j < VNET_NETDEV_STATS_LEN; j++) {
+		p = (char *)netdev + vnet_gstrings_net_stats[j].stat_offset;
+		data[i++] = (vnet_gstrings_net_stats[j].sizeof_stat ==
+			     sizeof(u64)) ? *(u64 *)p : *(u32 *)p;
+	}
+}
+EXPORT_SYMBOL_GPL(sunvnet_get_ethtool_stats_common);
+
+void sunvnet_get_strings_common(struct net_device *netdev,
+				u32 stringset, u8 *data,
+				unsigned int drv_type)
+{
+	u8 *p = data;
+	int i;
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		for (i = 0; i < VNET_NETDEV_STATS_LEN; i++) {
+			snprintf(p, ETH_GSTRING_LEN, "%s",
+				 vnet_gstrings_net_stats[i].stat_string);
+			p += ETH_GSTRING_LEN;
+		}
+		break;
+	}
+}
+EXPORT_SYMBOL_GPL(sunvnet_get_strings_common);
diff --git a/drivers/net/ethernet/sun/sunvnet_common.h b/drivers/net/ethernet/sun/sunvnet_common.h
index bd36528..9618d42 100644
--- a/drivers/net/ethernet/sun/sunvnet_common.h
+++ b/drivers/net/ethernet/sun/sunvnet_common.h
@@ -72,6 +72,55 @@ struct vnet_port {
 	u16			q_index;
 };
 
+/* Ethtool support strings and macros */
+enum {
+	SUNVNET_ETHTOOL,
+	VSW_ETHTOOL
+};
+
+struct vnet_stats {
+	char stat_string[ETH_GSTRING_LEN];
+	u8 sizeof_stat;
+	u64 stat_offset;
+};
+
+#define VNET_STAT(_type, _name, _stat) { \
+	.stat_string = _name, \
+	.sizeof_stat = FIELD_SIZEOF(_type, _stat), \
+	.stat_offset = offsetof(_type, _stat) \
+}
+
+#define VNET_NETDEV_STAT(_name, _net_stat) \
+		VNET_STAT(struct net_device, _name, _net_stat)
+
+static const struct vnet_stats vnet_gstrings_net_stats[] = {
+	VNET_NETDEV_STAT("rx_packets", stats.rx_packets),
+	VNET_NETDEV_STAT("tx_packets", stats.tx_packets),
+	VNET_NETDEV_STAT("rx_bytes", stats.rx_bytes),
+	VNET_NETDEV_STAT("tx_bytes", stats.tx_bytes),
+	VNET_NETDEV_STAT("rx_errors", stats.rx_errors),
+	VNET_NETDEV_STAT("tx_errors", stats.tx_errors),
+	VNET_NETDEV_STAT("rx_dropped", stats.rx_dropped),
+	VNET_NETDEV_STAT("tx_dropped", stats.tx_dropped),
+	VNET_NETDEV_STAT("multicast", stats.multicast),
+	VNET_NETDEV_STAT("collisions", stats.collisions),
+	VNET_NETDEV_STAT("rx_length_errors", stats.rx_length_errors),
+	VNET_NETDEV_STAT("rx_over_errors", stats.rx_over_errors),
+	VNET_NETDEV_STAT("rx_crc_errors", stats.rx_crc_errors),
+	VNET_NETDEV_STAT("rx_frame_errors", stats.rx_frame_errors),
+	VNET_NETDEV_STAT("rx_fifo_errors", stats.rx_fifo_errors),
+	VNET_NETDEV_STAT("rx_missed_errors", stats.rx_missed_errors),
+	VNET_NETDEV_STAT("tx_aborted_errors", stats.tx_aborted_errors),
+	VNET_NETDEV_STAT("tx_carrier_errors", stats.tx_carrier_errors),
+	VNET_NETDEV_STAT("tx_fifo_errors", stats.tx_fifo_errors),
+	VNET_NETDEV_STAT("tx_heartbeat_errors", stats.tx_heartbeat_errors),
+	VNET_NETDEV_STAT("tx_window_errors", stats.tx_window_errors),
+	VNET_NETDEV_STAT("rx_compressed", stats.rx_compressed),
+	VNET_NETDEV_STAT("tx_compressed", stats.tx_compressed),
+};
+
+#define VNET_NETDEV_STATS_LEN ARRAY_SIZE(vnet_gstrings_net_stats)
+
 static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio)
 {
 	return container_of(vio, struct vnet_port, vio);
@@ -141,5 +190,13 @@ void sunvnet_port_free_tx_bufs_common(struct vnet_port *port);
 bool sunvnet_port_is_up_common(struct vnet_port *vnet);
 void sunvnet_port_add_txq_common(struct vnet_port *port);
 void sunvnet_port_rm_txq_common(struct vnet_port *port);
+int sunvnet_get_sset_count_common(struct net_device *netdev, int sset,
+				  unsigned int drv_type);
+void sunvnet_get_ethtool_stats_common(struct net_device *netdev,
+				      struct ethtool_stats *stats, u64 *data,
+				      unsigned int drv_type);
+void sunvnet_get_strings_common(struct net_device *netdev,
+				u32 stringset, u8 *data,
+				unsigned int drv_type);
 
 #endif /* _SUNVNETCOMMON_H */
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net v3] net: sched: do not requeue a NULL skb
From: Eric Dumazet @ 2016-04-12 16:23 UTC (permalink / raw)
  To: Lars Persson; +Cc: netdev, jhs, linux-kernel, xiyou.wangcong, Lars Persson
In-Reply-To: <1460443552-26710-1-git-send-email-larper@axis.com>

On Tue, 2016-04-12 at 08:45 +0200, Lars Persson wrote:
> A failure in validate_xmit_skb_list() triggered an unconditional call
> to dev_requeue_skb with skb=NULL. This slowly grows the queue
> discipline's qlen count until all traffic through the queue stops.

> 
> Fixes: 55a93b3ea780 ("qdisc: validate skb without holding lock")
> Signed-off-by: Lars Persson <larper@axis.com>
> ---


Acked-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* [PATCH 4.6 fix] bgmac: fix MAC soft-reset bit for corerev > 4
From: Rafał Miłecki @ 2016-04-12 16:27 UTC (permalink / raw)
  To: David S. Miller, netdev
  Cc: Hauke Mehrtens, Felix Fietkau, Rafał Miłecki

From: Felix Fietkau <nbd@openwrt.org>

Only core revisions older than 4 use BGMAC_CMDCFG_SR_REV0. This mainly
fixes support for BCM4708A0KF SoCs with Ethernet core rev 5 (it means
only some devices as most of BCM4708A0KF-s got core rev 4).
This was tested for regressions on BCM47094 which doesn't seem to care
which bit gets used.

Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
---
If it's OK for everyone, I suggest taking this patch into net.git repo.
It obviously can't regress old devices and was well tested on new ones
by OpenWrt users.
---
 drivers/net/ethernet/broadcom/bgmac.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac.h b/drivers/net/ethernet/broadcom/bgmac.h
index 4fbb093..9a03c14 100644
--- a/drivers/net/ethernet/broadcom/bgmac.h
+++ b/drivers/net/ethernet/broadcom/bgmac.h
@@ -199,9 +199,9 @@
 #define  BGMAC_CMDCFG_TAI			0x00000200
 #define  BGMAC_CMDCFG_HD			0x00000400	/* Set if in half duplex mode */
 #define  BGMAC_CMDCFG_HD_SHIFT			10
-#define  BGMAC_CMDCFG_SR_REV0			0x00000800	/* Set to reset mode, for other revs */
-#define  BGMAC_CMDCFG_SR_REV4			0x00002000	/* Set to reset mode, only for core rev 4 */
-#define  BGMAC_CMDCFG_SR(rev)  ((rev == 4) ? BGMAC_CMDCFG_SR_REV4 : BGMAC_CMDCFG_SR_REV0)
+#define  BGMAC_CMDCFG_SR_REV0			0x00000800	/* Set to reset mode, for core rev 0-3 */
+#define  BGMAC_CMDCFG_SR_REV4			0x00002000	/* Set to reset mode, for core rev >= 4 */
+#define  BGMAC_CMDCFG_SR(rev)  ((rev >= 4) ? BGMAC_CMDCFG_SR_REV4 : BGMAC_CMDCFG_SR_REV0)
 #define  BGMAC_CMDCFG_ML			0x00008000	/* Set to activate mac loopback mode */
 #define  BGMAC_CMDCFG_AE			0x00400000
 #define  BGMAC_CMDCFG_CFE			0x00800000
-- 
1.8.4.5

^ permalink raw reply related

* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Ian Campbell @ 2016-04-12 16:37 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: marius vlad, kvm, Michael S. Tsirkin, ijc, virtualization,
	Claudio Imbrenda, Matt Benjamin, netdev, Greg Kurz,
	Christoffer Dall
In-Reply-To: <20160412135957.GB29405@stefanha-x1.localdomain>


[-- Attachment #1.1: Type: text/plain, Size: 6431 bytes --]

For some reason your mails in this thread only appear in the gmail web UI
and not on the IMAP version of my mailbox (my own and Michael's mails are
fine).

So I'm replying via the web interface, sorry for the inevitable formatting
mess :-/

I've CCd another mailbox in the hopes of getting your mails in that IMAP
folder instead/aswell so I can avoid this next time.

On 12 April 2016 at 14:59, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> > > > One wrinkle I came across, which I'm not sure if it is by design or a
> > > > problem is that I can see this sequence coming from the guest (with
> > > > other activity in between):
> > > >
> > > >     1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > > >     2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > > >     3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
>
> How did you trigger this sequence?  I'd like to reproduce it.
>

Nothing magic. I've written some logging into my backend and captured the
result for a simple backend initiated connection.

In the log "TX" and "RX" indicate the thread doing the processing (with
"TX" being the one which processes the guest's TX ring, i.e. data coming
from the guest to the host). "<=" indicates a buffer going from guest to
host and "=>" is from host to guest. NB that guest to host replies are
queued synchronously by the TX thread onto the RX ring which is why the
somewhat odd looking "TX: =>" combination can occur. A host initiated
connection also happens from the TX thread in the same way.

The trace is of a simple request response (which both fit in one buffer in
each direction), the lines without an "?X:" prefix are my
annotations/guesses as to what is going on:

TX: =>SRC:00000002.00010002 DST:00000003.00000948
TX:   LEN:00000000 TYPE:0001 OP:1=REQUEST
TX:  FLAGS:00000000 BUF_ALLOC:00008000 FWD_CNT:00000000

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:2=RESPONSE
TX:  FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:00000000

REQUEST + RESPONSE == Channel open successfully

RX: =>SRC:00000002.00010002 DST:00000003.00000948
RX:   LEN:0000005e TYPE:0001 OP:5=RW
RX:  FLAGS:00000000 BUF_ALLOC:00008000 FWD_CNT:00000000

Host sends a request to the guest

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:6=CREDIT_UPDATE
TX:  FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest replies with a credit update

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000091 TYPE:0001 OP:5=RW
TX:  FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest replies with the answer to the request

RX: =>SRC:00000002.00010002 DST:00000003.00000948
RX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
RX:  FLAGS:00000002 BUF_ALLOC:00008000 FWD_CNT:00000091

Host has sent its only request, so host app must have done
shutdown(SHUT_WR) I suppose and host therefore sends SHUTDOWN_TX.

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX:  FLAGS:00000001 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest SHUTDOWN_RX. I'm not sure if this is a direct kernel response to the
SHUTDOWN_TX or if the application inside the guest saw an EOF when reading
the socket and did the corresponding shutdown(SHUT_RD).

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX:  FLAGS:00000002 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest SHUTDOWN_TX, I presume that having sent the only response it is going
to it then does shutdown(SHUT_WR).

TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX:   LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX:  FLAGS:00000003 BUF_ALLOC:00040000 FWD_CNT:0000005e

Guest shuts down both directions.

Perhaps the guest end is turning shutdown(foo) directly into a vsock
message without or-ing in the current state?

> > > I orignally had my backend close things down at #2, however this meant
> > > > that when #3 arrived it was for a non-existent socket (or, worse, an
> > > > active one if the ports got reused). I checked v5 of the spec
> > > > proposal[0] which says:
> > > >     If these bits are set and there are no more virtqueue buffers
> > > >     pending the socket is disconnected.
> > > >
> > > > but I'm not entirely sure if this behaviour contradicts this or not
> > > > (the bits have both been set at #2, but not at the same time).
> > > >
> > > > BTW, how does one tell if there are no more virtqueue buffers pending
> > > > or not while processing the op?
> > >
> > > #2 is odd.  The shutdown bits are sticky so they cannot be cleared once
> > > set.  I would have expected just #1 and #3.  The behavior you observe
> > > look like a bug.
> > >
> > > The spec text does not convey the meaning of OP_SHUTDOWN well.
> > > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> > > for this connection.  "there are no more virtqueue buffers pending the
> > > socket" really means that this isn't an immediate close from the
> > > perspective of the application.  If the application still has unread rx
> > > buffers then the socket stays readable until the rx data has been fully
> > > read.
> >
> > Yes but you also wrote:
> >       If these bits are set and there are no more virtqueue buffers
> >       pending the socket is disconnected.
> >
> > how does remote know that there are no buffers pending and so it's safe
> > to reuse the same source/destination address now?  Maybe destination
> > should send RST at that point?
>
> You are right, the source/destination address could be reused while the
> remote still has the connection in their table.  Connection
> establishment would fail with a RST reply.
>
> I can think of two solutions:
>
> 1. Implementations must remove connections from their table as soon as
>    SHUTDOWN_TX|SHUTDOWN_RX is received.  This way the source/destination
>    address tuple can be reused immediately, i.e. new connections with
>    the same source/destination would be possible while an application is
>    still draining the receive buffers of an old connection.
>
> 2. Extend the connection lifecycle so that an A->B
>    SHUTDOWN_TX|SHUTDOWN_RX must be followed by a by a B->A RST to close
>    a connection.  This way the source/destination address is only in use
>    once at a time.
>
> Option #2 seems safer because there is no overlap in source/destination
> address usage.
>

I agree (I hadn't seen this when I replied to Michael similar suggestion).

Ian. (sorry again for the formatting)

[-- Attachment #1.2: Type: text/html, Size: 8558 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: Inter-VRF routing on a single machine
From: David Ahern @ 2016-04-12 16:48 UTC (permalink / raw)
  To: Darwin Dingel, netdev
In-Reply-To: <CAAJwJQkfVWkdL+tOvcu5C9XBwTpNKcB8qvGeDUO32Tq6nW=xMw@mail.gmail.com>

On 4/12/16 4:09 AM, Darwin Dingel wrote:
> Hi All,
>
> Have anyone tried the following setup on a single machine with 2 TCP
> sockets on different VRF's and succeeded?
>
> - client_socket on VRF1
> - server_socket on VRF2
> - ip rules and iproutes for inter-VRF set up
> - client_socket sends TCP connect to server_socket. skb was sent using
> VRF1 interface
> - skb received in loopback interface

That is the key problem there.

> - TCP code got SYN but cannot route back to VRF1 to send ACK.
>
> I was wondering if this is a known limitation of VRF as of the moment,
> or could work with proper iprules/iproute.

In general local (within a single system) routing does not work with top 
of tree. e.g., within a VRF you can not connect sockets or ping a VRF 
local address. Inter-vrf connections within a system also do not work.

I have patches from our 4.1 kernel that I have rebased to top of tree. I 
hope to test and send those out in the next week or so. It addresses the 
first problem -- connections within a VRF. While it does not resolve 
your problem of connecting across VRFs within a system I think it is the 
foundation for how to fix it.

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Yuchung Cheng @ 2016-04-12 17:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Machani, Yaniv, netdev, David S. Miller, Eric Dumazet,
	Neal Cardwell, Nandita Dukkipati, open list, Kama, Meirav
In-Reply-To: <1460472764.6473.589.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, Apr 12, 2016 at 7:52 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
> > Hi,
> > After updating from Kernel 3.14 to Kernel 4.4 we have seen a TCP performance degradation over Wi-Fi.
> > In 3.14 kernel, TCP got to its max throughout after less than a second, while in the 4.4  it is taking ~20-30 seconds.
> > UDP TX/RX and TCP RX performance is as expected.
> > We are using a Beagle Bone Black and a WiLink8 device.
> >
> > Were there any related changes that might cause such behavior ?
> > Kernel configuration and sysctl values were compared, but no significant differences have been found.
> >
> > See a log of the behavior below :
> > -----------------------------------------------------------
> > Client connecting to 10.2.46.5, TCP port 5001
> > TCP window size:  320 KByte (WARNING: requested  256 KByte)
> > ------------------------------------------------------------
> > [  3] local 10.2.46.6 port 49282 connected with 10.2.46.5 port 5001
> > [ ID] Interval       Transfer     Bandwidth
> > [  3]  0.0- 1.0 sec  5.75 MBytes  48.2 Mbits/sec
> > [  3]  1.0- 2.0 sec  6.50 MBytes  54.5 Mbits/sec
> > [  3]  2.0- 3.0 sec  6.50 MBytes  54.5 Mbits/sec
> > [  3]  3.0- 4.0 sec  6.50 MBytes  54.5 Mbits/sec
> > [  3]  4.0- 5.0 sec  6.75 MBytes  56.6 Mbits/sec
> > [  3]  5.0- 6.0 sec  3.38 MBytes  28.3 Mbits/sec
> > [  3]  6.0- 7.0 sec  6.38 MBytes  53.5 Mbits/sec
> > [  3]  7.0- 8.0 sec  6.88 MBytes  57.7 Mbits/sec
> > [  3]  8.0- 9.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3]  9.0-10.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3] 10.0-11.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3] 11.0-12.0 sec  7.25 MBytes  60.8 Mbits/sec
> > [  3] 12.0-13.0 sec  7.12 MBytes  59.8 Mbits/sec
> > [  3] 13.0-14.0 sec  7.25 MBytes  60.8 Mbits/sec
> > [  3] 14.0-15.0 sec  7.62 MBytes  64.0 Mbits/sec
> > [  3] 15.0-16.0 sec  7.88 MBytes  66.1 Mbits/sec
> > [  3] 16.0-17.0 sec  8.12 MBytes  68.2 Mbits/sec
> > [  3] 17.0-18.0 sec  8.25 MBytes  69.2 Mbits/sec
> > [  3] 18.0-19.0 sec  8.50 MBytes  71.3 Mbits/sec
> > [  3] 19.0-20.0 sec  8.88 MBytes  74.4 Mbits/sec
> > [  3] 20.0-21.0 sec  8.75 MBytes  73.4 Mbits/sec
> > [  3] 21.0-22.0 sec  8.62 MBytes  72.4 Mbits/sec
> > [  3] 22.0-23.0 sec  8.75 MBytes  73.4 Mbits/sec
> > [  3] 23.0-24.0 sec  8.50 MBytes  71.3 Mbits/sec
> > [  3] 24.0-25.0 sec  8.62 MBytes  72.4 Mbits/sec
> > [  3] 25.0-26.0 sec  8.62 MBytes  72.4 Mbits/sec
> > [  3] 26.0-27.0 sec  8.62 MBytes  72.4 Mbits/sec
> >
>
> CC netdev, where this is better discussed.
>
> This could be a lot of different factors, and caused by a sender
> problem, a receiver problem, ...
>
> TCP behavior depends on the drivers, so maybe a change there can explain
> this.
>
> You could capture the first 5000 frames of the flow and post the pcap ?
> (-s 128 to capture only the headers)
pcap would be really helpful indeed. if possible please capture on
both 4.4 and 3.14 kernels.

>
> tcpdump -p -s 128 -i eth0 -c 5000 host 10.2.46.5 -w flow.pcap
>
>
> Also, while test is running, you could fetch
> ss -temoi dst 10.2.46.5:5001
>
>
>
>
>

^ permalink raw reply

* [PATCH net 0/2] Fixes for SO_REUSEPORT and mixed v4/v6 sockets
From: Craig Gallek @ 2016-04-12 17:11 UTC (permalink / raw)
  To: davem; +Cc: netdev

From: Craig Gallek <kraig@google.com>

Recent changes to the datastructures associated with SO_REUSEPORT broke
an existing behavior when equivalent SO_REUSEPORT sockets are created
using both AF_INET and AF_INET6.  This patch series restores the previous
behavior and includes a test to validate it.

This series should be a trivial merge to stable kernels (if deemed
necessary), but will have conflicts in net-next.  The following patches
recently replaced the use of hlist_nulls with hlists for UDP and TCP
socket lists:
ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

If this series is accepted, I will send an RFC for the net-next change
to assist with the merge.

Craig Gallek (2):
  soreuseport: fix ordering for mixed v4/v6 sockets
  soreuseport: test mixed v4/v6 sockets

 include/linux/rculist_nulls.h                     |  39 ++++
 include/net/sock.h                                |   6 +-
 net/ipv4/udp.c                                    |   9 +-
 tools/testing/selftests/net/.gitignore            |   1 +
 tools/testing/selftests/net/Makefile              |   2 +-
 tools/testing/selftests/net/reuseport_dualstack.c | 208 ++++++++++++++++++++++
 6 files changed, 261 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/net/reuseport_dualstack.c

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply

* [PATCH net 1/2] soreuseport: fix ordering for mixed v4/v6 sockets
From: Craig Gallek @ 2016-04-12 17:11 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1460481086-12982-1-git-send-email-kraigatgoog@gmail.com>

From: Craig Gallek <kraig@google.com>

With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.

Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).

The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match.  They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.

To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list.  This will force AF_INET sockets to always be
considered first.

Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")

Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/rculist_nulls.h | 39 +++++++++++++++++++++++++++++++++++++++
 include/net/sock.h            |  6 +++++-
 net/ipv4/udp.c                |  9 +++++++--
 3 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index 1c33dd7da4a7..4ae95f7e8597 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -98,6 +98,45 @@ static inline void hlist_nulls_add_head_rcu(struct hlist_nulls_node *n,
 	if (!is_a_nulls(first))
 		first->pprev = &n->next;
 }
+
+/**
+ * hlist_nulls_add_tail_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the end of the specified hlist_nulls,
+ * while permitting racing traversals.  NOTE: tail insertion requires
+ * list traversal.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
+ * or hlist_nulls_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
+					struct hlist_nulls_head *h)
+{
+	struct hlist_nulls_node *i, *last = NULL;
+
+	for (i = hlist_nulls_first_rcu(h); !is_a_nulls(i);
+	     i = hlist_nulls_next_rcu(i))
+		last = i;
+
+	if (last) {
+		n->next = last->next;
+		n->pprev = &last->next;
+		rcu_assign_pointer(hlist_nulls_next_rcu(last), n);
+	} else {
+		hlist_nulls_add_head_rcu(n, h);
+	}
+}
+
 /**
  * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
  * @tpos:	the type * to use as a loop cursor.
diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e03727b..121ffc115c4f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -630,7 +630,11 @@ static inline void sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
 
 static inline void __sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
 {
-	hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
+	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
+	    sk->sk_family == AF_INET6)
+		hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
+	else
+		hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
 }
 
 static inline void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 08eed5e16df0..a2e7f55a1f61 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -339,8 +339,13 @@ found:
 
 		hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
 		spin_lock(&hslot2->lock);
-		hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
-					 &hslot2->head);
+		if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
+			sk->sk_family == AF_INET6)
+			hlist_nulls_add_tail_rcu(&udp_sk(sk)->udp_portaddr_node,
+						 &hslot2->head);
+		else
+			hlist_nulls_add_head_rcu(&udp_sk(sk)->udp_portaddr_node,
+						 &hslot2->head);
 		hslot2->count++;
 		spin_unlock(&hslot2->lock);
 	}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net 2/2] soreuseport: test mixed v4/v6 sockets
From: Craig Gallek @ 2016-04-12 17:11 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1460481086-12982-1-git-send-email-kraigatgoog@gmail.com>

From: Craig Gallek <kraig@google.com>

Test to validate the behavior of SO_REUSEPORT sockets that are
created with both AF_INET and AF_INET6.  See the commit prior to this
for a description of this behavior.

Signed-off-by: Craig Gallek <kraig@google.com>
---
 tools/testing/selftests/net/.gitignore            |   1 +
 tools/testing/selftests/net/Makefile              |   2 +-
 tools/testing/selftests/net/reuseport_dualstack.c | 208 ++++++++++++++++++++++
 3 files changed, 210 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/reuseport_dualstack.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 69bb3fc38fb2..0840684deb7d 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -3,3 +3,4 @@ psock_fanout
 psock_tpacket
 reuseport_bpf
 reuseport_bpf_cpu
+reuseport_dualstack
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index c658792d47b4..0e5340742620 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -4,7 +4,7 @@ CFLAGS = -Wall -O2 -g
 
 CFLAGS += -I../../../../usr/include/
 
-NET_PROGS = socket psock_fanout psock_tpacket reuseport_bpf reuseport_bpf_cpu
+NET_PROGS = socket psock_fanout psock_tpacket reuseport_bpf reuseport_bpf_cpu reuseport_dualstack
 
 all: $(NET_PROGS)
 %: %.c
diff --git a/tools/testing/selftests/net/reuseport_dualstack.c b/tools/testing/selftests/net/reuseport_dualstack.c
new file mode 100644
index 000000000000..90958aaaafb9
--- /dev/null
+++ b/tools/testing/selftests/net/reuseport_dualstack.c
@@ -0,0 +1,208 @@
+/*
+ * It is possible to use SO_REUSEPORT to open multiple sockets bound to
+ * equivalent local addresses using AF_INET and AF_INET6 at the same time.  If
+ * the AF_INET6 socket has IPV6_V6ONLY set, it's clear which socket should
+ * receive a given incoming packet.  However, when it is not set, incoming v4
+ * packets should prefer the AF_INET socket(s).  This behavior was defined with
+ * the original SO_REUSEPORT implementation, but broke with
+ * e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
+ * This test creates these mixed AF_INET/AF_INET6 sockets and asserts the
+ * AF_INET preference for v4 packets.
+ */
+
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <errno.h>
+#include <error.h>
+#include <linux/in.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/epoll.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <unistd.h>
+
+static const int PORT = 8888;
+
+static void build_rcv_fd(int family, int proto, int *rcv_fds, int count)
+{
+	struct sockaddr_storage addr;
+	struct sockaddr_in  *addr4;
+	struct sockaddr_in6 *addr6;
+	int opt, i;
+
+	switch (family) {
+	case AF_INET:
+		addr4 = (struct sockaddr_in *)&addr;
+		addr4->sin_family = AF_INET;
+		addr4->sin_addr.s_addr = htonl(INADDR_ANY);
+		addr4->sin_port = htons(PORT);
+		break;
+	case AF_INET6:
+		addr6 = (struct sockaddr_in6 *)&addr;
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_addr = in6addr_any;
+		addr6->sin6_port = htons(PORT);
+		break;
+	default:
+		error(1, 0, "Unsupported family %d", family);
+	}
+
+	for (i = 0; i < count; ++i) {
+		rcv_fds[i] = socket(family, proto, 0);
+		if (rcv_fds[i] < 0)
+			error(1, errno, "failed to create receive socket");
+
+		opt = 1;
+		if (setsockopt(rcv_fds[i], SOL_SOCKET, SO_REUSEPORT, &opt,
+			       sizeof(opt)))
+			error(1, errno, "failed to set SO_REUSEPORT");
+
+		if (bind(rcv_fds[i], (struct sockaddr *)&addr, sizeof(addr)))
+			error(1, errno, "failed to bind receive socket");
+
+		if (proto == SOCK_STREAM && listen(rcv_fds[i], 10))
+			error(1, errno, "failed to listen on receive port");
+	}
+}
+
+static void send_from_v4(int proto)
+{
+	struct sockaddr_in  saddr, daddr;
+	int fd;
+
+	saddr.sin_family = AF_INET;
+	saddr.sin_addr.s_addr = htonl(INADDR_ANY);
+	saddr.sin_port = 0;
+
+	daddr.sin_family = AF_INET;
+	daddr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+	daddr.sin_port = htons(PORT);
+
+	fd = socket(AF_INET, proto, 0);
+	if (fd < 0)
+		error(1, errno, "failed to create send socket");
+
+	if (bind(fd, (struct sockaddr *)&saddr, sizeof(saddr)))
+		error(1, errno, "failed to bind send socket");
+
+	if (connect(fd, (struct sockaddr *)&daddr, sizeof(daddr)))
+		error(1, errno, "failed to connect send socket");
+
+	if (send(fd, "a", 1, 0) < 0)
+		error(1, errno, "failed to send message");
+
+	close(fd);
+}
+
+static int receive_once(int epfd, int proto)
+{
+	struct epoll_event ev;
+	int i, fd;
+	char buf[8];
+
+	i = epoll_wait(epfd, &ev, 1, -1);
+	if (i < 0)
+		error(1, errno, "epoll_wait failed");
+
+	if (proto == SOCK_STREAM) {
+		fd = accept(ev.data.fd, NULL, NULL);
+		if (fd < 0)
+			error(1, errno, "failed to accept");
+		i = recv(fd, buf, sizeof(buf), 0);
+		close(fd);
+	} else {
+		i = recv(ev.data.fd, buf, sizeof(buf), 0);
+	}
+
+	if (i < 0)
+		error(1, errno, "failed to recv");
+
+	return ev.data.fd;
+}
+
+static void test(int *rcv_fds, int count, int proto)
+{
+	struct epoll_event ev;
+	int epfd, i, test_fd;
+	uint16_t test_family;
+	socklen_t len;
+
+	epfd = epoll_create(1);
+	if (epfd < 0)
+		error(1, errno, "failed to create epoll");
+
+	ev.events = EPOLLIN;
+	for (i = 0; i < count; ++i) {
+		ev.data.fd = rcv_fds[i];
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, rcv_fds[i], &ev))
+			error(1, errno, "failed to register sock epoll");
+	}
+
+	send_from_v4(proto);
+
+	test_fd = receive_once(epfd, proto);
+	if (getsockopt(test_fd, SOL_SOCKET, SO_DOMAIN, &test_family, &len))
+		error(1, errno, "failed to read socket domain");
+	if (test_family != AF_INET)
+		error(1, 0, "expected to receive on v4 socket but got v6 (%d)",
+		      test_family);
+
+	close(epfd);
+}
+
+int main(void)
+{
+	int rcv_fds[32], i;
+
+	fprintf(stderr, "---- UDP IPv4 created before IPv6 ----\n");
+	build_rcv_fd(AF_INET, SOCK_DGRAM, rcv_fds, 5);
+	build_rcv_fd(AF_INET6, SOCK_DGRAM, &(rcv_fds[5]), 5);
+	test(rcv_fds, 10, SOCK_DGRAM);
+	for (i = 0; i < 10; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "---- UDP IPv6 created before IPv4 ----\n");
+	build_rcv_fd(AF_INET6, SOCK_DGRAM, rcv_fds, 5);
+	build_rcv_fd(AF_INET, SOCK_DGRAM, &(rcv_fds[5]), 5);
+	test(rcv_fds, 10, SOCK_DGRAM);
+	for (i = 0; i < 10; ++i)
+		close(rcv_fds[i]);
+
+	/* NOTE: UDP socket lookups traverse a different code path when there
+	 * are > 10 sockets in a group.
+	 */
+	fprintf(stderr, "---- UDP IPv4 created before IPv6 (large) ----\n");
+	build_rcv_fd(AF_INET, SOCK_DGRAM, rcv_fds, 16);
+	build_rcv_fd(AF_INET6, SOCK_DGRAM, &(rcv_fds[16]), 16);
+	test(rcv_fds, 32, SOCK_DGRAM);
+	for (i = 0; i < 32; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "---- UDP IPv6 created before IPv4 (large) ----\n");
+	build_rcv_fd(AF_INET6, SOCK_DGRAM, rcv_fds, 16);
+	build_rcv_fd(AF_INET, SOCK_DGRAM, &(rcv_fds[16]), 16);
+	test(rcv_fds, 32, SOCK_DGRAM);
+	for (i = 0; i < 32; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "---- TCP IPv4 created before IPv6 ----\n");
+	build_rcv_fd(AF_INET, SOCK_STREAM, rcv_fds, 5);
+	build_rcv_fd(AF_INET6, SOCK_STREAM, &(rcv_fds[5]), 5);
+	test(rcv_fds, 10, SOCK_STREAM);
+	for (i = 0; i < 10; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "---- TCP IPv6 created before IPv4 ----\n");
+	build_rcv_fd(AF_INET6, SOCK_STREAM, rcv_fds, 5);
+	build_rcv_fd(AF_INET, SOCK_STREAM, &(rcv_fds[5]), 5);
+	test(rcv_fds, 10, SOCK_STREAM);
+	for (i = 0; i < 10; ++i)
+		close(rcv_fds[i]);
+
+	fprintf(stderr, "SUCCESS\n");
+	return 0;
+}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Alexei Starovoitov @ 2016-04-12 17:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf@lists.linux-foundation.org, James Bottomley, Sagi Grimberg,
	Tom Herbert, Brenden Blanco, Christoph Hellwig, linux-mm,
	netdev@vger.kernel.org, Bart Van Assche,
	lsf-pc@lists.linux-foundation.org
In-Reply-To: <20160412081649.4cb4f9db@redhat.com>

On Tue, Apr 12, 2016 at 08:16:49AM +0200, Jesper Dangaard Brouer wrote:
> 
> On Mon, 11 Apr 2016 15:21:26 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
> > >   
> [...]
> > > > 
> > > > If we go down this road how about also attaching some driver opaques
> > > > to the page sets?  
> > > 
> > > That was the ultimate plan... to leave some opaques bytes left in the
> > > page struct that drivers could use.
> > > 
> > > In struct page I would need a pointer back to my page_pool struct and a
> > > page flag.  Then, I would need room to store the dma_unmap address.
> > > (And then some of the usual fields are still needed, like the refcnt,
> > > and reusing some of the list constructs).  And a zero-copy cross-domain
> > > id.  
> > 
> > I don't think we need to add anything to struct page.
> > This is supposed to be small cache of dma_mapped pages with lockless access.
> > It can be implemented as an array or link list where every element
> > is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
> > send it to back to page allocator.
> 
> It sounds like the Intel drivers recycle facility, where they split the
> page into two parts, and keep page in RX-ring, by swapping to other
> half of page, if page_count(page) is <= 2.  Thus, they use the atomic
> page ref count to synchronize on.

actually I'm proposing the opposite. one page = one packet.
I'm perfectly happy to waste half a page, since number of such pages is small
and performance matter more. Typical performance vs memory tradeoff.

> Thus, we end-up having two atomic operations per RX packet, on the page
> refcnt.  Where DPDK have zero...

the page recycling cache should have zero atomic ops per packet
otherwise it's non starter.

> By fully taking over the page as an allocator, almost like slab. I can
> optimize the common case (of the packet-page getting allocated and
> free'ed on the same CPU), and remove these atomic operations.

slub is doing local cmpxchg. 40G networking cannot afford it per packet.
If it's amortized due to batching that will be ok.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH net] bpf/verifier: reject invalid LD_ABS | BPF_DW instruction
From: Alexei Starovoitov @ 2016-04-12 17:26 UTC (permalink / raw)
  To: David S . Miller; +Cc: Daniel Borkmann, netdev

verifier must check for reserved size bits in instruction opcode and
reject BPF_LD | BPF_ABS | BPF_DW and BPF_LD | BPF_IND | BPF_DW instructions,
otherwise interpreter will WARN_RATELIMIT on them during execution.

Fixes: ddd872bc3098 ("bpf: verifier: add checks for BPF_ABS | BPF_IND instructions")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/verifier.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2e08f8e9b771..618ef77c302a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1374,6 +1374,7 @@ static int check_ld_abs(struct verifier_env *env, struct bpf_insn *insn)
 	}
 
 	if (insn->dst_reg != BPF_REG_0 || insn->off != 0 ||
+	    BPF_SIZE(insn->code) == BPF_DW ||
 	    (mode == BPF_ABS && insn->src_reg != BPF_REG_0)) {
 		verbose("BPF_LD_ABS uses reserved fields\n");
 		return -EINVAL;
-- 
2.8.0

^ permalink raw reply related

* Re: [PATCH] mwifiex: fix possible NULL dereference
From: Rustad, Mark D @ 2016-04-12 17:43 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Sudip Mukherjee, Amitkumar Karwar, Nishant Sarmukadam, Kalle Valo,
	linux-kernel@vger.kernel.org, open list:TI WILINK WIRELES...,
	netdev, Sudip Mukherjee
In-Reply-To: <CAHp75VfO3A+HBjkjc9WC9jzB9gtYrVM6izLmpiqLntChGOzjMA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1971 bytes --]

Andy Shevchenko <andy.shevchenko@gmail.com> wrote:

> On Mon, Apr 11, 2016 at 6:27 PM, Sudip Mukherjee
> <sudipm.mukherjee@gmail.com> wrote:
>> From: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
>>
>> We have a check for card just after dereferencing it. So if it is NULL
>> we have already dereferenced it before its check. Lets dereference it
>> after checking card for NULL.
>
> IIUC the code does nothing with dereference.
>
> I would have told NAK if I would have been maintainer.
>
>> Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
>> ---
>>  drivers/net/wireless/marvell/mwifiex/pcie.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c  
>> b/drivers/net/wireless/marvell/mwifiex/pcie.c
>> index edf8b07..84562d0 100644
>> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
>> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
>> @@ -2884,10 +2884,11 @@ static void mwifiex_unregister_dev(struct  
>> mwifiex_adapter *adapter)
>>  {
>>         struct pcie_service_card *card = adapter->card;
>
> Let's say it's 0.
>
>>        const struct mwifiex_pcie_card_reg *reg;
>> -       struct pci_dev *pdev = card->dev;
>
> This would be equal to offset of dev member in pcie_service_card struct.
>
> Nothing wrong here.

Actually, that is not true. The dereference of card tells the compiler that  
card can't be NULL, so it is free to eliminate the check below.  
Unbelievably, this can even happen for a reference such as &ptr->thing  
where the pointer isn't even actually dereferenced at all!

>> +       struct pci_dev *pdev;
>>         int i;
>>
>>         if (card) {
>> +               pdev = card->dev;
>>                 if (card->msix_enable) {
>>                         for (i = 0; i < MWIFIEX_NUM_MSIX_VECTORS; i++)
>>                                 synchronize_irq(card->msix_entries[i].vector);
>> --
>> 1.9.1
>
>
>
> --
> With Best Regards,
> Andy Shevchenko

^ permalink raw reply

* Issue with ping source address display
From: Daniele Orlandi @ 2016-04-12 17:52 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 4826 bytes --]


Hello,

More than one year ago I posted the following message but it hasn't
received a reply, now I've been stung by a similar issue, you may want
to investigate:


I noticed that when ping receives ICMP messages from different sources
the first IP address is always used and displayed:


vihai@seviolab:~$ ping -V
ping utility, iputils-s20121221

This is a (simulated) flapping route:

vihai@seviolab:~$ ping 10.254.10.140
PING 10.254.10.140 (10.254.10.140) 56(84) bytes of data.
From 192.168.1.1 icmp_seq=1 Destination Host Unreachable
From 192.168.1.1 icmp_seq=2 Destination Host Unreachable
From 192.168.1.1 icmp_seq=3 Destination Host Unreachable
64 bytes from 192.168.1.1: icmp_seq=4 ttl=61 time=24.7 ms
64 bytes from 192.168.1.1: icmp_seq=5 ttl=61 time=25.6 ms
64 bytes from 192.168.1.1: icmp_seq=6 ttl=61 time=69.6 ms
From 192.168.1.1 icmp_seq=7 Destination Host Unreachable
From 192.168.1.1 icmp_seq=8 Destination Host Unreachable
From 192.168.1.1 icmp_seq=9 Destination Host Unreachable
^C
--- 10.254.10.140 ping statistics ---
9 packets transmitted, 3 received, +6 errors, 66% packet loss, time 8001ms
rtt min/avg/max/mdev = 24.797/40.061/69.692/20.955 ms


The sources, however are different:

vihai@seviolab:~$ sudo tcpdump -n icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
^[OA^[OA^[OA17:09:55.932981 IP 192.168.1.21 > 10.254.10.140: ICMP echo
request, id 9278, seq 1, length 64
17:09:55.933234 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:09:56.933169 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 2, length 64
17:09:56.933416 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:09:57.933160 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 3, length 64
17:09:57.933404 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:09:58.933163 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 4, length 64
17:09:58.957939 IP 10.254.10.140 > 192.168.1.21: ICMP echo reply, id
9278, seq 4, length 64
17:09:59.935050 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 5, length 64
17:09:59.960724 IP 10.254.10.140 > 192.168.1.21: ICMP echo reply, id
9278, seq 5, length 64
17:10:00.936177 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 6, length 64
17:10:01.005849 IP 10.254.10.140 > 192.168.1.21: ICMP echo reply, id
9278, seq 6, length 64
17:10:01.936313 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 7, length 64
17:10:01.936626 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:10:02.935321 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 8, length 64
17:10:02.935591 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92
17:10:03.934322 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
9278, seq 9, length 64
17:10:03.934613 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
unreachable, length 92




Tried with a different ping implementation (RouterOS) and the behaviour
seems correct:

[vihai@SevioLab SW1] > ping 10.254.10.140
HOST                                     SIZE TTL TIME  STATUS
192.168.1.1                             84  64 0ms   host unreachable
192.168.1.1                             84  64 0ms   host unreachable
192.168.1.1                             84  64 0ms   host unreachable
10.254.10.140                           56  61 20ms
10.254.10.140                           56  61 46ms
10.254.10.140                           56  61 37ms
192.168.1.1                             84  64 0ms   host unreachable
192.168.1.1                             84  64 0ms   host unreachable
192.168.1.1                             84  64 0ms   host unreachable
    sent=9 received=3 packet-loss=66% min-rtt=20ms avg-rtt=34ms max-rtt=46ms


Recently I was pinging with IPv6, a router in between filtered the
packet, however the shown source address was not the right one:

root@monitor:~# ping6 -i 0.2 www.google.com
PING www.google.com(mil01s25-in-x04.1e100.net) 56 data bytes
From mil01s25-in-x04.1e100.net icmp_seq=1 Destination unreachable: No route
From mil01s25-in-x04.1e100.net icmp_seq=2 Destination unreachable: No route
From mil01s25-in-x04.1e100.net icmp_seq=3 Destination unreachable: No route

19:33:19.589285 IP6 2a01:2d8:aca0:fce:944e:c8ff:fe4d:96de >
mil01s25-in-x04.1e100.net: ICMP6, echo request, seq 1, length 64
19:33:19.611666 IP6 mix-br2.intercom.it >
2a01:2d8:aca0:fce:944e:c8ff:fe4d:96de: ICMP6, destination unreachable,
unreachable route mil01s25-in-x04.1e100.net, length 112


Bye,


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4326 bytes --]

^ permalink raw reply

* RE: TCP reaching to maximum throughput after a long time
From: Machani, Yaniv @ 2016-04-12 19:31 UTC (permalink / raw)
  To: Ben Greear, netdev
  Cc: Eric Dumazet, David S. Miller, Eric Dumazet, Neal Cardwell,
	Yuchung Cheng, Nandita Dukkipati, open list, Kama, Meirav
In-Reply-To: <570D0E94.3080006@candelatech.com>

On Tue, Apr 12, 2016 at 18:04:52, Ben Greear wrote:
> On 04/12/2016 07:52 AM, Eric Dumazet wrote:
> > On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
>>>
> 
> If you are using 'Cubic' TCP congestion control, then please try 
> something different.
> It was broken last I checked, at least when used with the ath10k driver.
> 

Thanks Ben, this indeed seems to be the issue !
Switching to reno got me to max throughput instantly.

I'm still looking through the thread you have shared, but from what I understand there is no planned fix for it ?

--
Thanks,
Yaniv


^ permalink raw reply

* [PATCH net-next] ibmvnic: Defer tx completion processing using a wait queue
From: John Allen @ 2016-04-12 19:38 UTC (permalink / raw)
  To: Thomas Falcon; +Cc: netdev, linuxppc-dev

Moves tx completion processing out of interrupt context, deferring work
using a wait queue. With this work now deferred, we must account for the
possibility that skbs can be sent faster than we can process completion
requests in which case the tx buffer will overflow. If the tx buffer is
full, ibmvnic_xmit will return NETDEV_TX_BUSY and stop the current tx
queue. Subsequently, the queue will be restarted in ibmvnic_complete_tx
when all pending tx completion requests have been cleared.

Signed-off-by: John Allen <jallen@linux.vnet.ibm.com>
---
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 864cb21..641e340 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -105,6 +105,7 @@ static int pending_scrq(struct ibmvnic_adapter *,
 			struct ibmvnic_sub_crq_queue *);
 static union sub_crq *ibmvnic_next_scrq(struct ibmvnic_adapter *,
 					struct ibmvnic_sub_crq_queue *);
+static int ibmvnic_tx_work(void *data);
 static int ibmvnic_poll(struct napi_struct *napi, int data);
 static void send_map_query(struct ibmvnic_adapter *adapter);
 static void send_request_map(struct ibmvnic_adapter *, dma_addr_t, __be32, u8);
@@ -437,6 +438,17 @@ static int ibmvnic_open(struct net_device *netdev)

 		tx_pool->consumer_index = 0;
 		tx_pool->producer_index = 0;
+
+		init_waitqueue_head(&tx_pool->ibmvnic_tx_comp_q);
+		tx_pool->work_thread =
+			kthread_run(ibmvnic_tx_work, adapter->tx_scrq[i],
+				    "%s_%s_%d",
+				    IBMVNIC_NAME, adapter->netdev->name, i);
+		if (IS_ERR(tx_pool->work_thread)) {
+			dev_err(dev, "Couldn't create kernel thread: %ld\n",
+				PTR_ERR(tx_pool->work_thread));
+			goto thread_failed;
+		}
 	}
 	adapter->bounce_buffer_size =
 	    (netdev->mtu + ETH_HLEN - 1) / PAGE_SIZE + 1;
@@ -477,6 +489,9 @@ bounce_map_failed:
 bounce_alloc_failed:
 	i = tx_subcrqs - 1;
 	kfree(adapter->tx_pool[i].free_map);
+thread_failed:
+	for (j = 0; j < i; j++)
+		kthread_stop(adapter->tx_pool[j].work_thread);
 tx_fm_alloc_failed:
 	free_long_term_buff(adapter, &adapter->tx_pool[i].long_term_buff);
 tx_ltb_alloc_failed:
@@ -731,6 +746,16 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 	}

 	index = tx_pool->free_map[tx_pool->consumer_index];
+	/* tx queue full */
+	if ((index + 1) % adapter->max_tx_entries_per_subcrq ==
+	    tx_pool->free_map[tx_pool->producer_index]) {
+		netif_tx_stop_queue(netdev_get_tx_queue(netdev, queue_num));
+		tx_send_failed++;
+		tx_dropped++;
+		ret = NETDEV_TX_BUSY;
+		goto out;
+	}
+
 	offset = index * adapter->req_mtu;
 	dst = tx_pool->long_term_buff.buff + offset;
 	memset(dst, 0, adapter->req_mtu);
@@ -1314,6 +1339,7 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter *adapter,
 {
 	struct device *dev = &adapter->vdev->dev;
 	struct ibmvnic_tx_buff *txbuff;
+	struct netdev_queue *txq;
 	union sub_crq *next;
 	int index;
 	int i, j;
@@ -1361,6 +1387,10 @@ restart_loop:
 		next->tx_comp.first = 0;
 	}

+	txq = netdev_get_tx_queue(adapter->netdev, scrq->pool_index);
+	if (netif_tx_queue_stopped(txq))
+		netif_tx_wake_queue(txq);
+
 	enable_scrq_irq(adapter, scrq);

 	if (pending_scrq(adapter, scrq)) {
@@ -1371,13 +1401,35 @@ restart_loop:
 	return 0;
 }

+static int ibmvnic_tx_work(void *data)
+{
+	int rc;
+	struct ibmvnic_sub_crq_queue *scrq = data;
+	struct ibmvnic_adapter *adapter = scrq->adapter;
+
+	while (1) {
+		rc = wait_event_interruptible(adapter->
+					      tx_pool[scrq->pool_index].
+					      ibmvnic_tx_comp_q,
+					      pending_scrq(adapter, scrq));
+		BUG_ON(rc);
+
+		if (kthread_should_stop())
+			break;
+
+		disable_scrq_irq(adapter, scrq);
+
+		ibmvnic_complete_tx(adapter, scrq);
+	}
+	return 0;
+}
+
 static irqreturn_t ibmvnic_interrupt_tx(int irq, void *instance)
 {
 	struct ibmvnic_sub_crq_queue *scrq = instance;
 	struct ibmvnic_adapter *adapter = scrq->adapter;

-	disable_scrq_irq(adapter, scrq);
-	ibmvnic_complete_tx(adapter, scrq);
+	wake_up(&adapter->tx_pool[scrq->pool_index].ibmvnic_tx_comp_q);

 	return IRQ_HANDLED;
 }

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox