Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 04/10] tipc: change sk_receive_queue upper limit
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Jon Maloy <jon.maloy@ericsson.com>

The sk_recv_queue upper limit for connectionless sockets has empirically
turned out to be too low. When we double the current limit we get much
fewer rejected messages and no noticable negative side-effects.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 4d6a448..65a6bfc 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1,7 +1,7 @@
 /*
  * net/tipc/socket.c: TIPC socket API
  *
- * Copyright (c) 2001-2007, Ericsson AB
+ * Copyright (c) 2001-2007, 2012 Ericsson AB
  * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
  * All rights reserved.
  *
@@ -43,7 +43,7 @@
 #define SS_LISTENING	-1	/* socket is listening */
 #define SS_READY	-2	/* socket is connectionless */
 
-#define OVERLOAD_LIMIT_BASE	5000
+#define OVERLOAD_LIMIT_BASE	10000
 #define CONN_TIMEOUT_DEFAULT	8000	/* default connect timeout = 8s */
 
 struct tipc_sock {
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
global atomic counter for the sum of sk_recv_queue sizes across all
tipc sockets. When incremented, the counter is compared to an upper
threshold value, and if this is reached, the message is rejected
with error code TIPC_OVERLOAD.

This check was originally meant to protect the node against
buffer exhaustion and general CPU overload. However, all experience
indicates that the feature not only is redundant on Linux, but even
harmful. Users run into the limit very often, causing disturbances
for their applications, while removing it seems to have no negative
effects at all. We have also seen that overall performance is
boosted significantly when this bottleneck is removed.

Furthermore, we don't see any other network protocols maintaining
such a mechanism, something strengthening our conviction that this
control can be eliminated.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1a720c8..a059ed0 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2,7 +2,7 @@
  * net/tipc/socket.c: TIPC socket API
  *
  * Copyright (c) 2001-2007, Ericsson AB
- * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
+ * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -1241,11 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	}

 	/* Reject message if there isn't room to queue it */
-	recv_q_len = (u32)atomic_read(&tipc_queue_size);
-	if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
-		if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
-			return TIPC_ERR_OVERLOAD;
-	}
 	recv_q_len = skb_queue_len(&sk->sk_receive_queue);
 	if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
 		if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 05/10] tipc: standardize across connect/disconnect function naming
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

Currently we have tipc_disconnect and tipc_disconnect_port.  It is
not clear from the names alone, what they do or how they differ.
It turns out that tipc_disconnect just deals with the port locking
and then calls tipc_disconnect_port which does all the work.

If we rename as follows: tipc_disconnect_port --> __tipc_disconnect
then we will be following typical linux convention, where:

   __tipc_disconnect: "raw" function that does all the work.

   tipc_disconnect: wrapper that deals with locking and then calls
		    the real core __tipc_disconnect function

With this, the difference is immediately evident, and locking
violations are more apt to be spotted by chance while working on,
or even just while reading the code.

On the connect side of things, we currently only have the single
"tipc_connect2port" function.  It does both the locking at enter/exit,
and the core of the work.  Pending changes will make it desireable to
have the connect be a two part locking wrapper + worker function,
just like the disconnect is already.

Here, we make the connect look just like the updated disconnect case,
for the above reason, and for consistency.  In the process, we also
get rid of the "2port" suffix that was on the original name, since
it adds no descriptive value.

On close examination, one might notice that the above connect
changes implicitly move the call to tipc_link_get_max_pkt() to be
within the scope of tipc_port_lock() protected region; when it was
not previously.  We don't see any issues with this, and it is in
keeping with __tipc_connect doing the work and tipc_connect just
handling the locking.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/port.c   | 32 +++++++++++++++++++++++---------
 net/tipc/port.h   |  6 ++++--
 net/tipc/socket.c |  6 +++---
 net/tipc/subscr.c |  2 +-
 4 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/net/tipc/port.c b/net/tipc/port.c
index 07c42fb..18098ca 100644
--- a/net/tipc/port.c
+++ b/net/tipc/port.c
@@ -726,7 +726,7 @@ static void port_dispatcher_sigh(void *dummy)
 				if (unlikely(!cb))
 					goto reject;
 				if (unlikely(!connected)) {
-					if (tipc_connect2port(dref, &orig))
+					if (tipc_connect(dref, &orig))
 						goto reject;
 				} else if (peer_invalid)
 					goto reject;
@@ -1036,15 +1036,30 @@ int tipc_withdraw(u32 ref, unsigned int scope, struct tipc_name_seq const *seq)
 	return res;
 }
 
-int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
+int tipc_connect(u32 ref, struct tipc_portid const *peer)
 {
 	struct tipc_port *p_ptr;
-	struct tipc_msg *msg;
-	int res = -EINVAL;
+	int res;
 
 	p_ptr = tipc_port_lock(ref);
 	if (!p_ptr)
 		return -EINVAL;
+	res = __tipc_connect(ref, p_ptr, peer);
+	tipc_port_unlock(p_ptr);
+	return res;
+}
+
+/*
+ * __tipc_connect - connect to a remote peer
+ *
+ * Port must be locked.
+ */
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+			struct tipc_portid const *peer)
+{
+	struct tipc_msg *msg;
+	int res = -EINVAL;
+
 	if (p_ptr->published || p_ptr->connected)
 		goto exit;
 	if (!peer->ref)
@@ -1067,17 +1082,16 @@ int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
 			  (net_ev_handler)port_handle_node_down);
 	res = 0;
 exit:
-	tipc_port_unlock(p_ptr);
 	p_ptr->max_pkt = tipc_link_get_max_pkt(peer->node, ref);
 	return res;
 }
 
-/**
- * tipc_disconnect_port - disconnect port from peer
+/*
+ * __tipc_disconnect - disconnect port from peer
  *
  * Port must be locked.
  */
-int tipc_disconnect_port(struct tipc_port *tp_ptr)
+int __tipc_disconnect(struct tipc_port *tp_ptr)
 {
 	int res;
 
@@ -1104,7 +1118,7 @@ int tipc_disconnect(u32 ref)
 	p_ptr = tipc_port_lock(ref);
 	if (!p_ptr)
 		return -EINVAL;
-	res = tipc_disconnect_port(p_ptr);
+	res = __tipc_disconnect(p_ptr);
 	tipc_port_unlock(p_ptr);
 	return res;
 }
diff --git a/net/tipc/port.h b/net/tipc/port.h
index 4660e30..fb66e2e 100644
--- a/net/tipc/port.h
+++ b/net/tipc/port.h
@@ -190,7 +190,7 @@ int tipc_publish(u32 portref, unsigned int scope,
 int tipc_withdraw(u32 portref, unsigned int scope,
 		struct tipc_name_seq const *name_seq);
 
-int tipc_connect2port(u32 portref, struct tipc_portid const *port);
+int tipc_connect(u32 portref, struct tipc_portid const *port);
 
 int tipc_disconnect(u32 portref);
 
@@ -200,7 +200,9 @@ int tipc_shutdown(u32 ref);
 /*
  * The following routines require that the port be locked on entry
  */
-int tipc_disconnect_port(struct tipc_port *tp_ptr);
+int __tipc_disconnect(struct tipc_port *tp_ptr);
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+		   struct tipc_portid const *peer);
 int tipc_port_peer_msg(struct tipc_port *p_ptr, struct tipc_msg *msg);
 
 /*
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 65a6bfc..4d56eae 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -791,7 +791,7 @@ static int auto_connect(struct socket *sock, struct tipc_msg *msg)
 
 	tsock->peer_name.ref = msg_origport(msg);
 	tsock->peer_name.node = msg_orignode(msg);
-	tipc_connect2port(tsock->p->ref, &tsock->peer_name);
+	tipc_connect(tsock->p->ref, &tsock->peer_name);
 	tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
 	sock->state = SS_CONNECTED;
 	return 0;
@@ -1254,7 +1254,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	/* Initiate connection termination for an incoming 'FIN' */
 	if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
 		sock->state = SS_DISCONNECTING;
-		tipc_disconnect_port(tipc_sk_port(sk));
+		__tipc_disconnect(tipc_sk_port(sk));
 	}
 
 	sk->sk_data_ready(sk, 0);
@@ -1514,7 +1514,7 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
 		/* Connect new socket to it's peer */
 		new_tsock->peer_name.ref = msg_origport(msg);
 		new_tsock->peer_name.node = msg_orignode(msg);
-		tipc_connect2port(new_ref, &new_tsock->peer_name);
+		tipc_connect(new_ref, &new_tsock->peer_name);
 		new_sock->state = SS_CONNECTED;
 
 		tipc_set_portimportance(new_ref, msg_importance(msg));
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index 0f7d0d0..6b42d47 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -462,7 +462,7 @@ static void subscr_named_msg_event(void *usr_handle,
 		kfree(subscriber);
 		return;
 	}
-	tipc_connect2port(subscriber->port_ref, orig);
+	tipc_connect(subscriber->port_ref, orig);
 
 	/* Lock server port (& save lock address for future use) */
 	subscriber->lock = tipc_port_lock(subscriber->port_ref)->lock;
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 06/10] tipc: consolidate connection-oriented message reception in one function
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

Handling of connection-related message reception is currently scattered
around at different places in the code. This makes it harder to verify
that things are handled correctly in all possible scenarios.
So we consolidate the existing processing of connection-oriented
message reception in a single routine.  In the process, we convert the
chain of if/else into a switch/case for improved readability.

A cast on the socket_state in the switch is needed to avoid compile
warnings on 32 bit, like "net/tipc/socket.c:1252:2: warning: case value
‘4294967295’ not in enumerated type".  This happens because existing
tipc code pseudo extends the default linux socket state values with:

	#define SS_LISTENING    -1      /* socket is listening */
	#define SS_READY        -2      /* socket is connectionless */

It may make sense to add these as _positive_ values to the existing
socket state enum list someday, vs. these already existing defines.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: add cast to fix warning; remove returns from middle of switch]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 75 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 24 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 4d56eae..fc0aa4f 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1192,6 +1192,53 @@ static int rx_queue_full(struct tipc_msg *msg, u32 queue_size, u32 base)
 }
 
 /**
+ * filter_connect - Handle all incoming messages for a connection-based socket
+ * @tsock: TIPC socket
+ * @msg: message
+ *
+ * Returns TIPC error status code and socket error status code
+ * once it encounters some errors
+ */
+static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
+{
+	struct socket *sock = tsock->sk.sk_socket;
+	struct tipc_msg *msg = buf_msg(*buf);
+	u32 retval = TIPC_ERR_NO_PORT;
+
+	if (msg_mcast(msg))
+		return retval;
+
+	switch ((int)sock->state) {
+	case SS_CONNECTED:
+		/* Accept only connection-based messages sent by peer */
+		if (msg_connected(msg) && tipc_port_peer_msg(tsock->p, msg)) {
+			if (unlikely(msg_errcode(msg))) {
+				sock->state = SS_DISCONNECTING;
+				__tipc_disconnect(tsock->p);
+			}
+			retval = TIPC_OK;
+		}
+		break;
+	case SS_CONNECTING:
+		/* Accept only ACK or NACK message */
+		if (msg_connected(msg) || (msg_errcode(msg)))
+			retval = TIPC_OK;
+		break;
+	case SS_LISTENING:
+	case SS_UNCONNECTED:
+		/* Accept only SYN message */
+		if (!msg_connected(msg) && !(msg_errcode(msg)))
+			retval = TIPC_OK;
+		break;
+	case SS_DISCONNECTING:
+		break;
+	default:
+		pr_err("Unknown socket state %u\n", sock->state);
+	}
+	return retval;
+}
+
+/**
  * filter_rcv - validate incoming message
  * @sk: socket
  * @buf: message
@@ -1208,6 +1255,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	struct socket *sock = sk->sk_socket;
 	struct tipc_msg *msg = buf_msg(buf);
 	u32 recv_q_len;
+	u32 res = TIPC_OK;
 
 	/* Reject message if it is wrong sort of message for socket */
 	if (msg_type(msg) > TIPC_DIRECT_MSG)
@@ -1226,24 +1274,9 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 				return TIPC_ERR_OVERLOAD;
 		}
 	} else {
-		if (msg_mcast(msg))
-			return TIPC_ERR_NO_PORT;
-		if (sock->state == SS_CONNECTED) {
-			if (!msg_connected(msg) ||
-			    !tipc_port_peer_msg(tipc_sk_port(sk), msg))
-				return TIPC_ERR_NO_PORT;
-		} else if (sock->state == SS_CONNECTING) {
-			if (!msg_connected(msg) && (msg_errcode(msg) == 0))
-				return TIPC_ERR_NO_PORT;
-		} else if (sock->state == SS_LISTENING) {
-			if (msg_connected(msg) || msg_errcode(msg))
-				return TIPC_ERR_NO_PORT;
-		} else if (sock->state == SS_DISCONNECTING) {
-			return TIPC_ERR_NO_PORT;
-		} else /* (sock->state == SS_UNCONNECTED) */ {
-			if (msg_connected(msg) || msg_errcode(msg))
-				return TIPC_ERR_NO_PORT;
-		}
+		res = filter_connect(tipc_sk(sk), &buf);
+		if (res != TIPC_OK || buf == NULL)
+			return res;
 	}
 
 	/* Enqueue message (finally!) */
@@ -1251,12 +1284,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	atomic_inc(&tipc_queue_size);
 	__skb_queue_tail(&sk->sk_receive_queue, buf);
 
-	/* Initiate connection termination for an incoming 'FIN' */
-	if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
-		sock->state = SS_DISCONNECTING;
-		__tipc_disconnect(tipc_sk_port(sk));
-	}
-
 	sk->sk_data_ready(sk, 0);
 	return TIPC_OK;
 }
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 08/10] tipc: eliminate connection setup for implied connect in recv_msg()
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

As connection setup is now completed asynchronously in BH context,
in the function filter_connect(), the corresponding code in recv_msg()
becomes redundant.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1ba3b6f..0df42fa 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -954,13 +954,6 @@ restart:
 	sz = msg_data_sz(msg);
 	err = msg_errcode(msg);
 
-	/* Complete connection setup for an implied connect */
-	if (unlikely(sock->state == SS_CONNECTING)) {
-		res = auto_connect(sock, msg);
-		if (res)
-			goto exit;
-	}
-
 	/* Discard an empty non-errored message & try again */
 	if ((!sz) && (!err)) {
 		advance_rx_queue(sk);
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 07/10] tipc: introduce non-blocking socket connect
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.

With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.

The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.

It is also now possible to call select() or poll() to wait for the
completion of a connection.

An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 158 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 93 insertions(+), 65 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index fc0aa4f..1ba3b6f 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -783,16 +783,19 @@ exit:
 static int auto_connect(struct socket *sock, struct tipc_msg *msg)
 {
 	struct tipc_sock *tsock = tipc_sk(sock->sk);
-
-	if (msg_errcode(msg)) {
-		sock->state = SS_DISCONNECTING;
-		return -ECONNREFUSED;
-	}
+	struct tipc_port *p_ptr;
 
 	tsock->peer_name.ref = msg_origport(msg);
 	tsock->peer_name.node = msg_orignode(msg);
-	tipc_connect(tsock->p->ref, &tsock->peer_name);
-	tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
+	p_ptr = tipc_port_deref(tsock->p->ref);
+	if (!p_ptr)
+		return -EINVAL;
+
+	__tipc_connect(tsock->p->ref, p_ptr, &tsock->peer_name);
+
+	if (msg_importance(msg) > TIPC_CRITICAL_IMPORTANCE)
+		return -EINVAL;
+	msg_set_importance(&p_ptr->phdr, (u32)msg_importance(msg));
 	sock->state = SS_CONNECTED;
 	return 0;
 }
@@ -1203,7 +1206,9 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
 {
 	struct socket *sock = tsock->sk.sk_socket;
 	struct tipc_msg *msg = buf_msg(*buf);
+	struct sock *sk = &tsock->sk;
 	u32 retval = TIPC_ERR_NO_PORT;
+	int res;
 
 	if (msg_mcast(msg))
 		return retval;
@@ -1221,8 +1226,36 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
 		break;
 	case SS_CONNECTING:
 		/* Accept only ACK or NACK message */
-		if (msg_connected(msg) || (msg_errcode(msg)))
+		if (unlikely(msg_errcode(msg))) {
+			sock->state = SS_DISCONNECTING;
+			sk->sk_err = -ECONNREFUSED;
+			retval = TIPC_OK;
+			break;
+		}
+
+		if (unlikely(!msg_connected(msg)))
+			break;
+
+		res = auto_connect(sock, msg);
+		if (res) {
+			sock->state = SS_DISCONNECTING;
+			sk->sk_err = res;
 			retval = TIPC_OK;
+			break;
+		}
+
+		/* If an incoming message is an 'ACK-', it should be
+		 * discarded here because it doesn't contain useful
+		 * data. In addition, we should try to wake up
+		 * connect() routine if sleeping.
+		 */
+		if (msg_data_sz(msg) == 0) {
+			kfree_skb(*buf);
+			*buf = NULL;
+			if (waitqueue_active(sk_sleep(sk)))
+				wake_up_interruptible(sk_sleep(sk));
+		}
+		retval = TIPC_OK;
 		break;
 	case SS_LISTENING:
 	case SS_UNCONNECTED:
@@ -1369,8 +1402,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 	struct sock *sk = sock->sk;
 	struct sockaddr_tipc *dst = (struct sockaddr_tipc *)dest;
 	struct msghdr m = {NULL,};
-	struct sk_buff *buf;
-	struct tipc_msg *msg;
 	unsigned int timeout;
 	int res;
 
@@ -1382,26 +1413,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 		goto exit;
 	}
 
-	/* For now, TIPC does not support the non-blocking form of connect() */
-	if (flags & O_NONBLOCK) {
-		res = -EOPNOTSUPP;
-		goto exit;
-	}
-
-	/* Issue Posix-compliant error code if socket is in the wrong state */
-	if (sock->state == SS_LISTENING) {
-		res = -EOPNOTSUPP;
-		goto exit;
-	}
-	if (sock->state == SS_CONNECTING) {
-		res = -EALREADY;
-		goto exit;
-	}
-	if (sock->state != SS_UNCONNECTED) {
-		res = -EISCONN;
-		goto exit;
-	}
-
 	/*
 	 * Reject connection attempt using multicast address
 	 *
@@ -1413,49 +1424,66 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 		goto exit;
 	}
 
-	/* Reject any messages already in receive queue (very unlikely) */
-	reject_rx_queue(sk);
+	timeout = (flags & O_NONBLOCK) ? 0 : tipc_sk(sk)->conn_timeout;
+
+	switch (sock->state) {
+	case SS_UNCONNECTED:
+		/* Send a 'SYN-' to destination */
+		m.msg_name = dest;
+		m.msg_namelen = destlen;
+
+		/* If connect is in non-blocking case, set MSG_DONTWAIT to
+		 * indicate send_msg() is never blocked.
+		 */
+		if (!timeout)
+			m.msg_flags = MSG_DONTWAIT;
+
+		res = send_msg(NULL, sock, &m, 0);
+		if ((res < 0) && (res != -EWOULDBLOCK))
+			goto exit;
 
-	/* Send a 'SYN-' to destination */
-	m.msg_name = dest;
-	m.msg_namelen = destlen;
-	res = send_msg(NULL, sock, &m, 0);
-	if (res < 0)
+		/* Just entered SS_CONNECTING state; the only
+		 * difference is that return value in non-blocking
+		 * case is EINPROGRESS, rather than EALREADY.
+		 */
+		res = -EINPROGRESS;
+		break;
+	case SS_CONNECTING:
+		res = -EALREADY;
+		break;
+	case SS_CONNECTED:
+		res = -EISCONN;
+		break;
+	default:
+		res = -EINVAL;
 		goto exit;
+	}
 
-	/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
-	timeout = tipc_sk(sk)->conn_timeout;
-	release_sock(sk);
-	res = wait_event_interruptible_timeout(*sk_sleep(sk),
-			(!skb_queue_empty(&sk->sk_receive_queue) ||
-			(sock->state != SS_CONNECTING)),
-			timeout ? (long)msecs_to_jiffies(timeout)
-				: MAX_SCHEDULE_TIMEOUT);
-	lock_sock(sk);
+	if (sock->state == SS_CONNECTING) {
+		if (!timeout)
+			goto exit;
 
-	if (res > 0) {
-		buf = skb_peek(&sk->sk_receive_queue);
-		if (buf != NULL) {
-			msg = buf_msg(buf);
-			res = auto_connect(sock, msg);
-			if (!res) {
-				if (!msg_data_sz(msg))
-					advance_rx_queue(sk);
-			}
-		} else {
-			if (sock->state == SS_CONNECTED)
-				res = -EISCONN;
+		/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
+		release_sock(sk);
+		res = wait_event_interruptible_timeout(*sk_sleep(sk),
+				sock->state != SS_CONNECTING,
+				timeout ? (long)msecs_to_jiffies(timeout)
+					: MAX_SCHEDULE_TIMEOUT);
+		lock_sock(sk);
+		if (res <= 0) {
+			if (res == 0)
+				res = -ETIMEDOUT;
 			else
-				res = -ECONNREFUSED;
+				; /* leave "res" unchanged */
+			goto exit;
 		}
-	} else {
-		if (res == 0)
-			res = -ETIMEDOUT;
-		else
-			; /* leave "res" unchanged */
-		sock->state = SS_DISCONNECTING;
 	}
 
+	if (unlikely(sock->state == SS_DISCONNECTING))
+		res = sock_error(sk);
+	else
+		res = 0;
+
 exit:
 	release_sock(sk);
 	return res;
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 09/10] tipc: add lock nesting notation to quiet lockdep warning
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

TIPC accept() call grabs the socket lock on a newly allocated
socket while holding the socket lock on an old socket. But lockdep
worries that this might be a recursive lock attempt:

  [ INFO: possible recursive locking detected ]
  ---------------------------------------------
  kworker/u:0/6 is trying to acquire lock:
  (sk_lock-AF_TIPC){+.+.+.}, at: [<c8c1226c>] accept+0x15c/0x310 [tipc]

  but task is already holding lock:
  (sk_lock-AF_TIPC){+.+.+.}, at: [<c8c12138>] accept+0x28/0x310 [tipc]

  other info that might help us debug this:
  Possible unsafe locking scenario:

          CPU0
          ----
          lock(sk_lock-AF_TIPC);
          lock(sk_lock-AF_TIPC);

          *** DEADLOCK ***

  May be due to missing lock nesting notation
  [...]

Tell lockdep that this locking is safe by using lock_sock_nested().
This is similar to what was done in commit 5131a184a3458d9 for
SCTP code ("SCTP: lock_sock_nested in sctp_sock_migrate").

Also note that this is isn't something that is seen normally,
as it was uncovered with some experimental work-in-progress
code not yet ready for mainline.  So no need for stable
backports or similar of this commit.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 0df42fa..38613cf 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1551,7 +1551,8 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
 		u32 new_ref = new_tport->ref;
 		struct tipc_msg *msg = buf_msg(buf);
 
-		lock_sock(new_sk);
+		/* we lock on new_sk; but lockdep sees the lock on sk */
+		lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
 
 		/*
 		 * Reject any stray messages received by new socket
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 10/10] tipc: refactor accept() code for improved readability
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

In TIPC's accept() routine, there is a large block of code relating
to initialization of a new socket, all within an if condition checking
if the allocation succeeded.

Here, we simply factor out that init code within the accept() function
to its own separate function, which improves readability, and makes
it easier to track which lock_sock() calls are operating on existing
vs. new sockets.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 93 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 49 insertions(+), 44 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 38613cf..56661c8 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1507,6 +1507,53 @@ static int listen(struct socket *sock, int len)
 	return res;
 }
 
+static void tipc_init_socket(struct sock *sk, struct socket *new_sock,
+			     struct sk_buff *buf)
+{
+
+	struct sock *new_sk = new_sock->sk;
+	struct tipc_sock *new_tsock = tipc_sk(new_sk);
+	struct tipc_port *new_tport = new_tsock->p;
+	u32 new_ref = new_tport->ref;
+	struct tipc_msg *msg = buf_msg(buf);
+
+	/* we lock on new_sk; but lockdep sees accept's lock on sk */
+	lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
+
+	/*
+	 * Reject any stray messages received by new socket
+	 * before the socket lock was taken (very, very unlikely)
+	 */
+	reject_rx_queue(new_sk);
+
+	/* Connect new socket to it's peer */
+	new_tsock->peer_name.ref = msg_origport(msg);
+	new_tsock->peer_name.node = msg_orignode(msg);
+	tipc_connect(new_ref, &new_tsock->peer_name);
+	new_sock->state = SS_CONNECTED;
+
+	tipc_set_portimportance(new_ref, msg_importance(msg));
+	if (msg_named(msg)) {
+		new_tport->conn_type = msg_nametype(msg);
+		new_tport->conn_instance = msg_nameinst(msg);
+	}
+
+	/*
+	 * Respond to 'SYN-' by discarding it & returning 'ACK'-.
+	 * Respond to 'SYN+' by queuing it on new socket.
+	 */
+	if (!msg_data_sz(msg)) {
+		struct msghdr m = {NULL,};
+
+		advance_rx_queue(sk);
+		send_packet(NULL, new_sock, &m, 0);
+	} else {
+		__skb_dequeue(&sk->sk_receive_queue);
+		__skb_queue_head(&new_sk->sk_receive_queue, buf);
+	}
+	release_sock(new_sk);
+}
+
 /**
  * accept - wait for connection request
  * @sock: listening socket
@@ -1542,51 +1589,9 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
 	}
 
 	buf = skb_peek(&sk->sk_receive_queue);
-
 	res = tipc_create(sock_net(sock->sk), new_sock, 0, 0);
-	if (!res) {
-		struct sock *new_sk = new_sock->sk;
-		struct tipc_sock *new_tsock = tipc_sk(new_sk);
-		struct tipc_port *new_tport = new_tsock->p;
-		u32 new_ref = new_tport->ref;
-		struct tipc_msg *msg = buf_msg(buf);
-
-		/* we lock on new_sk; but lockdep sees the lock on sk */
-		lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
-
-		/*
-		 * Reject any stray messages received by new socket
-		 * before the socket lock was taken (very, very unlikely)
-		 */
-		reject_rx_queue(new_sk);
-
-		/* Connect new socket to it's peer */
-		new_tsock->peer_name.ref = msg_origport(msg);
-		new_tsock->peer_name.node = msg_orignode(msg);
-		tipc_connect(new_ref, &new_tsock->peer_name);
-		new_sock->state = SS_CONNECTED;
-
-		tipc_set_portimportance(new_ref, msg_importance(msg));
-		if (msg_named(msg)) {
-			new_tport->conn_type = msg_nametype(msg);
-			new_tport->conn_instance = msg_nameinst(msg);
-		}
-
-		/*
-		 * Respond to 'SYN-' by discarding it & returning 'ACK'-.
-		 * Respond to 'SYN+' by queuing it on new socket.
-		 */
-		if (!msg_data_sz(msg)) {
-			struct msghdr m = {NULL,};
-
-			advance_rx_queue(sk);
-			send_packet(NULL, new_sock, &m, 0);
-		} else {
-			__skb_dequeue(&sk->sk_receive_queue);
-			__skb_queue_head(&new_sk->sk_receive_queue, buf);
-		}
-		release_sock(new_sk);
-	}
+	if (!res)
+		tipc_init_socket(sk, new_sock, buf);
 exit:
 	release_sock(sk);
 	return res;
-- 
1.7.12.1

^ permalink raw reply related

* Re: [PATCH RFC 0/5] Containerize syslog
From: Glauber Costa @ 2012-12-07 14:30 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Rui Xiang, netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman
In-Reply-To: <20121207142331.GC4004@sergelap>

On 12/07/2012 06:23 PM, Serge Hallyn wrote:
> Quoting Andrew Morton (akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org):
>> On Mon, 19 Nov 2012 01:51:09 -0800 ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) wrote:
>>
>>> Are there any kernel print statements besides networking stack printks
>>> that we want to move to show up in a new "kernel log" namespace?
>>
>> That's a good question, and afaict it remains unanswered.
> 
> There are some other (not *terribly* compelling) cases.  For instance
> selinux hooks, if you say mount an fs without xattr support or with
> unsupported options, will printk a warning.  Things like stat.c and
> capabilities and syslog print out warnings when userspace uses a
> deprecated somethingorother - old stat syscall or sys_syslog without
> CAP_SYSLOG.  That should go to the container.  Filesystems may give
> warnings (bad mount options for tmpfs, bad uid owner for many of them,
> etc) which belong in the container.  Obviously some belong on the host -
> if they show a corrupt superblock which may indicate an attempt by the
> container to crash the kernel.
> 
>> As so often happens, this patchset's changelogs forgot to describe the
>> reason for the existence of this patchset.  Via a bit of lwn reading
> 
> Not as a separate justification admittedly, but the description was
> meant to explain it:  right now /dev/kmsg and sys_syslog are not safe
> and useful in a container;  syslog messages from host and containers
> can be confusingly intermixed;  and helpful printks are not seen in
> the container.
> 
>> and my awesome telepathic skills, I divine that something in networking
>> is using syslog for kernel->userspace communications.
>>
>> wtf?
> 
> Well, syslog is the kernel->userspace channel of last resort.
> 
>> Wouldn't it be better to just stop doing that, and to implement a
>> respectable and reliable kernel->userspace messaging scheme?
> 
> Convenience functions on top of netlink?
> 
>> And leave syslog alone - it's a crude low-level thing for random
>> unexpected things which operators might want to know about.
> 
> That sentence is a result of not calling a container admin an operator.
> I can't argue it because I'm not sure whether to agree with that
> classification.
> 

I keep asking myself if it isn't the case of forwarding to a container
all messages printed in process context. That will obviously exclude all
messages resulting from kthreads - that will always be in the initial
namespace anyway, interrupts, etc. There is no harm, for instance, in
delivering the same message twice: one to the container, and the other
to the host system.

Isn't it natural that if the kernel printed something on behalf of a
process, whoever is the admin of the machine that process lives on
should see what it is about?

^ permalink raw reply

* Re: [PATCH net-next] rps: overflow prevention for saturated cpus
From: Ben Hutchings @ 2012-12-07 14:51 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, davem, edumazet, therbert
In-Reply-To: <1354826194-9289-1-git-send-email-willemb@google.com>

On Thu, 2012-12-06 at 15:36 -0500, Willem de Bruijn wrote:
> RPS and RFS balance load across cpus with flow affinity. This can
> cause local bottlenecks, where a small number or single large flow
> (DoS) can saturate one CPU while others are idle.
> 
> This patch maintains flow affinity in normal conditions, but
> trades it for throughput when a cpu becomes saturated. Then, packets
> destined to that cpu (only) are redirected to the lightest loaded cpu
> in the rxqueue's rps_map. This breaks flow affinity under high load
> for some flows, in favor of processing packets up to the capacity
> of the complete rps_map cpuset in all circumstances.
[...]
> --- a/Documentation/networking/scaling.txt
> +++ b/Documentation/networking/scaling.txt
> @@ -135,6 +135,18 @@ packets have been queued to their backlog queue. The IPI wakes backlog
>  processing on the remote CPU, and any queued packets are then processed
>  up the networking stack.
>  
> +==== RPS Overflow Protection
> +
> +By selecting the same cpu from the cpuset for each packet in the same
> +flow, RPS will cause load imbalance when input flows are not uniformly
> +random. In the extreme case, a single flow, all packets are handled on a
> +single CPU, which limits the throughput of the machine to the throughput
> +of that CPU. RPS has optional overflow protection, which disables flow
> +affinity when an RPS CPU becomes saturated: during overload, its packets
> +will be sent to the least loaded other CPU in the RPS cpuset. To enable
> +this option, set sysctl net.core.netdev_max_rps_backlog to be smaller than
> +net.core.netdev_max_backlog. Setting it to half is a reasonable heuristic.
[...]

This only seems to be suitable for specialised applications where a high
degree of reordering is tolerable.  This documentation should make that
very clear.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [RFC PATCH] dynamic_queue_limit.h: Make the struct ___cacheline_aligned_on_smp
From: Joe Perches @ 2012-12-07 14:58 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Eric Dumazet, netdev

Given that the struct will always have limit at the start of
a cacheline, why not make  struct ___cacheline_aligned_on_smp
and make limit the first member?

It could make other structs that use struct dql a bit more
predictable or efficient to pack.

(netdev_queue is size reduced from 256 to 192 on x86-32)

---

 include/linux/dynamic_queue_limits.h |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 5621547..2d20b22 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -38,6 +38,8 @@
 #ifdef __KERNEL__
 
 struct dql {
+	unsigned int	limit; /* Must be first field - Current limit */
+
 	/* Fields accessed in enqueue path (dql_queued) */
 	unsigned int	num_queued;		/* Total ever queued */
 	unsigned int	adj_limit;		/* limit + num_completed */
@@ -45,7 +47,6 @@ struct dql {
 
 	/* Fields accessed only by completion path (dql_completed) */
 
-	unsigned int	limit ____cacheline_aligned_in_smp; /* Current limit */
 	unsigned int	num_completed;		/* Total ever completed */
 
 	unsigned int	prev_ovlimit;		/* Previous over limit */
@@ -59,7 +60,7 @@ struct dql {
 	unsigned int	max_limit;		/* Max limit */
 	unsigned int	min_limit;		/* Minimum limit */
 	unsigned int	slack_hold_time;	/* Time to measure slack */
-};
+} ____cacheline_aligned_in_smp;
 
 /* Set some static maximums */
 #define DQL_MAX_OBJECT (UINT_MAX / 16)

^ permalink raw reply related

* Re: [patch net-next] net: call notifiers for mtu change even if iface is not up
From: Neil Horman @ 2012-12-07 15:07 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, edumazet, bhutchings, psimerda
In-Reply-To: <20121207122920.GB1631@minipsycho.zyxel.com>

On Fri, Dec 07, 2012 at 01:29:20PM +0100, Jiri Pirko wrote:
> Mon, Dec 03, 2012 at 04:22:50PM CET, nhorman@tuxdriver.com wrote:
> >On Mon, Dec 03, 2012 at 03:22:29PM +0100, Jiri Pirko wrote:
> >> Mon, Dec 03, 2012 at 03:18:23PM CET, nhorman@tuxdriver.com wrote:
> >> >On Mon, Dec 03, 2012 at 12:16:32PM +0100, Jiri Pirko wrote:
> >> >> Do the same thing as in set mac. Call notifiers every time.
> >> >> 
> >> >> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> >> >> ---
> >> >>  net/core/dev.c | 2 +-
> >> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >> >> 
> >> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> >> index 2f94df2..0685a72 100644
> >> >> --- a/net/core/dev.c
> >> >> +++ b/net/core/dev.c
> >> >> @@ -4971,7 +4971,7 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
> >> >>  	else
> >> >>  		dev->mtu = new_mtu;
> >> >>  
> >> >> -	if (!err && dev->flags & IFF_UP)
> >> >> +	if (!err)
> >> >>  		call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
> >> >>  	return err;
> >> >>  }
> >> >
> >> >I'm not opposed to this change, but is there something that it expressly fixes?
> >> 
> >> This is about a consistency. To have the same behaviour as set_mac
> >> for example.
> >> 
> >> >While it doesn't hurt to send around mtu change events, one would presume that
> >> >listeners would pick up mtu changes when the NETDEV_UP event went' around.
> >> >
> >> >Neil
> >> >
> >I figured, I just wanted to be sure that there wasn't a bug fix that needed
> >documenting in the changelog, or that listeners for this event didn't expect the
> >interface to already be up.  It looks pretty clean, with the exception of the
> >TIPC protocol.  It appears that it might require enable_bearer to be called
> >prior to any netdevice CHANGEMTU events, so that the eth_bearers array gets
> >filled out, and it looks like that has to get triggered either by user space, or
> >the receipt of a message over the network.  Let me know what you think
> 
> I'm looking at the code. What is the difference between CHANGEMTU and
> CHANGEADDR here? It looks to me that it should work for both in the same
> way.
> 
Hmm, they do work in the same way, which likely means that tipc is probably
begging to oops there regardless of having your patch or not :)

I'll get around to fixing that shortly.  Since your patch then isn't going to
break anything that isn't already broken
Acked-by: Neil Horman <nhorman@tuxdriver.com>

> 
> >
> >Neil
> >
> >> >--
> >> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> >the body of a message to majordomo@vger.kernel.org
> >> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> 

^ permalink raw reply

* Re: [PATCH 2/4] net: remove obsolete simple_strto<foo>
From: Neil Horman @ 2012-12-07 15:22 UTC (permalink / raw)
  To: Abhijit Pawar
  Cc: David S. Miller, Pablo Neira Ayuso, Patrick McHardy,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	John W. Linville, Johannes Berg, Cong Wang, Eric Dumazet,
	Joe Perches, netdev, linux-kernel, netfilter-devel, netfilter,
	coreteam, linux-wireless
In-Reply-To: <1354880998-23417-1-git-send-email-abhi.c.pawar@gmail.com>

On Fri, Dec 07, 2012 at 05:19:58PM +0530, Abhijit Pawar wrote:
> This patch replace the obsolete simple_strto<foo> with kstrto<foo>
> 
> Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
> ---
>  net/core/netpoll.c                 |    9 +++++++--
>  net/ipv4/netfilter/ipt_CLUSTERIP.c |    9 +++++++--
>  net/mac80211/debugfs_sta.c         |    4 +++-
>  net/netfilter/nf_conntrack_core.c  |    6 ++++--
>  4 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 77a0388..596b127 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -668,13 +668,16 @@ EXPORT_SYMBOL(netpoll_print_options);
>  
>  int netpoll_parse_options(struct netpoll *np, char *opt)
>  {
> +	int rc;
>  	char *cur=opt, *delim;
>  
>  	if (*cur != '@') {
>  		if ((delim = strchr(cur, '@')) == NULL)
>  			goto parse_failed;
>  		*delim = 0;
> -		np->local_port = simple_strtol(cur, NULL, 10);
> +		rc = kstrtol(cur, 10, &np->local_port);
> +		if (rc)
> +			goto parse_failed;
Perhaps consolidate this to:
if (kstrtol(cur, 10, &np->local_port)
	goto parse_failed

Then you don't have to declare the new stack variable

>  		cur = delim;
>  	}
>  	cur++;
> @@ -705,7 +708,9 @@ int netpoll_parse_options(struct netpoll *np, char *opt)
>  		*delim = 0;
>  		if (*cur == ' ' || *cur == '\t')
>  			np_info(np, "warning: whitespace is not allowed\n");
> -		np->remote_port = simple_strtol(cur, NULL, 10);
> +		rc = kstrtol(cur, 10, &np->remote_port);
> +		if (rc)
> +			goto parse_failed;
>  		cur = delim;
Ditto

>  	}
>  	cur++;
> diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c
> index fe5daea..55e7b73 100644
> --- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
> +++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
> @@ -661,6 +661,7 @@ static ssize_t clusterip_proc_write(struct file *file, const char __user *input,
>  #define PROC_WRITELEN	10
>  	char buffer[PROC_WRITELEN+1];
>  	unsigned long nodenum;
> +	int rc;
>  
>  	if (size > PROC_WRITELEN)
>  		return -EIO;
> @@ -669,11 +670,15 @@ static ssize_t clusterip_proc_write(struct file *file, const char __user *input,
>  	buffer[size] = 0;
>  
>  	if (*buffer == '+') {
> -		nodenum = simple_strtoul(buffer+1, NULL, 10);
> +		rc = kstrtoul(buffer+1, 10, &nodenum);
> +		if (rc)
> +			return -EINVAL;
>  		if (clusterip_add_node(c, nodenum))
>  			return -ENOMEM;
>  	} else if (*buffer == '-') {
> -		nodenum = simple_strtoul(buffer+1, NULL,10);
> +		rc = kstrtoul(buffer+1, 10, &nodenum);
> +		if (rc)
> +			return -EINVAL;
>  		if (clusterip_del_node(c, nodenum))
>  			return -ENOENT;
Same deal with the rc variable, although in this case it might make sense to
return rc if kstrtoul fails, instead of just filtering it all down to -EINVAL.

>  	} else
> diff --git a/net/mac80211/debugfs_sta.c b/net/mac80211/debugfs_sta.c
> index 89281d2..18754fd 100644
> --- a/net/mac80211/debugfs_sta.c
> +++ b/net/mac80211/debugfs_sta.c
> @@ -219,7 +219,9 @@ static ssize_t sta_agg_status_write(struct file *file, const char __user *userbu
>  	} else
>  		return -EINVAL;
>  
> -	tid = simple_strtoul(buf, NULL, 0);
> +	ret = kstrtoul(buf, 0, &tid);
> +	if (ret)
> +		return -EINVAL;
>  
>  	if (tid >= IEEE80211_NUM_TIDS)
>  		return -EINVAL;
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index af17516..18ce24b 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1409,7 +1409,7 @@ EXPORT_SYMBOL_GPL(nf_ct_alloc_hashtable);
>  
>  int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
>  {
> -	int i, bucket;
> +	int i, bucket, rc;
>  	unsigned int hashsize, old_size;
>  	struct hlist_nulls_head *hash, *old_hash;
>  	struct nf_conntrack_tuple_hash *h;
> @@ -1422,7 +1422,9 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
>  	if (!nf_conntrack_htable_size)
>  		return param_set_uint(val, kp);
>  
> -	hashsize = simple_strtoul(val, NULL, 0);
> +	rc = kstrtouint(val, 0, &hashsize);
> +	if (rc)
> +		return -EINVAL;
>  	if (!hashsize)
>  		return -EINVAL;
>  
As above, these call points might benefit from returning rc rather than just
EINVAL.

Neil

> -- 
> 1.7.7.6
> 
> 

^ permalink raw reply

* Tx timestamp for packet mmap.
From: Paul Chavent @ 2012-12-07 15:28 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1814 bytes --]

Hi.

I would like to be able to get tx timestamps of packets sent by the 
packet mmap interface...

Actually, I try to get them with the sample code below.

The problem is that it doesn't work without the joined patch.

I wonder if my current implementation is good. And if not, how should i 
get the timestamps ?

Wouldn't be a good idea to put timestamps in the ring buffer frame 
before give back the frame to the user ?

Thanks for your help and advices.

Paul.

8<-------------------------------


   struct timespec ts = {0,0};
   struct sockaddr from_addr;
   static uint8_t tmp_data[256];
   struct iovec msg_iov = {tmp_data, sizeof(tmp_data)};
   static uint8_t cmsg_buff[256];
   struct msghdr msghdr = {&from_addr, sizeof(from_addr),
                           &msg_iov, 1,
                           cmsg_buff, sizeof(cmsg_buff),
                           0};
   ssize_t err = recvmsg(itf->sock_fd, &msghdr, MSG_ERRQUEUE);
   if(err < 0)
     {
       perror("recvmsg failed");
       return -1;
     }

   struct cmsghdr *cmsg;
   for(cmsg = CMSG_FIRSTHDR(&msghdr); cmsg != NULL; cmsg = 
CMSG_NXTHDR(&msghdr, cmsg))
     {
       if(cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == 
SCM_TIMESTAMPING)
         {
           ts = *(struct timespec *)CMSG_DATA(cmsg);
#if !defined(NDEBUG)
           if(itf->debug)
             {
               fprintf(stderr, "SCM_TIMESTAMPING available\n");
             }
#else
           break;
#endif
         }
       else if (cmsg->cmsg_level == SOL_PACKET && cmsg->cmsg_type == 
PACKET_TX_TIMESTAMP)
         {
           ts = *(struct timespec *)CMSG_DATA(cmsg);
#if !defined(NDEBUG)
           if(itf->debug)
             {
               fprintf(stderr, "PACKET_TX_TIMESTAMP available\n");
             }
#else
           break;
#endif
         }
     }

[-- Attachment #2: 0002-net-add-tx-timestamp-to-packet-mmap.patch --]
[-- Type: text/x-patch, Size: 763 bytes --]

>From 762a8e89d1453e629bbe9c255c0ba4ec207cca25 Mon Sep 17 00:00:00 2001
From: Paul Chavent <paul.chavent@onera.fr>
Date: Fri, 7 Dec 2012 16:15:44 +0100
Subject: [PATCH 2/2] net : add tx timestamp to packet mmap.

Signed-off-by: Paul Chavent <paul.chavent@onera.fr>
---
 net/packet/af_packet.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e639645..948748b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1857,6 +1857,10 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	void *data;
 	int err;
 
+	err = sock_tx_timestamp(&po->sk, &skb_shinfo(skb)->tx_flags);
+	if (err < 0)
+		return err;
+
 	ph.raw = frame;
 
 	skb->protocol = proto;
-- 
1.7.12.1


^ permalink raw reply related

* [PATCH stable] ipv4: avoid passing NULL to inet_putpeer() in icmpv4_xrlim_allow()
From: CAI Qian @ 2012-12-07 15:46 UTC (permalink / raw)
  To: netdev; +Cc: Neal Cardwell, David S. Miller
In-Reply-To: <1411074275.5265862.1354894974886.JavaMail.root@redhat.com>

David, this patch looks applicable for the stable releases.

>From Neal Cardwell <ncardwell@google.com>

inet_getpeer_v4() can return NULL under OOM conditions, and while
inet_peer_xrlim_allow() is OK with a NULL peer, inet_putpeer() will
crash.

This code path now uses the same idiom as the others from:
1d861aa4b3fb08822055345f480850205ffe6170 ("inet: Minimize use of
cached route inetpeer.").

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Upstream-ID: e1a676424c290b1c8d757e3860170ac7ecd89af4
Stable-trees: 3.6.x
Signed-off-by: CAI Qian <caiqian@redhat.com>

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f2eccd5..17ff9fd 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -257,7 +257,8 @@ static inline bool icmpv4_xrlim_allow(struct net *net, struct rtable *rt,
 		struct inet_peer *peer = inet_getpeer_v4(net->ipv4.peers, fl4->daddr, 1);
 		rc = inet_peer_xrlim_allow(peer,
 					   net->ipv4.sysctl_icmp_ratelimit);
-		inet_putpeer(peer);
+		if (peer)
+			inet_putpeer(peer);
 	}
 out:
 	return rc;

^ permalink raw reply related

* Re: [RFC PATCH] dynamic_queue_limit.h: Make the struct ___cacheline_aligned_on_smp
From: Eric Dumazet @ 2012-12-07 15:55 UTC (permalink / raw)
  To: Joe Perches; +Cc: Tom Herbert, David Miller, netdev
In-Reply-To: <1354892334.29937.14.camel@joe-AO722>

2012/12/7 Joe Perches <joe@perches.com>:
> Given that the struct will always have limit at the start of
> a cacheline, why not make  struct ___cacheline_aligned_on_smp
> and make limit the first member?
>
> It could make other structs that use struct dql a bit more
> predictable or efficient to pack.
>
> (netdev_queue is size reduced from 256 to 192 on x86-32)
>

No, please.

Have you tested this on a range of hardware and check how it can hurt
performance ?

^ permalink raw reply

* [PATCH stable] ipv4: do not cache looped multicasts
From: CAI Qian @ 2012-12-07 15:58 UTC (permalink / raw)
  To: netdev; +Cc: Maxime Bizon, Julian Anastasov, David S. Miller
In-Reply-To: <2138496357.5269916.1354895709722.JavaMail.root@redhat.com>

David, this looks like applicable to the stable releases.

>From Maxime Bizon <mbizon@freebox.fr>,

	Starting from 3.6 we cache output routes for
multicasts only when using route to 224/4. For local receivers
we can set RTCF_LOCAL flag depending on the membership but
in such case we use maddr and saddr which are not caching
keys as before. Additionally, we can not use same place to
cache routes that differ in RTCF_LOCAL flag value.

	Fix it by caching only RTCF_MULTICAST entries
without RTCF_LOCAL (send-only, no loopback). As a side effect,
we avoid unneeded lookup for fnhe when not caching because
multicasts are not redirected and they do not learn PMTU.

	Thanks to Maxime Bizon for showing the caching
problems in __mkroute_output for 3.6 kernels: different
RTCF_LOCAL flag in cache can lead to wrong ip_mc_output or
ip_output call and the visible problem is that traffic can
not reach local receivers via loopback.

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Tested-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>

Upstream-ID: 636174219b52b5a8bc51bc23bbcba97cd30a65e3
Stable-trees: 3.6.x
Signed-off-by: CAI Qian <caiqian@redhat.com>

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 200d287..df25142 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1785,6 +1785,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	if (dev_out->flags & IFF_LOOPBACK)
 		flags |= RTCF_LOCAL;
 
+	do_cache = true;
 	if (type == RTN_BROADCAST) {
 		flags |= RTCF_BROADCAST | RTCF_LOCAL;
 		fi = NULL;
@@ -1793,6 +1794,8 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 		if (!ip_check_mc_rcu(in_dev, fl4->daddr, fl4->saddr,
 				     fl4->flowi4_proto))
 			flags &= ~RTCF_LOCAL;
+		else
+			do_cache = false;
 		/* If multicast route do not exist use
 		 * default one, but do not gateway in this case.
 		 * Yes, it is hack.
@@ -1802,8 +1805,8 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	fnhe = NULL;
-	do_cache = fi != NULL;
-	if (fi) {
+	do_cache &= fi != NULL;
+	if (do_cache) {
 		struct rtable __rcu **prth;
 		struct fib_nh *nh = &FIB_RES_NH(*res);

^ permalink raw reply related

* Re: [PATCH net-next] rps: overflow prevention for saturated cpus
From: Willem de Bruijn @ 2012-12-07 16:04 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, David Miller, Eric Dumazet, Tom Herbert
In-Reply-To: <50C12E22.3030206@hp.com>

On Thu, Dec 6, 2012 at 6:45 PM, Rick Jones <rick.jones2@hp.com> wrote:
> On 12/06/2012 03:04 PM, Willem de Bruijn wrote:
>>
>> On Thu, Dec 6, 2012 at 5:25 PM, Rick Jones <rick.jones2@hp.com> wrote:
>>>
>>> I thought (one of) the ideas behind RFS at least was to give the CPU
>>> scheduler control over where network processing took place instead of it
>>> being dictated solely by the addressing. I would have expected the CPU
>>> scheduler to migrate some work off the saturated CPU.  Or will this only
>>> affect RPS and not RFS?
>>
>>
>> I wrote it with RPS in mind, indeed. With RFS, for sufficiently
>> multithreaded applications that are unpinned, the scheduler will
>> likely spread the threads across as many cpus as possible. In that
>> case, the mechanism will not kick in, or as quickly. Even with RFS,
>> pinned threads and single-threaded applications will likely also
>> benefit during high load from redirecting kernel receive
>> processing away from the cpu that runs the application thread. I
>> haven't tested that case independently.
>
>
> Unless that single-threaded application (or single receiving thread) is
> pinned to a CPU, isn't there a non-trivial chance that incoming traffic
> flowing up different CPUs will cause it to be bounced from one CPU to
> another, taking its cache lines with it and not just the "intra-stack" cache
> lines?

Yes. The patch restricts the offload cpus to rps_cpus, with the assumption
that this is a small subset of all cpus. In that case, other workloads will
eventually migrate to the remainder. I previously tested spreading across
all cpus, which indeed did interfere with the userspace threads.

> Long (?) ago and far away it was possible to say that a given IRQ should be
> potentially serviced by more than one CPU (if I recall though not phrase
> correctly).  Didn't that get taken away because it did such nasty things
> like reordering and such?  (Admittedly, I'm really stretching the limits of
> my dimm memory there)

Sounds familiar. Wasn't there a mechanism to periodically switch the
destination cpu? If at HZ granularity, that is very coarse grain compared to
Mpps, but out of order does seem likely. I assume that this patch will lead
to a steady state where userspace and kernel receive run on disjoint cpusets,
due to the rps_cpus set being hot with kernel receive processing. That said,
I can run a test with RFS enabled to see whether that actually holds.

>>> What kind of workload is this targeting that calls for
>>> such intra-flow parallelism?
>>
>>
>> Packet processing middeboxes that rather operate in degraded mode
>> (reordering) than drop packets. Intrusion detection systems and proxies,
>> for instance. These boxes are actually likely to have RPS enabled and
>> RFS disabled.
>>
>>> With respect to the examples given, what happens when it is TCP traffic
>>> rather than UDP?
>>
>>
>> That should be identical. RFS is supported for both protocols. In the
>> test, it is turned off to demonstrate the effect solely with RPS.
>
>
> Will it be identical with TCP?  If anything, I would think causing
> reordering of the TCP segments within flows would only further increase the
> workload of the middlebox because it will increase the ACK rates. Perhaps
> quite significantly if GRO was effective at the receivers before the
> reordering started.
>
> At least unless/until the reordering is bad enough to cause the sending TCPs
> to fast retransmit and so throttle back.  And unless we are talking about
> being overloaded by massive herds of "mice" I'd think that the TCP flows
> would be throttling back to what the single CPU in the middlebox could
> handle.

Agreed, I will try to get some data on the interaction with TCP flows. My
hunch is that they throttle down due to the reordering, but data is more useful.
The initial increase in ACKs, if any, will likely not increase rate beyond a
small factor.

The situations that this patch mean to address are more straightforward
DoS attacks, where a box can handle normal load with a big safety margin,
but falls over at a 10x or 100x flood of TCP SYN or similar packets.

> rick

^ permalink raw reply

* Re: [RFC PATCH] dynamic_queue_limit.h: Make the struct ___cacheline_aligned_on_smp
From: Joe Perches @ 2012-12-07 16:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, David Miller, netdev
In-Reply-To: <CAL4Wiip12_Frb64WGk419_YApOuckTa3Jtd-nVROodLFTqvacg@mail.gmail.com>

On Fri, 2012-12-07 at 07:55 -0800, Eric Dumazet wrote:
> 2012/12/7 Joe Perches <joe@perches.com>:
> > Given that the struct will always have limit at the start of
> > a cacheline, why not make  struct ___cacheline_aligned_on_smp
> > and make limit the first member?
> >
> > It could make other structs that use struct dql a bit more
> > predictable or efficient to pack.
> >
> > (netdev_queue is size reduced from 256 to 192 on x86-32)
> >
> 
> No, please.
> 
> Have you tested this on a range of hardware and check how it can hurt
> performance ?

No.  Hence the RFC subject title and
unsigned patch.

I was wondering though about cacheline ping-pong 
effects.

I noted Tom's comment back in
http://patchwork.ozlabs.org/patch/108856/
"Also, the cache line containing the struct dql can ping-pong between
CPUs doing initiation and completion.  (I know we're aiming for these to
be the same, but we can't yet assume they will be.)"

So it seemed somewhat sensible to make the
entire struct in a single cacheline.

^ permalink raw reply

* Re: [PATCH V2 wireless-next] iwlwifi: iwlagn_request_scan: Fix check for priv->scan_request
From: Johannes Berg @ 2012-12-07 16:05 UTC (permalink / raw)
  To: Tim Gardner
  Cc: linux-kernel, Wey-Yi Guy, Intel Linux Wireless, John W. Linville,
	Emmanuel Grumbach, Don Fry, linux-wireless, netdev
In-Reply-To: <1354886914-7822-1-git-send-email-tim.gardner@canonical.com>

On Fri, 2012-12-07 at 06:28 -0700, Tim Gardner wrote:
> The WARN_ON_ONCE() check for scan_request will not correctly detect
> a NULL pointer for scan_type == IWL_SCAN_NORMAL. Make it explicit
> that the check only applies to normal scans.
> 
> Convert WARN_ON_ONCE to WARN_ON since priv->scan_request really _can't_
> be NULL for normal scans. If it is then we should emit frequent warnings.
> 
> This smatch warning led to scrutiny of iwlagn_request_scan():
> 
> drivers/net/wireless/iwlwifi/dvm/scan.c:894 iwlagn_request_scan() error: we previously assumed 'priv->scan_request' could be null (see line 792)
> 
> Cc: Johannes Berg <johannes.berg@intel.com>
> Cc: Wey-Yi Guy <wey-yi.w.guy@intel.com>
> Cc: Intel Linux Wireless <ilw@linux.intel.com>
> Cc: "John W. Linville" <linville@tuxdriver.com>
> Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
> Cc: Don Fry <donald.h.fry@intel.com>
> Cc: linux-wireless@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
> ---
> 
> This patch does apply to 3.6.y, but it doesn't fix an existing
> bug so I don't think it qualifies. This patch simply makes
> the driver more robust for future development.
> 
> V2 - corrected indentation more like the rest of the source
> in this file.

Thanks, I've picked it up now, adding one space in the condition
still :)
It's in my internal tree for now, so it'll be a few days until it
trickles out to iwlwifi-next.

johannes

^ permalink raw reply

* Re: [patch v2] bridge: make buffer larger in br_setlink()
From: walter harms @ 2012-12-07 16:07 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Stephen Hemminger, David S. Miller, bridge, netdev,
	kernel-janitors, Thomas Graf
In-Reply-To: <20121207111045.GA9676@elgon.mountain>



Am 07.12.2012 12:10, schrieb Dan Carpenter:
> We pass IFLA_BRPORT_MAX to nla_parse_nested() so we need
> IFLA_BRPORT_MAX + 1 elements.  Also Smatch complains that we read past
> the end of the array when in br_set_port_flag() when it's called with
> IFLA_BRPORT_FAST_LEAVE.
> 



I have no clue why nla_parse_nested() need IFLA_BRPORT_MAX elements.
but the majory of loop look like
for(i=0;i<max;++)
most programmers will think this way.
So it seems the place to fix is nla_parse_nested().
doing not so is asking for trouble (in the long run).
At least this function needs a big warning label that (max-1)
is actually needed.


just my two cents,
 wh


> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> ---
> v2: Style tweak.
> 
> Only needed in linux-next.
> 
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index 850b7d1..cfc5cfe 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -239,7 +239,7 @@ int br_setlink(struct net_device *dev, struct nlmsghdr *nlh)
>  	struct ifinfomsg *ifm;
>  	struct nlattr *protinfo;
>  	struct net_bridge_port *p;
> -	struct nlattr *tb[IFLA_BRPORT_MAX];
> +	struct nlattr *tb[IFLA_BRPORT_MAX + 1];
>  	int err;
>  
>  	ifm = nlmsg_data(nlh);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply

* Re: [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit
From: Neil Horman @ 2012-12-07 16:07 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: David Miller, netdev, Jon Maloy, Ying Xue
In-Reply-To: <1354890498-6448-3-git-send-email-paul.gortmaker@windriver.com>

On Fri, Dec 07, 2012 at 09:28:10AM -0500, Paul Gortmaker wrote:
> From: Ying Xue <ying.xue@windriver.com>
> 
> As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
> global atomic counter for the sum of sk_recv_queue sizes across all
> tipc sockets. When incremented, the counter is compared to an upper
> threshold value, and if this is reached, the message is rejected
> with error code TIPC_OVERLOAD.
> 
> This check was originally meant to protect the node against
> buffer exhaustion and general CPU overload. However, all experience
> indicates that the feature not only is redundant on Linux, but even
> harmful. Users run into the limit very often, causing disturbances
> for their applications, while removing it seems to have no negative
> effects at all. We have also seen that overall performance is
> boosted significantly when this bottleneck is removed.
> 
> Furthermore, we don't see any other network protocols maintaining
> such a mechanism, something strengthening our conviction that this
> control can be eliminated.
> 
> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ---
>  net/tipc/socket.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index 1a720c8..a059ed0 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -2,7 +2,7 @@
>   * net/tipc/socket.c: TIPC socket API
>   *
>   * Copyright (c) 2001-2007, Ericsson AB
> - * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
> + * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
>   * All rights reserved.
>   *
>   * Redistribution and use in source and binary forms, with or without
> @@ -1241,11 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
>  	}
>  
>  	/* Reject message if there isn't room to queue it */
> -	recv_q_len = (u32)atomic_read(&tipc_queue_size);
> -	if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
> -		if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
> -			return TIPC_ERR_OVERLOAD;
> -	}
If you're going to remove the one place that you read this variable, don't you
also want to remove the points where you increment/decrement the atomic as well,
and for that matter eliminate the definition itself?

Neil

^ permalink raw reply

* [PATCH net-next] bonding: Fix check for ethtool get_link operation support
From: Ben Hutchings @ 2012-12-07 16:15 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek; +Cc: netdev

Since commit 2c60db037034 ('net: provide a default dev->ethtool_ops')
all devices have a non-null ethtool_ops.  Test only
dev->ethtool_ops->get_link in both places where we care.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
Compile-tested only.

Ben.

 drivers/net/bonding/bond_main.c |   17 ++++++-----------
 1 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c8bff3e..800a897 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -615,15 +615,9 @@ static int bond_check_dev_link(struct bonding *bond,
 		return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
 
 	/* Try to get link status using Ethtool first. */
-	if (slave_dev->ethtool_ops) {
-		if (slave_dev->ethtool_ops->get_link) {
-			u32 link;
-
-			link = slave_dev->ethtool_ops->get_link(slave_dev);
-
-			return link ? BMSR_LSTATUS : 0;
-		}
-	}
+	if (slave_dev->ethtool_ops->get_link)
+		return slave_dev->ethtool_ops->get_link(slave_dev) ?
+			BMSR_LSTATUS : 0;
 
 	/* Ethtool can't be used, fallback to MII ioctls. */
 	ioctl = slave_ops->ndo_do_ioctl;
@@ -1510,8 +1504,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	int link_reporting;
 	int res = 0;
 
-	if (!bond->params.use_carrier && slave_dev->ethtool_ops == NULL &&
-		slave_ops->ndo_do_ioctl == NULL) {
+	if (!bond->params.use_carrier &&
+	    slave_dev->ethtool_ops->get_link == NULL &&
+	    slave_ops->ndo_do_ioctl == NULL) {
 		pr_warning("%s: Warning: no link monitoring support for %s\n",
 			   bond_dev->name, slave_dev->name);
 	}
-- 
1.7.7.6



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 1/2] caif_usb: Check driver name before reading driver state in netdev notifier
From: Ben Hutchings @ 2012-12-07 16:17 UTC (permalink / raw)
  To: Sjur Braendeland; +Cc: netdev

In cfusbl_device_notify(), the usbnet and usbdev variables are
initialised before the driver name has been checked.  In case the
device's driver is not cdc_ncm, this may result in reading beyond the
end of the netdev private area.  Move the initialisation below the
driver name check.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
Compile-tested only.

Ben.

 net/caif/caif_usb.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/caif/caif_usb.c b/net/caif/caif_usb.c
index fd7cbf5..582f80c 100644
--- a/net/caif/caif_usb.c
+++ b/net/caif/caif_usb.c
@@ -126,8 +126,8 @@ static int cfusbl_device_notify(struct notifier_block *me, unsigned long what,
 	struct net_device *dev = arg;
 	struct caif_dev_common common;
 	struct cflayer *layer, *link_support;
-	struct usbnet	*usbnet = netdev_priv(dev);
-	struct usb_device	*usbdev = usbnet->udev;
+	struct usbnet *usbnet;
+	struct usb_device *usbdev;
 	struct ethtool_drvinfo drvinfo;
 
 	/*
@@ -141,6 +141,9 @@ static int cfusbl_device_notify(struct notifier_block *me, unsigned long what,
 	if (strncmp(drvinfo.driver, "cdc_ncm", 7) != 0)
 		return 0;
 
+	usbnet = netdev_priv(dev);
+	usbdev = usbnet->udev;
+
 	pr_debug("USB CDC NCM device VID:0x%4x PID:0x%4x\n",
 		le16_to_cpu(usbdev->descriptor.idVendor),
 		le16_to_cpu(usbdev->descriptor.idProduct));
-- 
1.7.7.6



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* Re: [RFC PATCH] dynamic_queue_limit.h: Make the struct ___cacheline_aligned_on_smp
From: Eric Dumazet @ 2012-12-07 16:19 UTC (permalink / raw)
  To: Joe Perches; +Cc: Tom Herbert, David Miller, netdev
In-Reply-To: <1354896346.29937.43.camel@joe-AO722>

On Fri, 2012-12-07 at 08:05 -0800, Joe Perches wrote:

> So it seemed somewhat sensible to make the
> entire struct in a single cacheline.

Any layout change in an object used in network fast path need a complete
performance study.

Even if you provide such a study, we'll need to reproduce your numbers
here.

BQL/DQL is not on our radars, spending two cache lines on a critical
object is fine.

Thanks

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox