Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v5] sctp: Implement quick failover draft from tsvwg
From: Neil Horman @ 2012-07-20 18:51 UTC (permalink / raw)
  To: netdev
  Cc: Neil Horman, Vlad Yasevich, Sridhar Samudrala, David S. Miller,
	linux-sctp, joe
In-Reply-To: <1342203998-24037-1-git-send-email-nhorman@tuxdriver.com>

I've seen several attempts recently made to do quick failover of sctp transports
by reducing various retransmit timers and counters.  While its possible to
implement a faster failover on multihomed sctp associations, its not
particularly robust, in that it can lead to unneeded retransmits, as well as
false connection failures due to intermittent latency on a network.

Instead, lets implement the new ietf quick failover draft found here:
http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05

This will let the sctp stack identify transports that have had a small number of
errors, and avoid using them quickly until their reliability can be
re-established.  I've tested this out on two virt guests connected via multiple
isolated virt networks and believe its in compliance with the above draft and
works well.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Sridhar Samudrala <sri@us.ibm.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: linux-sctp@vger.kernel.org
CC: joe@perches.com

---
Change notes:

V2)
- Added socket option API from section 6.1 of the specification, as per
request from Vlad. Adding this socket option allows us to alter both the path
maximum retransmit value and the path partial failure threshold for each
transport and the association as a whole.

- Added a per transport pf_retrans value, and initialized it from the
association value.  This makes each transport independently configurable as per
the socket option above, and prevents changes in the sysctl from bleeding into
an already created association.

V3)
- Cleaned up some line spacing (Joe Perches)
- Fixed some socket option user data sanitization (Vlad Yasevich)

V4)
- Added additional documentation (Flavio Leitner)

V5)
- Modified setsockopt option to ignore 0 pathmaxrxt rather than return
  error (Vlad Yasevich)
- Modified getsocopt to return option length written (Vlad Y.)
---
 Documentation/networking/ip-sysctl.txt |   14 +++++
 include/net/sctp/constants.h           |    1 +
 include/net/sctp/structs.h             |   20 ++++++-
 include/net/sctp/user.h                |   11 ++++
 net/sctp/associola.c                   |   37 ++++++++++--
 net/sctp/outqueue.c                    |    6 +-
 net/sctp/sm_sideeffect.c               |   33 +++++++++-
 net/sctp/socket.c                      |  100 ++++++++++++++++++++++++++++++++
 net/sctp/sysctl.c                      |    9 +++
 net/sctp/transport.c                   |    4 +-
 10 files changed, 220 insertions(+), 15 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 47b6c79..c636f9c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1408,6 +1408,20 @@ path_max_retrans - INTEGER
 
 	Default: 5
 
+pf_retrans - INTEGER
+	The number of retransmissions that will be attempted on a given path
+	before traffic is redirected to an alternate transport (should one
+	exist).  Note this is distinct from path_max_retrans, as a path that
+	passes the pf_retrans threshold can still be used.  Its only
+	deprioritized when a transmission path is selected by the stack.  This
+	setting is primarily used to enable fast failover mechanisms without
+	having to reduce path_max_retrans to a very low value.  See:
+	http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+	for details.  Note also that a value of pf_retrans > path_max_retrans
+	disables this feature
+
+	Default: 0
+
 rto_initial - INTEGER
 	The initial round trip timeout value in milliseconds that will be used
 	in calculating round trip times.  This is the initial time interval
diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
index 942b864..d053d2e 100644
--- a/include/net/sctp/constants.h
+++ b/include/net/sctp/constants.h
@@ -334,6 +334,7 @@ typedef enum {
 typedef enum {
 	SCTP_TRANSPORT_UP,
 	SCTP_TRANSPORT_DOWN,
+	SCTP_TRANSPORT_PF,
 } sctp_transport_cmd_t;
 
 /* These are the address scopes defined mainly for IPv4 addresses
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index e4652fe..cee0678 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -161,6 +161,12 @@ extern struct sctp_globals {
 	int max_retrans_path;
 	int max_retrans_init;
 
+	/* Potentially-Failed.Max.Retrans sysctl value
+	 * taken from:
+	 * http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
+	 */
+	int pf_retrans;
+
 	/*
 	 * Policy for preforming sctp/socket accounting
 	 * 0   - do socket level accounting, all assocs share sk_sndbuf
@@ -258,6 +264,7 @@ extern struct sctp_globals {
 #define sctp_sndbuf_policy	 	(sctp_globals.sndbuf_policy)
 #define sctp_rcvbuf_policy	 	(sctp_globals.rcvbuf_policy)
 #define sctp_max_retrans_path		(sctp_globals.max_retrans_path)
+#define sctp_pf_retrans			(sctp_globals.pf_retrans)
 #define sctp_max_retrans_init		(sctp_globals.max_retrans_init)
 #define sctp_sack_timeout		(sctp_globals.sack_timeout)
 #define sctp_hb_interval		(sctp_globals.hb_interval)
@@ -987,10 +994,15 @@ struct sctp_transport {
 
 	/* This is the max_retrans value for the transport and will
 	 * be initialized from the assocs value.  This can be changed
-	 * using SCTP_SET_PEER_ADDR_PARAMS socket option.
+	 * using the SCTP_SET_PEER_ADDR_PARAMS socket option.
 	 */
 	__u16 pathmaxrxt;
 
+	/* This is the partially failed retrans value for the transport
+	 * and will be initialized from the assocs value.  This can be changed
+	 * using the SCTP_PEER_ADDR_THLDS socket option
+	 */
+	int pf_retrans;
 	/* PMTU	      : The current known path MTU.  */
 	__u32 pathmtu;
 
@@ -1660,6 +1672,12 @@ struct sctp_association {
 	 */
 	int max_retrans;
 
+	/* This is the partially failed retrans value for the transport
+	 * and will be initialized from the assocs value.  This can be
+	 * changed using the SCTP_PEER_ADDR_THLDS socket option
+	 */
+	int pf_retrans;
+
 	/* Maximum number of times the endpoint will retransmit INIT  */
 	__u16 max_init_attempts;
 
diff --git a/include/net/sctp/user.h b/include/net/sctp/user.h
index 0842ef0..1b02d7a 100644
--- a/include/net/sctp/user.h
+++ b/include/net/sctp/user.h
@@ -93,6 +93,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_GET_ASSOC_NUMBER	28	/* Read only */
 #define SCTP_GET_ASSOC_ID_LIST	29	/* Read only */
 #define SCTP_AUTO_ASCONF       30
+#define SCTP_PEER_ADDR_THLDS	31
 
 /* Internal Socket Options. Some of the sctp library functions are
  * implemented using these socket options.
@@ -649,6 +650,7 @@ struct sctp_paddrinfo {
  */
 enum sctp_spinfo_state {
 	SCTP_INACTIVE,
+	SCTP_PF,
 	SCTP_ACTIVE,
 	SCTP_UNCONFIRMED,
 	SCTP_UNKNOWN = 0xffff  /* Value used for transport state unknown */
@@ -741,4 +743,13 @@ typedef struct {
 	int sd;
 } sctp_peeloff_arg_t;
 
+/*
+ *  Peer Address Thresholds socket option
+ */
+struct sctp_paddrthlds {
+	sctp_assoc_t spt_assoc_id;
+	struct sockaddr_storage spt_address;
+	__u16 spt_pathmaxrxt;
+	__u16 spt_pathpfthld;
+};
 #endif /* __net_sctp_user_h__ */
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 5bc9ab1..90fe36b 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -124,6 +124,8 @@ static struct sctp_association *sctp_association_init(struct sctp_association *a
 	 * socket values.
 	 */
 	asoc->max_retrans = sp->assocparams.sasoc_asocmaxrxt;
+	asoc->pf_retrans  = sctp_pf_retrans;
+
 	asoc->rto_initial = msecs_to_jiffies(sp->rtoinfo.srto_initial);
 	asoc->rto_max = msecs_to_jiffies(sp->rtoinfo.srto_max);
 	asoc->rto_min = msecs_to_jiffies(sp->rtoinfo.srto_min);
@@ -685,6 +687,9 @@ struct sctp_transport *sctp_assoc_add_peer(struct sctp_association *asoc,
 	/* Set the path max_retrans.  */
 	peer->pathmaxrxt = asoc->pathmaxrxt;
 
+	/* And the partial failure retrnas threshold */
+	peer->pf_retrans = asoc->pf_retrans;
+
 	/* Initialize the peer's SACK delay timeout based on the
 	 * association configured value.
 	 */
@@ -840,6 +845,7 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 	struct sctp_ulpevent *event;
 	struct sockaddr_storage addr;
 	int spc_state = 0;
+	bool ulp_notify = true;
 
 	/* Record the transition on the transport.  */
 	switch (command) {
@@ -853,6 +859,14 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 			spc_state = SCTP_ADDR_CONFIRMED;
 		else
 			spc_state = SCTP_ADDR_AVAILABLE;
+		/* Don't inform ULP about transition from PF to
+		 * active state and set cwnd to 1, see SCTP
+		 * Quick failover draft section 5.1, point 5
+		 */
+		if (transport->state == SCTP_PF) {
+			ulp_notify = false;
+			transport->cwnd = 1;
+		}
 		transport->state = SCTP_ACTIVE;
 		break;
 
@@ -871,6 +885,11 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 		spc_state = SCTP_ADDR_UNREACHABLE;
 		break;
 
+	case SCTP_TRANSPORT_PF:
+		transport->state = SCTP_PF;
+		ulp_notify = false;
+		break;
+
 	default:
 		return;
 	}
@@ -878,12 +897,15 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 	/* Generate and send a SCTP_PEER_ADDR_CHANGE notification to the
 	 * user.
 	 */
-	memset(&addr, 0, sizeof(struct sockaddr_storage));
-	memcpy(&addr, &transport->ipaddr, transport->af_specific->sockaddr_len);
-	event = sctp_ulpevent_make_peer_addr_change(asoc, &addr,
-				0, spc_state, error, GFP_ATOMIC);
-	if (event)
-		sctp_ulpq_tail_event(&asoc->ulpq, event);
+	if (ulp_notify) {
+		memset(&addr, 0, sizeof(struct sockaddr_storage));
+		memcpy(&addr, &transport->ipaddr,
+		       transport->af_specific->sockaddr_len);
+		event = sctp_ulpevent_make_peer_addr_change(asoc, &addr,
+					0, spc_state, error, GFP_ATOMIC);
+		if (event)
+			sctp_ulpq_tail_event(&asoc->ulpq, event);
+	}
 
 	/* Select new active and retran paths. */
 
@@ -899,7 +921,8 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
 			transports) {
 
 		if ((t->state == SCTP_INACTIVE) ||
-		    (t->state == SCTP_UNCONFIRMED))
+		    (t->state == SCTP_UNCONFIRMED) ||
+		    (t->state == SCTP_PF))
 			continue;
 		if (!first || t->last_time_heard > first->last_time_heard) {
 			second = first;
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index a0fa19f..e7aa177c 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -792,7 +792,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
 			if (!new_transport)
 				new_transport = asoc->peer.active_path;
 		} else if ((new_transport->state == SCTP_INACTIVE) ||
-			   (new_transport->state == SCTP_UNCONFIRMED)) {
+			   (new_transport->state == SCTP_UNCONFIRMED) ||
+			   (new_transport->state == SCTP_PF)) {
 			/* If the chunk is Heartbeat or Heartbeat Ack,
 			 * send it to chunk->transport, even if it's
 			 * inactive.
@@ -987,7 +988,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
 			new_transport = chunk->transport;
 			if (!new_transport ||
 			    ((new_transport->state == SCTP_INACTIVE) ||
-			     (new_transport->state == SCTP_UNCONFIRMED)))
+			     (new_transport->state == SCTP_UNCONFIRMED) ||
+			     (new_transport->state == SCTP_PF)))
 				new_transport = asoc->peer.active_path;
 			if (new_transport->state == SCTP_UNCONFIRMED)
 				continue;
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index c96d1a8..285e26a 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -76,6 +76,8 @@ static int sctp_side_effects(sctp_event_t event_type, sctp_subtype_t subtype,
 			     sctp_cmd_seq_t *commands,
 			     gfp_t gfp);
 
+static void sctp_cmd_hb_timer_update(sctp_cmd_seq_t *cmds,
+				     struct sctp_transport *t);
 /********************************************************************
  * Helper functions
  ********************************************************************/
@@ -470,7 +472,8 @@ sctp_timer_event_t *sctp_timer_events[SCTP_NUM_TIMEOUT_TYPES] = {
  * notification SHOULD be sent to the upper layer.
  *
  */
-static void sctp_do_8_2_transport_strike(struct sctp_association *asoc,
+static void sctp_do_8_2_transport_strike(sctp_cmd_seq_t *commands,
+					 struct sctp_association *asoc,
 					 struct sctp_transport *transport,
 					 int is_hb)
 {
@@ -495,6 +498,23 @@ static void sctp_do_8_2_transport_strike(struct sctp_association *asoc,
 			transport->error_count++;
 	}
 
+	/* If the transport error count is greater than the pf_retrans
+	 * threshold, and less than pathmaxrtx, then mark this transport
+	 * as Partially Failed, ee SCTP Quick Failover Draft, secon 5.1,
+	 * point 1
+	 */
+	if ((transport->state != SCTP_PF) &&
+	   (asoc->pf_retrans < transport->pathmaxrxt) &&
+	   (transport->error_count > asoc->pf_retrans)) {
+
+		sctp_assoc_control_transport(asoc, transport,
+					     SCTP_TRANSPORT_PF,
+					     0);
+
+		/* Update the hb timer to resend a heartbeat every rto */
+		sctp_cmd_hb_timer_update(commands, transport);
+	}
+
 	if (transport->state != SCTP_INACTIVE &&
 	    (transport->error_count > transport->pathmaxrxt)) {
 		SCTP_DEBUG_PRINTK_IPADDR("transport_strike:association %p",
@@ -699,6 +719,10 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
 					     SCTP_HEARTBEAT_SUCCESS);
 	}
 
+	if (t->state == SCTP_PF)
+		sctp_assoc_control_transport(asoc, t, SCTP_TRANSPORT_UP,
+					     SCTP_HEARTBEAT_SUCCESS);
+
 	/* The receiver of the HEARTBEAT ACK should also perform an
 	 * RTT measurement for that destination transport address
 	 * using the time value carried in the HEARTBEAT ACK chunk.
@@ -1565,8 +1589,8 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
 
 		case SCTP_CMD_STRIKE:
 			/* Mark one strike against a transport.  */
-			sctp_do_8_2_transport_strike(asoc, cmd->obj.transport,
-						    0);
+			sctp_do_8_2_transport_strike(commands, asoc,
+						    cmd->obj.transport, 0);
 			break;
 
 		case SCTP_CMD_TRANSPORT_IDLE:
@@ -1576,7 +1600,8 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
 
 		case SCTP_CMD_TRANSPORT_HB_SENT:
 			t = cmd->obj.transport;
-			sctp_do_8_2_transport_strike(asoc, t, 1);
+			sctp_do_8_2_transport_strike(commands, asoc,
+						     t, 1);
 			t->hb_sent = 1;
 			break;
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index b3b8a8d..bba551f 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3470,6 +3470,56 @@ static int sctp_setsockopt_auto_asconf(struct sock *sk, char __user *optval,
 }
 
 
+/*
+ * SCTP_PEER_ADDR_THLDS
+ *
+ * This option allows us to alter the partially failed threshold for one or all
+ * transports in an association.  See Section 6.1 of:
+ * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+ */
+static int sctp_setsockopt_paddr_thresholds(struct sock *sk,
+					    char __user *optval,
+					    unsigned int optlen)
+{
+	struct sctp_paddrthlds val;
+	struct sctp_transport *trans;
+	struct sctp_association *asoc;
+
+	if (optlen < sizeof(struct sctp_paddrthlds))
+		return -EINVAL;
+	if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval,
+			   sizeof(struct sctp_paddrthlds)))
+		return -EFAULT;
+
+
+	if (sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
+		asoc = sctp_id2assoc(sk, val.spt_assoc_id);
+		if (!asoc)
+			return -ENOENT;
+		list_for_each_entry(trans, &asoc->peer.transport_addr_list,
+				    transports) {
+			if (val.spt_pathmaxrxt)
+				trans->pathmaxrxt = val.spt_pathmaxrxt;
+			trans->pf_retrans = val.spt_pathpfthld;
+		}
+
+		if (val.spt_pathmaxrxt)
+			asoc->pathmaxrxt = val.spt_pathmaxrxt;
+		asoc->pf_retrans = val.spt_pathpfthld;
+	} else {
+		trans = sctp_addr_id2transport(sk, &val.spt_address,
+					       val.spt_assoc_id);
+		if (!trans)
+			return -ENOENT;
+
+		if (val.spt_pathmaxrxt)
+			trans->pathmaxrxt = val.spt_pathmaxrxt;
+		trans->pf_retrans = val.spt_pathpfthld;
+	}
+
+	return 0;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -3619,6 +3669,9 @@ SCTP_STATIC int sctp_setsockopt(struct sock *sk, int level, int optname,
 	case SCTP_AUTO_ASCONF:
 		retval = sctp_setsockopt_auto_asconf(sk, optval, optlen);
 		break;
+	case SCTP_PEER_ADDR_THLDS:
+		retval = sctp_setsockopt_paddr_thresholds(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -5490,6 +5543,50 @@ static int sctp_getsockopt_assoc_ids(struct sock *sk, int len,
 	return 0;
 }
 
+/*
+ * SCTP_PEER_ADDR_THLDS
+ *
+ * This option allows us to fetch the partially failed threshold for one or all
+ * transports in an association.  See Section 6.1 of:
+ * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+ */
+static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
+					    char __user *optval,
+					    int optlen)
+{
+	struct sctp_paddrthlds val;
+	struct sctp_transport *trans;
+	struct sctp_association *asoc;
+
+	if (optlen < sizeof(struct sctp_paddrthlds))
+		return -EINVAL;
+	optlen = sizeof(struct sctp_paddrthlds);
+	if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval, optlen))
+		return -EFAULT;
+
+	if (sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
+		asoc = sctp_id2assoc(sk, val.spt_assoc_id);
+		if (!asoc)
+			return -ENOENT;
+
+		val.spt_pathpfthld = asoc->pf_retrans;
+		val.spt_pathmaxrxt = asoc->pathmaxrxt;
+	} else {
+		trans = sctp_addr_id2transport(sk, &val.spt_address,
+					       val.spt_assoc_id);
+		if (!trans)
+			return -ENOENT;
+
+		val.spt_pathmaxrxt = trans->pathmaxrxt;
+		val.spt_pathpfthld = trans->pf_retrans;
+	}
+
+	if (copy_to_user(optval, &val, optlen))
+		return -EFAULT;
+
+	return optlen;
+}
+
 SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
 				char __user *optval, int __user *optlen)
 {
@@ -5628,6 +5725,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
 	case SCTP_AUTO_ASCONF:
 		retval = sctp_getsockopt_auto_asconf(sk, len, optval, optlen);
 		break;
+	case SCTP_PEER_ADDR_THLDS:
+		retval = sctp_getsockopt_paddr_thresholds(sk, optval, len);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index e5fe639..2b2bfe9 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -141,6 +141,15 @@ static ctl_table sctp_table[] = {
 		.extra2		= &int_max
 	},
 	{
+		.procname	= "pf_retrans",
+		.data		= &sctp_pf_retrans,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &int_max
+	},
+	{
 		.procname	= "max_init_retransmits",
 		.data		= &sctp_max_retrans_init,
 		.maxlen		= sizeof(int),
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index b026ba0..194d0f3 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -85,6 +85,7 @@ static struct sctp_transport *sctp_transport_init(struct sctp_transport *peer,
 
 	/* Initialize the default path max_retrans.  */
 	peer->pathmaxrxt  = sctp_max_retrans_path;
+	peer->pf_retrans  = sctp_pf_retrans;
 
 	INIT_LIST_HEAD(&peer->transmitted);
 	INIT_LIST_HEAD(&peer->send_ready);
@@ -585,7 +586,8 @@ unsigned long sctp_transport_timeout(struct sctp_transport *t)
 {
 	unsigned long timeout;
 	timeout = t->rto + sctp_jitter(t->rto);
-	if (t->state != SCTP_UNCONFIRMED)
+	if ((t->state != SCTP_UNCONFIRMED) &&
+	    (t->state != SCTP_PF))
 		timeout += t->hbinterval;
 	timeout += jiffies;
 	return timeout;
-- 
1.7.7.6

^ permalink raw reply related

* Re: [PATCH] b44: add 64 bit stats
From: Kevin Groeneveld @ 2012-07-20 18:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1342761865.2626.5572.camel@edumazet-glaptop>

On Fri, Jul 20, 2012 at 1:24 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Absolutely. You can argue that probably nobody use this driver on a
>> 32bit UP machine, but technically speaking the current implementation is
>> racy.
>
> In fact all network drivers should use the _bh version.

Okay, thanks for clarifying.

> Could you send a patch for all of them, based on net-next tree ?

Sure, I can work on that.  It should be a relatively easy thing to
update.  I can probably send a patch within the next couple days.


Kevin

^ permalink raw reply

* Re: [PATCH v5] sctp: Implement quick failover draft from tsvwg
From: Flavio Leitner @ 2012-07-20 19:10 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, Vlad Yasevich, Sridhar Samudrala, David S. Miller,
	linux-sctp, joe
In-Reply-To: <1342810319-27457-1-git-send-email-nhorman@tuxdriver.com>

On Fri, 20 Jul 2012 14:51:59 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:

> I've seen several attempts recently made to do quick failover of sctp transports
> by reducing various retransmit timers and counters.  While its possible to
> implement a faster failover on multihomed sctp associations, its not
> particularly robust, in that it can lead to unneeded retransmits, as well as
> false connection failures due to intermittent latency on a network.
> 
> Instead, lets implement the new ietf quick failover draft found here:
> http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
> 
> This will let the sctp stack identify transports that have had a small number of
> errors, and avoid using them quickly until their reliability can be
> re-established.  I've tested this out on two virt guests connected via multiple
> isolated virt networks and believe its in compliance with the above draft and
> works well.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Vlad Yasevich <vyasevich@gmail.com>
> CC: Sridhar Samudrala <sri@us.ibm.com>
> CC: "David S. Miller" <davem@davemloft.net>
> CC: linux-sctp@vger.kernel.org
> CC: joe@perches.com
> 
> ---
> Change notes:
> 
> V2)
> - Added socket option API from section 6.1 of the specification, as per
> request from Vlad. Adding this socket option allows us to alter both the path
> maximum retransmit value and the path partial failure threshold for each
> transport and the association as a whole.
> 
> - Added a per transport pf_retrans value, and initialized it from the
> association value.  This makes each transport independently configurable as per
> the socket option above, and prevents changes in the sysctl from bleeding into
> an already created association.
> 
> V3)
> - Cleaned up some line spacing (Joe Perches)
> - Fixed some socket option user data sanitization (Vlad Yasevich)
> 
> V4)
> - Added additional documentation (Flavio Leitner)
> 
> V5)
> - Modified setsockopt option to ignore 0 pathmaxrxt rather than return
>   error (Vlad Yasevich)
> - Modified getsocopt to return option length written (Vlad Y.)
> ---
>  Documentation/networking/ip-sysctl.txt |   14 +++++
>  include/net/sctp/constants.h           |    1 +
>  include/net/sctp/structs.h             |   20 ++++++-
>  include/net/sctp/user.h                |   11 ++++
>  net/sctp/associola.c                   |   37 ++++++++++--
>  net/sctp/outqueue.c                    |    6 +-
>  net/sctp/sm_sideeffect.c               |   33 +++++++++-
>  net/sctp/socket.c                      |  100 ++++++++++++++++++++++++++++++++
>  net/sctp/sysctl.c                      |    9 +++
>  net/sctp/transport.c                   |    4 +-
>  10 files changed, 220 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 47b6c79..c636f9c 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -1408,6 +1408,20 @@ path_max_retrans - INTEGER
>  
>  	Default: 5
>  
> +pf_retrans - INTEGER
> +	The number of retransmissions that will be attempted on a given path
> +	before traffic is redirected to an alternate transport (should one
> +	exist).  Note this is distinct from path_max_retrans, as a path that
> +	passes the pf_retrans threshold can still be used.  Its only
> +	deprioritized when a transmission path is selected by the stack.  This
> +	setting is primarily used to enable fast failover mechanisms without
> +	having to reduce path_max_retrans to a very low value.  See:
> +	http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
> +	for details.  Note also that a value of pf_retrans > path_max_retrans
> +	disables this feature
> +
> +	Default: 0
> +
>  rto_initial - INTEGER
>  	The initial round trip timeout value in milliseconds that will be used
>  	in calculating round trip times.  This is the initial time interval
> diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
> index 942b864..d053d2e 100644
> --- a/include/net/sctp/constants.h
> +++ b/include/net/sctp/constants.h
> @@ -334,6 +334,7 @@ typedef enum {
>  typedef enum {
>  	SCTP_TRANSPORT_UP,
>  	SCTP_TRANSPORT_DOWN,
> +	SCTP_TRANSPORT_PF,
>  } sctp_transport_cmd_t;
>  
>  /* These are the address scopes defined mainly for IPv4 addresses
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index e4652fe..cee0678 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -161,6 +161,12 @@ extern struct sctp_globals {
>  	int max_retrans_path;
>  	int max_retrans_init;
>  
> +	/* Potentially-Failed.Max.Retrans sysctl value
> +	 * taken from:
> +	 * http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
> +	 */
> +	int pf_retrans;
> +
>  	/*
>  	 * Policy for preforming sctp/socket accounting
>  	 * 0   - do socket level accounting, all assocs share sk_sndbuf
> @@ -258,6 +264,7 @@ extern struct sctp_globals {
>  #define sctp_sndbuf_policy	 	(sctp_globals.sndbuf_policy)
>  #define sctp_rcvbuf_policy	 	(sctp_globals.rcvbuf_policy)
>  #define sctp_max_retrans_path		(sctp_globals.max_retrans_path)
> +#define sctp_pf_retrans			(sctp_globals.pf_retrans)
>  #define sctp_max_retrans_init		(sctp_globals.max_retrans_init)
>  #define sctp_sack_timeout		(sctp_globals.sack_timeout)
>  #define sctp_hb_interval		(sctp_globals.hb_interval)
> @@ -987,10 +994,15 @@ struct sctp_transport {
>  
>  	/* This is the max_retrans value for the transport and will
>  	 * be initialized from the assocs value.  This can be changed
> -	 * using SCTP_SET_PEER_ADDR_PARAMS socket option.
> +	 * using the SCTP_SET_PEER_ADDR_PARAMS socket option.
>  	 */
>  	__u16 pathmaxrxt;
>  
> +	/* This is the partially failed retrans value for the transport
> +	 * and will be initialized from the assocs value.  This can be changed
> +	 * using the SCTP_PEER_ADDR_THLDS socket option
> +	 */
> +	int pf_retrans;
>  	/* PMTU	      : The current known path MTU.  */
>  	__u32 pathmtu;
>  
> @@ -1660,6 +1672,12 @@ struct sctp_association {
>  	 */
>  	int max_retrans;
>  
> +	/* This is the partially failed retrans value for the transport
> +	 * and will be initialized from the assocs value.  This can be
> +	 * changed using the SCTP_PEER_ADDR_THLDS socket option
> +	 */
> +	int pf_retrans;
> +
>  	/* Maximum number of times the endpoint will retransmit INIT  */
>  	__u16 max_init_attempts;
>  
> diff --git a/include/net/sctp/user.h b/include/net/sctp/user.h
> index 0842ef0..1b02d7a 100644
> --- a/include/net/sctp/user.h
> +++ b/include/net/sctp/user.h
> @@ -93,6 +93,7 @@ typedef __s32 sctp_assoc_t;
>  #define SCTP_GET_ASSOC_NUMBER	28	/* Read only */
>  #define SCTP_GET_ASSOC_ID_LIST	29	/* Read only */
>  #define SCTP_AUTO_ASCONF       30
> +#define SCTP_PEER_ADDR_THLDS	31
>  
>  /* Internal Socket Options. Some of the sctp library functions are
>   * implemented using these socket options.
> @@ -649,6 +650,7 @@ struct sctp_paddrinfo {
>   */
>  enum sctp_spinfo_state {
>  	SCTP_INACTIVE,
> +	SCTP_PF,
>  	SCTP_ACTIVE,
>  	SCTP_UNCONFIRMED,
>  	SCTP_UNKNOWN = 0xffff  /* Value used for transport state unknown */
> @@ -741,4 +743,13 @@ typedef struct {
>  	int sd;
>  } sctp_peeloff_arg_t;
>  
> +/*
> + *  Peer Address Thresholds socket option
> + */
> +struct sctp_paddrthlds {
> +	sctp_assoc_t spt_assoc_id;
> +	struct sockaddr_storage spt_address;
> +	__u16 spt_pathmaxrxt;
> +	__u16 spt_pathpfthld;
> +};
>  #endif /* __net_sctp_user_h__ */
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 5bc9ab1..90fe36b 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -124,6 +124,8 @@ static struct sctp_association *sctp_association_init(struct sctp_association *a
>  	 * socket values.
>  	 */
>  	asoc->max_retrans = sp->assocparams.sasoc_asocmaxrxt;
> +	asoc->pf_retrans  = sctp_pf_retrans;
> +
>  	asoc->rto_initial = msecs_to_jiffies(sp->rtoinfo.srto_initial);
>  	asoc->rto_max = msecs_to_jiffies(sp->rtoinfo.srto_max);
>  	asoc->rto_min = msecs_to_jiffies(sp->rtoinfo.srto_min);
> @@ -685,6 +687,9 @@ struct sctp_transport *sctp_assoc_add_peer(struct sctp_association *asoc,
>  	/* Set the path max_retrans.  */
>  	peer->pathmaxrxt = asoc->pathmaxrxt;
>  
> +	/* And the partial failure retrnas threshold */
> +	peer->pf_retrans = asoc->pf_retrans;
> +
>  	/* Initialize the peer's SACK delay timeout based on the
>  	 * association configured value.
>  	 */
> @@ -840,6 +845,7 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
>  	struct sctp_ulpevent *event;
>  	struct sockaddr_storage addr;
>  	int spc_state = 0;
> +	bool ulp_notify = true;
>  
>  	/* Record the transition on the transport.  */
>  	switch (command) {
> @@ -853,6 +859,14 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
>  			spc_state = SCTP_ADDR_CONFIRMED;
>  		else
>  			spc_state = SCTP_ADDR_AVAILABLE;
> +		/* Don't inform ULP about transition from PF to
> +		 * active state and set cwnd to 1, see SCTP
> +		 * Quick failover draft section 5.1, point 5
> +		 */
> +		if (transport->state == SCTP_PF) {
> +			ulp_notify = false;
> +			transport->cwnd = 1;
> +		}
>  		transport->state = SCTP_ACTIVE;
>  		break;
>  
> @@ -871,6 +885,11 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
>  		spc_state = SCTP_ADDR_UNREACHABLE;
>  		break;
>  
> +	case SCTP_TRANSPORT_PF:
> +		transport->state = SCTP_PF;
> +		ulp_notify = false;
> +		break;
> +
>  	default:
>  		return;
>  	}
> @@ -878,12 +897,15 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
>  	/* Generate and send a SCTP_PEER_ADDR_CHANGE notification to the
>  	 * user.
>  	 */
> -	memset(&addr, 0, sizeof(struct sockaddr_storage));
> -	memcpy(&addr, &transport->ipaddr, transport->af_specific->sockaddr_len);
> -	event = sctp_ulpevent_make_peer_addr_change(asoc, &addr,
> -				0, spc_state, error, GFP_ATOMIC);
> -	if (event)
> -		sctp_ulpq_tail_event(&asoc->ulpq, event);
> +	if (ulp_notify) {
> +		memset(&addr, 0, sizeof(struct sockaddr_storage));
> +		memcpy(&addr, &transport->ipaddr,
> +		       transport->af_specific->sockaddr_len);
> +		event = sctp_ulpevent_make_peer_addr_change(asoc, &addr,
> +					0, spc_state, error, GFP_ATOMIC);
> +		if (event)
> +			sctp_ulpq_tail_event(&asoc->ulpq, event);
> +	}
>  
>  	/* Select new active and retran paths. */
>  
> @@ -899,7 +921,8 @@ void sctp_assoc_control_transport(struct sctp_association *asoc,
>  			transports) {
>  
>  		if ((t->state == SCTP_INACTIVE) ||
> -		    (t->state == SCTP_UNCONFIRMED))
> +		    (t->state == SCTP_UNCONFIRMED) ||
> +		    (t->state == SCTP_PF))
>  			continue;
>  		if (!first || t->last_time_heard > first->last_time_heard) {
>  			second = first;
> diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
> index a0fa19f..e7aa177c 100644
> --- a/net/sctp/outqueue.c
> +++ b/net/sctp/outqueue.c
> @@ -792,7 +792,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
>  			if (!new_transport)
>  				new_transport = asoc->peer.active_path;
>  		} else if ((new_transport->state == SCTP_INACTIVE) ||
> -			   (new_transport->state == SCTP_UNCONFIRMED)) {
> +			   (new_transport->state == SCTP_UNCONFIRMED) ||
> +			   (new_transport->state == SCTP_PF)) {
>  			/* If the chunk is Heartbeat or Heartbeat Ack,
>  			 * send it to chunk->transport, even if it's
>  			 * inactive.
> @@ -987,7 +988,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout)
>  			new_transport = chunk->transport;
>  			if (!new_transport ||
>  			    ((new_transport->state == SCTP_INACTIVE) ||
> -			     (new_transport->state == SCTP_UNCONFIRMED)))
> +			     (new_transport->state == SCTP_UNCONFIRMED) ||
> +			     (new_transport->state == SCTP_PF)))
>  				new_transport = asoc->peer.active_path;
>  			if (new_transport->state == SCTP_UNCONFIRMED)
>  				continue;
> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> index c96d1a8..285e26a 100644
> --- a/net/sctp/sm_sideeffect.c
> +++ b/net/sctp/sm_sideeffect.c
> @@ -76,6 +76,8 @@ static int sctp_side_effects(sctp_event_t event_type, sctp_subtype_t subtype,
>  			     sctp_cmd_seq_t *commands,
>  			     gfp_t gfp);
>  
> +static void sctp_cmd_hb_timer_update(sctp_cmd_seq_t *cmds,
> +				     struct sctp_transport *t);
>  /********************************************************************
>   * Helper functions
>   ********************************************************************/
> @@ -470,7 +472,8 @@ sctp_timer_event_t *sctp_timer_events[SCTP_NUM_TIMEOUT_TYPES] = {
>   * notification SHOULD be sent to the upper layer.
>   *
>   */
> -static void sctp_do_8_2_transport_strike(struct sctp_association *asoc,
> +static void sctp_do_8_2_transport_strike(sctp_cmd_seq_t *commands,
> +					 struct sctp_association *asoc,
>  					 struct sctp_transport *transport,
>  					 int is_hb)
>  {
> @@ -495,6 +498,23 @@ static void sctp_do_8_2_transport_strike(struct sctp_association *asoc,
>  			transport->error_count++;
>  	}
>  
> +	/* If the transport error count is greater than the pf_retrans
> +	 * threshold, and less than pathmaxrtx, then mark this transport
> +	 * as Partially Failed, ee SCTP Quick Failover Draft, secon 5.1,
> +	 * point 1
> +	 */
> +	if ((transport->state != SCTP_PF) &&
> +	   (asoc->pf_retrans < transport->pathmaxrxt) &&
> +	   (transport->error_count > asoc->pf_retrans)) {
> +
> +		sctp_assoc_control_transport(asoc, transport,
> +					     SCTP_TRANSPORT_PF,
> +					     0);
> +
> +		/* Update the hb timer to resend a heartbeat every rto */
> +		sctp_cmd_hb_timer_update(commands, transport);
> +	}
> +
>  	if (transport->state != SCTP_INACTIVE &&
>  	    (transport->error_count > transport->pathmaxrxt)) {
>  		SCTP_DEBUG_PRINTK_IPADDR("transport_strike:association %p",
> @@ -699,6 +719,10 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
>  					     SCTP_HEARTBEAT_SUCCESS);
>  	}
>  
> +	if (t->state == SCTP_PF)
> +		sctp_assoc_control_transport(asoc, t, SCTP_TRANSPORT_UP,
> +					     SCTP_HEARTBEAT_SUCCESS);
> +
>  	/* The receiver of the HEARTBEAT ACK should also perform an
>  	 * RTT measurement for that destination transport address
>  	 * using the time value carried in the HEARTBEAT ACK chunk.
> @@ -1565,8 +1589,8 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
>  
>  		case SCTP_CMD_STRIKE:
>  			/* Mark one strike against a transport.  */
> -			sctp_do_8_2_transport_strike(asoc, cmd->obj.transport,
> -						    0);
> +			sctp_do_8_2_transport_strike(commands, asoc,
> +						    cmd->obj.transport, 0);
>  			break;
>  
>  		case SCTP_CMD_TRANSPORT_IDLE:
> @@ -1576,7 +1600,8 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
>  
>  		case SCTP_CMD_TRANSPORT_HB_SENT:
>  			t = cmd->obj.transport;
> -			sctp_do_8_2_transport_strike(asoc, t, 1);
> +			sctp_do_8_2_transport_strike(commands, asoc,
> +						     t, 1);
>  			t->hb_sent = 1;
>  			break;
>  
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index b3b8a8d..bba551f 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -3470,6 +3470,56 @@ static int sctp_setsockopt_auto_asconf(struct sock *sk, char __user *optval,
>  }
>  
>  
> +/*
> + * SCTP_PEER_ADDR_THLDS
> + *
> + * This option allows us to alter the partially failed threshold for one or all
> + * transports in an association.  See Section 6.1 of:
> + * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
> + */
> +static int sctp_setsockopt_paddr_thresholds(struct sock *sk,
> +					    char __user *optval,
> +					    unsigned int optlen)
> +{
> +	struct sctp_paddrthlds val;
> +	struct sctp_transport *trans;
> +	struct sctp_association *asoc;
> +
> +	if (optlen < sizeof(struct sctp_paddrthlds))
> +		return -EINVAL;
> +	if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval,
> +			   sizeof(struct sctp_paddrthlds)))
> +		return -EFAULT;
> +
> +
> +	if (sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
> +		asoc = sctp_id2assoc(sk, val.spt_assoc_id);
> +		if (!asoc)
> +			return -ENOENT;
> +		list_for_each_entry(trans, &asoc->peer.transport_addr_list,
> +				    transports) {
> +			if (val.spt_pathmaxrxt)
> +				trans->pathmaxrxt = val.spt_pathmaxrxt;
> +			trans->pf_retrans = val.spt_pathpfthld;
> +		}
> +
> +		if (val.spt_pathmaxrxt)
> +			asoc->pathmaxrxt = val.spt_pathmaxrxt;
> +		asoc->pf_retrans = val.spt_pathpfthld;
> +	} else {
> +		trans = sctp_addr_id2transport(sk, &val.spt_address,
> +					       val.spt_assoc_id);
> +		if (!trans)
> +			return -ENOENT;
> +
> +		if (val.spt_pathmaxrxt)
> +			trans->pathmaxrxt = val.spt_pathmaxrxt;
> +		trans->pf_retrans = val.spt_pathpfthld;
> +	}
> +
> +	return 0;
> +}
> +
>  /* API 6.2 setsockopt(), getsockopt()
>   *
>   * Applications use setsockopt() and getsockopt() to set or retrieve
> @@ -3619,6 +3669,9 @@ SCTP_STATIC int sctp_setsockopt(struct sock *sk, int level, int optname,
>  	case SCTP_AUTO_ASCONF:
>  		retval = sctp_setsockopt_auto_asconf(sk, optval, optlen);
>  		break;
> +	case SCTP_PEER_ADDR_THLDS:
> +		retval = sctp_setsockopt_paddr_thresholds(sk, optval, optlen);
> +		break;
>  	default:
>  		retval = -ENOPROTOOPT;
>  		break;
> @@ -5490,6 +5543,50 @@ static int sctp_getsockopt_assoc_ids(struct sock *sk, int len,
>  	return 0;
>  }
>  
> +/*
> + * SCTP_PEER_ADDR_THLDS
> + *
> + * This option allows us to fetch the partially failed threshold for one or all
> + * transports in an association.  See Section 6.1 of:
> + * http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
> + */
> +static int sctp_getsockopt_paddr_thresholds(struct sock *sk,
> +					    char __user *optval,
> +					    int optlen)
> +{
> +	struct sctp_paddrthlds val;
> +	struct sctp_transport *trans;
> +	struct sctp_association *asoc;
> +
> +	if (optlen < sizeof(struct sctp_paddrthlds))
> +		return -EINVAL;
> +	optlen = sizeof(struct sctp_paddrthlds);
> +	if (copy_from_user(&val, (struct sctp_paddrthlds __user *)optval, optlen))
> +		return -EFAULT;
> +
> +	if (sctp_is_any(sk, (const union sctp_addr *)&val.spt_address)) {
> +		asoc = sctp_id2assoc(sk, val.spt_assoc_id);
> +		if (!asoc)
> +			return -ENOENT;
> +
> +		val.spt_pathpfthld = asoc->pf_retrans;
> +		val.spt_pathmaxrxt = asoc->pathmaxrxt;
> +	} else {
> +		trans = sctp_addr_id2transport(sk, &val.spt_address,
> +					       val.spt_assoc_id);
> +		if (!trans)
> +			return -ENOENT;
> +
> +		val.spt_pathmaxrxt = trans->pathmaxrxt;
> +		val.spt_pathpfthld = trans->pf_retrans;
> +	}
> +
> +	if (copy_to_user(optval, &val, optlen))
> +		return -EFAULT;
> +
> +	return optlen;
> +}
> +
>  SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
>  				char __user *optval, int __user *optlen)
>  {
> @@ -5628,6 +5725,9 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
>  	case SCTP_AUTO_ASCONF:
>  		retval = sctp_getsockopt_auto_asconf(sk, len, optval, optlen);
>  		break;
> +	case SCTP_PEER_ADDR_THLDS:
> +		retval = sctp_getsockopt_paddr_thresholds(sk, optval, len);
> +		break;
>  	default:
>  		retval = -ENOPROTOOPT;
>  		break;
> diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
> index e5fe639..2b2bfe9 100644
> --- a/net/sctp/sysctl.c
> +++ b/net/sctp/sysctl.c
> @@ -141,6 +141,15 @@ static ctl_table sctp_table[] = {
>  		.extra2		= &int_max
>  	},
>  	{
> +		.procname	= "pf_retrans",
> +		.data		= &sctp_pf_retrans,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &int_max
> +	},
> +	{
>  		.procname	= "max_init_retransmits",
>  		.data		= &sctp_max_retrans_init,
>  		.maxlen		= sizeof(int),
> diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> index b026ba0..194d0f3 100644
> --- a/net/sctp/transport.c
> +++ b/net/sctp/transport.c
> @@ -85,6 +85,7 @@ static struct sctp_transport *sctp_transport_init(struct sctp_transport *peer,
>  
>  	/* Initialize the default path max_retrans.  */
>  	peer->pathmaxrxt  = sctp_max_retrans_path;
> +	peer->pf_retrans  = sctp_pf_retrans;
>  
>  	INIT_LIST_HEAD(&peer->transmitted);
>  	INIT_LIST_HEAD(&peer->send_ready);
> @@ -585,7 +586,8 @@ unsigned long sctp_transport_timeout(struct sctp_transport *t)
>  {
>  	unsigned long timeout;
>  	timeout = t->rto + sctp_jitter(t->rto);
> -	if (t->state != SCTP_UNCONFIRMED)
> +	if ((t->state != SCTP_UNCONFIRMED) &&
> +	    (t->state != SCTP_PF))
>  		timeout += t->hbinterval;
>  	timeout += jiffies;
>  	return timeout;

Reviewed-by: Flavio Leitner <fbl@redhat.com>

fbl

^ permalink raw reply

* [PATCHv3 0/6] tun zerocopy support
From: Michael S. Tsirkin @ 2012-07-20 19:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem

This adds support for experimental zero copy transmit to tun.

This includes some patches from Ian's patchset to support zerocopy with tun,
so it should help that work progress: we are still trying to figure out
how to make everything work properly with tcp but tun seems easier, and
it's helpful by itself since not everyone can use macvtap.

Same as with macvtap, I get single-percentage wins in CPU utilization
on guest to external from this patchset, and a performance regression on
guest to host, so more work is needed until this feature can move out of
experimental status, but I think it's useful for some people already.

Pls review and consider for 3.6.

There's some code duplication between tun and macvtap now: common code
could move to net/core/datagram.c, this patch does not do this yet.

Changes from v2:
	Fixed some bugs so it's stable now

Michael S. Tsirkin (6):
  skbuff: add an api to orphan frags
  skbuff: convert to skb_orphan_frags
  skbuff: export skb_copy_ubufs
  tun: orphan frags on xmit
  net: orphan frags on receive
  tun: experimental zero copy tx support

 drivers/net/tun.c      | 148 +++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/skbuff.h |  16 ++++++
 net/core/dev.c         |   7 ++-
 net/core/skbuff.c      |  24 +++-----
 4 files changed, 167 insertions(+), 28 deletions(-)

-- 
MST

^ permalink raw reply

* [PATCHv3 1/6] skbuff: add an api to orphan frags
From: Michael S. Tsirkin @ 2012-07-20 19:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <cover.1342812067.git.mst@redhat.com>

Many places do
       if ((skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY))
		skb_copy_ubufs(skb, gfp_mask);
to copy and invoke frag destructors if necessary.
Add an inline helper for this.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/skbuff.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 642cb73..d205c4b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1667,6 +1667,22 @@ static inline void skb_orphan(struct sk_buff *skb)
 }
 
 /**
+ *	skb_orphan_frags - orphan the frags contained in a buffer
+ *	@skb: buffer to orphan frags from
+ *	@gfp_mask: allocation mask for replacement pages
+ *
+ *	For each frag in the SKB which needs a destructor (i.e. has an
+ *	owner) create a copy of that frag and release the original
+ *	page by calling the destructor.
+ */
+static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+		return 0;
+	return skb_copy_ubufs(skb, gfp_mask);
+}
+
+/**
  *	__skb_queue_purge - empty a list
  *	@list: list to empty
  *
-- 
MST

^ permalink raw reply related

* [PATCHv3 2/6] skbuff: convert to skb_orphan_frags
From: Michael S. Tsirkin @ 2012-07-20 19:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <cover.1342812067.git.mst@redhat.com>

Reduce code duplication a bit using the new helper.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/core/skbuff.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ccfcb7d..438bbc5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -804,10 +804,8 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 {
 	struct sk_buff *n;
 
-	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
-		if (skb_copy_ubufs(skb, gfp_mask))
-			return NULL;
-	}
+	if (skb_orphan_frags(skb, gfp_mask))
+		return NULL;
 
 	n = skb + 1;
 	if (skb->fclone == SKB_FCLONE_ORIG &&
@@ -927,12 +925,10 @@ struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom, gfp_t gfp_mask)
 	if (skb_shinfo(skb)->nr_frags) {
 		int i;
 
-		if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
-			if (skb_copy_ubufs(skb, gfp_mask)) {
-				kfree_skb(n);
-				n = NULL;
-				goto out;
-			}
+		if (skb_orphan_frags(skb, gfp_mask)) {
+			kfree_skb(n);
+			n = NULL;
+			goto out;
 		}
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
@@ -1005,10 +1001,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	 */
 	if (skb_cloned(skb)) {
 		/* copy this zero copy skb frags */
-		if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
-			if (skb_copy_ubufs(skb, gfp_mask))
-				goto nofrags;
-		}
+		if (skb_orphan_frags(skb, gfp_mask))
+			goto nofrags;
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			skb_frag_ref(skb, i);
 
-- 
MST

^ permalink raw reply related

* [PATCHv3 4/6] tun: orphan frags on xmit
From: Michael S. Tsirkin @ 2012-07-20 19:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <cover.1342812067.git.mst@redhat.com>

tun xmit is actually receive of the internal tun
socket. Orphan the frags same as we do for normal rx path.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/tun.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index f3a454c..b95a7f4 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -416,6 +416,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)

 	/* Orphan the skb - required as we might hang on to it
 	 * for indefinite time. */
+	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+		goto drop;
 	skb_orphan(skb);

 	/* Enqueue packet */
-- 
MST

^ permalink raw reply related

* [PATCHv3 5/6] net: orphan frags on receive
From: Michael S. Tsirkin @ 2012-07-20 19:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <cover.1342812067.git.mst@redhat.com>

zero copy packets are normally sent to the outside
network, but bridging, tun etc might loop them
back to host networking stack. If this happens
destructors will never be called, so orphan
the frags immediately on receive.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/core/dev.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index d70e4a3..cca02ae 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1632,6 +1632,8 @@ static inline int deliver_skb(struct sk_buff *skb,
 			      struct packet_type *pt_prev,
 			      struct net_device *orig_dev)
 {
+	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+		return -ENOMEM;
 	atomic_inc(&skb->users);
 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
@@ -3262,7 +3264,10 @@ ncls:
 	}

 	if (pt_prev) {
-		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+			ret = -ENOMEM;
+		else
+			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
 		atomic_long_inc(&skb->dev->rx_dropped);
 		kfree_skb(skb);
-- 
MST

^ permalink raw reply related

* [PATCHv3 3/6] skbuff: export skb_copy_ubufs
From: Michael S. Tsirkin @ 2012-07-20 19:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <cover.1342812067.git.mst@redhat.com>

Export skb_copy_ubufs so that modules can orphan frags.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 438bbc5..368f65c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -784,7 +784,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
 	return 0;
 }
-
+EXPORT_SYMBOL_GPL(skb_copy_ubufs);

 /**
  *	skb_clone	-	duplicate an sk_buff
-- 
MST

^ permalink raw reply related

* [PATCHv3 6/6] tun: experimental zero copy tx support
From: Michael S. Tsirkin @ 2012-07-20 19:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, eric.dumazet, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <cover.1342812067.git.mst@redhat.com>

Let vhost-net utilize zero copy tx when used with tun.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/tun.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 134 insertions(+), 12 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b95a7f4..c62163e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -100,6 +100,8 @@ do {								\
 } while (0)
 #endif
 
+#define GOODCOPY_LEN 128
+
 #define FLT_EXACT_COUNT 8
 struct tap_filter {
 	unsigned int    count;    /* Number of addrs. Zero means disabled */
@@ -604,19 +606,100 @@ static struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
 	return skb;
 }
 
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+				  int offset, size_t count)
+{
+	int len = iov_length(from, count) - offset;
+	int copy = skb_headlen(skb);
+	int size, offset1 = 0;
+	int i = 0;
+
+	/* Skip over from offset */
+	while (count && (offset >= from->iov_len)) {
+		offset -= from->iov_len;
+		++from;
+		--count;
+	}
+
+	/* copy up to skb headlen */
+	while (count && (copy > 0)) {
+		size = min_t(unsigned int, copy, from->iov_len - offset);
+		if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+				   size))
+			return -EFAULT;
+		if (copy > size) {
+			++from;
+			--count;
+			offset = 0;
+		} else
+			offset += size;
+		copy -= size;
+		offset1 += size;
+	}
+
+	if (len == offset1)
+		return 0;
+
+	while (count--) {
+		struct page *page[MAX_SKB_FRAGS];
+		int num_pages;
+		unsigned long base;
+		unsigned long truesize;
+
+		len = from->iov_len - offset;
+		if (!len) {
+			offset = 0;
+			++from;
+			continue;
+		}
+		base = (unsigned long)from->iov_base + offset;
+		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (i + size > MAX_SKB_FRAGS)
+			return -EMSGSIZE;
+		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+		if (num_pages != size) {
+			for (i = 0; i < num_pages; i++)
+				put_page(page[i]);
+			return -EFAULT;
+		}
+		truesize = size * PAGE_SIZE;
+		skb->data_len += len;
+		skb->len += len;
+		skb->truesize += truesize;
+		atomic_add(truesize, &skb->sk->sk_wmem_alloc);
+		while (len) {
+			int off = base & ~PAGE_MASK;
+			int size = min_t(int, len, PAGE_SIZE - off);
+			__skb_fill_page_desc(skb, i, page[i], off, size);
+			skb_shinfo(skb)->nr_frags++;
+			/* increase sk_wmem_alloc */
+			base += size;
+			len -= size;
+			i++;
+		}
+		offset = 0;
+		++from;
+	}
+	return 0;
+}
+
 /* Get packet from user space buffer */
-static ssize_t tun_get_user(struct tun_struct *tun,
-			    const struct iovec *iv, size_t count,
-			    int noblock)
+static ssize_t tun_get_user(struct tun_struct *tun, void *msg_control,
+			    const struct iovec *iv, size_t total_len,
+			    size_t count, int noblock)
 {
 	struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
 	struct sk_buff *skb;
-	size_t len = count, align = NET_SKB_PAD;
+	size_t len = total_len, align = NET_SKB_PAD;
 	struct virtio_net_hdr gso = { 0 };
 	int offset = 0;
+	int copylen;
+	bool zerocopy = false;
+	int err;
 
 	if (!(tun->flags & TUN_NO_PI)) {
-		if ((len -= sizeof(pi)) > count)
+		if ((len -= sizeof(pi)) > total_len)
 			return -EINVAL;
 
 		if (memcpy_fromiovecend((void *)&pi, iv, 0, sizeof(pi)))
@@ -625,7 +708,7 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 	}
 
 	if (tun->flags & TUN_VNET_HDR) {
-		if ((len -= tun->vnet_hdr_sz) > count)
+		if ((len -= tun->vnet_hdr_sz) > total_len)
 			return -EINVAL;
 
 		if (memcpy_fromiovecend((void *)&gso, iv, offset, sizeof(gso)))
@@ -647,14 +730,46 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 			return -EINVAL;
 	}
 
-	skb = tun_alloc_skb(tun, align, len, gso.hdr_len, noblock);
+	if (msg_control)
+		zerocopy = true;
+
+	if (zerocopy) {
+		/* Userspace may produce vectors with count greater than
+		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
+		 * to let the rest of data to be fit in the frags.
+		 */
+		if (count > MAX_SKB_FRAGS) {
+			copylen = iov_length(iv, count - MAX_SKB_FRAGS);
+			if (copylen < offset)
+				copylen = 0;
+			else
+				copylen -= offset;
+		} else
+				copylen = 0;
+		/* There are 256 bytes to be copied in skb, so there is enough
+		 * room for skb expand head in case it is used.
+		 * The rest of the buffer is mapped from userspace.
+		 */
+		if (copylen < gso.hdr_len)
+			copylen = gso.hdr_len;
+		if (!copylen)
+			copylen = GOODCOPY_LEN;
+	} else
+		copylen = len;
+
+	skb = tun_alloc_skb(tun, align, copylen, gso.hdr_len, noblock);
 	if (IS_ERR(skb)) {
 		if (PTR_ERR(skb) != -EAGAIN)
 			tun->dev->stats.rx_dropped++;
 		return PTR_ERR(skb);
 	}
 
-	if (skb_copy_datagram_from_iovec(skb, 0, iv, offset, len)) {
+	if (zerocopy)
+		err = zerocopy_sg_from_iovec(skb, iv, offset, count);
+	else
+		err = skb_copy_datagram_from_iovec(skb, 0, iv, offset, len);
+
+	if (err) {
 		tun->dev->stats.rx_dropped++;
 		kfree_skb(skb);
 		return -EFAULT;
@@ -728,12 +843,18 @@ static ssize_t tun_get_user(struct tun_struct *tun,
 		skb_shinfo(skb)->gso_segs = 0;
 	}
 
+	/* copy skb_ubuf_info for callback when skb has no error */
+	if (zerocopy) {
+		skb_shinfo(skb)->destructor_arg = msg_control;
+		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+	}
+
 	netif_rx_ni(skb);
 
 	tun->dev->stats.rx_packets++;
 	tun->dev->stats.rx_bytes += len;
 
-	return count;
+	return total_len;
 }
 
 static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
@@ -748,7 +869,7 @@ static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
 
 	tun_debug(KERN_INFO, tun, "tun_chr_write %ld\n", count);
 
-	result = tun_get_user(tun, iv, iov_length(iv, count),
+	result = tun_get_user(tun, NULL, iv, iov_length(iv, count), count,
 			      file->f_flags & O_NONBLOCK);
 
 	tun_put(tun);
@@ -962,8 +1083,8 @@ static int tun_sendmsg(struct kiocb *iocb, struct socket *sock,
 		       struct msghdr *m, size_t total_len)
 {
 	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
-	return tun_get_user(tun, m->msg_iov, total_len,
-			    m->msg_flags & MSG_DONTWAIT);
+	return tun_get_user(tun, m->msg_control, m->msg_iov, total_len,
+			    m->msg_iovlen, m->msg_flags & MSG_DONTWAIT);
 }
 
 static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
@@ -1133,6 +1254,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		sock_init_data(&tun->socket, sk);
 		sk->sk_write_space = tun_sock_write_space;
 		sk->sk_sndbuf = INT_MAX;
+		sock_set_flag(sk, SOCK_ZEROCOPY);
 
 		tun_sk(sk)->tun = tun;
 
-- 
MST

^ permalink raw reply related

* Re: New commands to configure IOV features
From: Chris Friesen @ 2012-07-20 19:29 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Don Dutile, David Miller, yuvalmin, gregory.v.rose, netdev,
	linux-pci
In-Reply-To: <1342806146.2678.31.camel@bwh-desktop.uk.solarflarecom.com>

On 07/20/2012 11:42 AM, Ben Hutchings wrote:
>
> The ethtool API is typically used for net device operations that can be
> largely devolved to individual drivers, and which the network stack can
> mostly ignore (though offload features are an historical exception to
> this).  It started with Ethernet link settings, but many operations are
> applicable (and implemented by) other types of network device.

That (potentially) accounts for all network devices, but it leaves all 
the other devices that could export virtual functions.

Why should I need to use a different API to enable virtual functions on 
my network device and my storage controller?  (And why should "ethtool" 
or "ip" care that it's a virtual function?)

What Don and I are suggesting is that the concept of virtual functions 
is a PCI thing, so it should be dealt with at the PCI layer.  Regardless 
of the type of device the export of virtual functions is conceptually 
the same thing, so it should use the same API.

Once the device exists, then domain-specific APIs would be used to 
configure it the same way that they would configure a physical device.

Chris

^ permalink raw reply

* Re: [PATCH v5] sctp: Implement quick failover draft from tsvwg
From: David Miller @ 2012-07-20 19:31 UTC (permalink / raw)
  To: fbl; +Cc: nhorman, netdev, vyasevich, sri, linux-sctp, joe
In-Reply-To: <20120720161001.1cc1bb10@obelix.rh>

Please DO NOT quote an entire large patch just to add your
signoff.  It's a waste of bandwith, and folks like me have
to scroll down the entire thing to see if you actually
have real patch feedback or not.

Just quote the commit message or similar.

^ permalink raw reply

* [RFC] net: further seperate dst_entry.__refcnt from cache contention
From: Nathan Zimmer @ 2012-07-20 19:46 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Robin.Holt, Nathan Zimmer, David S. Miller

After some investigation on large machines I found that
dst_entry.__refcnt particpates in false cache sharing issues that show
when scaling past 12 threads who communicate via tcp with loopback addresses.
I adjusted refcnt to be on its own cache line and that helped quite a bit.
But perhaps a bit of a waste of space?  Is there some better way?

Here is some preliminary data I had gathered.  It shows nicely improved scaling.

Threads baseline   afterchange
2       1328.03    1340.67
4       2430.31    2282.09
6       3087.65    3258.12
8       3560.34    4165.88
10      3900.34    4962.28
12      3933.38    5613.76
14      3876.98    6113.85
16      3706.01    6338.00
18      3742.48    6634.77
20      3670.15    6641.25
22      3660.98    6799.55
24      3503.13    6613.45
26      3525.73    6702.67
28      3440.16    6801.27
30      3497.59    6911.52
32      3498.25    6540.06

I should say something about this test.  It is a dead simple test in which a
pair of threads simply pass data to each other.  They were placed in the same
socket to avoid cross node overhead.

CC: "David S. Miller" <davem@davemloft.net> 
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>

---
 include/net/dst.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 8197ead..3898643 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -84,7 +84,7 @@ struct dst_entry {
 	 * input/output/ops or performance tanks badly
 	 */
 	atomic_t		__refcnt;	/* client references	*/
-	int			__use;
+	int			__use	____cacheline_aligned;
 	unsigned long		lastuse;
 	union {
 		struct dst_entry	*next;
-- 
1.6.0.2

^ permalink raw reply related

* Re: New commands to configure IOV features
From: Ben Hutchings @ 2012-07-20 20:01 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Don Dutile, David Miller, yuvalmin, gregory.v.rose, netdev,
	linux-pci
In-Reply-To: <5009B186.6000806@genband.com>

On Fri, 2012-07-20 at 13:29 -0600, Chris Friesen wrote:
> On 07/20/2012 11:42 AM, Ben Hutchings wrote:
> >
> > The ethtool API is typically used for net device operations that can be
> > largely devolved to individual drivers, and which the network stack can
> > mostly ignore (though offload features are an historical exception to
> > this).  It started with Ethernet link settings, but many operations are
> > applicable (and implemented by) other types of network device.
> 
> That (potentially) accounts for all network devices, but it leaves all 
> the other devices that could export virtual functions.
> 
> Why should I need to use a different API to enable virtual functions on 
> my network device and my storage controller?

Indeed; I was merely making the point that it would be quite valid to
use that means for setting VF network parameters for any network device
that supports IOV.

> (And why should "ethtool" or "ip" care that it's a virtual function?)

VFs may be assigned to a guest which is not fully trusted by the
hypervisor or privileged domain.  (This can sometimes be true for PFs
too, depending on the capabilities of the hypervisor and guest OS.)
Some configuration may therefore need to be done via a trusted PF.

> What Don and I are suggesting is that the concept of virtual functions 
> is a PCI thing, so it should be dealt with at the PCI layer.  Regardless 
> of the type of device the export of virtual functions is conceptually 
> the same thing, so it should use the same API.
> 
> Once the device exists, then domain-specific APIs would be used to 
> configure it the same way that they would configure a physical device.

To an extent, but not entirely.

Currently, the assigned MAC address and (optional) VLAN tag for each
networking VF are configured via the PF net device (though this is done
though the rtnetlink API rather than ethtool).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH] mlx4: Add support for EEH error recovery
From: Kleber Sacilotto de Souza @ 2012-07-20 19:55 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Jack Morgenstein, Yevgeny Petrilin, Or Gerlitz, cascardo,
	brking, Kleber Sacilotto de Souza

Currently the mlx4 drivers don't have the necessary callbacks to
implement EEH errors detection and recovery, so the PCI layer uses the
probe and remove callbacks to try to recover the device after an error on
the bus. However, these callbacks have race conditions with the internal
catastrophic error recovery functions, which will also detect the error
and this can cause the system to crash if both EEH and catas functions
try to reset the device.

This patch adds the necessary error recovery callbacks and makes sure
that the internal catastrophic error functions will not try to reset the
device in such scenarios. It also adds some calls to
pci_channel_offline() to suppress reads/writes on the bus when the slot
cannot accept I/O operations so we prevent unnecessary accesses to the
bus and speed up the device removal.

Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
---
 drivers/net/ethernet/mellanox/mlx4/catas.c |   25 ++++++++++----
 drivers/net/ethernet/mellanox/mlx4/cmd.c   |   49 ++++++++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/main.c  |   30 ++++++++++++++++-
 3 files changed, 93 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/catas.c b/drivers/net/ethernet/mellanox/mlx4/catas.c
index 915e947..9c656fe 100644
--- a/drivers/net/ethernet/mellanox/mlx4/catas.c
+++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
@@ -69,16 +69,21 @@ static void poll_catas(unsigned long dev_ptr)
 	struct mlx4_priv *priv = mlx4_priv(dev);
 
 	if (readl(priv->catas_err.map)) {
-		dump_err_buf(dev);
-
-		mlx4_dispatch_event(dev, MLX4_DEV_EVENT_CATASTROPHIC_ERROR, 0);
+		/* If the device is off-line, we cannot try to recover it */
+		if (pci_channel_offline(dev->pdev))
+			mod_timer(&priv->catas_err.timer,
+				  round_jiffies(jiffies + MLX4_CATAS_POLL_INTERVAL));
+		else {
+			dump_err_buf(dev);
+			mlx4_dispatch_event(dev, MLX4_DEV_EVENT_CATASTROPHIC_ERROR, 0);
 
-		if (internal_err_reset) {
-			spin_lock(&catas_lock);
-			list_add(&priv->catas_err.list, &catas_list);
-			spin_unlock(&catas_lock);
+			if (internal_err_reset) {
+				spin_lock(&catas_lock);
+				list_add(&priv->catas_err.list, &catas_list);
+				spin_unlock(&catas_lock);
 
-			queue_work(mlx4_wq, &catas_work);
+				queue_work(mlx4_wq, &catas_work);
+			}
 		}
 	} else
 		mod_timer(&priv->catas_err.timer,
@@ -100,6 +105,10 @@ static void catas_reset(struct work_struct *work)
 	list_for_each_entry_safe(priv, tmppriv, &tlist, catas_err.list) {
 		struct pci_dev *pdev = priv->dev.pdev;
 
+		/* If the device is off-line, we cannot reset it */
+		if (pci_channel_offline(pdev))
+			continue;
+
 		ret = mlx4_restart_one(priv->dev.pdev);
 		/* 'priv' now is not valid */
 		if (ret)
diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 7e94987..c8fef43 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -296,7 +296,12 @@ int mlx4_comm_cmd(struct mlx4_dev *dev, u8 cmd, u16 param,
 
 static int cmd_pending(struct mlx4_dev *dev)
 {
-	u32 status = readl(mlx4_priv(dev)->cmd.hcr + HCR_STATUS_OFFSET);
+	u32 status;
+
+	if (pci_channel_offline(dev->pdev))
+		return -EIO;
+
+	status = readl(mlx4_priv(dev)->cmd.hcr + HCR_STATUS_OFFSET);
 
 	return (status & swab32(1 << HCR_GO_BIT)) ||
 		(mlx4_priv(dev)->cmd.toggle ==
@@ -314,11 +319,29 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param,
 
 	mutex_lock(&cmd->hcr_mutex);
 
+	if (pci_channel_offline(dev->pdev)) {
+		/*
+		 * Device is going through error recovery
+		 * and cannot accept commands.
+		 */
+		ret = -EIO;
+		goto out;
+	}
+
 	end = jiffies;
 	if (event)
 		end += msecs_to_jiffies(GO_BIT_TIMEOUT_MSECS);
 
 	while (cmd_pending(dev)) {
+		if (pci_channel_offline(dev->pdev)) {
+			/*
+			 * Device is going through error recovery
+			 * and cannot accept commands.
+			 */
+			ret = -EIO;
+			goto out;
+		}
+
 		if (time_after_eq(jiffies, end)) {
 			mlx4_err(dev, "%s:cmd_pending failed\n", __func__);
 			goto out;
@@ -431,14 +454,33 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 in_param, u64 *out_param,
 
 	down(&priv->cmd.poll_sem);
 
+	if (pci_channel_offline(dev->pdev)) {
+		/*
+		 * Device is going through error recovery
+		 * and cannot accept commands.
+		 */
+		err = -EIO;
+		goto out;
+	}
+
 	err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0,
 			    in_modifier, op_modifier, op, CMD_POLL_TOKEN, 0);
 	if (err)
 		goto out;
 
 	end = msecs_to_jiffies(timeout) + jiffies;
-	while (cmd_pending(dev) && time_before(jiffies, end))
+	while (cmd_pending(dev) && time_before(jiffies, end)) {
+		if (pci_channel_offline(dev->pdev)) {
+			/*
+			 * Device is going through error recovery
+			 * and cannot accept commands.
+			 */
+			err = -EIO;
+			goto out;
+		}
+
 		cond_resched();
+	}
 
 	if (cmd_pending(dev)) {
 		err = -ETIMEDOUT;
@@ -532,6 +574,9 @@ int __mlx4_cmd(struct mlx4_dev *dev, u64 in_param, u64 *out_param,
 	       int out_is_imm, u32 in_modifier, u8 op_modifier,
 	       u16 op, unsigned long timeout, int native)
 {
+	if (pci_channel_offline(dev->pdev))
+		return -EIO;
+
 	if (!mlx4_is_mfunc(dev) || (native && mlx4_is_master(dev))) {
 		if (mlx4_priv(dev)->cmd.use_events)
 			return mlx4_cmd_wait(dev, in_param, out_param,
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 4264516..e717091 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -1775,6 +1775,9 @@ static int mlx4_get_ownership(struct mlx4_dev *dev)
 	void __iomem *owner;
 	u32 ret;
 
+	if (pci_channel_offline(dev->pdev))
+		return -EIO;
+
 	owner = ioremap(pci_resource_start(dev->pdev, 0) + MLX4_OWNER_BASE,
 			MLX4_OWNER_SIZE);
 	if (!owner) {
@@ -1791,6 +1794,9 @@ static void mlx4_free_ownership(struct mlx4_dev *dev)
 {
 	void __iomem *owner;
 
+	if (pci_channel_offline(dev->pdev))
+		return;
+
 	owner = ioremap(pci_resource_start(dev->pdev, 0) + MLX4_OWNER_BASE,
 			MLX4_OWNER_SIZE);
 	if (!owner) {
@@ -2237,11 +2243,33 @@ static DEFINE_PCI_DEVICE_TABLE(mlx4_pci_table) = {
 
 MODULE_DEVICE_TABLE(pci, mlx4_pci_table);
 
+static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev,
+					      pci_channel_state_t state)
+{
+	mlx4_remove_one(pdev);
+
+	return state == pci_channel_io_perm_failure ?
+		PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
+}
+
+static pci_ers_result_t mlx4_pci_slot_reset(struct pci_dev *pdev)
+{
+	int ret = __mlx4_init_one(pdev, NULL);
+
+	return ret ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
+}
+
+static struct pci_error_handlers mlx4_err_handler = {
+	.error_detected = mlx4_pci_err_detected,
+	.slot_reset     = mlx4_pci_slot_reset,
+};
+
 static struct pci_driver mlx4_driver = {
 	.name		= DRV_NAME,
 	.id_table	= mlx4_pci_table,
 	.probe		= mlx4_init_one,
-	.remove		= __devexit_p(mlx4_remove_one)
+	.remove		= __devexit_p(mlx4_remove_one),
+	.err_handler    = &mlx4_err_handler,
 };
 
 static int __init mlx4_verify_params(void)
-- 
1.7.1

^ permalink raw reply related

* Re: New commands to configure IOV features
From: Don Dutile @ 2012-07-20 20:15 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Chris Friesen, David Miller, yuvalmin, gregory.v.rose, netdev,
	linux-pci
In-Reply-To: <1342814473.2678.65.camel@bwh-desktop.uk.solarflarecom.com>

On 07/20/2012 04:01 PM, Ben Hutchings wrote:
> On Fri, 2012-07-20 at 13:29 -0600, Chris Friesen wrote:
>> On 07/20/2012 11:42 AM, Ben Hutchings wrote:
>>>
>>> The ethtool API is typically used for net device operations that can be
>>> largely devolved to individual drivers, and which the network stack can
>>> mostly ignore (though offload features are an historical exception to
>>> this).  It started with Ethernet link settings, but many operations are
>>> applicable (and implemented by) other types of network device.
>>
>> That (potentially) accounts for all network devices, but it leaves all
>> the other devices that could export virtual functions.
>>
>> Why should I need to use a different API to enable virtual functions on
>> my network device and my storage controller?
>
> Indeed; I was merely making the point that it would be quite valid to
> use that means for setting VF network parameters for any network device
> that supports IOV.
>
Yes, I read Ben's reply as supporting the proposition of VF enablement
at the PCI level.

>> (And why should "ethtool" or "ip" care that it's a virtual function?)
>
> VFs may be assigned to a guest which is not fully trusted by the
> hypervisor or privileged domain.  (This can sometimes be true for PFs
> too, depending on the capabilities of the hypervisor and guest OS.)
> Some configuration may therefore need to be done via a trusted PF.
>
Correct!  The security domain (for KVM) is the host, thus, the host
assignes VF attributes *before* they are given to the guest.... The guest
is just a consumer, at least that's been my experience with VF devices to date,
but I could see how an improper VF design could allow untrusted/guest
(ethtool/netlink) ops on the VF.

>> What Don and I are suggesting is that the concept of virtual functions
>> is a PCI thing, so it should be dealt with at the PCI layer.  Regardless
>> of the type of device the export of virtual functions is conceptually
>> the same thing, so it should use the same API.
>>
>> Once the device exists, then domain-specific APIs would be used to
>> configure it the same way that they would configure a physical device.
>
> To an extent, but not entirely.
>
> Currently, the assigned MAC address and (optional) VLAN tag for each
> networking VF are configured via the PF net device (though this is done
> though the rtnetlink API rather than ethtool).
Yes, through the PF, which is suppose to remain in the trusted host/hypervisor
domain.  (Do a 'man ip' on RHEL6 and look at 'ip link set'  where it then mentions
the parameter 'vf'.).

>
> Ben.
>

^ permalink raw reply

* Re: [RFC] net: further seperate dst_entry.__refcnt from cache contention
From: Eric Dumazet @ 2012-07-20 20:16 UTC (permalink / raw)
  To: Nathan Zimmer; +Cc: linux-kernel, netdev, Robin.Holt, David S. Miller
In-Reply-To: <1342813587-31601-1-git-send-email-nzimmer@sgi.com>

On Fri, 2012-07-20 at 14:46 -0500, Nathan Zimmer wrote:
> After some investigation on large machines I found that
> dst_entry.__refcnt particpates in false cache sharing issues that show
> when scaling past 12 threads who communicate via tcp with loopback addresses.
> I adjusted refcnt to be on its own cache line and that helped quite a bit.
> But perhaps a bit of a waste of space?  Is there some better way?
> 
> Here is some preliminary data I had gathered.  It shows nicely improved scaling.
> 
> Threads baseline   afterchange
> 2       1328.03    1340.67
> 4       2430.31    2282.09
> 6       3087.65    3258.12
> 8       3560.34    4165.88
> 10      3900.34    4962.28
> 12      3933.38    5613.76
> 14      3876.98    6113.85
> 16      3706.01    6338.00
> 18      3742.48    6634.77
> 20      3670.15    6641.25
> 22      3660.98    6799.55
> 24      3503.13    6613.45
> 26      3525.73    6702.67
> 28      3440.16    6801.27
> 30      3497.59    6911.52
> 32      3498.25    6540.06
> 
> I should say something about this test.  It is a dead simple test in which a
> pair of threads simply pass data to each other.  They were placed in the same
> socket to avoid cross node overhead.
> 
> CC: "David S. Miller" <davem@davemloft.net> 
> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
> 
> ---
>  include/net/dst.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 8197ead..3898643 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -84,7 +84,7 @@ struct dst_entry {
>  	 * input/output/ops or performance tanks badly
>  	 */
>  	atomic_t		__refcnt;	/* client references	*/
> -	int			__use;
> +	int			__use	____cacheline_aligned;
>  	unsigned long		lastuse;
>  	union {
>  		struct dst_entry	*next;

Its a known problem, and we are waiting IP cache removal to address it.

Before the cache removal, a machine can have million of dst

Another idea concerning very hot dst would be to clone them on demand.

^ permalink raw reply

* Re: [RFC] net: further seperate dst_entry.__refcnt from cache contention
From: David Miller @ 2012-07-20 20:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nzimmer, linux-kernel, netdev, Robin.Holt
In-Reply-To: <1342815411.2626.7936.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 20 Jul 2012 22:16:51 +0200

> Another idea concerning very hot dst would be to clone them on demand.

The FIB info cached dsts could also be made per-cpu.

And yes, the only reason I'm entertaining that idea now is the fact
that we'll have the number of dsts in the system under control.

^ permalink raw reply

* Re: [PATCH 14/16] ipv4: Kill rt->rt_oif
From: David Miller @ 2012-07-20 20:31 UTC (permalink / raw)
  To: ja; +Cc: netdev, kaber
In-Reply-To: <alpine.LFD.2.00.1207200219010.5516@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Fri, 20 Jul 2012 04:10:12 +0300 (EEST)

>> @@ -1795,7 +1795,6 @@ static struct mr_table *ipmr_rt_fib_lookup(struct net *net, struct sk_buff *skb)
>>  		.daddr = iph->daddr,
>>  		.saddr = iph->saddr,
>>  		.flowi4_tos = RT_TOS(iph->tos),
>> -		.flowi4_oif = rt->rt_oif,
>>  		.flowi4_iif = rt->rt_iif,
>>  		.flowi4_mark = skb->mark,
> 
> 	But it was wrong at first place to use rt_iif
> here. May be we should provide devices to "mrule" just
> like we do for "rule", with the only difference that
> oif now is possible to match outdev instead of preferred
> device, i.e. oif is always set for output routes.
> 
> 	.flowi4_oif = rt_is_output_route(rt) ?
> 			skb->dev->ifindex : 0,
> 	.flowi4_iif = rt_is_output_route(rt) ?
> 			net->loopback_dev->ifindex :
> 			skb->dev->ifindex;

Ok.

> 	Let me know if patch is needed

I'll take care of it myself today, thanks.

^ permalink raw reply

* Re: [ethtool 1/2] ethtool.h: implement new MDI-X set defines
From: Jeff Kirsher @ 2012-07-20 20:50 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Jesse Brandeburg, netdev, davem
In-Reply-To: <1342797849.2678.19.camel@bwh-desktop.uk.solarflarecom.com>

[-- Attachment #1: Type: text/plain, Size: 840 bytes --]

On Fri, 2012-07-20 at 16:24 +0100, Ben Hutchings wrote:
> On Thu, 2012-07-19 at 23:25 -0700, Jeff Kirsher wrote:
> > From: Jesse Brandeburg <jesse.brandeburg@intel.com>
> > 
> > These changes implement the kernel side of the interface
> > for allowing drivers to set MDI-X state on twisted pair.
> > 
> > Changes implemented as suggested by Ben Hutchings, thanks Ben!
> > 
> > see ethtool patches titled:
> > ethtool: allow setting MDI-X state
> [...]
> 
> I don't see the corresponding kernel changes, but I'll hold onto these
> for a while in the hope that they show up.
> 
> Ben.
> 

I wanted to wait for Dave to pull the patches currently sitting in my
net-next tree before pushing out the corresponding kernel patches.

I will be sending them out here shortly now that Dave has pulled from
me.

Cheers,
Jeff

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH] net-next: minor cleanups for bonding documentation
From: Rick Jones @ 2012-07-20 20:51 UTC (permalink / raw)
  To: netdev; +Cc: Jay Vosburgh, Andy Gospodarek

From: Rick Jones <rick.jones2@hp.com>

The section titled "Configuring Bonding for Maximum Throughput" is
actually section twelve not thirteen, and there are a couple of words
spelled incorrectly.

Signed-off-by: Rick Jones <rick.jones2@hp.com>

---

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index bfea8a3..6b1c711 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -1210,7 +1210,7 @@ options, you may wish to use the "max_bonds" module parameter,
 documented above.
 
 	To create multiple bonding devices with differing options, it is
-preferrable to use bonding parameters exported by sysfs, documented in the
+preferable to use bonding parameters exported by sysfs, documented in the
 section below.
 
 	For versions of bonding without sysfs support, the only means to
@@ -1950,7 +1950,7 @@ access to fail over to.  Additionally, the bonding load balance modes
 support link monitoring of their members, so if individual links fail,
 the load will be rebalanced across the remaining devices.
 
-	See Section 13, "Configuring Bonding for Maximum Throughput"
+	See Section 12, "Configuring Bonding for Maximum Throughput"
 for information on configuring bonding with one peer device.
 
 11.2 High Availability in a Multiple Switch Topology
@@ -2620,7 +2620,7 @@ be found at:
 
 https://lists.sourceforge.net/lists/listinfo/bonding-devel
 
-	Discussions regarding the developpement of the bonding driver take place
+	Discussions regarding the development of the bonding driver take place
 on the main Linux network mailing list, hosted at vger.kernel.org. The list
 address is:
 

^ permalink raw reply related

* [net-next PATCH v1] net: netprio_cgroup: rework update socket logic
From: John Fastabend @ 2012-07-20 20:39 UTC (permalink / raw)
  To: davem, nhorman; +Cc: netdev, eric.dumazet, gaofeng, lizefan

Instead of updating the sk_cgrp_prioidx struct field on every send
this only updates the field when a task is moved via cgroup
infrastructure.

This allows sockets that may be used by a kernel worker thread
to be managed. For example in the iscsi case today a user can
put iscsid in a netprio cgroup and control traffic will be sent
with the correct sk_cgrp_prioidx value set but as soon as data
is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
is updated with the kernel worker threads value which is the
default case.

It seems more correct to only update the field when the user
explicitly sets it via control group infrastructure. This allows
the users to manage sockets that may be used with other threads.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/net.h          |    1 +
 include/net/netprio_cgroup.h |    4 ++-
 net/core/netprio_cgroup.c    |   53 ++++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c              |    6 ++---
 net/socket.c                 |    5 ++--
 5 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index dc95700..99276c3 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -248,6 +248,7 @@ extern int	     sock_recvmsg(struct socket *sock, struct msghdr *msg,
 				  size_t size, int flags);
 extern int 	     sock_map_fd(struct socket *sock, int flags);
 extern struct socket *sockfd_lookup(int fd, int *err);
+extern struct socket *sock_from_file(struct file *file, int *err);
 #define		     sockfd_put(sock) fput(sock->file)
 extern int	     net_ratelimit(void);
 
diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
index d58fdec..2719dec 100644
--- a/include/net/netprio_cgroup.h
+++ b/include/net/netprio_cgroup.h
@@ -35,7 +35,7 @@ struct cgroup_netprio_state {
 extern int net_prio_subsys_id;
 #endif
 
-extern void sock_update_netprioidx(struct sock *sk);
+extern void sock_update_netprioidx(struct sock *sk, struct task_struct *task);
 
 #if IS_BUILTIN(CONFIG_NETPRIO_CGROUP)
 
@@ -82,7 +82,7 @@ static inline u32 task_netprioidx(struct task_struct *p)
 #endif /* CONFIG_NETPRIO_CGROUP */
 
 #else
-#define sock_update_netprioidx(sk)
+#define sock_update_netprioidx(sk, task)
 #endif
 
 #endif  /* _NET_CLS_CGROUP_H */
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b2e9caa..63d15e8 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -25,6 +25,8 @@
 #include <net/sock.h>
 #include <net/netprio_cgroup.h>
 
+#include <linux/fdtable.h>
+
 #define PRIOIDX_SZ 128
 
 static unsigned long prioidx_map[PRIOIDX_SZ];
@@ -272,6 +274,56 @@ out_free_devname:
 	return ret;
 }
 
+void net_prio_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
+{
+	struct task_struct *p;
+	char *tmp = kzalloc(sizeof(char) * PATH_MAX, GFP_KERNEL);
+
+	if (!tmp) {
+		pr_warn("Unable to attach cgrp due to alloc failure!\n");
+		return;
+	}
+
+	cgroup_taskset_for_each(p, cgrp, tset) {
+		unsigned int fd;
+		struct fdtable *fdt;
+		struct files_struct *files;
+
+		task_lock(p);
+		files = p->files;
+		if (!files) {
+			task_unlock(p);
+			continue;
+		}
+
+		rcu_read_lock();
+		fdt = files_fdtable(files);
+		for (fd = 0; fd < fdt->max_fds; fd++) {
+			char *path;
+			struct file *file;
+			struct socket *sock;
+			unsigned long s;
+			int rv, err = 0;
+
+			file = fcheck_files(files, fd);
+			if (!file)
+				continue;
+
+			path = d_path(&file->f_path, tmp, PAGE_SIZE);
+			rv = sscanf(path, "socket:[%lu]", &s);
+			if (rv <= 0)
+				continue;
+
+			sock = sock_from_file(file, &err);
+			if (!err)
+				sock_update_netprioidx(sock->sk, p);
+		}
+		rcu_read_unlock();
+		task_unlock(p);
+	}
+	kfree(tmp);
+}
+
 static struct cftype ss_files[] = {
 	{
 		.name = "prioidx",
@@ -289,6 +341,7 @@ struct cgroup_subsys net_prio_subsys = {
 	.name		= "net_prio",
 	.create		= cgrp_create,
 	.destroy	= cgrp_destroy,
+	.attach		= net_prio_attach,
 #ifdef CONFIG_NETPRIO_CGROUP
 	.subsys_id	= net_prio_subsys_id,
 #endif
diff --git a/net/core/sock.c b/net/core/sock.c
index 24039ac..2676a88 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1180,12 +1180,12 @@ void sock_update_classid(struct sock *sk)
 }
 EXPORT_SYMBOL(sock_update_classid);
 
-void sock_update_netprioidx(struct sock *sk)
+void sock_update_netprioidx(struct sock *sk, struct task_struct *task)
 {
 	if (in_interrupt())
 		return;
 
-	sk->sk_cgrp_prioidx = task_netprioidx(current);
+	sk->sk_cgrp_prioidx = task_netprioidx(task);
 }
 EXPORT_SYMBOL_GPL(sock_update_netprioidx);
 #endif
@@ -1215,7 +1215,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
-		sock_update_netprioidx(sk);
+		sock_update_netprioidx(sk, current);
 	}
 
 	return sk;
diff --git a/net/socket.c b/net/socket.c
index 0452dca..dfe5b66 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -398,7 +398,7 @@ int sock_map_fd(struct socket *sock, int flags)
 }
 EXPORT_SYMBOL(sock_map_fd);
 
-static struct socket *sock_from_file(struct file *file, int *err)
+struct socket *sock_from_file(struct file *file, int *err)
 {
 	if (file->f_op == &socket_file_ops)
 		return file->private_data;	/* set in sock_map_fd */
@@ -406,6 +406,7 @@ static struct socket *sock_from_file(struct file *file, int *err)
 	*err = -ENOTSOCK;
 	return NULL;
 }
+EXPORT_SYMBOL(sock_from_file);
 
 /**
  *	sockfd_lookup - Go from a file number to its socket slot
@@ -554,8 +555,6 @@ static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock,
 
 	sock_update_classid(sock->sk);
 
-	sock_update_netprioidx(sock->sk);
-
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;

^ permalink raw reply related

* Re: [RFC PATCH] net: Add support for virtual machine device queues (VMDQ)
From: John Fastabend @ 2012-07-20 20:58 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Jiri Pirko, or.gerlitz, davem, roland, netdev, ali, sean.hefty,
	shlomop, Ronciak, John
In-Reply-To: <1342807280.2678.33.camel@bwh-desktop.uk.solarflarecom.com>

On 7/20/2012 11:01 AM, Ben Hutchings wrote:
> On Fri, 2012-07-20 at 09:30 -0700, John Fastabend wrote:
>> On 7/18/2012 11:42 PM, Jiri Pirko wrote:
>>> Thu, Jul 19, 2012 at 12:05:44AM CEST, john.r.fastabend@intel.com wrote:
>>>> This adds support to allow virtual net devices to be created. These
>>>> devices can be managed independtly of the physical function but
>>>> use the same physical link.
>>
>> [...]
>>
>>>> +
>>>> +size_t vmdq_getpriv_size(struct net *src_net, struct nlattr *tb[])
>>>> +{
>>>> +	struct net_device *lowerdev;
>>>> +
>>>> +	if (!tb[IFLA_LINK])
>>>> +		return -EINVAL;
>>>> +
>>>> +	lowerdev = __dev_get_by_index(src_net, nla_get_u32(tb[IFLA_LINK]));
>>>> +	if (!lowerdev)
>>>> +		return -ENODEV;
>>>> +
>>>> +	return sizeof(netdev_priv(lowerdev));
>>>> +}
>>>
>>> Why exactly do you need to have the priv of same size as lowerdev? I do
>>> not see you use that anywhere...
>>>
>>
>> When we add a child device the hardware/sw may have some private data
>> it needs to manage this device.
>>
>> I made an assumption here that the priv space for child devices is the
>> same as the lowerdev but this might be a bad assumption.
>
> The code assumes that it is the size of a single pointer...
>
> Ben.
>

Right I'll fix it. Worked for me because my local unfinished
driver implementation only stored a single pointer. Thanks Ben.

^ permalink raw reply

* Re: [RFC] r8169 : why SG / TX checksum are default disabled
From: Francois Romieu @ 2012-07-20 21:01 UTC (permalink / raw)
  To: hayeswang; +Cc: 'David Miller', eric.dumazet, netdev
In-Reply-To: <A4FD513AF09D40849B8A75C371F1D431@realtek.com.tw>

hayeswang <hayeswang@realtek.com> :
[...]
> If the hw only fills in the checksum fields of IP header, UDP header, and TCP
> header, the patch would work. However, the hw would also fill in the total
> length field of IP header, so it causes problems.

Thanks, I missed this part. The patch below should be getting better now.

Btw, would there be any point in setting the TcpHeaderOffset field for
the post-8168c chipsets ?

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index be4e00f..2420af6 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5740,7 +5740,7 @@ err_out:
 	return -EIO;
 }
 
-static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
+static inline bool rtl8169_tso_csum(struct rtl8169_private *tp,
 				    struct sk_buff *skb, u32 *opts)
 {
 	const struct rtl_tx_desc_info *info = tx_desc_info + tp->txd_version;
@@ -5753,6 +5753,15 @@ static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
 	} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		const struct iphdr *ip = ip_hdr(skb);
 
+		if (unlikely(skb->len < ETH_ZLEN &&
+		    (tp->mac_version == RTL_GIGA_MAC_VER_34))) {
+			if (skb_padto(skb, ETH_ZLEN))
+				return false;
+			skb_checksum_help(skb);
+			skb_put(skb, ETH_ZLEN - skb->len);
+			return true;
+		}
+
 		if (ip->protocol == IPPROTO_TCP)
 			opts[offset] |= info->checksum.tcp;
 		else if (ip->protocol == IPPROTO_UDP)
@@ -5760,6 +5769,7 @@ static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
 		else
 			WARN_ON_ONCE(1);
 	}
+	return true;
 }
 
 static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
@@ -5783,25 +5793,26 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	if (unlikely(le32_to_cpu(txd->opts1) & DescOwn))
 		goto err_stop_0;
 
+	opts[1] = cpu_to_le32(rtl8169_tx_vlan_tag(tp, skb));
+	opts[0] = DescOwn;
+
+	if (!rtl8169_tso_csum(tp, skb, opts))
+		goto err_update_stats_0;
+
 	len = skb_headlen(skb);
 	mapping = dma_map_single(d, skb->data, len, DMA_TO_DEVICE);
 	if (unlikely(dma_mapping_error(d, mapping))) {
 		if (net_ratelimit())
 			netif_err(tp, drv, dev, "Failed to map TX DMA!\n");
-		goto err_dma_0;
+		goto err_free_skb_1;
 	}
 
 	tp->tx_skb[entry].len = len;
 	txd->addr = cpu_to_le64(mapping);
 
-	opts[1] = cpu_to_le32(rtl8169_tx_vlan_tag(tp, skb));
-	opts[0] = DescOwn;
-
-	rtl8169_tso_csum(tp, skb, opts);
-
 	frags = rtl8169_xmit_frags(tp, skb, opts);
 	if (frags < 0)
-		goto err_dma_1;
+		goto err_unmap_2;
 	else if (frags)
 		opts[0] |= FirstFrag;
 	else {
@@ -5849,10 +5860,11 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	return NETDEV_TX_OK;
 
-err_dma_1:
+err_unmap_2:
 	rtl8169_unmap_tx_skb(d, tp->tx_skb + entry, txd);
-err_dma_0:
+err_free_skb_1:
 	dev_kfree_skb(skb);
+err_update_stats_0:
 	dev->stats.tx_dropped++;
 	return NETDEV_TX_OK;
 

^ permalink raw reply related

* Re: [PATCH] net-next: minor cleanups for bonding documentation
From: Nicolas de Pesloüan @ 2012-07-20 21:25 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, Jay Vosburgh, Andy Gospodarek
In-Reply-To: <20120720205137.84E3D290043B@tardy>

On 20/07/2012 22:51, Rick Jones wrote:

> -	Discussions regarding the developpement of the bonding driver take place
> +	Discussions regarding the development of the bonding driver take place

Thanks for catching my typo :-)

Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox