Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 0/3] net: Byte queue limit patch series
From: Tom Herbert @ 2011-04-26  4:38 UTC (permalink / raw)
  To: davem, netdev

This patch series implements byte queue limits (bql) for NIC TX queues.

Byte queue limits are a mechanism to limit the size of the transmit
hardware queue on a NIC by number of bytes. The goal of these byte
limits is too reduce latency caused by excessive queuing in hardware
without sacrificing throughput.

Hardware queuing limits are typically specified in terms of a number
hardware descriptors, each of which has a variable size. The variability
of the size of individual queued items can have a very wide range. For
instance with the e1000 NIC the size could range from 64 bytes to 4K
(with TSO enabled). This variability makes it next to impossible to
choose a single queue limit that prevents starvation and provides lowest
possible latency.

The objective of byte queue limits is to set the limit to be the
minimum needed to prevent starvation between successive transmissions to
the hardware. The latency between two transmissions can be variable in a
system. It is dependent on interrupt frequency, NAPI polling latencies,
scheduling of the queuing discipline, lock contention, etc. Therefore we
propose that byte queue limits should be dynamic and change in
iaccordance with networking stack latencies a system encounters.

Patches to implement this:
Patch 1: Dynamic queue limits (dql) library.  This provides the general
queuing algorithm.
Patch 2: netdev changes that use dlq to support byte queue limits.
Patch 3: Support in forcedeth drvier for byte queue limits.

The effects of BQL are demonstrated in the benchmark results below.
These were made running 200 stream of netperf RR tests:

140000 rr size
BQL: 80-215K bytes in queue, 856 tps, 3.26%
No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu

14000 rr size
BQ: 25-55K bytes in queue, 8500 tps
No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu

1400 rr size
BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
No BQL: 29-117K 85738 tps, 7.67% cpu

140 rr size
BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
No BQL: 1-13K bytes in queue, 323158, 37.16% cpu

1 rr size
BQL: 0-3K in queue, 338811 tps, 41.41% cpu
No BQL: 0-3K in queue, 339947 42.36% cpu

The amount of queuing in the NIC is reduced up to 90%, and I haven't
yet seen a consistent negative impact in terms of throughout or
CPU utilization.

^ permalink raw reply

* [PATCH 1/3] dql: Dynamic queue limits
From: Tom Herbert @ 2011-04-26  4:38 UTC (permalink / raw)
  To: davem, netdev

Implementation of dynamic queue limits (dql).  This is a libary which
allows a queue limit to be dynamically managed.  The goal of dql is
to set the queue limit, number of ojects to the queue, to be minimized
without allowing the queue to be starved.

dql would be used with a queue whose use has these properties:

1) Objects are queued up to some limit which can be expressed as a
   count of objects.
2) Periodically a completion process executes which retires consumed
   objects.
3) Starvation occurs when limit has been reached, all queued data has
   actually been consumed but completion processing has not yet run,
   so queuing new data is blocked.
4) Minimizing the amount of queued data is desirable.

A canonical example of such a queue would be a NIC HW transmit queue.

The queue limit is dynamic, it will increase or decrease over time
depending on the workload.  The queue limit is recalculated each time
completion processing is done.  Increases occur when the queue is
starved and can exponentially increase over successive intervals.
Decreases occur when more data is being maintained in the queue than
needed to prevent starvation.  The number of extra objects, or "slack",
is measured over successive intervals, and to avoid hysteresis the
limit is only reduced by the miminum slack seen over a configurable
time period.

dql API provides routines to manage the queue:
- dql_init is called to intialize the dql structure
- dql_reset is called to reset dynamic structures
- dql_queued when objects are being enqueued
- dql_avail returns availability in the queue
- dql_completed is called when objects have be consumed in the queue

Configuration consists of:
- max_limit, maximum limit
- min_limt, minimum limit
- slack_hold_time, time to measure instances of slack before reducing
  queue limit.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/dynamic_queue_limits.h |   80 ++++++++++++++++++++
 lib/Makefile                         |    3 +-
 lib/dynamic_queue_limits.c           |  132 ++++++++++++++++++++++++++++++++++
 3 files changed, 214 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/dynamic_queue_limits.h
 create mode 100644 lib/dynamic_queue_limits.c

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
new file mode 100644
index 0000000..3ffc591
--- /dev/null
+++ b/include/linux/dynamic_queue_limits.h
@@ -0,0 +1,80 @@
+/*
+ * Dynamic queue limits (dql) - Definitions
+ *
+ * Author: Tom Herbert (therbert@google.com)
+ *
+ * This header file contains the definitions for dynamic queue limits (dql).
+ * dql would be used in conjunction with a producer/consumer type queue
+ * (possibly a HW queue).  Such a queue would have these general properties:
+ *
+ *   1) Objects are queued up to some limit.
+ *   2) Periodically a completion process executes which retires consumed
+ *      objects.
+ *   3) Starvation occurs when limit has been reached, all queued data has
+ *      actually been consumed but completion processing has not yet run
+ *      so queuing new data is blocked.
+ *   4) Minimizing the amount of queued data is desirable.
+ *
+ * The goal of dql is to calculate the limit as the minimum number of objects
+ * needed to prevent starvation.
+ *
+ * The dql implemenation does not implement any locking for the dql data
+ * structures, the higher layer should provide this.
+ */
+
+#ifndef _LINUX_DQL_H
+#define _LINUX_DQL_H
+
+#ifdef __KERNEL__
+
+struct dql {
+	unsigned long	limit;			/* Current limit */
+	unsigned long	prev_ovlimit;		/* Previous over limit */
+
+	unsigned long	num_queued;		/* Total ever queued */
+	unsigned long	prev_num_queued;	/* Previous queue total */
+	unsigned long	num_completed;		/* Total ever completed */
+
+	unsigned long	last_obj_cnt;		/* Count at last queuing */
+	unsigned long	prev_last_obj_cnt;	/* Previous queuing cnt */
+
+	unsigned long	lowest_slack;		/* Lowest slack found */
+	unsigned long	slack_start_time;	/* Time slacks seen */
+
+	unsigned long	max_limit;		/* Maximum limit */
+	unsigned long	min_limit;		/* Minimum limit */
+	unsigned	slack_hold_time;	/* Time to measure slack */
+};
+
+/* Set some static maximums */
+#define	DQL_MAX_OBJECT (-1UL / 16)
+#define	DQL_MAX_LIMIT ((-1UL / 2) - DQL_MAX_OBJECT)
+
+/* Record number of objects queued. */
+static inline void dql_queued(struct dql *dql, unsigned long count)
+{
+	BUG_ON(count > DQL_MAX_OBJECT);
+	BUG_ON(dql->num_queued - dql->num_completed > DQL_MAX_LIMIT);
+
+	dql->num_queued += count;
+	dql->last_obj_cnt = count;
+}
+
+/* Returns how many objects can be queued, < 0 indicates over limit.  */
+static inline long dql_avail(struct dql *dql)
+{
+	return dql->limit - (dql->num_queued - dql->num_completed);
+}
+
+/* Record number of completed objects and recalculate the limit. */
+extern void dql_completed(struct dql *dql, unsigned long count);
+
+/* Reset dql state */
+extern void dql_reset(struct dql *dql);
+
+/* Initialize dql state */
+extern int dql_init(struct dql *dql, unsigned hold_time);
+
+#endif /* _KERNEL_ */
+
+#endif /* _LINUX_DQL_H */
diff --git a/lib/Makefile b/lib/Makefile
index ef0f285..47b3605 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -21,7 +21,8 @@ lib-y	+= kobject.o kref.o klist.o
 
 obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 bust_spinlocks.o hexdump.o kasprintf.o bitmap.o scatterlist.o \
-	 string_helpers.o gcd.o lcm.o list_sort.o uuid.o flex_array.o
+	 string_helpers.o gcd.o lcm.o list_sort.o uuid.o flex_array.o \
+	 dynamic_queue_limits.o
 obj-y += kstrtox.o
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 
diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
new file mode 100644
index 0000000..6a1f5b9
--- /dev/null
+++ b/lib/dynamic_queue_limits.c
@@ -0,0 +1,132 @@
+/*
+ * Dynamic byte queue limits.  See include/linux/dynamic_queue_limits.h
+ *
+ * Author: Tom Herbert (therbert@google.com)
+ */
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/ctype.h>
+#include <linux/kernel.h>
+#include <linux/dynamic_queue_limits.h>
+
+#define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0)
+
+/* Records completed count and recalculates the queue limit */
+void dql_completed(struct dql *dql, unsigned long count)
+{
+	unsigned long inprogress, prev_inprogress, limit;
+	unsigned long ovlimit, all_prev_completed, completed;
+
+	/* Can't complete more than what's in queue */
+	BUG_ON(count > dql->num_queued - dql->num_completed);
+
+	completed = dql->num_completed + count;
+	limit = dql->limit;
+	ovlimit = POSDIFF(dql->num_queued - dql->num_completed, limit);
+	inprogress = dql->num_queued - completed;
+	prev_inprogress = dql->prev_num_queued - dql->num_completed;
+	all_prev_completed = POSDIFF(completed, dql->prev_num_queued);
+
+	if ((ovlimit && !inprogress) ||
+	    (dql->prev_ovlimit && all_prev_completed)) {
+		/*
+		 * Queue considered starved if:
+		 *   - The queue was over-limit in the last interval,
+		 *     and there is no more data in the queue.
+		 *  OR
+		 *   - The queue was over-limit in the previous interval and
+		 *     when enqueuing it was possible that all queued data
+		 *     had been consumed.  This covers the case when queue
+		 *     may have becomes starved between completion processing
+		 *     running and next time enqueue was scheduled.
+		 *
+		 *     When queue is starved increase the limit by the amount
+		 *     of bytes both sent and completed in the last interval,
+		 *     plus any previous over-limit.
+		 */
+		limit += POSDIFF(completed, dql->prev_num_queued) +
+		     dql->prev_ovlimit;
+		dql->slack_start_time = jiffies;
+		dql->lowest_slack = -1UL;
+	} else if (inprogress && prev_inprogress && !all_prev_completed) {
+		/*
+		 * Queue was not starved, check if the limit can be decreased.
+		 * A decrease is only considered if the queue has been busy in
+		 * the whole interval (the check above).
+		 *
+		 * If there is slack, the amount execess data queued above the
+		 * the amount needed to prevent starvation, the queue limit can
+		 * be decreased.  To avoid hysteresis we consider the
+		 * minimum amount of slack found over several iterations of the
+		 * completion routine.
+		 */
+		unsigned long slack, slack_last_objs;
+
+		/*
+		 * Slack is the maximum of
+		 *   - The queue limit plus previous over-limit minus twice
+		 *     the number of objects completed.  Note that two times
+		 *     number of completed bytes is basis for upper bound
+		 *     of the limit.
+		 *   - Portion of objects in the last queuing operation that
+		 *     was not part of non-zero previous over-limit.  That is
+		 *     "round down" by non-overlimit portion of the last
+		 *     queueing operation.
+		 */
+		slack = POSDIFF(limit + dql->prev_ovlimit,
+		    2 * (completed - dql->num_completed));
+		slack_last_objs = dql->prev_ovlimit ?
+		    POSDIFF(dql->prev_last_obj_cnt, dql->prev_ovlimit) : 0;
+
+		slack = max(slack, slack_last_objs);
+
+		if (slack < dql->lowest_slack)
+			dql->lowest_slack = slack;
+
+		if (time_after(jiffies,
+			       dql->slack_start_time + dql->slack_hold_time)) {
+			limit = POSDIFF(limit, dql->lowest_slack);
+			dql->slack_start_time = jiffies;
+			dql->lowest_slack = -1UL;
+		}
+	}
+
+	/* Enforce bounds on limit */
+	limit = clamp(limit, dql->min_limit, dql->max_limit);
+
+	if (limit != dql->limit) {
+		dql->limit = limit;
+		ovlimit = 0;
+	}
+
+	dql->prev_ovlimit = ovlimit;
+	dql->prev_last_obj_cnt = dql->last_obj_cnt;
+	dql->num_completed = completed;
+	dql->prev_num_queued = dql->num_queued;
+}
+EXPORT_SYMBOL(dql_completed);
+
+void dql_reset(struct dql *dql)
+{
+	/* Reset all dynamic values */
+	dql->limit = 0;
+	dql->num_queued = 0;
+	dql->num_completed = 0;
+	dql->last_obj_cnt = 0;
+	dql->prev_num_queued = 0;
+	dql->prev_last_obj_cnt = 0;
+	dql->prev_ovlimit = 0;
+	dql->lowest_slack = -1UL;
+	dql->slack_start_time = jiffies;
+}
+EXPORT_SYMBOL(dql_reset);
+
+int dql_init(struct dql *dql, unsigned hold_time)
+{
+	dql->max_limit = DQL_MAX_LIMIT;
+	dql->min_limit = 0;
+	dql->slack_hold_time = hold_time;
+	dql_reset(dql);
+	return 0;
+}
+EXPORT_SYMBOL(dql_init);
-- 
1.7.3.1


^ permalink raw reply related

* [PATCH 2/3] bql: Byte queue limits
From: Tom Herbert @ 2011-04-26  4:38 UTC (permalink / raw)
  To: davem, netdev

Networking stack support for byte queue limits, uses dynamic queue
limits library.  Byte queue limits are maintained per transmit queue,
and a bql structure has been added to netdev_queue structure for this
purpose.

Configuration of bql is in the tx-<n> sysfs directory for the queue
under the byte_queue_limits directory.  Configuration includes:
limit_min, bql minimum limit
limit_max, bql maximum limit
hold_time, bql slack hold time

Also under the directory are:
limit, current byte limit
inflight, current number of bytes on the queue

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/netdevice.h |   46 +++++++++++++++-
 net/core/net-sysfs.c      |  137 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 177 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb8178a..0a76b88 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -44,6 +44,7 @@
 #include <linux/rculist.h>
 #include <linux/dmaengine.h>
 #include <linux/workqueue.h>
+#include <linux/dynamic_queue_limits.h>
 
 #include <linux/ethtool.h>
 #include <net/net_namespace.h>
@@ -556,8 +557,10 @@ struct netdev_queue {
 	struct Qdisc		*qdisc;
 	unsigned long		state;
 	struct Qdisc		*qdisc_sleeping;
-#ifdef CONFIG_RPS
+#ifdef CONFIG_XPS
 	struct kobject		kobj;
+	bool			do_bql;
+	struct dql		dql;
 #endif
 #if defined(CONFIG_XPS) && defined(CONFIG_NUMA)
 	int			numa_node;
@@ -589,6 +592,47 @@ static inline void netdev_queue_numa_node_write(struct netdev_queue *q, int node
 #endif
 }
 
+/*
+ * Definitions for byte queue limits for TX queue.
+ */
+static inline void netdev_queue_bql_init(struct netdev_queue *q)
+{
+#ifdef CONFIG_XPS
+	dql_init(&q->dql, 1000);
+	q->do_bql = true;
+#endif
+}
+
+static inline void netdev_queue_bql_reset(struct netdev_queue *q)
+{
+#ifdef CONFIG_XPS
+	dql_reset(&q->dql);
+#endif
+}
+
+static inline bool netdev_queue_bytes_avail(struct netdev_queue *q)
+{
+#ifdef CONFIG_XPS
+	return dql_avail(&q->dql) >= 0;
+#endif
+}
+
+static inline void netdev_queue_bytes_sent(struct netdev_queue *q,
+					   unsigned count)
+{
+#ifdef CONFIG_XPS
+	dql_queued(&q->dql, count);
+#endif
+}
+
+static inline void netdev_queue_bytes_completed(struct netdev_queue *q,
+						unsigned count)
+{
+#ifdef CONFIG_XPS
+	dql_completed(&q->dql, count);
+#endif
+}
+
 #ifdef CONFIG_RPS
 /*
  * This structure holds an RPS map which can be of variable length.  The
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 5ceb257..5f29b8e 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -20,6 +20,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/wireless.h>
 #include <linux/vmalloc.h>
+#include <linux/jiffies.h>
 #include <net/wext.h>
 
 #include "net-sysfs.h"
@@ -852,6 +853,119 @@ static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue)
 	return i;
 }
 
+static ssize_t bql_show(char *buf, unsigned long value)
+{
+	int p = 0;
+
+	p = sprintf(buf, "%lu\n", value);
+	return p;
+}
+
+static ssize_t bql_set(const char *buf, const size_t count,
+		       unsigned long *pvalue)
+{
+	unsigned long value;
+	int err;
+
+	if (!strcmp(buf, "max") || !strcmp(buf, "max\n"))
+		value = DQL_MAX_LIMIT;
+	else {
+		err = kstrtoul(buf, 10, &value);
+		if (err < 0)
+			return err;
+		if (value > DQL_MAX_LIMIT)
+			return -EINVAL;
+	}
+
+	*pvalue = value;
+
+	return count;
+}
+
+static ssize_t bql_show_hold_time(struct netdev_queue *queue,
+				  struct netdev_queue_attribute *attr,
+				  char *buf)
+{
+	struct dql *dql = &queue->dql;
+	int p = 0;
+
+	p = sprintf(buf, "%u\n", jiffies_to_msecs(dql->slack_hold_time));
+
+	return p;
+}
+
+static ssize_t bql_set_hold_time(struct netdev_queue *queue,
+				 struct netdev_queue_attribute *attribute,
+				 const char *buf, size_t len)
+{
+	struct dql *dql = &queue->dql;
+	unsigned value;
+	int err;
+
+	err = kstrtouint(buf, 10, &value);
+	if (err < 0)
+		return err;
+
+	dql->slack_hold_time = msecs_to_jiffies(value);
+
+	return len;
+}
+
+static struct netdev_queue_attribute bql_hold_time_attribute =
+	__ATTR(hold_time, S_IRUGO | S_IWUSR, bql_show_hold_time,
+	    bql_set_hold_time);
+
+static ssize_t bql_show_inflight(struct netdev_queue *queue,
+				 struct netdev_queue_attribute *attr,
+				 char *buf)
+{
+	struct dql *dql = &queue->dql;
+	int p = 0;
+
+	p = sprintf(buf, "%lu\n", dql->num_queued - dql->num_completed);
+
+	return p;
+}
+
+static struct netdev_queue_attribute bql_inflight_attribute =
+	__ATTR(inflight, S_IRUGO | S_IWUSR, bql_show_inflight, NULL);
+
+#define BQL_ATTR(NAME, FIELD)						\
+static ssize_t bql_show_ ## NAME(struct netdev_queue *queue,		\
+				 struct netdev_queue_attribute *attr,	\
+				 char *buf)				\
+{									\
+	return bql_show(buf, queue->dql.FIELD);				\
+}									\
+									\
+static ssize_t bql_set_ ## NAME(struct netdev_queue *queue,		\
+				struct netdev_queue_attribute *attr,	\
+				const char *buf, size_t len)		\
+{									\
+	return bql_set(buf, len, &queue->dql.FIELD);			\
+}									\
+									\
+static struct netdev_queue_attribute bql_ ## NAME ## _attribute =	\
+	__ATTR(NAME, S_IRUGO | S_IWUSR, bql_show_ ## NAME,		\
+	    bql_set_ ## NAME);
+
+BQL_ATTR(limit, limit)
+BQL_ATTR(limit_max, max_limit)
+BQL_ATTR(limit_min, min_limit)
+
+static struct attribute *dql_attrs[] = {
+	&bql_limit_attribute.attr,
+	&bql_limit_max_attribute.attr,
+	&bql_limit_min_attribute.attr,
+	&bql_hold_time_attribute.attr,
+	&bql_inflight_attribute.attr,
+	NULL
+};
+
+static struct attribute_group dql_group = {
+	.name  = "byte_queue_limits",
+	.attrs  = dql_attrs,
+};
 
 static ssize_t show_xps_map(struct netdev_queue *queue,
 			    struct netdev_queue_attribute *attribute, char *buf)
@@ -1119,14 +1233,22 @@ static int netdev_queue_add_kobject(struct net_device *net, int index)
 	kobj->kset = net->queues_kset;
 	error = kobject_init_and_add(kobj, &netdev_queue_ktype, NULL,
 	    "tx-%u", index);
-	if (error) {
-		kobject_put(kobj);
-		return error;
+	if (error)
+		goto exit;
+
+	if (queue->do_bql) {
+		error = sysfs_create_group(kobj, &dql_group);
+		if (error)
+			goto kset_exit;
 	}
 
 	kobject_uevent(kobj, KOBJ_ADD);
 	dev_hold(queue->dev);
 
+	return 0;
+kset_exit:
+	kobject_put(kobj);
+exit:
 	return error;
 }
 #endif /* CONFIG_XPS */
@@ -1146,8 +1268,13 @@ netdev_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
 		}
 	}
 
-	while (--i >= new_num)
-		kobject_put(&net->_tx[i].kobj);
+	while (--i >= new_num) {
+		struct netdev_queue *queue = net->_tx + i;
+
+		if (queue->do_bql)
+			sysfs_remove_group(&queue->kobj, &dql_group);
+		kobject_put(&queue->kobj);
+	}
 
 	return error;
 #else
-- 
1.7.3.1


^ permalink raw reply related

* [PATCH 3/3] forcedeth: Support for byte queue limits
From: Tom Herbert @ 2011-04-26  4:38 UTC (permalink / raw)
  To: davem, netdev

Changes to forcedeth to use byte queue limits.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 drivers/net/forcedeth.c |   38 ++++++++++++++++++++++++++++++++++----
 1 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 0e1c76a..00f9f99 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -1827,6 +1827,11 @@ static void nv_init_rx(struct net_device *dev)
 	}
 }
 
+static struct netdev_queue *nv_netdev_queue(struct fe_priv *np)
+{
+	return netdev_get_tx_queue(np->dev, 0);
+}
+
 static void nv_init_tx(struct net_device *dev)
 {
 	struct fe_priv *np = netdev_priv(dev);
@@ -1843,6 +1848,7 @@ static void nv_init_tx(struct net_device *dev)
 	np->tx_pkts_in_progress = 0;
 	np->tx_change_owner = NULL;
 	np->tx_end_flip = NULL;
+	netdev_queue_bql_reset(nv_netdev_queue(np));
 	np->tx_stop = 0;
 
 	for (i = 0; i < np->tx_ring_size; i++) {
@@ -2107,7 +2113,8 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	spin_lock_irqsave(&np->lock, flags);
 	empty_slots = nv_get_empty_tx_slots(np);
-	if (unlikely(empty_slots <= entries)) {
+	if (unlikely(empty_slots <= entries ||
+	    !netdev_queue_bytes_avail(nv_netdev_queue(np)))) {
 		netif_stop_queue(dev);
 		np->tx_stop = 1;
 		spin_unlock_irqrestore(&np->lock, flags);
@@ -2180,6 +2187,9 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* set tx flags */
 	start_tx->flaglen |= cpu_to_le32(tx_flags | tx_flags_extra);
+
+	netdev_queue_bytes_sent(nv_netdev_queue(np), skb->len);
+
 	np->put_tx.orig = put_tx;
 
 	spin_unlock_irqrestore(&np->lock, flags);
@@ -2216,7 +2226,8 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff *skb,
 
 	spin_lock_irqsave(&np->lock, flags);
 	empty_slots = nv_get_empty_tx_slots(np);
-	if (unlikely(empty_slots <= entries)) {
+	if (unlikely(empty_slots <= entries ||
+	    !netdev_queue_bytes_avail(nv_netdev_queue(np)))) {
 		netif_stop_queue(dev);
 		np->tx_stop = 1;
 		spin_unlock_irqrestore(&np->lock, flags);
@@ -2319,6 +2330,9 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff *skb,
 
 	/* set tx flags */
 	start_tx->flaglen |= cpu_to_le32(tx_flags | tx_flags_extra);
+
+	netdev_queue_bytes_sent(nv_netdev_queue(np), skb->len);
+
 	np->put_tx.ex = put_tx;
 
 	spin_unlock_irqrestore(&np->lock, flags);
@@ -2356,6 +2370,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
 	u32 flags;
 	int tx_work = 0;
 	struct ring_desc *orig_get_tx = np->get_tx.orig;
+	unsigned long bytes_cleaned = 0;
 
 	while ((np->get_tx.orig != np->put_tx.orig) &&
 	       !((flags = le32_to_cpu(np->get_tx.orig->flaglen)) & NV_TX_VALID) &&
@@ -2395,6 +2410,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
 					dev->stats.tx_packets++;
 					dev->stats.tx_bytes += np->get_tx_ctx->skb->len;
 				}
+				bytes_cleaned += np->get_tx_ctx->skb->len;
 				dev_kfree_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
 				tx_work++;
@@ -2405,7 +2421,12 @@ static int nv_tx_done(struct net_device *dev, int limit)
 		if (unlikely(np->get_tx_ctx++ == np->last_tx_ctx))
 			np->get_tx_ctx = np->first_tx_ctx;
 	}
-	if (unlikely((np->tx_stop == 1) && (np->get_tx.orig != orig_get_tx))) {
+
+	if (bytes_cleaned)
+		netdev_queue_bytes_completed(nv_netdev_queue(np),
+		    bytes_cleaned);
+	if (unlikely((np->tx_stop == 1) && (np->get_tx.orig != orig_get_tx) &&
+	    netdev_queue_bytes_avail(nv_netdev_queue(np)))) {
 		np->tx_stop = 0;
 		netif_wake_queue(dev);
 	}
@@ -2418,6 +2439,7 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit)
 	u32 flags;
 	int tx_work = 0;
 	struct ring_desc_ex *orig_get_tx = np->get_tx.ex;
+	unsigned long bytes_cleaned = 0;
 
 	while ((np->get_tx.ex != np->put_tx.ex) &&
 	       !((flags = le32_to_cpu(np->get_tx.ex->flaglen)) & NV_TX2_VALID) &&
@@ -2437,6 +2459,7 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit)
 				}
 			}
 
+			bytes_cleaned += np->get_tx_ctx->skb->len;
 			dev_kfree_skb_any(np->get_tx_ctx->skb);
 			np->get_tx_ctx->skb = NULL;
 			tx_work++;
@@ -2449,7 +2472,12 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit)
 		if (unlikely(np->get_tx_ctx++ == np->last_tx_ctx))
 			np->get_tx_ctx = np->first_tx_ctx;
 	}
-	if (unlikely((np->tx_stop == 1) && (np->get_tx.ex != orig_get_tx))) {
+
+	if (bytes_cleaned)
+		netdev_queue_bytes_completed(nv_netdev_queue(np),
+		    bytes_cleaned);
+	if (unlikely((np->tx_stop == 1) && (np->get_tx.ex != orig_get_tx) &&
+	    netdev_queue_bytes_avail(nv_netdev_queue(np)))) {
 		np->tx_stop = 0;
 		netif_wake_queue(dev);
 	}
@@ -5263,6 +5291,8 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, const struct pci_device_i
 	np->stats_poll.data = (unsigned long) dev;
 	np->stats_poll.function = nv_do_stats_poll;	/* timer handler */
 
+	netdev_queue_bql_init(nv_netdev_queue(np));
+
 	err = pci_enable_device(pci_dev);
 	if (err)
 		goto out_free;
-- 
1.7.3.1


^ permalink raw reply related

* Re: how to set vlan filter for intel 82599
From: zhou rui @ 2011-04-26  4:39 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev
In-Reply-To: <1303789868.3032.347.camel@localhost>

On Tue, Apr 26, 2011 at 11:51 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Tue, 2011-04-26 at 11:39 +0800, zhou rui wrote:
>> On Tue, Apr 26, 2011 at 10:57 AM, Ben Hutchings
>> <bhutchings@solarflare.com> wrote:
>> > On Tue, 2011-04-26 at 10:19 +0800, zhou rui wrote:
>> >> hi
>> >> here is the problem troubles me,how to set vlan filter for intel
>> >> 82599? for example
>> >> I want vlan id 0~31 will go to queue 0, vlan id 32-63 will go to queue
>> >> 1...below is my setting,but doesn't work
>> >>
>> >> don't know the exact meanning of the vlan-mask and vlan,how are they calculated?
>> >>
>> >> ./ethtool -K eth5 ntuple on
>> >>
>> >> ./ethtool -U eth5 flow-type udp4 src-ip 0x0 src-ip-mask 0x0 dst-ip 0x0
>> >> dst-ip-mask 0x0 src-port 0x0 src-port-mask 0x0 dst-port 0x0
>> >> dst-port-mask 0x0 vlan 0x0000 vlan-mask 0x00E0 user-def 0x0
>> >> user-def-mask 0x0 action 0
>> > [...]
>> >
>> > This specifies a filter for UDP/IPv4 packets, and the masks are wrong.
>> > If you actually wanted to filter only UDP/IPv4 packets for VID 0-31 then
>> > the correct syntax would be:
>> >
>> >    ethtool -U eth5 flow-type udp4 vlan 0 vlan-mask 0xf01f
>> >
>> > If you don't care about the layer 3/4 protocols then you would need to
>> > use 'flow-type ether', but no driver implements that yet.  (Well, sfc
>> > implements the *type*, but not filtering by VID only.)
> [...]
>> hi ben,thanks for your help,would you mind tell me "32~63" VID filter?
>> still can not understand the vlan-mask
>
> The vlan-mask specifies tag bits to be ignored.  You want to ignore the
> priority, CFI and lower 5 bits of the VID, hence 0xf01f.  For a
> different group of 32 VIDs you would only change the vlan argument, not
> the vlan-mask argument.
>
> Ben.
>
> --
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
>
i set the filter like below:

for a vlanid=50, it always match the last rule (action 7)

./ethtool -K eth5 ntuple off
./ethtool -K eth5 ntuple on
./ethtool -U eth5 flow-type tcp4 vlan 32 vlan-mask 0xF01F action 1
./ethtool -U eth5 flow-type udp4 vlan 32 vlan-mask 0xF01F action 1
./ethtool -U eth5 flow-type udp4 vlan 64 vlan-mask 0xF01F action 7
./ethtool -U eth5 flow-type tcp4 vlan 64 vlan-mask 0xF01F action 7

I tried the latest ixgbe driver 3.3.9, it reports:

Cannot add new RX n-tuple filter: Operation not permitted

./ethtool -V
ethtool version 2.6.36

thanks!

^ permalink raw reply

* Re: [PATCH] netfilter/IPv6: initialize TOS field in REJECT target module
From: David Miller @ 2011-04-26  5:17 UTC (permalink / raw)
  To: pablo; +Cc: eric.dumazet, fernando, netfilter-devel, netdev, yoshfuji,
	jengelh
In-Reply-To: <4DB61C2C.7060508@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Tue, 26 Apr 2011 03:13:16 +0200

> On 22/04/11 10:37, Eric Dumazet wrote:
>> Le vendredi 22 avril 2011 à 17:11 +0900, Fernando Luis Vazquez Cao a
>> écrit :
>> 
>>> Thank you!
>>>
>>> Should we send these two patches to -stable too?
>> 
>> David takes care of stable submissions for netdev stuff, thanks.
> 
> If the patch follows the netfilter path, we'll take care of sending
> stable submissions.

Right.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] netfilter/IPv6: initialize TOS field in REJECT target module
From: David Miller @ 2011-04-26  5:17 UTC (permalink / raw)
  To: fernando; +Cc: pablo, eric.dumazet, netfilter-devel, netdev, yoshfuji, jengelh
In-Reply-To: <1303781180.2874.13.camel@nausicaa>

From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Date: Tue, 26 Apr 2011 10:26:20 +0900

> On Tue, 2011-04-26 at 03:13 +0200, Pablo Neira Ayuso wrote:
>> On 22/04/11 10:37, Eric Dumazet wrote:
>> > Le vendredi 22 avril 2011 à 17:11 +0900, Fernando Luis Vazquez Cao a
>> > écrit :
>> > 
>> >> Thank you!
>> >>
>> >> Should we send these two patches to -stable too?
>> > 
>> > David takes care of stable submissions for netdev stuff, thanks.
>> 
>> If the patch follows the netfilter path, we'll take care of sending
>> stable submissions.
> 
> David, will you take care of these two patches or should they go through
> the netfilter tree?

Netfilter, as usual.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] netfilter/IPv6: initialize TOS field in REJECT target module
From: Fernando Luis Vazquez Cao @ 2011-04-26  5:25 UTC (permalink / raw)
  To: David Miller
  Cc: pablo, eric.dumazet, netfilter-devel, netdev, yoshfuji, jengelh
In-Reply-To: <20110425.221744.226775617.davem@davemloft.net>

On Mon, 2011-04-25 at 22:17 -0700, David Miller wrote:
> From: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
> Date: Tue, 26 Apr 2011 10:26:20 +0900
> 
> > On Tue, 2011-04-26 at 03:13 +0200, Pablo Neira Ayuso wrote:
> >> On 22/04/11 10:37, Eric Dumazet wrote:
> >> > Le vendredi 22 avril 2011 à 17:11 +0900, Fernando Luis Vazquez Cao a
> >> > écrit :
> >> > 
> >> >> Thank you!
> >> >>
> >> >> Should we send these two patches to -stable too?
> >> > 
> >> > David takes care of stable submissions for netdev stuff, thanks.
> >> 
> >> If the patch follows the netfilter path, we'll take care of sending
> >> stable submissions.
> > 
> > David, will you take care of these two patches or should they go through
> > the netfilter tree?
> 
> Netfilter, as usual.

Thank you for the clarification. I really appreciate it.

Pablo, could you pull in the two patches below? They have already been
acked by Eric. It would be great it we could get them merged for the
next -rc and stable releases.

[PATCH] netfilter/IPv6: fix DSCP mangle code
[PATCH] netfilter/IPv6: initialize TOS field in REJECT target module

- Fernando

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 1/5] sctp: Add Auto-ASCONF support
From: Wei Yongjun @ 2011-04-26  5:30 UTC (permalink / raw)
  To: Michio Honda; +Cc: netdev, lksctp-developers
In-Reply-To: <5D4D424A-5545-44C4-909A-4DDE594BD5D4@sfc.wide.ad.jp>


> SCTP reconfigure the IP addresses in the association by using ASCONF chunks as mentioned in RFC5061.  
> For example, we can start to use the newly configured IP address in the existing association.  
> ASCONF operation is invoked in two ways: 
> First is done by the application to call sctp_bindx() system call.  
> Second is automatic operation in the SCTP stack with address events in the host computer (called auto_asconf) .  
> The former is already implemented, and this patch implement the latter.  
>
> Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
>

Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>


^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 2/5] sctp: Add sysctl support for Auto-ASCONF
From: Wei Yongjun @ 2011-04-26  5:30 UTC (permalink / raw)
  To: Michio Honda; +Cc: netdev, lksctp-developers
In-Reply-To: <4B304B0D-35AC-4372-84F3-EFBC5A4C7BF2@sfc.wide.ad.jp>


> This patch allows the system administrator to change default Auto-ASCONF on/off behavior via an sysctl value.  
>
> Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
>

Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>


^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 3/5] sctp: Add socket option operation for Auto-ASCONF
From: Wei Yongjun @ 2011-04-26  5:31 UTC (permalink / raw)
  To: Michio Honda; +Cc: netdev, lksctp-developers
In-Reply-To: <0B9100AB-44C5-49E7-AA03-8B99180BE7E3@sfc.wide.ad.jp>



> This patch allows the application to operate Auto-ASCONF on/off behavior via setsockopt() and getsockopt().  
>
> Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
> ---
>

Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>


^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 4/5] sctp: Add ADD/DEL ASCONF handling at the receiver
From: Wei Yongjun @ 2011-04-26  5:31 UTC (permalink / raw)
  To: Michio Honda; +Cc: netdev, lksctp-developers
In-Reply-To: <BECB6CDC-BC4F-4BC1-B67D-B9F3F02E8D87@sfc.wide.ad.jp>


> This patch fixes the problem that the original code cannot delete the remote address where the corresponding transport is currently directed, even when the ASCONF is sent from the other address (this situation happens when the single-homed sender transmits  ASCONF with ADD and DEL.)  
>
> Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
> ---
>
Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>


^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 5/5] sctp: Add ASCONF operation on the single-homed host
From: Wei Yongjun @ 2011-04-26  5:33 UTC (permalink / raw)
  To: Michio Honda; +Cc: netdev, lksctp-developers
In-Reply-To: <856CB69B-767A-4F6C-9DBF-26EEAFCC3B56@sfc.wide.ad.jp>


> SCTP can change the IP address on the single-homed host.  
> In this case, the SCTP association transmits an ASCONF packet including addition of the new IP address and deletion of the old address.  This patch implements this functionality.  
> In this case, the ASCONF chunk is added to the beginning of the queue, because the other chunks cannot be transmitted in this state.  
>
> Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
> ---
>
Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>


^ permalink raw reply

* Re: [PATCH 2/3] bql: Byte queue limits
From: Eric Dumazet @ 2011-04-26  5:38 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1104252128290.5895@pokey.mtv.corp.google.com>

Le lundi 25 avril 2011 à 21:38 -0700, Tom Herbert a écrit :
> Networking stack support for byte queue limits, uses dynamic queue
> limits library.  Byte queue limits are maintained per transmit queue,
> and a bql structure has been added to netdev_queue structure for this
> purpose.
> 
> Configuration of bql is in the tx-<n> sysfs directory for the queue
> under the byte_queue_limits directory.  Configuration includes:
> limit_min, bql minimum limit
> limit_max, bql maximum limit
> hold_time, bql slack hold time
> 
> Also under the directory are:
> limit, current byte limit
> inflight, current number of bytes on the queue
> 

Wow... magical values and very limited advices how to tune them.

Tom, this reminds me you were supposed to provide Documentation/files to
describe RPS, RFS, XPS ...

We receive many questions about these features...

> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  include/linux/netdevice.h |   46 +++++++++++++++-
>  net/core/net-sysfs.c      |  137 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 177 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index cb8178a..0a76b88 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -44,6 +44,7 @@
>  #include <linux/rculist.h>
>  #include <linux/dmaengine.h>
>  #include <linux/workqueue.h>
> +#include <linux/dynamic_queue_limits.h>
>  
>  #include <linux/ethtool.h>
>  #include <net/net_namespace.h>
> @@ -556,8 +557,10 @@ struct netdev_queue {
>  	struct Qdisc		*qdisc;
>  	unsigned long		state;
>  	struct Qdisc		*qdisc_sleeping;
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_XPS
>  	struct kobject		kobj;
> +	bool			do_bql;
> +	struct dql		dql;
>  #endif

I have no idea why you use CONFIG_XPS for BQL (how BQL is it related to
SMP ???), and why kobj is now guarded by CONFIG_XPS instead of
CONFIG_RPS.




^ permalink raw reply

* [PATCH net-2.6 1/4] xfrm: Fix replay window size calculation on initialization
From: Steffen Klassert @ 2011-04-26  5:39 UTC (permalink / raw)
  To: David Miller, Herbert Xu; +Cc: netdev

On replay initialization, we compute the size of the replay
buffer to see if the replay window fits into the buffer.
This computation lacks a mutliplication by 8 because we need
the size in bit, not in byte. So we might return an error
even though the replay window would fit into the buffer.
This patch fixes this issue.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_replay.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
index f218385..e8a7814 100644
--- a/net/xfrm/xfrm_replay.c
+++ b/net/xfrm/xfrm_replay.c
@@ -532,7 +532,7 @@ int xfrm_init_replay(struct xfrm_state *x)
 
 	if (replay_esn) {
 		if (replay_esn->replay_window >
-		    replay_esn->bmp_len * sizeof(__u32))
+		    replay_esn->bmp_len * sizeof(__u32) * 8)
 			return -EINVAL;
 
 	if ((x->props.flags & XFRM_STATE_ESN) && x->replay_esn)
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH net-2.6 2/4] esp6: Fix scatterlist initialization
From: Steffen Klassert @ 2011-04-26  5:40 UTC (permalink / raw)
  To: David Miller, Herbert Xu; +Cc: netdev
In-Reply-To: <20110426053923.GF5495@secunet.com>

When we use IPsec extended sequence numbers, we may overwrite
the last scatterlist of the associated data by the scatterlist
for the skb. This patch fixes this by placing the scatterlist
for the skb right behind the last scatterlist of the associated
data. esp4 does it already like that.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv6/esp6.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 5aa8ec8..59dccfb 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -371,7 +371,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
 	iv = esp_tmp_iv(aead, tmp, seqhilen);
 	req = esp_tmp_req(aead, iv);
 	asg = esp_req_sg(aead, req);
-	sg = asg + 1;
+	sg = asg + sglists;
 
 	skb->ip_summed = CHECKSUM_NONE;
 
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH net-2.6 3/4] xfrm: Check for the new replay implementation if an esn state is inserted
From: Steffen Klassert @ 2011-04-26  5:41 UTC (permalink / raw)
  To: David Miller, Herbert Xu; +Cc: netdev
In-Reply-To: <20110426053923.GF5495@secunet.com>

IPsec extended sequence numbers can be used only with the new
anti-replay window implementation. So check if the new implementation
is used if an esn state is inserted and return an error if it is not.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_user.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 5d1d60d..c658cb3 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -124,6 +124,9 @@ static inline int verify_replay(struct xfrm_usersa_info *p,
 {
 	struct nlattr *rt = attrs[XFRMA_REPLAY_ESN_VAL];
 
+	if ((p->flags & XFRM_STATE_ESN) && !rt)
+		return -EINVAL;
+
 	if (!rt)
 		return 0;
 
-- 
1.7.0.4


^ permalink raw reply related

* Re: [PATCH net-2.6 1/4] xfrm: Fix replay window size calculation on initialization
From: Herbert Xu @ 2011-04-26  5:41 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20110426053923.GF5495@secunet.com>

On Tue, Apr 26, 2011 at 07:39:24AM +0200, Steffen Klassert wrote:
> On replay initialization, we compute the size of the replay
> buffer to see if the replay window fits into the buffer.
> This computation lacks a mutliplication by 8 because we need
> the size in bit, not in byte. So we might return an error
> even though the replay window would fit into the buffer.
> This patch fixes this issue.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH net-2.6 4/4] xfrm: Fix integer underrun on zero sized replay windows
From: Steffen Klassert @ 2011-04-26  5:42 UTC (permalink / raw)
  To: David Miller, Herbert Xu; +Cc: netdev
In-Reply-To: <20110426053923.GF5495@secunet.com>

The check if the replay window is contained within one subspace or
spans over two subspaces causes an unwanted integer underrun on
zero sized replay windows when we subtract minus one. We fix this by
changeing this check to avoid the subtraction.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_replay.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
index e8a7814..19f94bb 100644
--- a/net/xfrm/xfrm_replay.c
+++ b/net/xfrm/xfrm_replay.c
@@ -32,7 +32,7 @@ u32 xfrm_replay_seqhi(struct xfrm_state *x, __be32 net_seq)
 	seq_hi = replay_esn->seq_hi;
 	bottom = replay_esn->seq - replay_esn->replay_window + 1;
 
-	if (likely(replay_esn->seq >= replay_esn->replay_window - 1)) {
+	if (likely(replay_esn->seq > replay_esn->replay_window)) {
 		/* A. same subspace */
 		if (unlikely(seq < bottom))
 			seq_hi++;
-- 
1.7.0.4


^ permalink raw reply related

* Re: [PATCH net-2.6 2/4] esp6: Fix scatterlist initialization
From: Herbert Xu @ 2011-04-26  5:41 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20110426054023.GG5495@secunet.com>

On Tue, Apr 26, 2011 at 07:40:23AM +0200, Steffen Klassert wrote:
> When we use IPsec extended sequence numbers, we may overwrite
> the last scatterlist of the associated data by the scatterlist
> for the skb. This patch fixes this by placing the scatterlist
> for the skb right behind the last scatterlist of the associated
> data. esp4 does it already like that.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH net-2.6 3/4] xfrm: Check for the new replay implementation if an esn state is inserted
From: Herbert Xu @ 2011-04-26  5:43 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20110426054121.GH5495@secunet.com>

On Tue, Apr 26, 2011 at 07:41:21AM +0200, Steffen Klassert wrote:
> IPsec extended sequence numbers can be used only with the new
> anti-replay window implementation. So check if the new implementation
> is used if an esn state is inserted and return an error if it is not.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/3] net: Byte queue limit patch series
From: Bill Fink @ 2011-04-26  5:56 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1104252128001.5889@pokey.mtv.corp.google.com>

On Mon, 25 Apr 2011, Tom Herbert wrote:

> This patch series implements byte queue limits (bql) for NIC TX queues.
> 
> Byte queue limits are a mechanism to limit the size of the transmit
> hardware queue on a NIC by number of bytes. The goal of these byte
> limits is too reduce latency caused by excessive queuing in hardware
> without sacrificing throughput.
> 
> Hardware queuing limits are typically specified in terms of a number
> hardware descriptors, each of which has a variable size. The variability
> of the size of individual queued items can have a very wide range. For
> instance with the e1000 NIC the size could range from 64 bytes to 4K
> (with TSO enabled). This variability makes it next to impossible to
> choose a single queue limit that prevents starvation and provides lowest
> possible latency.
> 
> The objective of byte queue limits is to set the limit to be the
> minimum needed to prevent starvation between successive transmissions to
> the hardware. The latency between two transmissions can be variable in a
> system. It is dependent on interrupt frequency, NAPI polling latencies,
> scheduling of the queuing discipline, lock contention, etc. Therefore we
> propose that byte queue limits should be dynamic and change in
> iaccordance with networking stack latencies a system encounters.
> 
> Patches to implement this:
> Patch 1: Dynamic queue limits (dql) library.  This provides the general
> queuing algorithm.
> Patch 2: netdev changes that use dlq to support byte queue limits.
> Patch 3: Support in forcedeth drvier for byte queue limits.
> 
> The effects of BQL are demonstrated in the benchmark results below.
> These were made running 200 stream of netperf RR tests:
> 
> 140000 rr size
> BQL: 80-215K bytes in queue, 856 tps, 3.26%
> No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu

	tps	+0.23 %

> 14000 rr size
> BQ: 25-55K bytes in queue, 8500 tps
> No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu

	tps	-0.27 %

> 1400 rr size
> BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
> No BQL: 29-117K 85738 tps, 7.67% cpu

	tps	+0.98 %

> 140 rr size
> BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
> No BQL: 1-13K bytes in queue, 323158, 37.16% cpu

	tps	-0.81 %

> 1 rr size
> BQL: 0-3K in queue, 338811 tps, 41.41% cpu
> No BQL: 0-3K in queue, 339947 42.36% cpu

	tps	-0.33 %

> The amount of queuing in the NIC is reduced up to 90%, and I haven't
> yet seen a consistent negative impact in terms of throughout or
> CPU utilization.

I don't quite follow your conclusion from your data.
While there was a sweet spot for the 1400 rr size, other
smaller rr took a hit.  Now all the tps changes were
within 1 %, so perhaps that isn't considered significant
(I'm not qualified to make that call).  But if that's
the case, then the effective latency change seen by the
user isn't significant either, although the amount of
queuing in the NIC is admittedly significantly reduced
for a rr size of 1400 or larger.

					-Bill

^ permalink raw reply

* Re: [PATCH net-2.6 4/4] xfrm: Fix integer underrun on zero sized replay windows
From: Herbert Xu @ 2011-04-26  6:01 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20110426054232.GI5495@secunet.com>

On Tue, Apr 26, 2011 at 07:42:32AM +0200, Steffen Klassert wrote:
> The check if the replay window is contained within one subspace or
> spans over two subspaces causes an unwanted integer underrun on
> zero sized replay windows when we subtract minus one. We fix this by
> changeing this check to avoid the subtraction.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/3] net: Byte queue limit patch series
From: Eric Dumazet @ 2011-04-26  6:14 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1104252128001.5889@pokey.mtv.corp.google.com>

Le lundi 25 avril 2011 à 21:38 -0700, Tom Herbert a écrit :
> This patch series implements byte queue limits (bql) for NIC TX queues.
> 
> Byte queue limits are a mechanism to limit the size of the transmit
> hardware queue on a NIC by number of bytes. The goal of these byte
> limits is too reduce latency caused by excessive queuing in hardware
> without sacrificing throughput.
> 
> Hardware queuing limits are typically specified in terms of a number
> hardware descriptors, each of which has a variable size. The variability
> of the size of individual queued items can have a very wide range. For
> instance with the e1000 NIC the size could range from 64 bytes to 4K
> (with TSO enabled). This variability makes it next to impossible to
> choose a single queue limit that prevents starvation and provides lowest
> possible latency.
> 
> The objective of byte queue limits is to set the limit to be the
> minimum needed to prevent starvation between successive transmissions to
> the hardware. The latency between two transmissions can be variable in a
> system. It is dependent on interrupt frequency, NAPI polling latencies,
> scheduling of the queuing discipline, lock contention, etc. Therefore we
> propose that byte queue limits should be dynamic and change in
> iaccordance with networking stack latencies a system encounters.
> 
> Patches to implement this:
> Patch 1: Dynamic queue limits (dql) library.  This provides the general
> queuing algorithm.
> Patch 2: netdev changes that use dlq to support byte queue limits.
> Patch 3: Support in forcedeth drvier for byte queue limits.
> 
> The effects of BQL are demonstrated in the benchmark results below.
> These were made running 200 stream of netperf RR tests:
> 
> 140000 rr size
> BQL: 80-215K bytes in queue, 856 tps, 3.26%
> No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu
> 
> 14000 rr size
> BQ: 25-55K bytes in queue, 8500 tps
> No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu
> 
> 1400 rr size
> BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
> No BQL: 29-117K 85738 tps, 7.67% cpu
> 
> 140 rr size
> BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
> No BQL: 1-13K bytes in queue, 323158, 37.16% cpu
> 
> 1 rr size
> BQL: 0-3K in queue, 338811 tps, 41.41% cpu
> No BQL: 0-3K in queue, 339947 42.36% cpu
> 
> The amount of queuing in the NIC is reduced up to 90%, and I haven't
> yet seen a consistent negative impact in terms of throughout or
> CPU utilization.

Hi Tom

Thats a focus on thoughput, adding some extra latency (because of new
fields to access/dirty in tx path and tx completion path), especially on
setups where many cpus are sending data on one device. I suspect this is
the price to pay to fight bufferbloat.

We can try to make this non so expensive.

Maybe try to separate the DQL structure into two parts, one use on TX
path (inside the already dirtied cache line in netdev_queue structure
(_xmit_lock, xmit_lock_owner, trans_start)), and the other one in TX
completion path ?


This new limit schem also favors streams using super packets. Your
workload use 200 identical clients, it would be nice to mix DNS trafic
(small UDP frames) in them, and check how they behave when queue is
full, while it was almost never full before...




^ permalink raw reply

* Re: [PATCH 0/3] net: Byte queue limit patch series
From: Eric Dumazet @ 2011-04-26  6:17 UTC (permalink / raw)
  To: Bill Fink; +Cc: Tom Herbert, davem, netdev
In-Reply-To: <20110426015645.c2d19cfe.billfink@mindspring.com>

Le mardi 26 avril 2011 à 01:56 -0400, Bill Fink a écrit :

> I don't quite follow your conclusion from your data.
> While there was a sweet spot for the 1400 rr size, other
> smaller rr took a hit.  Now all the tps changes were
> within 1 %, so perhaps that isn't considered significant
> (I'm not qualified to make that call).  But if that's
> the case, then the effective latency change seen by the
> user isn't significant either, although the amount of
> queuing in the NIC is admittedly significantly reduced
> for a rr size of 1400 or larger.

Tom point was to show that we can reduce latency (because size of
netdevice queue is smaller) without changing tps ;)




^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox