Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH 04/12] net: introduce NETIF_F_RXCSUM
From: Michał Mirosław @ 2011-01-04 18:33 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev
In-Reply-To: <20101219005715.GC12005@rere.qmqm.pl>

On Sun, Dec 19, 2010 at 01:57:15AM +0100, Michał Mirosław wrote:
> On Thu, Dec 16, 2010 at 11:27:29PM +0000, Ben Hutchings wrote:
> > On Wed, 2010-12-15 at 23:24 +0100, Michał Mirosław wrote:
> > > Introduce NETIF_F_RXCSUM to replace device-private flags for RX checksum
> > > offload. Integrate it with ndo_fix_features.
> > > 
> > > ethtool_op_get_rx_csum() is removed altogether as nothing in-tree uses it.
> > Not explicitly, but what about drivers that turn TX and RX checksumming
> > on and off together?  They are implicitly covered by the case which
> > you're removing:
> > [...]
> > > -	case ETHTOOL_GRXCSUM:
> > > -		rc = ethtool_get_value(dev, useraddr, ethcmd,
> > > -				       (dev->ethtool_ops->get_rx_csum ?
> > > -					dev->ethtool_ops->get_rx_csum :
> > > -					ethtool_op_get_rx_csum));
> > [...]
> I had an impression that drivers not defining get_rx_csum() had no
> RX checksumming or didn't allow it to be toggled. I'll have a closer look
> on this.

All drivers that have set_rx_csum() also define matching get_rx_csum().

Anyway, the code this is removing is broken in that it assumes TX csum
capability implying RX csum capability.

As a side note, GRO receive functions (TCPv4 and v6 - no other protocols
use it) are checking skb->ip_summed != CHECKSUM_NONE before allowing the
packet to be passed on, so it's not even necessary to force disabling GRO
when RX csum is off.

Best Regards,
Michał Mirosław

^ permalink raw reply

* [net-next-2.6 PATCH v5 1/2] net: implement mechanism for HW based QOS
From: John Fastabend @ 2011-01-04 18:56 UTC (permalink / raw)
  To: davem, jarkao2
  Cc: john.r.fastabend, hadi, shemminger, tgraf, eric.dumazet,
	bhutchings, nhorman, netdev

This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.

Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.

By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.

With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
a single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.

To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.

This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |   62 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |   10 +++++++
 2 files changed, 71 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0f6b1c9..ae51323 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -646,6 +646,14 @@ struct xps_dev_maps {
     (nr_cpu_ids * sizeof(struct xps_map *)))
 #endif /* CONFIG_XPS */

+#define TC_MAX_QUEUE	16
+#define TC_BITMASK	15
+/* HW offloaded queuing disciplines txq count and offset maps */
+struct netdev_tc_txq {
+	u16 count;
+	u16 offset;
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1146,6 +1154,9 @@ struct net_device {
 	/* Data Center Bridging netlink ops */
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
+	u8 num_tc;
+	struct netdev_tc_txq tc_to_txq[TC_MAX_QUEUE];
+	u8 prio_tc_map[TC_BITMASK+1];

 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	/* max exchange id for FCoE LRO by ddp */
@@ -1162,6 +1173,57 @@ struct net_device {
 #define	NETDEV_ALIGN		32

 static inline
+int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
+{
+	return dev->prio_tc_map[prio & TC_BITMASK];
+}
+
+static inline
+int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->prio_tc_map[prio & TC_BITMASK] = tc & TC_BITMASK;
+	return 0;
+}
+
+static inline
+void netdev_reset_tc(struct net_device *dev)
+{
+	dev->num_tc = 0;
+	memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
+	memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
+}
+
+static inline
+int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->tc_to_txq[tc].count = count;
+	dev->tc_to_txq[tc].offset = offset;
+	return 0;
+}
+
+static inline
+int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
+{
+	if (num_tc > TC_MAX_QUEUE)
+		return -EINVAL;
+
+	dev->num_tc = num_tc;
+	return 0;
+}
+
+static inline
+u8 netdev_get_num_tc(const struct net_device *dev)
+{
+	return dev->num_tc;
+}
+
+static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index a215269..38318e2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2165,6 +2165,8 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		  unsigned int num_tx_queues)
 {
 	u32 hash;
+	u16 qoffset = 0;
+	u16 qcount = num_tx_queues;

 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
@@ -2173,13 +2175,19 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		return hash;
 	}

+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		qoffset = dev->tc_to_txq[tc].offset;
+		qcount = dev->tc_to_txq[tc].count;
+	}
+
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
 	else
 		hash = (__force u16) skb->protocol ^ skb->rxhash;
 	hash = jhash_1word(hash, hashrnd);

-	return (u16) (((u64) hash * num_tx_queues) >> 32);
+	return (u16) (((u64) hash * qcount) >> 32) + qoffset;
 }
 EXPORT_SYMBOL(__skb_tx_hash);

^ permalink raw reply related

* [net-next-2.6 PATCH v5 2/2] net_sched: implement a root container qdisc sch_mqprio
From: John Fastabend @ 2011-01-04 18:56 UTC (permalink / raw)
  To: davem, jarkao2
  Cc: john.r.fastabend, hadi, shemminger, tgraf, eric.dumazet,
	bhutchings, nhorman, netdev
In-Reply-To: <20110104185600.13692.47967.stgit@jf-dev1-dcblab>

This implements a mqprio queueing discipline that by default creates
a pfifo_fast qdisc per tx queue and provides the needed configuration
interface.

Using the mqprio qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mqprio_qopt {
        __u8    num_tc;
        __u8    prio_tc_map[TC_BITMASK + 1];
        __u8    hw;
        __u16   count[TC_MAX_QUEUE];
        __u16   offset[TC_MAX_QUEUE];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc.

The hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup().

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to know specifics about
traffic types.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |    3 
 include/linux/pkt_sched.h |   10 +
 net/sched/Kconfig         |   12 +
 net/sched/Makefile        |    1 
 net/sched/sch_generic.c   |    4 
 net/sched/sch_mqprio.c    |  413 +++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 443 insertions(+), 0 deletions(-)
 create mode 100644 net/sched/sch_mqprio.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ae51323..19a855b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -764,6 +764,8 @@ struct netdev_tc_txq {
  * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
  *			  struct nlattr *port[]);
  * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
+ *
+ * int (*ndo_setup_tc)(struct net_device *dev, int tc);
  */
 #define HAVE_NET_DEVICE_OPS
 struct net_device_ops {
@@ -822,6 +824,7 @@ struct net_device_ops {
 						   struct nlattr *port[]);
 	int			(*ndo_get_vf_port)(struct net_device *dev,
 						   int vf, struct sk_buff *skb);
+	int			(*ndo_setup_tc)(struct net_device *dev, u8 tc);
 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	int			(*ndo_fcoe_enable)(struct net_device *dev);
 	int			(*ndo_fcoe_disable)(struct net_device *dev);
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 2cfa4bc..1c5310a 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -2,6 +2,7 @@
 #define __LINUX_PKT_SCHED_H
 
 #include <linux/types.h>
+#include <linux/netdevice.h>
 
 /* Logical priority bands not depending on specific packet scheduler.
    Every scheduler will map them to real traffic classes, if it has
@@ -481,4 +482,13 @@ struct tc_drr_stats {
 	__u32	deficit;
 };
 
+/* MQPRIO */
+struct tc_mqprio_qopt {
+	__u8	num_tc;
+	__u8	prio_tc_map[TC_BITMASK + 1];
+	__u8	hw;
+	__u16	count[TC_MAX_QUEUE];
+	__u16	offset[TC_MAX_QUEUE];
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a36270a..f52f5eb 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -205,6 +205,18 @@ config NET_SCH_DRR
 
 	  If unsure, say N.
 
+config NET_SCH_MQPRIO
+	tristate "Multi-queue priority scheduler (MQPRIO)"
+	help
+	  Say Y here if you want to use the Multi-queue Priority scheduler.
+	  This scheduler allows QOS to be offloaded on NICs that have support
+	  for offloading QOS schedulers.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called sch_mqprio.
+
+	  If unsure, say N.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 960f5db..26ce681 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_multiq.o
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
+obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 34dc598..723b278 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -540,6 +540,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
 	.dump		=	pfifo_fast_dump,
 	.owner		=	THIS_MODULE,
 };
+EXPORT_SYMBOL(pfifo_fast_ops);
 
 struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 			  struct Qdisc_ops *ops)
@@ -674,6 +675,7 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 
 	return oqdisc;
 }
+EXPORT_SYMBOL(dev_graft_qdisc);
 
 static void attach_one_default_qdisc(struct net_device *dev,
 				     struct netdev_queue *dev_queue,
@@ -761,6 +763,7 @@ void dev_activate(struct net_device *dev)
 		dev_watchdog_up(dev);
 	}
 }
+EXPORT_SYMBOL(dev_activate);
 
 static void dev_deactivate_queue(struct net_device *dev,
 				 struct netdev_queue *dev_queue,
@@ -840,6 +843,7 @@ void dev_deactivate(struct net_device *dev)
 	list_add(&dev->unreg_list, &single);
 	dev_deactivate_many(&single);
 }
+EXPORT_SYMBOL(dev_deactivate);
 
 static void dev_init_scheduler_queue(struct net_device *dev,
 				     struct netdev_queue *dev_queue,
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
new file mode 100644
index 0000000..b16dc2c
--- /dev/null
+++ b/net/sched/sch_mqprio.c
@@ -0,0 +1,413 @@
+/*
+ * net/sched/sch_mqprio.c
+ *
+ * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+
+struct mqprio_sched {
+	struct Qdisc		**qdiscs;
+	int hw_owned;
+};
+
+static void mqprio_destroy(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	unsigned int ntx;
+
+	if (!priv->qdiscs)
+		return;
+
+	for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
+		qdisc_destroy(priv->qdiscs[ntx]);
+
+	if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
+		dev->netdev_ops->ndo_setup_tc(dev, 0);
+	else
+		netdev_set_num_tc(dev, 0);
+
+	kfree(priv->qdiscs);
+}
+
+static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
+{
+	int i, j;
+
+	/* Verify num_tc is not out of max range */
+	if (qopt->num_tc > TC_MAX_QUEUE)
+		return -EINVAL;
+
+	for (i = 0; i < qopt->num_tc; i++) {
+		unsigned int last = qopt->offset[i] + qopt->count[i];
+		/* Verify the queue offset is in the num tx range */
+		if (qopt->offset[i] >= dev->num_tx_queues)
+			return -EINVAL;
+		/* Verify the queue count is in tx range being equal to the
+		 * num_tx_queues indicates the last queue is in use.
+		 */
+		else if (last > dev->num_tx_queues)
+			return -EINVAL;
+
+		/* Verify that the offset and counts do not overlap */
+		for (j = i + 1; j < qopt->num_tc; j++) {
+			if (last > qopt->offset[j])
+				return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	struct netdev_queue *dev_queue;
+	struct Qdisc *qdisc;
+	int i, err = -EOPNOTSUPP;
+	struct tc_mqprio_qopt *qopt = NULL;
+
+	/* Unwind attributes on failure */
+	u8 unwnd_tc = dev->num_tc;
+	u8 unwnd_map[TC_BITMASK + 1];
+	struct netdev_tc_txq unwnd_txq[TC_MAX_QUEUE];
+
+	if (sch->parent != TC_H_ROOT)
+		return -EOPNOTSUPP;
+
+	if (!netif_is_multiqueue(dev))
+		return -EOPNOTSUPP;
+
+	if (nla_len(opt) < sizeof(*qopt))
+		return -EINVAL;
+	qopt = nla_data(opt);
+
+	memcpy(unwnd_map, dev->prio_tc_map, sizeof(unwnd_map));
+	memcpy(unwnd_txq, dev->tc_to_txq, sizeof(unwnd_txq));
+
+	/* If the mqprio options indicate that hardware should own
+	 * the queue mapping then run ndo_setup_tc if this can not
+	 * be done fail immediately.
+	 */
+	if (qopt->hw && dev->netdev_ops->ndo_setup_tc) {
+		priv->hw_owned = 1;
+		err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc);
+		if (err)
+			return err;
+	} else if (!qopt->hw) {
+		if (mqprio_parse_opt(dev, qopt))
+			return -EINVAL;
+
+		if (netdev_set_num_tc(dev, qopt->num_tc))
+			return -EINVAL;
+
+		for (i = 0; i < qopt->num_tc; i++)
+			netdev_set_tc_queue(dev, i,
+					    qopt->count[i], qopt->offset[i]);
+	} else {
+		return -EINVAL;
+	}
+
+	/* Always use supplied priority mappings */
+	for (i = 0; i < TC_BITMASK + 1; i++) {
+		if (netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])) {
+			err = -EINVAL;
+			goto tc_err;
+		}
+	}
+
+	/* pre-allocate qdisc, attachment can't fail */
+	priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
+			       GFP_KERNEL);
+	if (priv->qdiscs == NULL) {
+		err = -ENOMEM;
+		goto tc_err;
+	}
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		dev_queue = netdev_get_tx_queue(dev, i);
+		qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
+					  TC_H_MAKE(TC_H_MAJ(sch->handle),
+						    TC_H_MIN(i + 1)));
+		if (qdisc == NULL) {
+			err = -ENOMEM;
+			goto err;
+		}
+		qdisc->flags |= TCQ_F_CAN_BYPASS;
+		priv->qdiscs[i] = qdisc;
+	}
+
+	sch->flags |= TCQ_F_MQROOT;
+	return 0;
+
+err:
+	mqprio_destroy(sch);
+tc_err:
+	if (priv->hw_owned)
+		dev->netdev_ops->ndo_setup_tc(dev, unwnd_tc);
+	else
+		netdev_set_num_tc(dev, unwnd_tc);
+
+	memcpy(dev->prio_tc_map, unwnd_map, sizeof(unwnd_map));
+	memcpy(dev->tc_to_txq, unwnd_txq, sizeof(unwnd_txq));
+
+	return err;
+}
+
+static void mqprio_attach(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	unsigned int ntx;
+
+	/* Attach underlying qdisc */
+	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+		qdisc = priv->qdiscs[ntx];
+		qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
+		if (qdisc)
+			qdisc_destroy(qdisc);
+	}
+	kfree(priv->qdiscs);
+	priv->qdiscs = NULL;
+}
+
+static struct netdev_queue *mqprio_queue_get(struct Qdisc *sch,
+					     unsigned long cl)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned long ntx = cl - 1 - netdev_get_num_tc(dev);
+
+	if (ntx >= dev->num_tx_queues)
+		return NULL;
+	return netdev_get_tx_queue(dev, ntx);
+}
+
+static int mqprio_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
+		    struct Qdisc **old)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+	if (dev->flags & IFF_UP)
+		dev_deactivate(dev);
+
+	*old = dev_graft_qdisc(dev_queue, new);
+
+	if (dev->flags & IFF_UP)
+		dev_activate(dev);
+
+	return 0;
+}
+
+static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tc_mqprio_qopt opt;
+	struct Qdisc *qdisc;
+	unsigned int i;
+
+	sch->q.qlen = 0;
+	memset(&sch->bstats, 0, sizeof(sch->bstats));
+	memset(&sch->qstats, 0, sizeof(sch->qstats));
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+		spin_lock_bh(qdisc_lock(qdisc));
+		sch->q.qlen		+= qdisc->q.qlen;
+		sch->bstats.bytes	+= qdisc->bstats.bytes;
+		sch->bstats.packets	+= qdisc->bstats.packets;
+		sch->qstats.qlen	+= qdisc->qstats.qlen;
+		sch->qstats.backlog	+= qdisc->qstats.backlog;
+		sch->qstats.drops	+= qdisc->qstats.drops;
+		sch->qstats.requeues	+= qdisc->qstats.requeues;
+		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
+		spin_unlock_bh(qdisc_lock(qdisc));
+	}
+
+	opt.num_tc = dev->num_tc;
+	memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
+	opt.hw = priv->hw_owned;
+
+	for (i = 0; i < dev->num_tc; i++) {
+		opt.count[i] = dev->tc_to_txq[i].count;
+		opt.offset[i] = dev->tc_to_txq[i].offset;
+	}
+
+	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+	return skb->len;
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static struct Qdisc *mqprio_leaf(struct Qdisc *sch, unsigned long cl)
+{
+	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+	return dev_queue->qdisc_sleeping;
+}
+
+static unsigned long mqprio_get(struct Qdisc *sch, u32 classid)
+{
+	unsigned int ntx = TC_H_MIN(classid);
+
+	if (!mqprio_queue_get(sch, ntx))
+		return 0;
+	return ntx;
+}
+
+static void mqprio_put(struct Qdisc *sch, unsigned long cl)
+{
+}
+
+static int mqprio_dump_class(struct Qdisc *sch, unsigned long cl,
+			 struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct net_device *dev = qdisc_dev(sch);
+
+	if (cl <= dev->num_tc) {
+		tcm->tcm_parent = TC_H_ROOT;
+		tcm->tcm_info = 0;
+	} else {
+		int i;
+		struct netdev_queue *dev_queue;
+		dev_queue = mqprio_queue_get(sch, cl);
+
+		tcm->tcm_parent = 0;
+		for (i = 0; i < netdev_get_num_tc(dev); i++) {
+			struct netdev_tc_txq tc = dev->tc_to_txq[i];
+			int q_idx = cl - dev->num_tc;
+			if (q_idx >= tc.offset &&
+			    q_idx < tc.offset + tc.count) {
+				tcm->tcm_parent =
+					TC_H_MAKE(TC_H_MAJ(sch->handle),
+						  TC_H_MIN(i + 1));
+				break;
+			}
+		}
+		tcm->tcm_info = dev_queue->qdisc_sleeping->handle;
+	}
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	return 0;
+}
+
+static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+			       struct gnet_dump *d)
+{
+	struct net_device *dev = qdisc_dev(sch);
+
+	if (cl <= netdev_get_num_tc(dev)) {
+		int i;
+		struct Qdisc *qdisc;
+		struct gnet_stats_queue qstats = {0};
+		struct gnet_stats_basic_packed bstats = {0};
+		struct netdev_tc_txq tc = dev->tc_to_txq[cl - 1];
+
+		/* Drop lock here it will be reclaimed before touching
+		 * statistics this is required because the d->lock we
+		 * hold here is the look on dev_queue->qdisc_sleeping
+		 * also acquired below.
+		 */
+		spin_unlock_bh(d->lock);
+
+		for (i = tc.offset; i < tc.offset + tc.count; i++) {
+			qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+			spin_lock_bh(qdisc_lock(qdisc));
+			bstats.bytes      += qdisc->bstats.bytes;
+			bstats.packets    += qdisc->bstats.packets;
+			qstats.qlen       += qdisc->qstats.qlen;
+			qstats.backlog    += qdisc->qstats.backlog;
+			qstats.drops      += qdisc->qstats.drops;
+			qstats.requeues   += qdisc->qstats.requeues;
+			qstats.overlimits += qdisc->qstats.overlimits;
+			spin_unlock_bh(qdisc_lock(qdisc));
+		}
+		/* Reclaim root sleeping lock before completing stats */
+		spin_lock_bh(d->lock);
+		if (gnet_stats_copy_basic(d, &bstats) < 0 ||
+		    gnet_stats_copy_queue(d, &qstats) < 0)
+			return -1;
+	} else {
+		struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+		sch = dev_queue->qdisc_sleeping;
+		sch->qstats.qlen = sch->q.qlen;
+		if (gnet_stats_copy_basic(d, &sch->bstats) < 0 ||
+		    gnet_stats_copy_queue(d, &sch->qstats) < 0)
+			return -1;
+	}
+	return 0;
+}
+
+static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned long ntx;
+	u8 num_tc = netdev_get_num_tc(dev);
+
+	if (arg->stop)
+		return;
+
+	/* Walk hierarchy with a virtual class per tc */
+	arg->count = arg->skip;
+	for (ntx = arg->skip; ntx < dev->num_tx_queues + num_tc; ntx++) {
+		if (arg->fn(sch, ntx + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static const struct Qdisc_class_ops mqprio_class_ops = {
+	.graft		= mqprio_graft,
+	.leaf		= mqprio_leaf,
+	.get		= mqprio_get,
+	.put		= mqprio_put,
+	.walk		= mqprio_walk,
+	.dump		= mqprio_dump_class,
+	.dump_stats	= mqprio_dump_class_stats,
+};
+
+struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
+	.cl_ops		= &mqprio_class_ops,
+	.id		= "mqprio",
+	.priv_size	= sizeof(struct mqprio_sched),
+	.init		= mqprio_init,
+	.destroy	= mqprio_destroy,
+	.attach		= mqprio_attach,
+	.dump		= mqprio_dump,
+	.owner		= THIS_MODULE,
+};
+
+static int __init mqprio_module_init(void)
+{
+	return register_qdisc(&mqprio_qdisc_ops);
+}
+
+static void __exit mqprio_module_exit(void)
+{
+	unregister_qdisc(&mqprio_qdisc_ops);
+}
+
+module_init(mqprio_module_init);
+module_exit(mqprio_module_exit);
+
+MODULE_LICENSE("GPL");


^ permalink raw reply related

* [PATCH v2] net: typos in comments in include/linux/igmp.h
From: Francois-Xavier Le Bail @ 2011-01-04 19:10 UTC (permalink / raw)
  To: netdev

There are typos in comments in include/linux/igmp.h:

83 #define IGMP_HOST_MEMBERSHIP_QUERY      0x11    /* From RFC1112 */
84 #define IGMP_HOST_MEMBERSHIP_REPORT     0x12    /* Ditto */
[snip]
88 #define IGMPV2_HOST_MEMBERSHIP_REPORT   0x16    /* V2 version of 0x11 */
89 #define IGMP_HOST_LEAVE_MESSAGE         0x17
90 #define IGMPV3_HOST_MEMBERSHIP_REPORT   0x22    /* V3 version of 0x11 */

The line 88 and 90 are about REPORT messages.
The IGMP_HOST_MEMBERSHIP_REPORT (IGMP V1) value is 0x12.
So the comment on line 88 must be /* V2 version of 0x12 */,
and the comment on line 90 must be /* V3 version of 0x12 */.

Signed-off-by: Francois-Xavier Le Bail <fx.lebail@orange.fr>

---

diff -ru a/include/linux/igmp.h b/include/linux/igmp.h
--- a/include/linux/igmp.h	2010-08-27 01:47:12.000000000 +0200
+++ b/include/linux/igmp.h	2010-12-15 09:50:47.808363144 +0100
@@ -85,9 +85,9 @@
 #define IGMP_DVMRP			0x13	/* DVMRP routing */
 #define IGMP_PIM			0x14	/* PIM routing */
 #define IGMP_TRACE			0x15
-#define IGMPV2_HOST_MEMBERSHIP_REPORT	0x16	/* V2 version of 0x11 */
+#define IGMPV2_HOST_MEMBERSHIP_REPORT	0x16	/* V2 version of 0x12 */
 #define IGMP_HOST_LEAVE_MESSAGE 	0x17
-#define IGMPV3_HOST_MEMBERSHIP_REPORT	0x22	/* V3 version of 0x11 */
+#define IGMPV3_HOST_MEMBERSHIP_REPORT	0x22	/* V3 version of 0x12 */
 
 #define IGMP_MTRACE_RESP		0x1e
 #define IGMP_MTRACE			0x1f

^ permalink raw reply

* Re: [PATCH v2] net: typos in comments in include/linux/igmp.h
From: David Miller @ 2011-01-04 19:30 UTC (permalink / raw)
  To: fx.lebail; +Cc: netdev
In-Reply-To: <4D23709C.1040408@orange.fr>

From: Francois-Xavier Le Bail <fx.lebail@orange.fr>
Date: Tue, 04 Jan 2011 20:10:20 +0100

> There are typos in comments in include/linux/igmp.h:
> 
> 83 #define IGMP_HOST_MEMBERSHIP_QUERY      0x11    /* From RFC1112 */
> 84 #define IGMP_HOST_MEMBERSHIP_REPORT     0x12    /* Ditto */
> [snip]
> 88 #define IGMPV2_HOST_MEMBERSHIP_REPORT   0x16    /* V2 version of 0x11 */
> 89 #define IGMP_HOST_LEAVE_MESSAGE         0x17
> 90 #define IGMPV3_HOST_MEMBERSHIP_REPORT   0x22    /* V3 version of 0x11 */
> 
> The line 88 and 90 are about REPORT messages.
> The IGMP_HOST_MEMBERSHIP_REPORT (IGMP V1) value is 0x12.
> So the comment on line 88 must be /* V2 version of 0x12 */,
> and the comment on line 90 must be /* V3 version of 0x12 */.
> 
> Signed-off-by: Francois-Xavier Le Bail <fx.lebail@orange.fr>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] ipv4/route.c: respect prefsrc for local routes
From: David Miller @ 2011-01-04 19:35 UTC (permalink / raw)
  To: jsing; +Cc: netdev
In-Reply-To: <1294122260-13245-1-git-send-email-jsing@google.com>

From: Joel Sing <jsing@google.com>
Date: Tue,  4 Jan 2011 17:24:20 +1100

> The preferred source address is currently ignored for local routes,
> which results in all local connections having a src address that is the
> same as the local dst address. Fix this by respecting the preferred source
> address when it is provided for local routes.
> 
> This bug can be demonstrated as follows:
> 
>  # ifconfig dummy0 192.168.0.1
>  # ip route show table local | grep local.*dummy0
>  local 192.168.0.1 dev dummy0  proto kernel  scope host  src 192.168.0.1
>  # ip route change table local local 192.168.0.1 dev dummy0 \
>      proto kernel scope host src 127.0.0.1
>  # ip route show table local | grep local.*dummy0
>  local 192.168.0.1 dev dummy0  proto kernel  scope host  src 127.0.0.1
> 
> We now establish a local connection and verify the source IP
> address selection:
> 
>  # nc -l 192.168.0.1 3128 &
>  # nc 192.168.0.1 3128 &
>  # netstat -ant | grep 192.168.0.1:3128.*EST
>  tcp        0      0 192.168.0.1:3128        192.168.0.1:33228 ESTABLISHED
>  tcp        0      0 192.168.0.1:33228       192.168.0.1:3128  ESTABLISHED
> 
> Signed-off-by: Joel Sing <jsing@google.com>

Applied to net-2.6, thanks Joel.

If you guys want to mess with ternary operators and new macros,
please do that in net-next-2.6 the next time I merge or similar.

Thanks.

^ permalink raw reply

* [PATCH 0/2] IRQ affinity reverse-mapping
From: Ben Hutchings @ 2011-01-04 19:37 UTC (permalink / raw)
  To: Thomas Gleixner, David Miller
  Cc: Tom Herbert, linux-kernel, netdev, linux-net-drivers

This patch series is intended to support queue selection on multiqueue
IRQ-per-queue network devices (accelerated RFS and XPS-MQ) and
potentially queue selection for other classes of multiqueue device.

The first patch implements IRQ affinity notifiers, based on the outline
that Thomas wrote in response to my earlier patch series for accelerated RFS.

The second patch is a generalisation of the CPU affinity reverse-
mapping, plus functions to maintain such a mapping based on the new IRQ
affinity notifiers.

I would like to be able to use this functionality in networking for
2.6.38.  Thomas, if you are happy with this, could these changes go
through net-next-2.6?  Alternately, if Linus pulls from linux-2.6-tip
and David pulls from Linus during the merge window, I can (re-)submit
the dependent changes after that.

Ben.

Ben Hutchings (2):
  genirq: Add IRQ affinity notifiers
  lib: cpu_rmap: CPU affinity reverse-mapping

 include/linux/cpu_rmap.h  |   73 +++++++++++++
 include/linux/interrupt.h |   41 +++++++
 include/linux/irqdesc.h   |    3 +
 kernel/irq/manage.c       |   81 ++++++++++++++
 lib/Kconfig               |    4 +
 lib/Makefile              |    2 +
 lib/cpu_rmap.c            |  262 +++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 466 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/cpu_rmap.h
 create mode 100644 lib/cpu_rmap.c

-- 
1.7.3.4

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [net-next-2.6 #2 00/08] r8169 driver updates
From: David Miller @ 2011-01-04 19:38 UTC (permalink / raw)
  To: romieu; +Cc: netdev, hayeswang, benh
In-Reply-To: <20110104010716.GA5957@electric-eye.fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Tue, 4 Jan 2011 02:07:16 +0100

> The following series adds on top of the previous one :
> - missing TxPacketMax and FW_LOADER (Ben Hutchings)
> - wrong OCPDR_GPHY_REG offset and mask.
>   Hayes, are you ok with this one ?
> 
> The RTL_GIGA_MAC_VER_27 changes suggested by Hayes at the end of the
> previous series are not included yet (feedback anyone ?).
> 
> The previous series included :
> - Hayes'firmware removal. The firmware is already included in linux-firmware.
> - my sauce to diminish the gap between the in-kernel r8169 driver
>   and Realtek's one(s).
> 
> The series is available as:
> git://git.kernel.org/pub/scm/linux/kernel/git/romieu/netdev-2.6.git r8169-davem

I applied all of these patches, thanks Francois.

Because I am an idiot, I didn't notice you had a GIT tree with this
stuff prepared for me until after I pushed the patches out to
kernel.org :-/

Don't let this discourage you, using a GIT tree helps me a lot and I
just didn't expect it from you for whatever reason.  :-) So please
keep doing this and I will pull directly from your GIT tree next time.

^ permalink raw reply

* [PATCH 1/2] genirq: Add IRQ affinity notifiers
From: Ben Hutchings @ 2011-01-04 19:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294169842.3636.31.camel@bwh-desktop>

When initiating I/O on a multiqueue and multi-IRQ device, we may want
to select a queue for which the response will be handled on the same
or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add a
notification mechanism to support this.

This is based closely on work by Thomas Gleixner <tglx@linutronix.de>.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/linux/interrupt.h |   41 +++++++++++++++++++++++
 include/linux/irqdesc.h   |    3 ++
 kernel/irq/manage.c       |   81 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 125 insertions(+), 0 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 55e0d42..09d6039 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -14,6 +14,8 @@
 #include <linux/smp.h>
 #include <linux/percpu.h>
 #include <linux/hrtimer.h>
+#include <linux/kref.h>
+#include <linux/workqueue.h>
 
 #include <asm/atomic.h>
 #include <asm/ptrace.h>
@@ -231,6 +233,28 @@ static inline void resume_device_irqs(void) { };
 static inline int check_wakeup_irqs(void) { return 0; }
 #endif
 
+/**
+ * struct irq_affinity_notify - context for notification of IRQ affinity changes
+ * @irq:		Interrupt to which notification applies
+ * @kref:		Reference count, for internal use
+ * @work:		Work item, for internal use
+ * @notify:		Function to be called on change.  This will be
+ *			called in process context.
+ * @release:		Function to be called on release.  This will be
+ *			called in process context.  Once registered, the
+ *			structure must only be freed when this function is
+ *			called or later.
+ */
+struct irq_affinity_notify {
+        unsigned int irq;
+        struct kref kref;
+#if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_HARDIRQS)
+        struct work_struct work;
+#endif
+        void (*notify)(struct irq_affinity_notify *, const cpumask_t *mask);
+        void (*release)(struct kref *ref);
+};
+
 #if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_HARDIRQS)
 
 extern cpumask_var_t irq_default_affinity;
@@ -240,6 +264,13 @@ extern int irq_can_set_affinity(unsigned int irq);
 extern int irq_select_affinity(unsigned int irq);
 
 extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
+extern int
+irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
+
+static inline void irq_run_affinity_notifiers(void)
+{
+	flush_scheduled_work();
+}
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -259,6 +290,16 @@ static inline int irq_set_affinity_hint(unsigned int irq,
 {
 	return -EINVAL;
 }
+
+static inline int
+irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
+{
+	return 0;
+}
+
+static inline void irq_run_affinity_notifiers(void)
+{
+}
 #endif /* CONFIG_SMP && CONFIG_GENERIC_HARDIRQS */
 
 #ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 979c68c..5e0d2e4 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -8,6 +8,7 @@
  * For now it's included from <linux/irq.h>
  */
 
+struct irq_affinity_notify;
 struct proc_dir_entry;
 struct timer_rand_state;
 /**
@@ -24,6 +25,7 @@ struct timer_rand_state;
  * @last_unhandled:	aging timer for unhandled count
  * @irqs_unhandled:	stats field for spurious unhandled interrupts
  * @lock:		locking for SMP
+ * @affinity_notify:	context for notification of affinity changes
  * @pending_mask:	pending rebalanced interrupts
  * @threads_active:	number of irqaction threads currently running
  * @wait_for_threads:	wait queue for sync_irq to wait for threaded handlers
@@ -70,6 +72,7 @@ struct irq_desc {
 	raw_spinlock_t		lock;
 #ifdef CONFIG_SMP
 	const struct cpumask	*affinity_hint;
+	struct irq_affinity_notify *affinity_notify;
 #ifdef CONFIG_GENERIC_PENDING_IRQ
 	cpumask_var_t		pending_mask;
 #endif
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 91a5fa2..fb6525a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -134,6 +134,10 @@ int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask)
 		irq_set_thread_affinity(desc);
 	}
 #endif
+	if (desc->affinity_notify) {
+		kref_get(&desc->affinity_notify->kref);
+		schedule_work(&desc->affinity_notify->work);
+	}
 	desc->status |= IRQ_AFFINITY_SET;
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 	return 0;
@@ -155,6 +159,79 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
 }
 EXPORT_SYMBOL_GPL(irq_set_affinity_hint);
 
+static void irq_affinity_notify(struct work_struct *work)
+{
+	struct irq_affinity_notify *notify =
+		container_of(work, struct irq_affinity_notify, work);
+	struct irq_desc *desc = irq_to_desc(notify->irq);
+	cpumask_var_t cpumask;
+	unsigned long flags;
+
+	if (!desc)
+		goto out;
+
+	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
+		goto out;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+#ifdef CONFIG_GENERIC_PENDING_IRQ
+	if (desc->status & IRQ_MOVE_PENDING)
+		cpumask_copy(cpumask, desc->pending_mask);
+	else
+#endif
+		cpumask_copy(cpumask, desc->affinity);
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	notify->notify(notify, cpumask);
+
+	free_cpumask_var(cpumask);
+out:
+	kref_put(&notify->kref, notify->release);
+}
+
+/**
+ *	irq_set_affinity_notifier - control notification of IRQ affinity changes
+ *	@irq:		Interrupt for which to enable/disable notification
+ *	@notify:	Context for notification, or %NULL to disable
+ *			notification.  Function pointers must be initialised;
+ *			the other fields will be initialised by this function.
+ *
+ *	Must be called in process context.  Notification may only be enabled
+ *	after the IRQ is allocated but before it is bound with request_irq()
+ *	and must be disabled before the IRQ is freed using free_irq().
+ */
+int
+irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	struct irq_affinity_notify *old_notify;
+	unsigned long flags;
+
+	/* The release function is promised process context */
+	might_sleep();
+
+	if (!desc)
+		return -EINVAL;
+
+	/* Complete initialisation of *notify */
+	if (notify) {
+		notify->irq = irq;
+		kref_init(&notify->kref);
+		INIT_WORK(&notify->work, irq_affinity_notify);
+	}
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+	old_notify = desc->affinity_notify;
+	desc->affinity_notify = notify;
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	if (old_notify)
+		kref_put(&old_notify->kref, old_notify->release);
+	
+	return 0;
+}
+EXPORT_SYMBOL_GPL(irq_set_affinity_notifier);
+
 #ifndef CONFIG_AUTO_IRQ_AFFINITY
 /*
  * Generic version of the affinity autoselector.
@@ -1004,6 +1081,10 @@ void free_irq(unsigned int irq, void *dev_id)
 	if (!desc)
 		return;
 
+#ifdef CONFIG_SMP
+	BUG_ON(desc->affinity_notify);
+#endif
+
 	chip_bus_lock(desc);
 	kfree(__free_irq(irq, dev_id));
 	chip_bus_sync_unlock(desc);
-- 
1.7.3.4


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Ben Hutchings @ 2011-01-04 19:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294169842.3636.31.camel@bwh-desktop>

When initiating I/O on a multiqueue and multi-IRQ device, we may want
to select a queue for which the response will be handled on the same
or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
library functions to support a generic reverse-mapping from CPUs to
objects with affinity and the specific case where the objects are
IRQs.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/linux/cpu_rmap.h |   73 +++++++++++++
 lib/Kconfig              |    4 +
 lib/Makefile             |    2 +
 lib/cpu_rmap.c           |  262 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 341 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/cpu_rmap.h
 create mode 100644 lib/cpu_rmap.c

diff --git a/include/linux/cpu_rmap.h b/include/linux/cpu_rmap.h
new file mode 100644
index 0000000..6e2f5ff
--- /dev/null
+++ b/include/linux/cpu_rmap.h
@@ -0,0 +1,73 @@
+/*
+ * cpu_rmap.c: CPU affinity reverse-map support
+ * Copyright 2010 Solarflare Communications Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation, incorporated herein by reference.
+ */
+
+#include <linux/cpumask.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+
+/**
+ * struct cpu_rmap - CPU affinity reverse-map
+ * @near: For each CPU, the index and distance to the nearest object,
+ *      based on affinity masks
+ * @size: Number of objects to be reverse-mapped
+ * @used: Number of objects added
+ * @obj: Array of object pointers
+ */
+struct cpu_rmap {
+	struct {
+		u16     index;
+		u16     dist;
+	} near[NR_CPUS];
+	u16		size, used;
+	void		*obj[0];
+};
+#define CPU_RMAP_DIST_INF 0xffff
+
+extern struct cpu_rmap *alloc_cpu_rmap(unsigned int size, gfp_t flags);
+
+/**
+ * free_cpu_rmap - free CPU affinity reverse-map
+ * @rmap: Reverse-map allocated with alloc_cpu_rmap(), or %NULL
+ */
+static inline void free_cpu_rmap(struct cpu_rmap *rmap)
+{
+	kfree(rmap);
+}
+
+extern int cpu_rmap_add(struct cpu_rmap *rmap, void *obj);
+extern int cpu_rmap_update(struct cpu_rmap *rmap, u16 index,
+			   const struct cpumask *affinity);
+
+static inline u16 cpu_rmap_lookup_index(struct cpu_rmap *rmap, unsigned int cpu)
+{
+	return rmap->near[cpu].index;
+}
+
+static inline void *cpu_rmap_lookup_obj(struct cpu_rmap *rmap, unsigned int cpu)
+{
+	return rmap->obj[rmap->near[cpu].index];
+}
+
+#ifdef CONFIG_GENERIC_HARDIRQS
+
+/**
+ * alloc_irq_cpu_rmap - allocate CPU affinity reverse-map for IRQs
+ * @size: Number of objects to be mapped
+ *
+ * Must be called in process context.
+ */
+static inline struct cpu_rmap *alloc_irq_cpu_rmap(unsigned int size)
+{
+	return alloc_cpu_rmap(size, GFP_KERNEL);
+}
+extern void free_irq_cpu_rmap(struct cpu_rmap *rmap);
+
+extern int irq_cpu_rmap_add(struct cpu_rmap *rmap, int irq);
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 3d498b2..f43cb2e 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -195,6 +195,10 @@ config DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
        bool "Disable obsolete cpumask functions" if DEBUG_PER_CPU_MAPS
        depends on EXPERIMENTAL && BROKEN
 
+config CPU_RMAP
+	bool
+	depends on SMP
+
 #
 # Netlink attribute parsing support is select'ed if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index 0248767..001b528 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -110,6 +110,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
 
 obj-$(CONFIG_AVERAGE) += average.o
 
+obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
+
 hostprogs-y	:= gen_crc32table
 clean-files	:= crc32table.h
 
diff --git a/lib/cpu_rmap.c b/lib/cpu_rmap.c
new file mode 100644
index 0000000..8f7f6c9
--- /dev/null
+++ b/lib/cpu_rmap.c
@@ -0,0 +1,262 @@
+/*
+ * cpu_rmap.c: CPU affinity reverse-map support
+ * Copyright 2010 Solarflare Communications Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation, incorporated herein by reference.
+ */
+
+#include <linux/cpu_rmap.h>
+#ifdef CONFIG_GENERIC_HARDIRQS
+#include <linux/interrupt.h>
+#endif
+#include <linux/module.h>
+
+/*
+ * These functions maintain a mapping from CPUs to some ordered set of
+ * objects with CPU affinities.  This can be seen as a reverse-map of
+ * CPU affinity.  However, we do not assume that the object affinities
+ * cover all CPUs in the system.  For those CPUs not directly covered
+ * by object affinities, we attempt to find a nearest object based on
+ * CPU topology.
+ */
+
+/**
+ * alloc_cpu_rmap - allocate CPU affinity reverse-map
+ * @size: Number of objects to be mapped
+ * @flags: Allocation flags e.g. %GFP_KERNEL
+ */
+struct cpu_rmap *alloc_cpu_rmap(unsigned int size, gfp_t flags)
+{
+	struct cpu_rmap *rmap;
+	unsigned int cpu;
+
+	/* This is a silly number of objects, and we use u16 indices. */
+	if (size > 0xffff)
+		return NULL;
+
+	rmap = kzalloc(sizeof(*rmap) + size * sizeof(rmap->obj[0]), flags);
+	if (!rmap)
+		return NULL;
+
+	/* Initially assign CPUs to objects on a rota, since we have
+	 * no idea where the objects are.  Use infinite distance, so
+	 * any object with known distance is preferable.  Include the
+	 * CPUs that are not present/online, since we definitely want
+	 * any newly-hotplugged CPUs to have some object assigned.
+	 */
+	for_each_possible_cpu(cpu) {
+		rmap->near[cpu].index = cpu % size;
+		rmap->near[cpu].dist = CPU_RMAP_DIST_INF;
+	}
+
+	rmap->size = size;
+	return rmap;
+}
+EXPORT_SYMBOL(alloc_cpu_rmap);
+
+/* Reevaluate nearest object for given CPU, comparing with the given
+ * neighbours at the given distance.
+ */
+static bool cpu_rmap_copy_neigh(struct cpu_rmap *rmap, unsigned int cpu,
+				const struct cpumask *mask, u16 dist)
+{
+	int neigh;
+
+	for_each_cpu(neigh, mask) {
+		if (rmap->near[cpu].dist > dist &&
+		    rmap->near[neigh].dist <= dist) {
+			rmap->near[cpu].index = rmap->near[neigh].index;
+			rmap->near[cpu].dist = dist;
+			return true;
+		}
+	}
+	return false;
+}
+
+#ifdef DEBUG
+static void debug_print_rmap(const struct cpu_rmap *rmap, const char *prefix)
+{
+	unsigned index;
+	unsigned int cpu;
+
+	pr_info("cpu_rmap %p, %s:\n", rmap, prefix);
+
+	for_each_possible_cpu(cpu) {
+		index = rmap->near[cpu].index;
+		pr_info("cpu %d -> obj %u (distance %u)\n",
+			cpu, index, rmap->near[cpu].dist);
+	}
+}
+#else
+static inline void
+debug_print_rmap(const struct cpu_rmap *rmap, const char *prefix)
+{
+}
+#endif
+
+/**
+ * cpu_rmap_add - add object to a rmap
+ * @rmap: CPU rmap allocated with alloc_cpu_rmap()
+ * @obj: Object to add to rmap
+ *
+ * Return index of object.
+ */
+int cpu_rmap_add(struct cpu_rmap *rmap, void *obj)
+{
+	u16 index;
+
+	BUG_ON(rmap->used >= rmap->size);
+	index = rmap->used++;
+	rmap->obj[index] = obj;
+	return index;
+}
+EXPORT_SYMBOL(cpu_rmap_add);
+
+/**
+ * cpu_rmap_update - update CPU rmap following a change of object affinity
+ * @rmap: CPU rmap to update
+ * @index: Index of object whose affinity changed
+ * @affinity: New CPU affinity of object
+ */
+int cpu_rmap_update(struct cpu_rmap *rmap, u16 index,
+		    const struct cpumask *affinity)
+{
+	cpumask_var_t update_mask;
+	unsigned int cpu;
+
+	if (unlikely(!zalloc_cpumask_var(&update_mask, GFP_KERNEL)))
+		return -ENOMEM;
+
+	/* Invalidate distance for all CPUs for which this used to be
+	 * the nearest object.  Mark those CPUs for update.
+	 */
+	for_each_online_cpu(cpu) {
+		if (rmap->near[cpu].index == index) {
+			rmap->near[cpu].dist = CPU_RMAP_DIST_INF;
+			cpumask_set_cpu(cpu, update_mask);
+		}
+	}
+
+	debug_print_rmap(rmap, "after invalidating old distances");
+
+	/* Set distance to 0 for all CPUs in the new affinity mask.
+	 * Mark all CPUs within their NUMA nodes for update.
+	 */
+	for_each_cpu(cpu, affinity) {
+		rmap->near[cpu].index = index;
+		rmap->near[cpu].dist = 0;
+		cpumask_or(update_mask, update_mask,
+			   cpumask_of_node(cpu_to_node(cpu)));
+	}
+
+	debug_print_rmap(rmap, "after updating neighbours");
+
+	/* Update distances based on topology */
+	for_each_cpu(cpu, update_mask) {
+		if (cpu_rmap_copy_neigh(rmap, cpu,
+					topology_thread_cpumask(cpu), 1))
+			continue;
+		if (cpu_rmap_copy_neigh(rmap, cpu,
+					topology_core_cpumask(cpu), 2))
+			continue;
+		if (cpu_rmap_copy_neigh(rmap, cpu,
+					cpumask_of_node(cpu_to_node(cpu)), 3))
+			continue;
+		/* We could continue into NUMA node distances, but for now
+		 * we give up.
+		 */
+	}
+
+	debug_print_rmap(rmap, "after copying neighbours");
+
+	free_cpumask_var(update_mask);
+	return 0;
+}
+EXPORT_SYMBOL(cpu_rmap_update);
+
+#ifdef CONFIG_GENERIC_HARDIRQS
+
+/* Glue between IRQ affinity notifiers and CPU rmaps */
+
+struct irq_glue {
+	struct irq_affinity_notify notify;
+	struct cpu_rmap *rmap;
+	u16 index;
+};
+
+/**
+ * free_irq_cpu_rmap - free a CPU affinity reverse-map used for IRQs
+ * @rmap: Reverse-map allocated with alloc_irq_cpu_map(), or %NULL
+ *
+ * Must be called in process context, before freeing the IRQs, and
+ * without holding any locks required by global workqueue items.
+ */
+void free_irq_cpu_rmap(struct cpu_rmap *rmap)
+{
+	struct irq_glue *glue;
+	u16 index;
+
+	if (!rmap)
+		return;
+
+	for (index = 0; index < rmap->used; index++) {
+		glue = rmap->obj[index];
+		irq_set_affinity_notifier(glue->notify.irq, NULL);
+	}
+	irq_run_affinity_notifiers();
+
+	kfree(rmap);
+}
+EXPORT_SYMBOL(free_irq_cpu_rmap);
+
+static void
+irq_cpu_rmap_notify(struct irq_affinity_notify *notify, const cpumask_t *mask)
+{
+	struct irq_glue *glue =
+		container_of(notify, struct irq_glue, notify);
+	int rc;
+
+	rc = cpu_rmap_update(glue->rmap, glue->index, mask);
+	if (rc)
+		pr_warning("irq_cpu_rmap_notify: update failed: %d\n", rc);
+}
+
+static void irq_cpu_rmap_release(struct kref *ref)
+{
+	struct irq_glue *glue =
+		container_of(ref, struct irq_glue, notify.kref);
+	kfree(glue);
+}
+
+/**
+ * irq_cpu_rmap_add - add an IRQ to a CPU affinity reverse-map
+ * @rmap: The reverse-map
+ * @irq: The IRQ number
+ *
+ * This adds an IRQ affinity notifier that will update the reverse-map
+ * automatically.
+ *
+ * Must be called in process context, after the IRQ is allocated but
+ * before it is bound with request_irq().
+ */
+int irq_cpu_rmap_add(struct cpu_rmap *rmap, int irq)
+{
+	struct irq_glue *glue = kzalloc(sizeof(*glue), GFP_KERNEL);
+	int rc;
+
+	if (!glue)
+		return -ENOMEM;
+	glue->notify.notify = irq_cpu_rmap_notify;
+	glue->notify.release = irq_cpu_rmap_release;
+	glue->rmap = rmap;
+	glue->index = cpu_rmap_add(rmap, glue);
+	rc = irq_set_affinity_notifier(irq, &glue->notify);
+	if (rc)
+		kfree(glue);
+	return rc;
+}
+EXPORT_SYMBOL(irq_cpu_rmap_add);
+
+#endif /* CONFIG_GENERIC_HARDIRQS */
-- 
1.7.3.4


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [GIT] Networking
From: David Miller @ 2011-01-04 19:56 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel

Some stragglers, the most important bit is the bridging multicast
crash cure.

1) MIPS build of starfire fails on some 64-bit platforms, fix from Ben
   Hutchings.

2) VLAN flags of ehea inadvertantly changed when modifying other
   feature flags.  Fix from Breno Leitao.

3) Don't expose kernel addresses in CAN proc files, from Dan
   Rosenberg.

4) skfp driver probe checks wrong return value for error, from Dan
   Carpenter.

5) Bridging netfilter expects SKB mac header to be initialized
   properly, a simplification made to the STP code inadvertantly broke
   that.  Fix from Florian Westphal.

6) tg3_read_vpd() checks return value incorrectly, fix from David
   Sterba.

7) Bridging code doesn't handle non-linear SKBs properly when parsing
   through ipv6 extension headers to get at the IGMP message bits.
   Fix from Tomas Winkler.

8) There's a rather pervasive CISCO ppp implementation bug regarding a
   corner case of compression and protocol IDs, add a sysctl to work
   around this so people can at least function while waiting for
   various CISCO kit to get updated.  From Stephen Hemminger.

9) Memory leaks in ISDN gigaset and broadcom CNIC drivers, from Jesper
   Juhl.

10) Changing ring parameters causes oops in atl1, fix from J. K. Cliburn.

11) In ipv4 we ignore preferred source address setting for local routes,
    fix from Joel Sing.

12) Device leak in atmtcp.c, fix from Julia Lawall.

Please pull, thanks a lot.

The following changes since commit 989d873fc5b6a96695b97738dea8d9f02a60f8ab:

  Merge master.kernel.org:/home/rmk/linux-2.6-arm (2011-01-03 16:37:01 -0800)

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Ben Hutchings (1):
      starfire: Fix dma_addr_t size test for MIPS

Breno Leitao (1):
      ehea: Avoid changing vlan flags

Dan Carpenter (1):
      skfp: testing the wrong variable in skfp_driver_init()

Dan Rosenberg (1):
      CAN: Use inode instead of kernel address for /proc file

Dan Williams (1):
      ueagle-atm: fix PHY signal initialization race

David Sterba (1):
      tg3: fix return value check in tg3_read_vpd()

Florian Westphal (1):
      bridge: stp: ensure mac header is set

J. K. Cliburn (1):
      atl1: fix oops when changing tx/rx ring params

Jesper Juhl (2):
      ISDN, Gigaset: Fix memory leak in do_disconnect_req()
      Broadcom CNIC core network driver: fix mem leak on allocation failures in cnic_alloc_uio_rings()

Joel Sing (1):
      ipv4/route.c: respect prefsrc for local routes

Julia Lawall (1):
      drivers/atm/atmtcp.c: add missing atm_dev_put

Tomas Winkler (1):
      bridge: fix br_multicast_ipv6_rcv for paged skbs

stephen hemminger (1):
      ppp: allow disabling multilink protocol ID compression

 drivers/atm/atmtcp.c            |    5 ++++-
 drivers/isdn/gigaset/capi.c     |    1 +
 drivers/net/atlx/atl1.c         |   10 ++++++++++
 drivers/net/cnic.c              |   10 ++++++++--
 drivers/net/ehea/ehea_ethtool.c |    7 +++++++
 drivers/net/ppp_generic.c       |    9 +++++++--
 drivers/net/skfp/skfddi.c       |    2 +-
 drivers/net/starfire.c          |    2 +-
 drivers/net/tg3.c               |    2 +-
 drivers/usb/atm/ueagle-atm.c    |   22 +++++++++++++++++++---
 net/bridge/br_multicast.c       |   28 ++++++++++++++++++----------
 net/bridge/br_stp_bpdu.c        |    2 ++
 net/can/bcm.c                   |    4 ++--
 net/ipv4/route.c                |    8 ++++++--
 14 files changed, 87 insertions(+), 25 deletions(-)

^ permalink raw reply

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Eric Dumazet @ 2011-01-04 21:17 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294169967.3636.34.camel@bwh-desktop>

Le mardi 04 janvier 2011 à 19:39 +0000, Ben Hutchings a écrit :
> When initiating I/O on a multiqueue and multi-IRQ device, we may want
> to select a queue for which the response will be handled on the same
> or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
> library functions to support a generic reverse-mapping from CPUs to
> objects with affinity and the specific case where the objects are
> IRQs.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
> ---
>  include/linux/cpu_rmap.h |   73 +++++++++++++
>  lib/Kconfig              |    4 +
>  lib/Makefile             |    2 +
>  lib/cpu_rmap.c           |  262 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 341 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/cpu_rmap.h
>  create mode 100644 lib/cpu_rmap.c
> 
> diff --git a/include/linux/cpu_rmap.h b/include/linux/cpu_rmap.h
> new file mode 100644
> index 0000000..6e2f5ff
> --- /dev/null
> +++ b/include/linux/cpu_rmap.h
> @@ -0,0 +1,73 @@
> +/*
> + * cpu_rmap.c: CPU affinity reverse-map support
> + * Copyright 2010 Solarflare Communications Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation, incorporated herein by reference.
> + */
> +
> +#include <linux/cpumask.h>
> +#include <linux/gfp.h>
> +#include <linux/slab.h>
> +
> +/**
> + * struct cpu_rmap - CPU affinity reverse-map
> + * @near: For each CPU, the index and distance to the nearest object,
> + *      based on affinity masks
> + * @size: Number of objects to be reverse-mapped
> + * @used: Number of objects added
> + * @obj: Array of object pointers
> + */
> +struct cpu_rmap {
> +	struct {
> +		u16     index;
> +		u16     dist;
> +	} near[NR_CPUS];

This [NR_CPUS] is highly suspect.

Are you sure you cant use a per_cpu allocation here ?

> +	u16		size, used;
> +	void		*obj[0];
> +};
> +#define CPU_RMAP_DIST_INF 0xffff
> +


> +
> +/**
> + * alloc_cpu_rmap - allocate CPU affinity reverse-map
> + * @size: Number of objects to be mapped
> + * @flags: Allocation flags e.g. %GFP_KERNEL
> + */

I really doubt you need other than GFP_KERNEL. (Especially if you switch
to per_cpu alloc ;) )

> +struct cpu_rmap *alloc_cpu_rmap(unsigned int size, gfp_t flags)
> +{
> +	struct cpu_rmap *rmap;
> +	unsigned int cpu;
> +
> +	/* This is a silly number of objects, and we use u16 indices. */
> +	if (size > 0xffff)
> +		return NULL;
> +
> +	rmap = kzalloc(sizeof(*rmap) + size * sizeof(rmap->obj[0]), flags);
> +	if (!rmap)
> +		return NULL;

^ permalink raw reply

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Ben Hutchings @ 2011-01-04 21:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294175823.3420.7.camel@edumazet-laptop>

On Tue, 2011-01-04 at 22:17 +0100, Eric Dumazet wrote:
> Le mardi 04 janvier 2011 à 19:39 +0000, Ben Hutchings a écrit :
> > When initiating I/O on a multiqueue and multi-IRQ device, we may want
> > to select a queue for which the response will be handled on the same
> > or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
> > library functions to support a generic reverse-mapping from CPUs to
> > objects with affinity and the specific case where the objects are
> > IRQs.
[...]
> > +/**
> > + * struct cpu_rmap - CPU affinity reverse-map
> > + * @near: For each CPU, the index and distance to the nearest object,
> > + *      based on affinity masks
> > + * @size: Number of objects to be reverse-mapped
> > + * @used: Number of objects added
> > + * @obj: Array of object pointers
> > + */
> > +struct cpu_rmap {
> > +	struct {
> > +		u16     index;
> > +		u16     dist;
> > +	} near[NR_CPUS];
> 
> This [NR_CPUS] is highly suspect.
> 
> Are you sure you cant use a per_cpu allocation here ?

I think that would be a waste of space in shared caches, as this is
read-mostly.

> > +	u16		size, used;
> > +	void		*obj[0];
> > +};
> > +#define CPU_RMAP_DIST_INF 0xffff
> > +
> 
> 
> > +
> > +/**
> > + * alloc_cpu_rmap - allocate CPU affinity reverse-map
> > + * @size: Number of objects to be mapped
> > + * @flags: Allocation flags e.g. %GFP_KERNEL
> > + */
> 
> I really doubt you need other than GFP_KERNEL. (Especially if you switch
> to per_cpu alloc ;) )
[...]

I agree, but this is consistent with ~all other allocation functions.

Ben.


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [Bugme-new] [Bug 25062] New: Bonding packet deduplication doesn't work properly anymore
From: Andrew Morton @ 2011-01-04 21:39 UTC (permalink / raw)
  To: netdev
  Cc: bugzilla-daemon, bugme-new, bugme-daemon, Jay Vosburgh,
	kevin.lapagna
In-Reply-To: <bug-25062-10286@https.bugzilla.kernel.org/>


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 17 Dec 2010 11:45:18 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=25062
> 
>            Summary: Bonding packet deduplication doesn't work properly
>                     anymore
>            Product: Networking
>            Version: 2.5
>     Kernel Version: > 2.6.33
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>         AssignedTo: acme@ghostprotocols.net
>         ReportedBy: kevin.lapagna@bigtag.ch
>         Regression: No
> 
> 
> Here's the setup:
> 
> switch: ordinary cisco switch
> eth0: NIC with kernel module tg3
> eth1: NIC with kernel module e1000e
> bond0: bond with slaves eth0,eth1 in mode 1 (or 5)
> bond0.100: vlan device created with vconfig
> bridge100: bridge created with brctl
> tap1: tap device created with tunctl
> vguest: qemu-kvm vguest whit emulated e1000 NIC
> 
> 
> |________________|-- eth0 \                                               |________________|
> | switch |          -- bond0 -- bond0.100 -- bridge100 -- tap1 -- | vguest |
> |________|-- eth1 /                                               |________|
> 
> When the vguest emits an ethernet broadcast (DHCP-request), it's forwarded all
> the way up to the switch, through eth0. The switch forwards the broadcast -
> also to eth1. The packet travels then all the way back to bridge100. So the
> last status known for bridge100, regarding the mac address of the vgeust is,
> that it is behind bond0.110 (instead of tap1). If a DHCP-server responds to the
> request, the packet travels to bridge100, which has now a faulty
> MAC-address-table and the packet will be rejected and never reaches tap1 and
> therefor not the vguest.
> 
> I witnessed this wrong behavior in kernel 2.6.37-rc5 (debian package), 2.6.36.2
> and 2.6.35.9 (self compiled -  vanilla). The setup has worked with kernels <=
> 2.6.33.7. I've never tried 2.6.34.
> 
> I assume the setup above is a common way for the separation of virtual guests
> on a network level. So this could become a major issue for a lot of people when
> upgrading their kernels.
> 


^ permalink raw reply

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Eric Dumazet @ 2011-01-04 21:45 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294176216.3636.38.camel@bwh-desktop>

Le mardi 04 janvier 2011 à 21:23 +0000, Ben Hutchings a écrit :
> On Tue, 2011-01-04 at 22:17 +0100, Eric Dumazet wrote:
> > Le mardi 04 janvier 2011 à 19:39 +0000, Ben Hutchings a écrit :
> > > When initiating I/O on a multiqueue and multi-IRQ device, we may want
> > > to select a queue for which the response will be handled on the same
> > > or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
> > > library functions to support a generic reverse-mapping from CPUs to
> > > objects with affinity and the specific case where the objects are
> > > IRQs.
> [...]
> > > +/**
> > > + * struct cpu_rmap - CPU affinity reverse-map
> > > + * @near: For each CPU, the index and distance to the nearest object,
> > > + *      based on affinity masks
> > > + * @size: Number of objects to be reverse-mapped
> > > + * @used: Number of objects added
> > > + * @obj: Array of object pointers
> > > + */
> > > +struct cpu_rmap {
> > > +	struct {
> > > +		u16     index;
> > > +		u16     dist;
> > > +	} near[NR_CPUS];
> > 
> > This [NR_CPUS] is highly suspect.
> > 
> > Are you sure you cant use a per_cpu allocation here ?
> 
> I think that would be a waste of space in shared caches, as this is
> read-mostly.

This is slow path, unless I dont understood the intent.

Cache lines dont matter. I was not concerned about speed but memory
needs.

NR_CPUS can be 4096 on some distros, that means a 32Kbyte allocation.

Really, you'll have to have very strong arguments to introduce an
[NR_CPUS] array in the kernel today.

^ permalink raw reply

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Ben Hutchings @ 2011-01-04 22:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294177548.3420.11.camel@edumazet-laptop>

On Tue, 2011-01-04 at 22:45 +0100, Eric Dumazet wrote:
> Le mardi 04 janvier 2011 à 21:23 +0000, Ben Hutchings a écrit :
> > On Tue, 2011-01-04 at 22:17 +0100, Eric Dumazet wrote:
> > > Le mardi 04 janvier 2011 à 19:39 +0000, Ben Hutchings a écrit :
> > > > When initiating I/O on a multiqueue and multi-IRQ device, we may want
> > > > to select a queue for which the response will be handled on the same
> > > > or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
> > > > library functions to support a generic reverse-mapping from CPUs to
> > > > objects with affinity and the specific case where the objects are
> > > > IRQs.
> > [...]
> > > > +/**
> > > > + * struct cpu_rmap - CPU affinity reverse-map
> > > > + * @near: For each CPU, the index and distance to the nearest object,
> > > > + *      based on affinity masks
> > > > + * @size: Number of objects to be reverse-mapped
> > > > + * @used: Number of objects added
> > > > + * @obj: Array of object pointers
> > > > + */
> > > > +struct cpu_rmap {
> > > > +	struct {
> > > > +		u16     index;
> > > > +		u16     dist;
> > > > +	} near[NR_CPUS];
> > > 
> > > This [NR_CPUS] is highly suspect.
> > > 
> > > Are you sure you cant use a per_cpu allocation here ?
> > 
> > I think that would be a waste of space in shared caches, as this is
> > read-mostly.
> 
> This is slow path, unless I dont understood the intent.

get_rps_cpu() will need to read from an arbitrary entry in cpu_rmap (not
the current CPU's entry) for each new flow and for each flow that went
idle for a while.  That's not fast path but it is part of the data path,
not the control path.

> Cache lines dont matter. I was not concerned about speed but memory
> needs.
> 
> NR_CPUS can be 4096 on some distros, that means a 32Kbyte allocation.
> 
> Really, you'll have to have very strong arguments to introduce an
> [NR_CPUS] array in the kernel today.

I could replace this with a pointer to an array of size
num_possible_cpus().  But I think per_cpu is wrong here.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 2/2] lib: cpu_rmap: CPU affinity reverse-mapping
From: Eric Dumazet @ 2011-01-04 22:19 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Thomas Gleixner, David Miller, Tom Herbert, linux-kernel, netdev,
	linux-net-drivers
In-Reply-To: <1294178690.3636.49.camel@bwh-desktop>

Le mardi 04 janvier 2011 à 22:04 +0000, Ben Hutchings a écrit :

> get_rps_cpu() will need to read from an arbitrary entry in cpu_rmap (not
> the current CPU's entry) for each new flow and for each flow that went
> idle for a while.  That's not fast path but it is part of the data path,
> not the control path.
> 

Hmm, I call this fast path :(

> > Cache lines dont matter. I was not concerned about speed but memory
> > needs.
> > 
> > NR_CPUS can be 4096 on some distros, that means a 32Kbyte allocation.
> > 
> > Really, you'll have to have very strong arguments to introduce an
> > [NR_CPUS] array in the kernel today.
> 
> I could replace this with a pointer to an array of size
> num_possible_cpus().  But I think per_cpu is wrong here.

Yes, an dynamic array is acceptable

You probably mean nr_cpu_ids 

^ permalink raw reply

* Re: [PATCH] r8169: support control of advertising
From: Francois Romieu @ 2011-01-04 22:38 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Oliver Neukum, davem, netdev, machen, Hayes Wang
In-Reply-To: <1294161621.3636.17.camel@bwh-desktop>

Ben Hutchings <bhutchings@solarflare.com> :
[...]
> Should be ADVERTISE_10HALF.

Fixed.

[...]
> > +		if (adv & ADVERTISED_10baseT_Half)
> > +			auto_nego |= ADVERTISE_10HALF;
> > +		if (adv & ADVERTISED_10baseT_Full)
> > +			auto_nego |= ADVERTISE_10FULL;
> > +		if (adv & ADVERTISED_100baseT_Half)
> > +			auto_nego |= ADVERTISE_100HALF;
> > +		if (adv & ADVERTISED_100baseT_Full)
> > +			auto_nego |=  ADVERTISE_100FULL;
> > +
> >  		auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
> [...]
> 
> Pause advertising should also be controllable through ethtool, if flow
> control can be altered in the MAC.  (It's not clear whether it can.)

It would be gluttony.

> This should also check for unsupported advertising flags (e.g.
> ADVERTISED_1000baseT_Full when the MAC doesn't support 1000M) and return
> -EINVAL if they're set.

Fixed.

Oliver, are you ok with the patch below (against davem's net-next or you
will experience rejects) ?

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 27a7c20..4175a14 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -540,7 +540,7 @@ struct rtl8169_private {
 		void (*up)(struct rtl8169_private *);
 	} pll_power_ops;
 
-	int (*set_speed)(struct net_device *, u8 autoneg, u16 speed, u8 duplex);
+	int (*set_speed)(struct net_device *, u8 aneg, u16 sp, u8 dpx, u32 adv);
 	int (*get_settings)(struct net_device *, struct ethtool_cmd *);
 	void (*phy_reset_enable)(struct rtl8169_private *tp);
 	void (*hw_start)(struct net_device *);
@@ -1093,7 +1093,7 @@ static int rtl8169_get_regs_len(struct net_device *dev)
 }
 
 static int rtl8169_set_speed_tbi(struct net_device *dev,
-				 u8 autoneg, u16 speed, u8 duplex)
+				 u8 autoneg, u16 speed, u8 duplex, u32 ignored)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	void __iomem *ioaddr = tp->mmio_addr;
@@ -1116,17 +1116,28 @@ static int rtl8169_set_speed_tbi(struct net_device *dev,
 }
 
 static int rtl8169_set_speed_xmii(struct net_device *dev,
-				  u8 autoneg, u16 speed, u8 duplex)
+				  u8 autoneg, u16 speed, u8 duplex, u32 adv)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	int giga_ctrl, bmcr;
+	int rc = -EINVAL;
 
 	if (autoneg == AUTONEG_ENABLE) {
 		int auto_nego;
 
 		auto_nego = rtl_readphy(tp, MII_ADVERTISE);
-		auto_nego |= (ADVERTISE_10HALF | ADVERTISE_10FULL |
-			      ADVERTISE_100HALF | ADVERTISE_100FULL);
+		auto_nego &= ~(ADVERTISE_10HALF | ADVERTISE_10FULL |
+				ADVERTISE_100HALF | ADVERTISE_100FULL);
+
+		if (adv & ADVERTISED_10baseT_Half)
+			auto_nego |= ADVERTISE_10HALF;
+		if (adv & ADVERTISED_10baseT_Full)
+			auto_nego |= ADVERTISE_10FULL;
+		if (adv & ADVERTISED_100baseT_Half)
+			auto_nego |= ADVERTISE_100HALF;
+		if (adv & ADVERTISED_100baseT_Full)
+			auto_nego |= ADVERTISE_100FULL;
+
 		auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
 
 		giga_ctrl = rtl_readphy(tp, MII_CTRL1000);
@@ -1141,10 +1152,15 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
 		    (tp->mac_version != RTL_GIGA_MAC_VER_14) &&
 		    (tp->mac_version != RTL_GIGA_MAC_VER_15) &&
 		    (tp->mac_version != RTL_GIGA_MAC_VER_16)) {
-			giga_ctrl |= ADVERTISE_1000FULL | ADVERTISE_1000HALF;
-		} else {
+			if (adv & ADVERTISED_1000baseT_Half)
+				giga_ctrl |= ADVERTISE_1000HALF;
+			if (adv & ADVERTISED_1000baseT_Full)
+				giga_ctrl |= ADVERTISE_1000FULL;
+		} else if (adv & (ADVERTISED_1000baseT_Half |
+				  ADVERTISED_1000baseT_Full)) {
 			netif_info(tp, link, dev,
 				   "PHY does not support 1000Mbps\n");
+			goto out;
 		}
 
 		bmcr = BMCR_ANENABLE | BMCR_ANRESTART;
@@ -1171,7 +1187,7 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
 		else if (speed == SPEED_100)
 			bmcr = BMCR_SPEED100;
 		else
-			return -EINVAL;
+			goto out;
 
 		if (duplex == DUPLEX_FULL)
 			bmcr |= BMCR_FULLDPLX;
@@ -1194,16 +1210,18 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
 		}
 	}
 
-	return 0;
+	rc = 0;
+out:
+	return rc;
 }
 
 static int rtl8169_set_speed(struct net_device *dev,
-			     u8 autoneg, u16 speed, u8 duplex)
+			     u8 autoneg, u16 speed, u8 duplex, u32 advertising)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	int ret;
 
-	ret = tp->set_speed(dev, autoneg, speed, duplex);
+	ret = tp->set_speed(dev, autoneg, speed, duplex, advertising);
 
 	if (netif_running(dev) && (tp->phy_1000_ctrl_reg & ADVERTISE_1000FULL))
 		mod_timer(&tp->timer, jiffies + RTL8169_PHY_TIMEOUT);
@@ -1218,7 +1236,8 @@ static int rtl8169_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
 	int ret;
 
 	spin_lock_irqsave(&tp->lock, flags);
-	ret = rtl8169_set_speed(dev, cmd->autoneg, cmd->speed, cmd->duplex);
+	ret = rtl8169_set_speed(dev,
+		cmd->autoneg, cmd->speed, cmd->duplex, cmd->advertising);
 	spin_unlock_irqrestore(&tp->lock, flags);
 
 	return ret;
@@ -2517,11 +2536,11 @@ static void rtl8169_init_phy(struct net_device *dev, struct rtl8169_private *tp)
 
 	rtl8169_phy_reset(dev, tp);
 
-	/*
-	 * rtl8169_set_speed_xmii takes good care of the Fast Ethernet
-	 * only 8101. Don't panic.
-	 */
-	rtl8169_set_speed(dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL);
+	rtl8169_set_speed(dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL,
+		ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full |
+		ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full |
+		tp->mii.supports_gmii ? 0 :
+			ADVERTISED_1000baseT_Half | ADVERTISED_1000baseT_Full);
 
 	if (RTL_R8(PHYstatus) & TBI_Enable)
 		netif_info(tp, link, dev, "TBI auto-negotiating\n");

^ permalink raw reply related

* Re: [net-next-2.6 PATCH v5 2/2] net_sched: implement a root container qdisc sch_mqprio
From: Jarek Poplawski @ 2011-01-04 22:59 UTC (permalink / raw)
  To: John Fastabend
  Cc: davem, hadi, shemminger, tgraf, eric.dumazet, bhutchings, nhorman,
	netdev
In-Reply-To: <20110104185646.13692.68146.stgit@jf-dev1-dcblab>

On Tue, Jan 04, 2011 at 10:56:46AM -0800, John Fastabend wrote:
> This implements a mqprio queueing discipline that by default creates
> a pfifo_fast qdisc per tx queue and provides the needed configuration
> interface.
> 
> Using the mqprio qdisc the number of tcs currently in use along
> with the range of queues alloted to each class can be configured. By
> default skbs are mapped to traffic classes using the skb priority.
> This mapping is configurable.
> 
> Configurable parameters,
> 
> struct tc_mqprio_qopt {
>         __u8    num_tc;
>         __u8    prio_tc_map[TC_BITMASK + 1];
>         __u8    hw;
>         __u16   count[TC_MAX_QUEUE];
>         __u16   offset[TC_MAX_QUEUE];
> };
> 
> Here the count/offset pairing give the queue alignment and the
> prio_tc_map gives the mapping from skb->priority to tc.
> 
> The hw bit determines if the hardware should configure the count
> and offset values. If the hardware bit is set then the operation
> will fail if the hardware does not implement the ndo_setup_tc
> operation. This is to avoid undetermined states where the hardware
> may or may not control the queue mapping. Also minimal bounds
> checking is done on the count/offset to verify a queue does not
> exceed num_tx_queues and that queue ranges do not overlap. Otherwise
> it is left to user policy or hardware configuration to create
> useful mappings.
> 
> It is expected that hardware QOS schemes can be implemented by
> creating appropriate mappings of queues in ndo_tc_setup().
> 
> One expected use case is drivers will use the ndo_setup_tc to map
> queue ranges onto 802.1Q traffic classes. This provides a generic
> mechanism to map network traffic onto these traffic classes and
> removes the need for lower layer drivers to know specifics about
> traffic types.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> 
>  include/linux/netdevice.h |    3 
>  include/linux/pkt_sched.h |   10 +
>  net/sched/Kconfig         |   12 +
>  net/sched/Makefile        |    1 
>  net/sched/sch_generic.c   |    4 
>  net/sched/sch_mqprio.c    |  413 +++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 443 insertions(+), 0 deletions(-)
>  create mode 100644 net/sched/sch_mqprio.c
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ae51323..19a855b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -764,6 +764,8 @@ struct netdev_tc_txq {
>   * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
>   *			  struct nlattr *port[]);
>   * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
> + *
> + * int (*ndo_setup_tc)(struct net_device *dev, int tc);

 * int (*ndo_setup_tc)(struct net_device *dev, u8 tc);

>   */
>  #define HAVE_NET_DEVICE_OPS
>  struct net_device_ops {
> @@ -822,6 +824,7 @@ struct net_device_ops {
>  						   struct nlattr *port[]);
>  	int			(*ndo_get_vf_port)(struct net_device *dev,
>  						   int vf, struct sk_buff *skb);
> +	int			(*ndo_setup_tc)(struct net_device *dev, u8 tc);
>  #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
>  	int			(*ndo_fcoe_enable)(struct net_device *dev);
>  	int			(*ndo_fcoe_disable)(struct net_device *dev);
> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> index 2cfa4bc..1c5310a 100644
> --- a/include/linux/pkt_sched.h
> +++ b/include/linux/pkt_sched.h
> @@ -2,6 +2,7 @@
>  #define __LINUX_PKT_SCHED_H
>  
>  #include <linux/types.h>
> +#include <linux/netdevice.h>

This should better be consulted with Stephen wrt. iproute patch.

>  
>  /* Logical priority bands not depending on specific packet scheduler.
>     Every scheduler will map them to real traffic classes, if it has
> @@ -481,4 +482,13 @@ struct tc_drr_stats {
>  	__u32	deficit;
>  };
>  
> +/* MQPRIO */
> +struct tc_mqprio_qopt {
> +	__u8	num_tc;
> +	__u8	prio_tc_map[TC_BITMASK + 1];
> +	__u8	hw;
> +	__u16	count[TC_MAX_QUEUE];
> +	__u16	offset[TC_MAX_QUEUE];
> +};
> +
>  #endif
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index a36270a..f52f5eb 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -205,6 +205,18 @@ config NET_SCH_DRR
>  
>  	  If unsure, say N.
>  
> +config NET_SCH_MQPRIO
> +	tristate "Multi-queue priority scheduler (MQPRIO)"
> +	help
> +	  Say Y here if you want to use the Multi-queue Priority scheduler.
> +	  This scheduler allows QOS to be offloaded on NICs that have support
> +	  for offloading QOS schedulers.
> +
> +	  To compile this driver as a module, choose M here: the module will
> +	  be called sch_mqprio.
> +
> +	  If unsure, say N.
> +
>  config NET_SCH_INGRESS
>  	tristate "Ingress Qdisc"
>  	depends on NET_CLS_ACT
> diff --git a/net/sched/Makefile b/net/sched/Makefile
> index 960f5db..26ce681 100644
> --- a/net/sched/Makefile
> +++ b/net/sched/Makefile
> @@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_multiq.o
>  obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
>  obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
>  obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
> +obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
>  obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
>  obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
>  obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 34dc598..723b278 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -540,6 +540,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
>  	.dump		=	pfifo_fast_dump,
>  	.owner		=	THIS_MODULE,
>  };
> +EXPORT_SYMBOL(pfifo_fast_ops);
>  
>  struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
>  			  struct Qdisc_ops *ops)
> @@ -674,6 +675,7 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
>  
>  	return oqdisc;
>  }
> +EXPORT_SYMBOL(dev_graft_qdisc);
>  
>  static void attach_one_default_qdisc(struct net_device *dev,
>  				     struct netdev_queue *dev_queue,
> @@ -761,6 +763,7 @@ void dev_activate(struct net_device *dev)
>  		dev_watchdog_up(dev);
>  	}
>  }
> +EXPORT_SYMBOL(dev_activate);
>  
>  static void dev_deactivate_queue(struct net_device *dev,
>  				 struct netdev_queue *dev_queue,
> @@ -840,6 +843,7 @@ void dev_deactivate(struct net_device *dev)
>  	list_add(&dev->unreg_list, &single);
>  	dev_deactivate_many(&single);
>  }
> +EXPORT_SYMBOL(dev_deactivate);
>  
>  static void dev_init_scheduler_queue(struct net_device *dev,
>  				     struct netdev_queue *dev_queue,
> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
> new file mode 100644
> index 0000000..b16dc2c
> --- /dev/null
> +++ b/net/sched/sch_mqprio.c
> @@ -0,0 +1,413 @@
> +/*
> + * net/sched/sch_mqprio.c
> + *
> + * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * version 2 as published by the Free Software Foundation.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +#include <linux/errno.h>
> +#include <linux/skbuff.h>
> +#include <net/netlink.h>
> +#include <net/pkt_sched.h>
> +#include <net/sch_generic.h>
> +
> +struct mqprio_sched {
> +	struct Qdisc		**qdiscs;
> +	int hw_owned;
> +};
> +
> +static void mqprio_destroy(struct Qdisc *sch)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct mqprio_sched *priv = qdisc_priv(sch);
> +	unsigned int ntx;
> +
> +	if (!priv->qdiscs)
> +		return;
> +
> +	for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
> +		qdisc_destroy(priv->qdiscs[ntx]);
> +
> +	if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
> +		dev->netdev_ops->ndo_setup_tc(dev, 0);
> +	else
> +		netdev_set_num_tc(dev, 0);
> +
> +	kfree(priv->qdiscs);
> +}
> +
> +static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
> +{
> +	int i, j;
> +
> +	/* Verify num_tc is not out of max range */
> +	if (qopt->num_tc > TC_MAX_QUEUE)
> +		return -EINVAL;
> +
> +	for (i = 0; i < qopt->num_tc; i++) {
> +		unsigned int last = qopt->offset[i] + qopt->count[i];

(empty line after declarations)

> +		/* Verify the queue offset is in the num tx range */
> +		if (qopt->offset[i] >= dev->num_tx_queues)
> +			return -EINVAL;
> +		/* Verify the queue count is in tx range being equal to the
> +		 * num_tx_queues indicates the last queue is in use.
> +		 */
> +		else if (last > dev->num_tx_queues)
> +			return -EINVAL;
> +
> +		/* Verify that the offset and counts do not overlap */
> +		for (j = i + 1; j < qopt->num_tc; j++) {
> +			if (last > qopt->offset[j])
> +				return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct mqprio_sched *priv = qdisc_priv(sch);
> +	struct netdev_queue *dev_queue;
> +	struct Qdisc *qdisc;
> +	int i, err = -EOPNOTSUPP;
> +	struct tc_mqprio_qopt *qopt = NULL;
> +
> +	/* Unwind attributes on failure */
> +	u8 unwnd_tc = dev->num_tc;
> +	u8 unwnd_map[TC_BITMASK + 1];
> +	struct netdev_tc_txq unwnd_txq[TC_MAX_QUEUE];
> +
> +	if (sch->parent != TC_H_ROOT)
> +		return -EOPNOTSUPP;
> +
> +	if (!netif_is_multiqueue(dev))
> +		return -EOPNOTSUPP;
> +
> +	if (nla_len(opt) < sizeof(*qopt))
> +		return -EINVAL;
> +	qopt = nla_data(opt);
> +
> +	memcpy(unwnd_map, dev->prio_tc_map, sizeof(unwnd_map));
> +	memcpy(unwnd_txq, dev->tc_to_txq, sizeof(unwnd_txq));
> +
> +	/* If the mqprio options indicate that hardware should own
> +	 * the queue mapping then run ndo_setup_tc if this can not
> +	 * be done fail immediately.
> +	 */
> +	if (qopt->hw && dev->netdev_ops->ndo_setup_tc) {
> +		priv->hw_owned = 1;
> +		err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc);
> +		if (err)
> +			return err;
> +	} else if (!qopt->hw) {
> +		if (mqprio_parse_opt(dev, qopt))
> +			return -EINVAL;
> +
> +		if (netdev_set_num_tc(dev, qopt->num_tc))
> +			return -EINVAL;
> +
> +		for (i = 0; i < qopt->num_tc; i++)
> +			netdev_set_tc_queue(dev, i,
> +					    qopt->count[i], qopt->offset[i]);
> +	} else {
> +		return -EINVAL;
> +	}
> +
> +	/* Always use supplied priority mappings */
> +	for (i = 0; i < TC_BITMASK + 1; i++) {
> +		if (netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])) {
> +			err = -EINVAL;

This would probably trigger if we try qopt->num_tc == 0. Is it expected?

> +			goto tc_err;
> +		}
> +	}
> +
> +	/* pre-allocate qdisc, attachment can't fail */
> +	priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
> +			       GFP_KERNEL);
> +	if (priv->qdiscs == NULL) {
> +		err = -ENOMEM;
> +		goto tc_err;
> +	}
> +
> +	for (i = 0; i < dev->num_tx_queues; i++) {
> +		dev_queue = netdev_get_tx_queue(dev, i);
> +		qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
> +					  TC_H_MAKE(TC_H_MAJ(sch->handle),
> +						    TC_H_MIN(i + 1)));
> +		if (qdisc == NULL) {
> +			err = -ENOMEM;
> +			goto err;
> +		}
> +		qdisc->flags |= TCQ_F_CAN_BYPASS;
> +		priv->qdiscs[i] = qdisc;
> +	}
> +
> +	sch->flags |= TCQ_F_MQROOT;
> +	return 0;
> +
> +err:
> +	mqprio_destroy(sch);
> +tc_err:
> +	if (priv->hw_owned)
> +		dev->netdev_ops->ndo_setup_tc(dev, unwnd_tc);

Setting here (again) to unwind a bit later looks strange.
Why not this 'else' only?

> +	else
> +		netdev_set_num_tc(dev, unwnd_tc);
> +
> +	memcpy(dev->prio_tc_map, unwnd_map, sizeof(unwnd_map));
> +	memcpy(dev->tc_to_txq, unwnd_txq, sizeof(unwnd_txq));
> +
> +	return err;
> +}
> +
> +static void mqprio_attach(struct Qdisc *sch)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct mqprio_sched *priv = qdisc_priv(sch);
> +	struct Qdisc *qdisc;
> +	unsigned int ntx;
> +
> +	/* Attach underlying qdisc */
> +	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
> +		qdisc = priv->qdiscs[ntx];
> +		qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
> +		if (qdisc)
> +			qdisc_destroy(qdisc);
> +	}
> +	kfree(priv->qdiscs);
> +	priv->qdiscs = NULL;
> +}
> +
> +static struct netdev_queue *mqprio_queue_get(struct Qdisc *sch,
> +					     unsigned long cl)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	unsigned long ntx = cl - 1 - netdev_get_num_tc(dev);
> +
> +	if (ntx >= dev->num_tx_queues)
> +		return NULL;
> +	return netdev_get_tx_queue(dev, ntx);
> +}
> +
> +static int mqprio_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
> +		    struct Qdisc **old)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
> +
> +	if (dev->flags & IFF_UP)
> +		dev_deactivate(dev);
> +
> +	*old = dev_graft_qdisc(dev_queue, new);
> +
> +	if (dev->flags & IFF_UP)
> +		dev_activate(dev);
> +
> +	return 0;
> +}
> +
> +static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	struct mqprio_sched *priv = qdisc_priv(sch);
> +	unsigned char *b = skb_tail_pointer(skb);
> +	struct tc_mqprio_qopt opt;
> +	struct Qdisc *qdisc;
> +	unsigned int i;
> +
> +	sch->q.qlen = 0;
> +	memset(&sch->bstats, 0, sizeof(sch->bstats));
> +	memset(&sch->qstats, 0, sizeof(sch->qstats));
> +
> +	for (i = 0; i < dev->num_tx_queues; i++) {
> +		qdisc = netdev_get_tx_queue(dev, i)->qdisc;
> +		spin_lock_bh(qdisc_lock(qdisc));
> +		sch->q.qlen		+= qdisc->q.qlen;
> +		sch->bstats.bytes	+= qdisc->bstats.bytes;
> +		sch->bstats.packets	+= qdisc->bstats.packets;
> +		sch->qstats.qlen	+= qdisc->qstats.qlen;
> +		sch->qstats.backlog	+= qdisc->qstats.backlog;
> +		sch->qstats.drops	+= qdisc->qstats.drops;
> +		sch->qstats.requeues	+= qdisc->qstats.requeues;
> +		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
> +		spin_unlock_bh(qdisc_lock(qdisc));
> +	}
> +
> +	opt.num_tc = dev->num_tc;
> +	memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
> +	opt.hw = priv->hw_owned;
> +
> +	for (i = 0; i < dev->num_tc; i++) {
> +		opt.count[i] = dev->tc_to_txq[i].count;
> +		opt.offset[i] = dev->tc_to_txq[i].offset;
> +	}
> +
> +	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
> +
> +	return skb->len;
> +nla_put_failure:
> +	nlmsg_trim(skb, b);
> +	return -1;
> +}
> +
> +static struct Qdisc *mqprio_leaf(struct Qdisc *sch, unsigned long cl)
> +{
> +	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
> +
> +	return dev_queue->qdisc_sleeping;
> +}
> +
> +static unsigned long mqprio_get(struct Qdisc *sch, u32 classid)
> +{
> +	unsigned int ntx = TC_H_MIN(classid);

We need to 'get' tc classes too, eg for individual dumps. Then we
should omit them in .leaf, .graft etc.

> +
> +	if (!mqprio_queue_get(sch, ntx))
> +		return 0;
> +	return ntx;
> +}
> +
> +static void mqprio_put(struct Qdisc *sch, unsigned long cl)
> +{
> +}
> +
> +static int mqprio_dump_class(struct Qdisc *sch, unsigned long cl,
> +			 struct sk_buff *skb, struct tcmsg *tcm)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +
> +	if (cl <= dev->num_tc) {
> +		tcm->tcm_parent = TC_H_ROOT;
> +		tcm->tcm_info = 0;
> +	} else {
> +		int i;
> +		struct netdev_queue *dev_queue;
> +		dev_queue = mqprio_queue_get(sch, cl);
> +
> +		tcm->tcm_parent = 0;
> +		for (i = 0; i < netdev_get_num_tc(dev); i++) {


Why dev->num_tc above, netdev_get_num_tc(dev) here, and dev->num_tc
below?

> +			struct netdev_tc_txq tc = dev->tc_to_txq[i];
> +			int q_idx = cl - dev->num_tc;

(empty line after declarations)

> +			if (q_idx >= tc.offset &&
> +			    q_idx < tc.offset + tc.count) {

cl == 17, tc.offset == 0, tc.count == 1, num_tc = 16, q_idx = 1,
!(1 < 0 + 1), doesn't belong to the parent #1?

> +				tcm->tcm_parent =
> +					TC_H_MAKE(TC_H_MAJ(sch->handle),
> +						  TC_H_MIN(i + 1));
> +				break;
> +			}
> +		}
> +		tcm->tcm_info = dev_queue->qdisc_sleeping->handle;
> +	}
> +	tcm->tcm_handle |= TC_H_MIN(cl);
> +	return 0;
> +}
> +
> +static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
> +			       struct gnet_dump *d)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +
> +	if (cl <= netdev_get_num_tc(dev)) {
> +		int i;
> +		struct Qdisc *qdisc;
> +		struct gnet_stats_queue qstats = {0};
> +		struct gnet_stats_basic_packed bstats = {0};
> +		struct netdev_tc_txq tc = dev->tc_to_txq[cl - 1];
> +
> +		/* Drop lock here it will be reclaimed before touching
> +		 * statistics this is required because the d->lock we
> +		 * hold here is the look on dev_queue->qdisc_sleeping
> +		 * also acquired below.
> +		 */
> +		spin_unlock_bh(d->lock);
> +
> +		for (i = tc.offset; i < tc.offset + tc.count; i++) {
> +			qdisc = netdev_get_tx_queue(dev, i)->qdisc;
> +			spin_lock_bh(qdisc_lock(qdisc));
> +			bstats.bytes      += qdisc->bstats.bytes;
> +			bstats.packets    += qdisc->bstats.packets;
> +			qstats.qlen       += qdisc->qstats.qlen;
> +			qstats.backlog    += qdisc->qstats.backlog;
> +			qstats.drops      += qdisc->qstats.drops;
> +			qstats.requeues   += qdisc->qstats.requeues;
> +			qstats.overlimits += qdisc->qstats.overlimits;
> +			spin_unlock_bh(qdisc_lock(qdisc));
> +		}
> +		/* Reclaim root sleeping lock before completing stats */
> +		spin_lock_bh(d->lock);
> +		if (gnet_stats_copy_basic(d, &bstats) < 0 ||
> +		    gnet_stats_copy_queue(d, &qstats) < 0)
> +			return -1;
> +	} else {
> +		struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);

(empty line after declarations)

> +		sch = dev_queue->qdisc_sleeping;
> +		sch->qstats.qlen = sch->q.qlen;
> +		if (gnet_stats_copy_basic(d, &sch->bstats) < 0 ||
> +		    gnet_stats_copy_queue(d, &sch->qstats) < 0)
> +			return -1;
> +	}
> +	return 0;
> +}
> +
> +static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> +{
> +	struct net_device *dev = qdisc_dev(sch);
> +	unsigned long ntx;
> +	u8 num_tc = netdev_get_num_tc(dev);
> +
> +	if (arg->stop)
> +		return;
> +
> +	/* Walk hierarchy with a virtual class per tc */
> +	arg->count = arg->skip;
> +	for (ntx = arg->skip; ntx < dev->num_tx_queues + num_tc; ntx++) {

Should we report possibly unused/unconfigured tx_queues?

Jarek P.

> +		if (arg->fn(sch, ntx + 1, arg) < 0) {
> +			arg->stop = 1;
> +			break;
> +		}
> +		arg->count++;
> +	}
> +}
> +
> +static const struct Qdisc_class_ops mqprio_class_ops = {
> +	.graft		= mqprio_graft,
> +	.leaf		= mqprio_leaf,
> +	.get		= mqprio_get,
> +	.put		= mqprio_put,
> +	.walk		= mqprio_walk,
> +	.dump		= mqprio_dump_class,
> +	.dump_stats	= mqprio_dump_class_stats,
> +};
> +
> +struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
> +	.cl_ops		= &mqprio_class_ops,
> +	.id		= "mqprio",
> +	.priv_size	= sizeof(struct mqprio_sched),
> +	.init		= mqprio_init,
> +	.destroy	= mqprio_destroy,
> +	.attach		= mqprio_attach,
> +	.dump		= mqprio_dump,
> +	.owner		= THIS_MODULE,
> +};
> +
> +static int __init mqprio_module_init(void)
> +{
> +	return register_qdisc(&mqprio_qdisc_ops);
> +}
> +
> +static void __exit mqprio_module_exit(void)
> +{
> +	unregister_qdisc(&mqprio_qdisc_ops);
> +}
> +
> +module_init(mqprio_module_init);
> +module_exit(mqprio_module_exit);
> +
> +MODULE_LICENSE("GPL");
> 

^ permalink raw reply

* [net-next-2.6 PATCH] ethtool: update get_rx_ntuple to correctly interpret string count
From: Alexander Duyck @ 2011-01-04 23:29 UTC (permalink / raw)
  To: davem, bhutchings; +Cc: netdev

Currently any strings returned via the get_rx_ntuple call will just be
dropped because the num_strings will be zero.  In order to correct this I
am updating things so that the return value of get_rx_ntuple is the number
of strings that were written, or a negative value if there was an error.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 net/core/ethtool.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 1774178..7ade13b 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -587,6 +587,9 @@ static int ethtool_get_rx_ntuple(struct net_device *dev, void __user *useraddr)
 	if (ops->get_rx_ntuple) {
 		/* driver-specific filter grab */
 		ret = ops->get_rx_ntuple(dev, gstrings.string_set, data);
+		if (ret < 0)
+			goto out;
+		num_strings = ret;
 		goto copy;
 	}
 


^ permalink raw reply related

* Re: [net-next-2.6 08/08] r8169: more 8168dp support.
From: Francois Romieu @ 2011-01-04 23:53 UTC (permalink / raw)
  To: hayeswang; +Cc: davem, netdev, 'Ben Hutchings'
In-Reply-To: <FFFEF1293F9B4E3B81D3777F91F21B01@realtek.com.tw>

hayeswang <hayeswang@realtek.com> :
[...]
> When doing reset, the nic has to notify the embedded system and wait a response.
> However, maybe the system accesses the same register at the same time, so the
> embedded system and the driver implement the same method of software mutex to
> prevent this situation.

Thanks for the information.

I have found no 8168dp in my collection. A volunteer to test this patch with
the relevant hardware would be welcome.

-- 
Ueimor

^ permalink raw reply

* Re: [net-next-2.6 PATCH] ethtool: update get_rx_ntuple to correctly interpret string count
From: Ben Hutchings @ 2011-01-05  0:01 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: davem, netdev
In-Reply-To: <20110104232947.3451.91153.stgit@gitlad.jf.intel.com>

On Tue, 2011-01-04 at 15:29 -0800, Alexander Duyck wrote:
> Currently any strings returned via the get_rx_ntuple call will just be
> dropped because the num_strings will be zero.  In order to correct this I
> am updating things so that the return value of get_rx_ntuple is the number
> of strings that were written, or a negative value if there was an error.
[...]

Nothing implements ethtool_ops::get_rx_ntuple, anyway.

The fallback implementation is totally bogus, too.  Maximum of 1024
filters?  Erm, sfc can handle more than that.  And doing complex string
formatting in the kernel, even though all the parsing is in ethtool?

Please, let's write off ETHTOOL_GRXNTUPLE as a failed experiment and
replace it with a command that behaves more like ETHTOOL_GRXCLSRLALL.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [RFC] ECN and IP defragmentation
From: Eric Dumazet @ 2011-01-05  0:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Hi David

It seems ip_fragment.c doesnt comply with RFC3168, section 5.3

5.3.  Fragmentation

   ECN-capable packets MAY have the DF (Don't Fragment) bit set.
   Reassembly of a fragmented packet MUST NOT lose indications of
   congestion.  In other words, if any fragment of an IP packet to be
   reassembled has the CE codepoint set, then one of two actions MUST be
   taken:

      * Set the CE codepoint on the reassembled packet.  However, this
        MUST NOT occur if any of the other fragments contributing to
        this reassembly carries the Not-ECT codepoint.

      * The packet is dropped, instead of being reassembled, for any
        other reason.

Should we fix this ?

(I tried to use ECN on UDP messages and noticed this problem)

Thanks

^ permalink raw reply

* [RFC] sched: CHOKe packet scheduler
From: Stephen Hemminger @ 2011-01-05  0:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This implements the CHOKe packet scheduler based on the existing
Linux RED scheduler based on the algorithm described in the paper.
Configuration is the same as RED; only the name changes.

The core idea is:
  For every packet arrival:
  	Calculate Qave
	if (Qave < minth) {
	   Queue the new packet
	}
	Else {
	     Select randomly a packet from the queue for their flow id
	     Compare arriving packet with a randomly selected packet.
	     If they have the same flow id {
	     	Drop both the packets
	     }
	     Else {
	     	  if (Qave ≥ maxth) {
		     Calculate the dropping probability pa
		     Drop the packet with probability pa
		  }
		  Else {
		     Drop the new packet
		  }
	     }
       }

This an early access version.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
 net/sched/Kconfig     |   11 +
 net/sched/Makefile    |    1 
 net/sched/sch_choke.c |  364 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 376 insertions(+)

--- a/net/sched/Kconfig	2011-01-04 16:25:18.000000000 -0800
+++ b/net/sched/Kconfig	2011-01-04 16:26:02.335973715 -0800
@@ -205,6 +205,17 @@ config NET_SCH_DRR
 
 	  If unsure, say N.
 
+config NET_SCH_CHOKE
+	tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
+	help
+	  Say Y here if you want to use the CHOKe packet scheduler (CHOose
+	  and Keep for responsive flows, CHOose and Kill for unresponsive
+	  flows). This is a variation of RED which trys to penalize flows
+	  that monopolize the queue.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called sch_choke.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
--- a/net/sched/Makefile	2011-01-04 16:25:18.000000000 -0800
+++ b/net/sched/Makefile	2011-01-04 16:26:16.048938937 -0800
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_mult
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
+obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/net/sched/sch_choke.c	2011-01-04 16:25:33.913971468 -0800
@@ -0,0 +1,364 @@
+/*
+ * net/sched/sch_choke.c	CHOKE scheduler
+ *
+ * Copyright (c) 2011 Stephen Hemminger <shemminger@vyatta.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/ipv6.h>
+#include <linux/jhash.h>
+#include <net/pkt_sched.h>
+#include <net/ip.h>
+#include <net/red.h>
+#include <net/ipv6.h>
+
+/*	CHOKe stateless AQM for fair bandwidth allocation
+        =================================================
+
+	Source:
+	R. Pan, B. Prabhakar, and K. Psounis, "CHOKe, A Stateless
+	Active Queue Management Scheme for Approximating Fair Bandwidth Allocation",
+	IEEE INFOCOM, 2000.
+
+	A. Tang, J. Wang, S. Low, "Understanding CHOKe: Throughput and Spatial
+	Characteristics", IEEE/ACM Transactions on Networking, 2004
+
+	ADVANTAGE:
+	- Penalizes unfair flows
+	- Random drop provide gradual feedback
+
+	DRAWBACKS:
+	- Small queue for single flow
+	- Can be gamed by opening lots of connections
+	- Hard to get correct paremeters (same problem as RED)
+
+ */
+
+struct choke_sched_data
+{
+	u32		  limit;
+	unsigned char	  flags;
+
+	struct red_parms  parms;
+	struct red_stats  stats;
+};
+
+/* Select a packet at random from the list.
+ * Same caveats as skb_peek.
+ */
+static struct sk_buff *skb_peek_random(struct sk_buff_head *list)
+{
+	struct sk_buff *skb = list->next;
+	unsigned int idx = net_random() % list->qlen;
+
+	while (skb && idx-- > 0)
+		skb = skb->next;
+
+	return skb;
+}
+
+/* Given IP header and size find src/dst port pair */
+static inline u32 get_ports(const void *hdr, size_t hdr_size, int offset)
+{
+	return *(u32 *)(hdr + hdr_size + offset);
+}
+
+
+static bool same_flow(struct sk_buff *nskb, const struct sk_buff *oskb)
+{
+	if (nskb->protocol != oskb->protocol)
+		return false;
+
+	switch (nskb->protocol) {
+	case htons(ETH_P_IP):
+	{
+		const struct iphdr *iph1, *iph2;
+		int poff;
+
+		if (!pskb_network_may_pull(nskb, sizeof(*iph1)))
+			return false;
+
+		iph1 = ip_hdr(nskb);
+		iph2 = ip_hdr(oskb);
+
+		if (iph1->protocol != iph2->protocol ||
+		    iph1->daddr != iph2->daddr ||
+		    iph1->saddr != iph2->saddr)
+			return false;
+
+		/* Be hostile to new fragmented packets */
+		if (iph1->frag_off & htons(IP_MF|IP_OFFSET))
+			return true;
+
+		if (iph2->frag_off & htons(IP_MF|IP_OFFSET))
+			return false;
+
+		poff = proto_ports_offset(iph1->protocol);
+		if (poff >= 0 &&
+		    pskb_network_may_pull(nskb, iph1->ihl * 4 + 4 + poff)) {
+			iph1 = ip_hdr(nskb);
+
+			return get_ports(iph1, iph1->ihl * 4, poff)
+				== get_ports(iph2, iph2->ihl * 4, poff);
+		}
+
+		return false;
+	}
+
+	case htons(ETH_P_IPV6):
+	{
+		const struct ipv6hdr *iph1, *iph2;
+		int poff;
+
+		if (!pskb_network_may_pull(nskb, sizeof(*iph1)))
+			return false;
+
+		iph1 = ipv6_hdr(nskb);
+		iph2 = ipv6_hdr(oskb);
+
+		if (iph1->nexthdr != iph2->nexthdr ||
+		    ipv6_addr_cmp(&iph1->daddr, &iph2->daddr) != 0 ||
+		    ipv6_addr_cmp(&iph1->saddr, &iph2->saddr) != 0)
+			return false;
+
+		poff = proto_ports_offset(iph1->nexthdr);
+		if (poff >= 0 &&
+		    pskb_network_may_pull(nskb, sizeof(*iph1) + 4 + poff)) {
+			iph1 = ipv6_hdr(nskb);
+
+			return get_ports(iph1, sizeof(*iph1), poff)
+				== get_ports(iph2, sizeof(*iph2), poff);
+		}
+		return false;
+	}
+	default:
+		return false;
+	}
+
+}
+
+/*
+ * Decide what to do with new packet based on queue size.
+ * returns 1 if packet should be admitted
+ *         0 if packet should be dropped
+ */
+static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct red_parms *p = &q->parms;
+
+	p->qavg = red_calc_qavg(p, skb_queue_len(&sch->q));
+	if (red_is_idling(p))
+		red_end_of_idle_period(p);
+
+	if (p->qavg <= p->qth_min)
+		p->qcount = -1;
+	else {
+		struct sk_buff *oskb;
+
+		/* Draw a packet at random from queue */
+		oskb = skb_peek_random(&sch->q);
+
+		/* Both packets from same flow? */
+		if (same_flow(skb, oskb)) {
+			/* Drop both packets */
+			__skb_unlink(oskb, &sch->q);
+			qdisc_drop(oskb, sch);
+			goto congestion_drop;
+		}
+
+		if (p->qavg > p->qth_max) {
+			p->qcount = -1;
+
+			sch->qstats.overlimits++;
+			q->stats.forced_drop++;
+			goto congestion_drop;
+		}
+
+		if (++p->qcount) {
+			if (red_mark_probability(p, p->qavg)) {
+				p->qcount = 0;
+				p->qR = red_random(p);
+
+				sch->qstats.overlimits++;
+				q->stats.prob_drop++;
+				goto congestion_drop;
+			}
+		} else
+			p->qR = red_random(p);
+	}
+
+	/* Admit new packet */
+	if (likely(skb_queue_len(&sch->q) < q->limit))
+		return qdisc_enqueue_tail(skb, sch);
+
+	q->stats.pdrop++;
+	sch->qstats.drops++;
+	kfree_skb(skb);
+	return NET_XMIT_DROP;
+
+ congestion_drop:
+	qdisc_drop(skb, sch);
+	return NET_XMIT_CN;
+}
+
+static struct sk_buff *choke_dequeue(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	skb = qdisc_dequeue_head(sch);
+	if (!skb) {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+	}
+
+	return skb;
+}
+
+static unsigned int choke_drop(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	unsigned int len;
+
+	len = qdisc_queue_drop(sch);
+
+	if (len > 0)
+		q->stats.other++;
+	else {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+	}
+
+	return len;
+}
+
+static void choke_reset(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	red_restart(&q->parms);
+}
+
+static const struct nla_policy choke_policy[TCA_RED_MAX + 1] = {
+	[TCA_RED_PARMS]	= { .len = sizeof(struct tc_red_qopt) },
+	[TCA_RED_STAB]	= { .len = RED_STAB_SIZE },
+};
+
+static int choke_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_RED_MAX + 1];
+	struct tc_red_qopt *ctl;
+	int err;
+
+	if (opt == NULL)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_RED_MAX, opt, choke_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_RED_PARMS] == NULL ||
+	    tb[TCA_RED_STAB] == NULL)
+		return -EINVAL;
+
+	ctl = nla_data(tb[TCA_RED_PARMS]);
+
+	sch_tree_lock(sch);
+	q->flags = ctl->flags;
+	q->limit = ctl->limit;
+
+	red_set_parms(&q->parms, ctl->qth_min, ctl->qth_max, ctl->Wlog,
+		      ctl->Plog, ctl->Scell_log,
+		      nla_data(tb[TCA_RED_STAB]));
+
+	if (skb_queue_empty(&sch->q))
+		red_end_of_idle_period(&q->parms);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static int choke_init(struct Qdisc* sch, struct nlattr *opt)
+{
+	return choke_change(sch, opt);
+}
+
+static int choke_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts = NULL;
+	struct tc_red_qopt opt = {
+		.limit		= q->limit,
+		.flags		= q->flags,
+		.qth_min	= q->parms.qth_min >> q->parms.Wlog,
+		.qth_max	= q->parms.qth_max >> q->parms.Wlog,
+		.Wlog		= q->parms.Wlog,
+		.Plog		= q->parms.Plog,
+		.Scell_log	= q->parms.Scell_log,
+	};
+
+	opts = nla_nest_start(skb, TCA_OPTIONS);
+	if (opts == NULL)
+		goto nla_put_failure;
+
+	NLA_PUT(skb, TCA_RED_PARMS, sizeof(opt), &opt);
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -EMSGSIZE;
+}
+
+static int choke_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct tc_red_xstats st = {
+		.early	= q->stats.prob_drop + q->stats.forced_drop,
+		.pdrop	= q->stats.pdrop,
+		.other	= q->stats.other,
+		.marked	= q->stats.prob_mark + q->stats.forced_mark,
+	};
+
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static struct Qdisc_ops choke_qdisc_ops __read_mostly = {
+	.id		=	"choke",
+	.priv_size	=	sizeof(struct choke_sched_data),
+
+	.enqueue	=	choke_enqueue,
+	.dequeue	=	choke_dequeue,
+	.peek		=	qdisc_peek_head,
+	.drop		=	choke_drop,
+	.init		=	choke_init,
+	.reset		=	choke_reset,
+	.change		=	choke_change,
+	.dump		=	choke_dump,
+	.dump_stats	=	choke_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init choke_module_init(void)
+{
+	return register_qdisc(&choke_qdisc_ops);
+}
+
+static void __exit choke_module_exit(void)
+{
+	unregister_qdisc(&choke_qdisc_ops);
+}
+
+module_init(choke_module_init)
+module_exit(choke_module_exit)
+
+MODULE_LICENSE("GPL");

^ permalink raw reply

* [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Mahesh Bandewar @ 2011-01-05  0:30 UTC (permalink / raw)
  To: David Miller, Ben Hutchings, Laurent Chavey, Tom Herbert
  Cc: netdev, Mahesh Bandewar
In-Reply-To: <AANLkTikW24AcLEAioYtCwuQOoVnTrqVoAaZw3sksZ_jU@mail.gmail.com>

This patch enables ethtool to set the loopback mode on a given interface.
By configuring the interface in loopback mode in conjunction with a policy
route / rule, a userland application can stress the egress / ingress path
exposing the flows of the change in progress and potentially help developer(s)
understand the impact of those changes without even sending a packet out
on the network.

Following set of commands illustrates one such example -
	a) ip -4 addr add 192.168.1.1/24 dev eth1
	b) ip -4 rule add from all iif eth1 lookup 250
	c) ip -4 route add local 0/0 dev lo proto kernel scope host table 250
	d) arp -Ds 192.168.1.100 eth1
	e) arp -Ds 192.168.1.200 eth1
	f) sysctl -w net.ipv4.ip_nonlocal_bind=1
	g) sysctl -w net.ipv4.conf.all.accept_local=1
	# Assuming that the machine has 8 cores
	h) taskset 000f netserver -L 192.168.1.200
	i) taskset 00f0 netperf -t TCP_CRR -L 192.168.1.100 -H 192.168.1.200 -l 30

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
---
 include/linux/ethtool.h |   16 ++++++++++++++++
 net/core/ethtool.c      |   39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 6628a50..c036347 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -616,6 +616,18 @@ void ethtool_ntuple_flush(struct net_device *dev);
  *	Should validate the magic field.  Don't need to check len for zero
  *	or wraparound.  Update len to the amount written.  Returns an error
  *	or zero.
+ *
+ * get_loopback:
+ * set_loopback:
+ *	These are the driver specific get / set methods to report / enable-
+ *	disable loopback mode. The idea is to stress test the ingress/egress
+ *	paths by enabling this mode. There are multiple places this could be
+ *	done and choice of place will most likely be affected by the device
+ *	capabilities. So as a guiding principle; select a place to implement 
+ *	loopback mode as close to the host as possible. This would maximize
+ *	the soft-path length and maintain parity in terms of comparison with
+ *	different set of drivers.
+ *
  */
 struct ethtool_ops {
 	int	(*get_settings)(struct net_device *, struct ethtool_cmd *);
@@ -678,6 +690,8 @@ struct ethtool_ops {
 				  struct ethtool_rxfh_indir *);
 	int	(*set_rxfh_indir)(struct net_device *,
 				  const struct ethtool_rxfh_indir *);
+	int	(*get_loopback)(struct net_device *, u32 *);
+	int	(*set_loopback)(struct net_device *, u32);
 };
 #endif /* __KERNEL__ */
 
@@ -741,6 +755,8 @@ struct ethtool_ops {
 #define ETHTOOL_GSSET_INFO	0x00000037 /* Get string set info */
 #define ETHTOOL_GRXFHINDIR	0x00000038 /* Get RX flow hash indir'n table */
 #define ETHTOOL_SRXFHINDIR	0x00000039 /* Set RX flow hash indir'n table */
+#define ETHTOOL_SLOOPBACK	0x0000003a /* Enable / Disable Loopback */
+#define ETHTOOL_GLOOPBACK	0x0000003b /* Get Loopback status */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 956a9f4..5c87c93 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1434,6 +1434,39 @@ static noinline_for_stack int ethtool_flash_device(struct net_device *dev,
 	return dev->ethtool_ops->flash_device(dev, &efl);
 }
 
+static int ethtool_set_loopback(struct net_device *dev, void __user *useraddr)
+{
+	struct ethtool_value edata;
+	const struct ethtool_ops *ops = dev->ethtool_ops;
+
+	if (!ops || !ops->set_loopback)
+		return -EOPNOTSUPP;
+
+	if (copy_from_user(&edata, useraddr, sizeof(edata)))
+		return -EFAULT;
+
+	return ops->set_loopback(dev, edata.data);
+}
+
+static int ethtool_get_loopback(struct net_device *dev, void __user *useraddr)
+{
+	struct ethtool_value edata;
+	const struct ethtool_ops *ops = dev->ethtool_ops;
+	int err;
+
+	if (!ops || !ops->get_loopback)
+		return -EOPNOTSUPP;
+
+	err = ops->get_loopback(dev, &edata.data);	
+	if (err)
+		return (err);
+
+	if (copy_to_user(useraddr, &edata, sizeof(edata)))
+		return -EFAULT;
+
+	return 0;
+}
+
 /* The main entry point in this file.  Called from net/core/dev.c */
 
 int dev_ethtool(struct net *net, struct ifreq *ifr)
@@ -1678,6 +1711,12 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 	case ETHTOOL_SRXFHINDIR:
 		rc = ethtool_set_rxfh_indir(dev, useraddr);
 		break;
+	case ETHTOOL_SLOOPBACK:
+		rc = ethtool_set_loopback(dev, useraddr);
+		break;
+	case ETHTOOL_GLOOPBACK:
+		rc = ethtool_get_loopback(dev, useraddr);
+		break;
 	default:
 		rc = -EOPNOTSUPP;
 	}
-- 
1.7.3.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox