Netdev List
 help / color / mirror / Atom feed
* [PATCH] ipv6: Silence privacy extensions initialization
From: Romain Francoise @ 2011-01-17 17:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: Alexey Kuznetsov, Pekka Savola (ipv6), James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev

When a network namespace is created (via CLONE_NEWNET), the loopback
interface is automatically added to the new namespace, triggering a
printk in ipv6_add_dev() if CONFIG_IPV6_PRIVACY is set.

This is problematic for applications which use CLONE_NEWNET as
part of a sandbox, like Chromium's suid sandbox or recent versions of
vsftpd. On a busy machine, it can lead to thousands of useless
"lo: Disabled Privacy Extensions" messages appearing in dmesg.

It's easy enough to check the status of privacy extensions via the
use_tempaddr sysctl, so just removing the printk seems like the most
sensible solution.

Signed-off-by: Romain Francoise <romain@orebokech.com>
---
 net/ipv6/addrconf.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 5b189c9..24a1cf1 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -420,9 +420,6 @@ static struct inet6_dev * ipv6_add_dev(struct net_device *dev)
 	    dev->type == ARPHRD_TUNNEL6 ||
 	    dev->type == ARPHRD_SIT ||
 	    dev->type == ARPHRD_NONE) {
-		printk(KERN_INFO
-		       "%s: Disabled Privacy Extensions\n",
-		       dev->name);
 		ndev->cnf.use_tempaddr = -1;
 	} else {
 		in6_dev_hold(ndev);
-- 
1.7.2.3


^ permalink raw reply related

* [net-next-2.6 PATCH v8 0/2] Series short description
From: John Fastabend @ 2011-01-17 18:05 UTC (permalink / raw)
  To: davem
  Cc: bhutchings, jarkao2, hadi, eric.dumazet, shemminger, tgraf,
	nhorman, netdev

Changed from v7, Ben Hutchings wants to actively manage queues
from the ndo_setup_tc() routine. Presumably to change the number
of tx queues creating n queues per traffic class. I agreed so
this version updates the patch to work correctly in this case.

Previously I was calling ndo_setup_tc from netif_set_num_tx_queues(),
to verify the queue offset/count but this would break any
queue management so this is removed. Now the mappings are
invalidated if the mapping requires it and it is expected
that the netdevice configured a valid mapping so calling
back into the driver is not needed. Validation is still
required in the case of a netdevice that does not
implement ndo_setup_tc() or to recover in cases where
ndo_open is adjusting number of queues do to resource
constraints.

Finally added some documentation to netdevice.h regarding
the new ops routine.

---

John Fastabend (2):
      net_sched: implement a root container qdisc sch_mqprio
      net: implement mechanism for HW based QOS


 include/linux/netdevice.h |   68 +++++++
 include/linux/pkt_sched.h |   12 +
 net/core/dev.c            |   55 ++++++
 net/sched/Kconfig         |   12 +
 net/sched/Makefile        |    1 
 net/sched/sch_generic.c   |    4 
 net/sched/sch_mqprio.c    |  417 +++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 568 insertions(+), 1 deletions(-)
 create mode 100644 net/sched/sch_mqprio.c

-- 
Signature

^ permalink raw reply

* [net-next-2.6 PATCH v8 1/2] net: implement mechanism for HW based QOS
From: John Fastabend @ 2011-01-17 18:06 UTC (permalink / raw)
  To: davem
  Cc: bhutchings, jarkao2, hadi, eric.dumazet, shemminger, tgraf,
	nhorman, netdev
In-Reply-To: <20110117175542.29543.38690.stgit@jf-dev1-dcblab>

This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.

Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.

By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.

With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.

To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.

This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |   68 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |   55 ++++++++++++++++++++++++++++++++++++
 2 files changed, 122 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0f6b1c9..c973582 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -646,6 +646,14 @@ struct xps_dev_maps {
     (nr_cpu_ids * sizeof(struct xps_map *)))
 #endif /* CONFIG_XPS */
 
+#define TC_MAX_QUEUE	16
+#define TC_BITMASK	15
+/* HW offloaded queuing disciplines txq count and offset maps */
+struct netdev_tc_txq {
+	u16 count;
+	u16 offset;
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -756,6 +764,11 @@ struct xps_dev_maps {
  * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
  *			  struct nlattr *port[]);
  * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
+ * int (*ndo_setup_tc)(struct net_device *dev, u8 tc)
+ * 	Called to setup 'tc' number of traffic classes in the net device. This
+ * 	is always called from the stack with the rtnl lock held and netif tx
+ * 	queues stopped. This allows the netdevice to perform queue management
+ * 	safely.
  */
 #define HAVE_NET_DEVICE_OPS
 struct net_device_ops {
@@ -814,6 +827,7 @@ struct net_device_ops {
 						   struct nlattr *port[]);
 	int			(*ndo_get_vf_port)(struct net_device *dev,
 						   int vf, struct sk_buff *skb);
+	int			(*ndo_setup_tc)(struct net_device *dev, u8 tc);
 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	int			(*ndo_fcoe_enable)(struct net_device *dev);
 	int			(*ndo_fcoe_disable)(struct net_device *dev);
@@ -1146,6 +1160,9 @@ struct net_device {
 	/* Data Center Bridging netlink ops */
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
+	u8 num_tc;
+	struct netdev_tc_txq tc_to_txq[TC_MAX_QUEUE];
+	u8 prio_tc_map[TC_BITMASK + 1];
 
 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	/* max exchange id for FCoE LRO by ddp */
@@ -1162,6 +1179,57 @@ struct net_device {
 #define	NETDEV_ALIGN		32
 
 static inline
+int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
+{
+	return dev->prio_tc_map[prio & TC_BITMASK];
+}
+
+static inline
+int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->prio_tc_map[prio & TC_BITMASK] = tc & TC_BITMASK;
+	return 0;
+}
+
+static inline
+void netdev_reset_tc(struct net_device *dev)
+{
+	dev->num_tc = 0;
+	memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
+	memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
+}
+
+static inline
+int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->tc_to_txq[tc].count = count;
+	dev->tc_to_txq[tc].offset = offset;
+	return 0;
+}
+
+static inline
+int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
+{
+	if (num_tc > TC_MAX_QUEUE)
+		return -EINVAL;
+
+	dev->num_tc = num_tc;
+	return 0;
+}
+
+static inline
+int netdev_get_num_tc(struct net_device *dev)
+{
+	return dev->num_tc;
+}
+
+static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index a215269..2348b98 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1593,6 +1593,48 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 	rcu_read_unlock();
 }
 
+/* netif_setup_tc - Handle tc mappings on real_num_tx_queues change
+ * @dev: Network device
+ * @txq: number of queues available
+ *
+ * If real_num_tx_queues is changed the tc mappings may no longer be
+ * valid. To resolve this verify the tc mapping remains valid and if
+ * not NULL the mapping. With no priorities mapping to this
+ * offset/count pair it will no longer be used. In the worst case TC0
+ * is invalid nothing can be done so disable priority mappings. If is
+ * expected that drivers will fix this mapping if they can before
+ * calling netif_set_real_num_tx_queues.
+ */
+void netif_setup_tc(struct net_device *dev, unsigned int txq)
+{
+	int i;
+	struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
+
+	/* If TC0 is invalidated disable TC mapping */
+	if (tc->offset + tc->count > txq) {
+		pr_warning("Number of in use tx queues changed "
+			   "invalidating tc mappings. Priority "
+			   "traffic classification disabled!\n");
+		dev->num_tc = 0;
+		return;
+	}
+
+	/* Invalidated prio to tc mappings set to TC0 */
+	for (i = 1; i < TC_BITMASK + 1; i++) {
+		int q = netdev_get_prio_tc_map(dev, i);
+
+		tc = &dev->tc_to_txq[q];
+		if (tc->offset + tc->count > txq) {
+			pr_warning("Number of in use tx queues "
+				   "changed. Priority %i to tc "
+				   "mapping %i is no longer valid "
+				   "setting map to 0\n",
+				   i, q);
+			netdev_set_prio_tc_map(dev, i, 0);
+		}
+	}
+}
+
 /*
  * Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
  * greater then real_num_tx_queues stale skbs on the qdisc must be flushed.
@@ -1612,6 +1654,9 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
 		if (rc)
 			return rc;
 
+		if (dev->num_tc)
+			netif_setup_tc(dev, txq);
+
 		if (txq < dev->real_num_tx_queues)
 			qdisc_reset_all_tx_gt(dev, txq);
 	}
@@ -2165,6 +2210,8 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		  unsigned int num_tx_queues)
 {
 	u32 hash;
+	u16 qoffset = 0;
+	u16 qcount = num_tx_queues;
 
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
@@ -2173,13 +2220,19 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		return hash;
 	}
 
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		qoffset = dev->tc_to_txq[tc].offset;
+		qcount = dev->tc_to_txq[tc].count;
+	}
+
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
 	else
 		hash = (__force u16) skb->protocol ^ skb->rxhash;
 	hash = jhash_1word(hash, hashrnd);
 
-	return (u16) (((u64) hash * num_tx_queues) >> 32);
+	return (u16) (((u64) hash * qcount) >> 32) + qoffset;
 }
 EXPORT_SYMBOL(__skb_tx_hash);
 


^ permalink raw reply related

* [net-next-2.6 PATCH v8 2/2] net_sched: implement a root container qdisc sch_mqprio
From: John Fastabend @ 2011-01-17 18:06 UTC (permalink / raw)
  To: davem
  Cc: bhutchings, jarkao2, hadi, eric.dumazet, shemminger, tgraf,
	nhorman, netdev
In-Reply-To: <20110117175542.29543.38690.stgit@jf-dev1-dcblab>

This implements a mqprio queueing discipline that by default creates
a pfifo_fast qdisc per tx queue and provides the needed configuration
interface.

Using the mqprio qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mqprio_qopt {
	__u8    num_tc;
	__u8    prio_tc_map[TC_BITMASK + 1];
	__u8    hw;
	__u16   count[TC_MAX_QUEUE];
	__u16   offset[TC_MAX_QUEUE];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc.

The hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup().

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to know specifics about
traffic types.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/pkt_sched.h |   12 +
 net/sched/Kconfig         |   12 +
 net/sched/Makefile        |    1 
 net/sched/sch_generic.c   |    4 
 net/sched/sch_mqprio.c    |  417 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 446 insertions(+), 0 deletions(-)
 create mode 100644 net/sched/sch_mqprio.c

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 2cfa4bc..776cd93 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -481,4 +481,16 @@ struct tc_drr_stats {
 	__u32	deficit;
 };
 
+/* MQPRIO */
+#define TC_QOPT_BITMASK 15
+#define TC_QOPT_MAX_QUEUE 16
+
+struct tc_mqprio_qopt {
+	__u8	num_tc;
+	__u8	prio_tc_map[TC_QOPT_BITMASK + 1];
+	__u8	hw;
+	__u16	count[TC_QOPT_MAX_QUEUE];
+	__u16	offset[TC_QOPT_MAX_QUEUE];
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a36270a..f52f5eb 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -205,6 +205,18 @@ config NET_SCH_DRR
 
 	  If unsure, say N.
 
+config NET_SCH_MQPRIO
+	tristate "Multi-queue priority scheduler (MQPRIO)"
+	help
+	  Say Y here if you want to use the Multi-queue Priority scheduler.
+	  This scheduler allows QOS to be offloaded on NICs that have support
+	  for offloading QOS schedulers.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called sch_mqprio.
+
+	  If unsure, say N.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 960f5db..26ce681 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_multiq.o
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
+obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 34dc598..723b278 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -540,6 +540,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
 	.dump		=	pfifo_fast_dump,
 	.owner		=	THIS_MODULE,
 };
+EXPORT_SYMBOL(pfifo_fast_ops);
 
 struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 			  struct Qdisc_ops *ops)
@@ -674,6 +675,7 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 
 	return oqdisc;
 }
+EXPORT_SYMBOL(dev_graft_qdisc);
 
 static void attach_one_default_qdisc(struct net_device *dev,
 				     struct netdev_queue *dev_queue,
@@ -761,6 +763,7 @@ void dev_activate(struct net_device *dev)
 		dev_watchdog_up(dev);
 	}
 }
+EXPORT_SYMBOL(dev_activate);
 
 static void dev_deactivate_queue(struct net_device *dev,
 				 struct netdev_queue *dev_queue,
@@ -840,6 +843,7 @@ void dev_deactivate(struct net_device *dev)
 	list_add(&dev->unreg_list, &single);
 	dev_deactivate_many(&single);
 }
+EXPORT_SYMBOL(dev_deactivate);
 
 static void dev_init_scheduler_queue(struct net_device *dev,
 				     struct netdev_queue *dev_queue,
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
new file mode 100644
index 0000000..8620c65
--- /dev/null
+++ b/net/sched/sch_mqprio.c
@@ -0,0 +1,417 @@
+/*
+ * net/sched/sch_mqprio.c
+ *
+ * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+
+struct mqprio_sched {
+	struct Qdisc		**qdiscs;
+	int hw_owned;
+};
+
+static void mqprio_destroy(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	unsigned int ntx;
+
+	if (!priv->qdiscs)
+		return;
+
+	for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
+		qdisc_destroy(priv->qdiscs[ntx]);
+
+	if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
+		dev->netdev_ops->ndo_setup_tc(dev, 0);
+	else
+		netdev_set_num_tc(dev, 0);
+
+	kfree(priv->qdiscs);
+}
+
+static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
+{
+	int i, j;
+
+	/* Verify num_tc is not out of max range */
+	if (qopt->num_tc > TC_MAX_QUEUE)
+		return -EINVAL;
+
+	/* Verify priority mapping uses valid tcs */
+	for (i = 0; i < TC_BITMASK + 1; i++) {
+		if (qopt->prio_tc_map[i] >= qopt->num_tc)
+			return -EINVAL;
+	}
+
+	/* net_device does not support requested operation */
+	if (qopt->hw && !dev->netdev_ops->ndo_setup_tc)
+		return -EINVAL;
+
+	/* if hw owned qcount and qoffset are taken from LLD so
+	 * no reason to verify them here
+	 */
+	if (qopt->hw)
+		return 0;
+
+	for (i = 0; i < qopt->num_tc; i++) {
+		unsigned int last = qopt->offset[i] + qopt->count[i];
+
+		/* Verify the queue count is in tx range being equal to the
+		 * real_num_tx_queues indicates the last queue is in use.
+		 */
+		if (qopt->offset[i] >= dev->real_num_tx_queues ||
+		    !qopt->count[i] ||
+		    last > dev->real_num_tx_queues)
+			return -EINVAL;
+
+		/* Verify that the offset and counts do not overlap */
+		for (j = i + 1; j < qopt->num_tc; j++) {
+			if (last > qopt->offset[j])
+				return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	struct netdev_queue *dev_queue;
+	struct Qdisc *qdisc;
+	int i, err = -EOPNOTSUPP;
+	struct tc_mqprio_qopt *qopt = NULL;
+
+	BUILD_BUG_ON(TC_MAX_QUEUE != TC_QOPT_MAX_QUEUE);
+	BUILD_BUG_ON(TC_BITMASK != TC_QOPT_BITMASK);
+
+	if (sch->parent != TC_H_ROOT)
+		return -EOPNOTSUPP;
+
+	if (!netif_is_multiqueue(dev))
+		return -EOPNOTSUPP;
+
+	if (nla_len(opt) < sizeof(*qopt))
+		return -EINVAL;
+
+	qopt = nla_data(opt);
+	if (mqprio_parse_opt(dev, qopt))
+		return -EINVAL;
+
+	/* pre-allocate qdisc, attachment can't fail */
+	priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
+			       GFP_KERNEL);
+	if (priv->qdiscs == NULL) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		dev_queue = netdev_get_tx_queue(dev, i);
+		qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
+					  TC_H_MAKE(TC_H_MAJ(sch->handle),
+						    TC_H_MIN(i + 1)));
+		if (qdisc == NULL) {
+			err = -ENOMEM;
+			goto err;
+		}
+		qdisc->flags |= TCQ_F_CAN_BYPASS;
+		priv->qdiscs[i] = qdisc;
+	}
+
+	/* If the mqprio options indicate that hardware should own
+	 * the queue mapping then run ndo_setup_tc otherwise use the
+	 * supplied and verified mapping
+	 */
+	if (qopt->hw) {
+		priv->hw_owned = 1;
+		err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc);
+		if (err)
+			goto err;
+	} else {
+		netdev_set_num_tc(dev, qopt->num_tc);
+		for (i = 0; i < qopt->num_tc; i++)
+			netdev_set_tc_queue(dev, i,
+					    qopt->count[i], qopt->offset[i]);
+	}
+
+	/* Always use supplied priority mappings */
+	for (i = 0; i < TC_BITMASK + 1; i++)
+		netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i]);
+
+	sch->flags |= TCQ_F_MQROOT;
+	return 0;
+
+err:
+	mqprio_destroy(sch);
+	return err;
+}
+
+static void mqprio_attach(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	unsigned int ntx;
+
+	/* Attach underlying qdisc */
+	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+		qdisc = priv->qdiscs[ntx];
+		qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
+		if (qdisc)
+			qdisc_destroy(qdisc);
+	}
+	kfree(priv->qdiscs);
+	priv->qdiscs = NULL;
+}
+
+static struct netdev_queue *mqprio_queue_get(struct Qdisc *sch,
+					     unsigned long cl)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned long ntx = cl - 1 - netdev_get_num_tc(dev);
+
+	if (ntx >= dev->num_tx_queues)
+		return NULL;
+	return netdev_get_tx_queue(dev, ntx);
+}
+
+static int mqprio_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
+		    struct Qdisc **old)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+	if (!dev_queue)
+		return -EINVAL;
+
+	if (dev->flags & IFF_UP)
+		dev_deactivate(dev);
+
+	*old = dev_graft_qdisc(dev_queue, new);
+
+	if (dev->flags & IFF_UP)
+		dev_activate(dev);
+
+	return 0;
+}
+
+static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tc_mqprio_qopt opt;
+	struct Qdisc *qdisc;
+	unsigned int i;
+
+	sch->q.qlen = 0;
+	memset(&sch->bstats, 0, sizeof(sch->bstats));
+	memset(&sch->qstats, 0, sizeof(sch->qstats));
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+		spin_lock_bh(qdisc_lock(qdisc));
+		sch->q.qlen		+= qdisc->q.qlen;
+		sch->bstats.bytes	+= qdisc->bstats.bytes;
+		sch->bstats.packets	+= qdisc->bstats.packets;
+		sch->qstats.qlen	+= qdisc->qstats.qlen;
+		sch->qstats.backlog	+= qdisc->qstats.backlog;
+		sch->qstats.drops	+= qdisc->qstats.drops;
+		sch->qstats.requeues	+= qdisc->qstats.requeues;
+		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
+		spin_unlock_bh(qdisc_lock(qdisc));
+	}
+
+	opt.num_tc = netdev_get_num_tc(dev);
+	memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
+	opt.hw = priv->hw_owned;
+
+	for (i = 0; i < netdev_get_num_tc(dev); i++) {
+		opt.count[i] = dev->tc_to_txq[i].count;
+		opt.offset[i] = dev->tc_to_txq[i].offset;
+	}
+
+	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+	return skb->len;
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static struct Qdisc *mqprio_leaf(struct Qdisc *sch, unsigned long cl)
+{
+	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+	if (!dev_queue)
+		return NULL;
+
+	return dev_queue->qdisc_sleeping;
+}
+
+static unsigned long mqprio_get(struct Qdisc *sch, u32 classid)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned int ntx = TC_H_MIN(classid);
+
+	if (ntx > dev->num_tx_queues + netdev_get_num_tc(dev))
+		return 0;
+	return ntx;
+}
+
+static void mqprio_put(struct Qdisc *sch, unsigned long cl)
+{
+}
+
+static int mqprio_dump_class(struct Qdisc *sch, unsigned long cl,
+			 struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct net_device *dev = qdisc_dev(sch);
+
+	if (cl <= netdev_get_num_tc(dev)) {
+		tcm->tcm_parent = TC_H_ROOT;
+		tcm->tcm_info = 0;
+	} else {
+		int i;
+		struct netdev_queue *dev_queue;
+
+		dev_queue = mqprio_queue_get(sch, cl);
+		tcm->tcm_parent = 0;
+		for (i = 0; i < netdev_get_num_tc(dev); i++) {
+			struct netdev_tc_txq tc = dev->tc_to_txq[i];
+			int q_idx = cl - netdev_get_num_tc(dev);
+
+			if (q_idx > tc.offset &&
+			    q_idx <= tc.offset + tc.count) {
+				tcm->tcm_parent =
+					TC_H_MAKE(TC_H_MAJ(sch->handle),
+						  TC_H_MIN(i + 1));
+				break;
+			}
+		}
+		tcm->tcm_info = dev_queue->qdisc_sleeping->handle;
+	}
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	return 0;
+}
+
+static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+			       struct gnet_dump *d)
+{
+	struct net_device *dev = qdisc_dev(sch);
+
+	if (cl <= netdev_get_num_tc(dev)) {
+		int i;
+		struct Qdisc *qdisc;
+		struct gnet_stats_queue qstats = {0};
+		struct gnet_stats_basic_packed bstats = {0};
+		struct netdev_tc_txq tc = dev->tc_to_txq[cl - 1];
+
+		/* Drop lock here it will be reclaimed before touching
+		 * statistics this is required because the d->lock we
+		 * hold here is the look on dev_queue->qdisc_sleeping
+		 * also acquired below.
+		 */
+		spin_unlock_bh(d->lock);
+
+		for (i = tc.offset; i < tc.offset + tc.count; i++) {
+			qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+			spin_lock_bh(qdisc_lock(qdisc));
+			bstats.bytes      += qdisc->bstats.bytes;
+			bstats.packets    += qdisc->bstats.packets;
+			qstats.qlen       += qdisc->qstats.qlen;
+			qstats.backlog    += qdisc->qstats.backlog;
+			qstats.drops      += qdisc->qstats.drops;
+			qstats.requeues   += qdisc->qstats.requeues;
+			qstats.overlimits += qdisc->qstats.overlimits;
+			spin_unlock_bh(qdisc_lock(qdisc));
+		}
+		/* Reclaim root sleeping lock before completing stats */
+		spin_lock_bh(d->lock);
+		if (gnet_stats_copy_basic(d, &bstats) < 0 ||
+		    gnet_stats_copy_queue(d, &qstats) < 0)
+			return -1;
+	} else {
+		struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+		sch = dev_queue->qdisc_sleeping;
+		sch->qstats.qlen = sch->q.qlen;
+		if (gnet_stats_copy_basic(d, &sch->bstats) < 0 ||
+		    gnet_stats_copy_queue(d, &sch->qstats) < 0)
+			return -1;
+	}
+	return 0;
+}
+
+static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned long ntx;
+
+	if (arg->stop)
+		return;
+
+	/* Walk hierarchy with a virtual class per tc */
+	arg->count = arg->skip;
+	for (ntx = arg->skip;
+	     ntx < dev->num_tx_queues + netdev_get_num_tc(dev);
+	     ntx++) {
+		if (arg->fn(sch, ntx + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static const struct Qdisc_class_ops mqprio_class_ops = {
+	.graft		= mqprio_graft,
+	.leaf		= mqprio_leaf,
+	.get		= mqprio_get,
+	.put		= mqprio_put,
+	.walk		= mqprio_walk,
+	.dump		= mqprio_dump_class,
+	.dump_stats	= mqprio_dump_class_stats,
+};
+
+struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
+	.cl_ops		= &mqprio_class_ops,
+	.id		= "mqprio",
+	.priv_size	= sizeof(struct mqprio_sched),
+	.init		= mqprio_init,
+	.destroy	= mqprio_destroy,
+	.attach		= mqprio_attach,
+	.dump		= mqprio_dump,
+	.owner		= THIS_MODULE,
+};
+
+static int __init mqprio_module_init(void)
+{
+	return register_qdisc(&mqprio_qdisc_ops);
+}
+
+static void __exit mqprio_module_exit(void)
+{
+	unregister_qdisc(&mqprio_qdisc_ops);
+}
+
+module_init(mqprio_module_init);
+module_exit(mqprio_module_exit);
+
+MODULE_LICENSE("GPL");


^ permalink raw reply related

* Re: 2.6.37 regression: adding main interface to a bridge breaks vlan interface RX
From: Simon Arlott @ 2011-01-17 18:17 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, Linux Kernel Mailing List, jesse, Herbert Xu
In-Reply-To: <1295280044.6264.5.camel@bwh-desktop>

On 17/01/11 16:00, Ben Hutchings wrote:
> On Sun, 2011-01-16 at 14:09 +0000, Simon Arlott wrote:
>> [    1.666706] forcedeth 0000:00:08.0: ifname eth0, PHY OUI 0x5043 @ 16, addr 00:e0:81:4d:2b:ec
>> [    1.666767] forcedeth 0000:00:08.0: highdma csum vlan pwrctl mgmt gbit lnktim msi desc-v3
>> 
>> I have eth0 and eth0.3840 which works until I add eth0 to a bridge.
>> While eth0 is in a bridge (the bridge device is up), eth0.3840 is unable
>> to receive packets. Using tcpdump on eth0 shows the packets being
>> received with a VLAN tag but they don't appear on eth0.3840. They appear
>> with the VLAN tag on the bridge interface.
> [...]
> 
> This means the behaviour is now consistent, whether or not hardware VLAN
> tag stripping is enabled.  (I previously pointed out the inconsistent
> behaviour in <http://thread.gmane.org/gmane.linux.network/149864>.)  I
> would consider this an improvement.

Shouldn't the kernel also prevent a device from being both part of a
bridge and having VLANs? Instead everything appears to work except
incoming traffic.

-- 
Simon Arlott

^ permalink raw reply

* Re: [PATCH v2] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Eric Dumazet @ 2011-01-17 18:21 UTC (permalink / raw)
  To: Po-Yu Chuang; +Cc: netdev, linux-kernel, ratbert, bhutchings, joe, dilinger
In-Reply-To: <1295256060-2091-1-git-send-email-ratbert.chuang@gmail.com>

Le lundi 17 janvier 2011 à 17:21 +0800, Po-Yu Chuang a écrit :


> +
> +	spin_lock_irqsave(&priv->tx_lock, flags);
> +	ftmac100_txdes_set_skb(txdes, skb);
> +	ftmac100_txdes_set_dma_addr(txdes, map);
> +
> +	ftmac100_txdes_set_first_segment(txdes);
> +	ftmac100_txdes_set_last_segment(txdes);
> +	ftmac100_txdes_set_txint(txdes);
> +	ftmac100_txdes_set_buffer_size(txdes, len);
> +

I wonder if its not too expensive to read/modify/write txdes->txdes1

Maybe you should use a temporary u32 var and perform one final write on
txdes->txdes1 (with the set_dma_own)

> +	priv->tx_pending++;
> +	if (priv->tx_pending == TX_QUEUE_ENTRIES) {
> +		if (net_ratelimit())
> +			netdev_info(netdev, "tx queue full\n");
> +
> +		netif_stop_queue(netdev);
> +	}
> +
> +	/* start transmit */
> +	ftmac100_txdes_set_dma_own(txdes);

	txdes->txdes1 = txdes1;

BTW, shouldnt you use cpu_to_be32() or cpu_to_le32(), if this driver is
multi platform ?

^ permalink raw reply

* Re: [PATCH v2] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Ben Hutchings @ 2011-01-17 18:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Po-Yu Chuang, netdev, linux-kernel, ratbert, joe, dilinger
In-Reply-To: <1295285354.3335.10.camel@edumazet-laptop>

On Mon, 2011-01-17 at 18:29 +0100, Eric Dumazet wrote:
> Le lundi 17 janvier 2011 à 17:21 +0800, Po-Yu Chuang a écrit :
> 
> 
> > +static int ftmac100_rx_packet(struct ftmac100 *priv, int *processed)
> > +{
> > +	struct net_device *netdev = priv->netdev;
> > +	struct ftmac100_rxdes *rxdes;
> > +	struct sk_buff *skb;
> > +	int length;
> > +	int copied = 0;
> > +	int done = 0;
> > +
> > +	rxdes = ftmac100_rx_locate_first_segment(priv);
> > +	if (!rxdes)
> > +		return 0;
> > +
> > +	length = ftmac100_rxdes_frame_length(rxdes);
> > +
> > +	netdev->stats.rx_packets++;
> > +	netdev->stats.rx_bytes += length;
> > +
> > +	if (unlikely(ftmac100_rx_packet_error(priv, rxdes))) {
> > +		ftmac100_rx_drop_packet(priv);
> > +		return 1;
> > +	}
> > +
> > +	/* start processing */
> > +	skb = netdev_alloc_skb_ip_align(netdev, length);
> > +	if (unlikely(!skb)) {
> > +		if (net_ratelimit())
> > +			netdev_err(netdev, "rx skb alloc failed\n");
> > +
> > +		ftmac100_rx_drop_packet(priv);
> > +		return 1;
> > +	}
> > +
> 
> Please dont increase rx_packets/rx_bytes before the
> netdev_alloc_skb_ip_align().
> 
> In case of mem allocation failure, it would be better not pretending we
> handled a packet.
>
> drivers/net/r8169.c for example does the rx_packets/rx_bytes only if
> packet is delivered to upper stack.

That's news to me.  I specifically advised Po-Yu Chuang to increment
these earlier because my understanding is that all packets/bytes should
be counted.  And drivers which use hardware MAC stats will generally do
that, so I really don't think it makes sense to make other drivers
different deliberately.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH v2 1/3] can: at91_can: clean up usage of AT91_MB_RX_FIRST and AT91_MB_RX_NUM
From: Marc Kleine-Budde @ 2011-01-17 19:50 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1294752085-30151-2-git-send-email-mkl-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 817 bytes --]

On 01/11/2011 02:21 PM, Marc Kleine-Budde wrote:
> This patch cleans up the usage of two macros which specify the mailbox
> usage. AT91_MB_RX_FIRST and AT91_MB_RX_NUM define the first and the
> number of RX mailboxes. The current driver uses these variables in an
> unclean way; assuming that AT91_MB_RX_FIRST is 0;
> 
> This patch cleans up the usage of these macros, no longer assuming
> AT91_MB_RX_FIRST == 0.
> 
> Signed-off-by: Marc Kleine-Budde <mkl-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>

Any comments on this?

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

[-- Attachment #2: Type: text/plain, Size: 188 bytes --]

_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core

^ permalink raw reply

* [PATCH] ns83820: Avoid bad pointer deref in ns83820_init_one().
From: Jesper Juhl @ 2011-01-17 20:24 UTC (permalink / raw)
  To: netdev
  Cc: linux-ns83820, linux-kernel, Tejun Heo, Tejun Heo,
	Kulikov Vasiliy, Denis Kirjanov, David S. Miller,
	Benjamin LaHaise

In drivers/net/ns83820.c::ns83820_init_one() we dynamically allocate 
memory via alloc_etherdev(). We then call PRIV() on the returned storage 
which is 'return netdev_priv()'. netdev_priv() takes the pointer it is 
passed and adds 'ALIGN(sizeof(struct net_device), NETDEV_ALIGN)' to it and 
returns it. Then we test the resulting pointer for NULL, which it is 
unlikely to be at this point, and later dereference it. This will go bad 
if alloc_etherdev() actually returned NULL.

This patch reworks the code slightly so that we test for a NULL pointer 
(and return -ENOMEM) directly after calling alloc_etherdev().

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
---
 ns83820.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

  Compile tested only. I have no way to test this for real.

diff --git a/drivers/net/ns83820.c b/drivers/net/ns83820.c
index 84134c7..a41b2cf 100644
--- a/drivers/net/ns83820.c
+++ b/drivers/net/ns83820.c
@@ -1988,12 +1988,11 @@ static int __devinit ns83820_init_one(struct pci_dev *pci_dev,
 	}
 
 	ndev = alloc_etherdev(sizeof(struct ns83820));
-	dev = PRIV(ndev);
-
 	err = -ENOMEM;
-	if (!dev)
+	if (!ndev)
 		goto out;
 
+	dev = PRIV(ndev);
 	dev->ndev = ndev;
 
 	spin_lock_init(&dev->rx_info.lock);


-- 
Jesper Juhl <jj@chaosbits.net>            http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

^ permalink raw reply related

* Re: [PATCH v2] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Eric Dumazet @ 2011-01-17 20:39 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Po-Yu Chuang, netdev, linux-kernel, ratbert, joe, dilinger
In-Reply-To: <1295290718.6264.19.camel@bwh-desktop>

Le lundi 17 janvier 2011 à 18:58 +0000, Ben Hutchings a écrit :
> On Mon, 2011-01-17 at 18:29 +0100, Eric Dumazet wrote:
> > Le lundi 17 janvier 2011 à 17:21 +0800, Po-Yu Chuang a écrit :
> > 
> > 
> > > +static int ftmac100_rx_packet(struct ftmac100 *priv, int *processed)
> > > +{
> > > +	struct net_device *netdev = priv->netdev;
> > > +	struct ftmac100_rxdes *rxdes;
> > > +	struct sk_buff *skb;
> > > +	int length;
> > > +	int copied = 0;
> > > +	int done = 0;
> > > +
> > > +	rxdes = ftmac100_rx_locate_first_segment(priv);
> > > +	if (!rxdes)
> > > +		return 0;
> > > +
> > > +	length = ftmac100_rxdes_frame_length(rxdes);
> > > +
> > > +	netdev->stats.rx_packets++;
> > > +	netdev->stats.rx_bytes += length;
> > > +
> > > +	if (unlikely(ftmac100_rx_packet_error(priv, rxdes))) {
> > > +		ftmac100_rx_drop_packet(priv);
> > > +		return 1;
> > > +	}
> > > +
> > > +	/* start processing */
> > > +	skb = netdev_alloc_skb_ip_align(netdev, length);
> > > +	if (unlikely(!skb)) {
> > > +		if (net_ratelimit())
> > > +			netdev_err(netdev, "rx skb alloc failed\n");
> > > +
> > > +		ftmac100_rx_drop_packet(priv);
> > > +		return 1;
> > > +	}
> > > +
> > 
> > Please dont increase rx_packets/rx_bytes before the
> > netdev_alloc_skb_ip_align().
> > 
> > In case of mem allocation failure, it would be better not pretending we
> > handled a packet.
> >
> > drivers/net/r8169.c for example does the rx_packets/rx_bytes only if
> > packet is delivered to upper stack.
> 
> That's news to me.  I specifically advised Po-Yu Chuang to increment
> these earlier because my understanding is that all packets/bytes should
> be counted.  And drivers which use hardware MAC stats will generally do
> that, so I really don't think it makes sense to make other drivers
> different deliberately.
> 

I see, but when one frame is dropped because of RX ring buffer
under/overflow we dont account for the lost packet/bytes.

Thats probably not very important, but would be good if all drivers
behave the same.




^ permalink raw reply

* Re: NETLINK: Failed to browse: Invalid argument from avahi-daemon in F14 since 2.6.37-git8
From: Alessandro Suardi @ 2011-01-17 21:23 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: linux-kernel, netdev
In-Reply-To: <20110117083208.GA7569@ff.dom.local>

On Mon, Jan 17, 2011 at 9:32 AM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> On 2011-01-15 19:34, Alessandro Suardi wrote:
>> /var/log/messages says:
>>
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Found user 'avahi' (UID 70)
>> and group 'avahi' (GID 70).
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Successfully dropped root privileges.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: avahi-daemon 0.6.27 starting up.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Successfully called chroot().
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Successfully dropped remaining capabilities.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Loading service file /services/ssh.service.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: Loading service file /services/udisks.service.
>> Jan 13 12:42:02 duff avahi-daemon[2771]: NETLINK: Failed to browse: Invalid argument
>> Jan 13 12:42:23 duff acpid: starting up with netlink and the input layer
>>
>>
>> Happens both at boot and after boot if restarting avahi-daemon service,
>>  and there is a 10-15" wait before the service start script prints a [FAILED]
>>  red tag; avahi-daemon process does start up anyways.
>>
>> -git7 is the latest "good" kernel
>> -git8, -git9, -git11, -git13 have been reproducing the issue
>>
>>
>> thanks, ciao,
>
> Seems to be resolved in this thread:
>
> http://www.spinics.net/lists/netdev/msg152762.html

Thanks Jarek - filed Fedora bug 670316 for this one.

https://bugzilla.redhat.com/show_bug.cgi?id=670316

--alessandro

 "There's always a siren singing you to shipwreck"

   (Radiohead, "There There")

^ permalink raw reply

* Re: [PATCH] net_sched: factorize qdisc stats handling
From: Jesper Dangaard Brouer @ 2011-01-17 21:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, Changli Gao, David Miller, Fabio Checconi,
	netdev, Luigi Rizzo, hawk
In-Reply-To: <1294478789.2709.79.camel@edumazet-laptop>



On Sat, 8 Jan 2011, Eric Dumazet wrote:
>> Changli Gao <xiaosuo@gmail.com> wrote:
>>
>>>> +       cl->bstats.packets += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;

What about the qlen when we enqueue a GSO packet? Should we account for 
the number of internal GSO packets, or just account a GSO packet as a 
single packet?

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 0918834..1a8c0a3 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -453,7 +453,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, 
struct Qdisc* qdisc)
                 struct sk_buff_head *list = band2list(priv, band);

                 priv->bitmap |= (1 << band);
-               qdisc->q.qlen++;
+               /* Should the number of GSO packets be taken into account?*/
+               qdisc->q.qlen += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;
                 return __qdisc_enqueue_tail(skb, qdisc, list);
         }

Is is at all legal to modify the q.qlen this way?

Cheers
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply related

* Re: [PATCH] net_sched: factorize qdisc stats handling
From: Eric Dumazet @ 2011-01-17 22:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Stephen Hemminger, Changli Gao, David Miller, Fabio Checconi,
	netdev, Luigi Rizzo, hawk
In-Reply-To: <Pine.LNX.4.64.1101172243470.15896@ask.diku.dk>

Le lundi 17 janvier 2011 à 22:55 +0100, Jesper Dangaard Brouer a écrit :
> 
> On Sat, 8 Jan 2011, Eric Dumazet wrote:
> >> Changli Gao <xiaosuo@gmail.com> wrote:
> >>
> >>>> +       cl->bstats.packets += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;
> 
> What about the qlen when we enqueue a GSO packet? Should we account for 
> the number of internal GSO packets, or just account a GSO packet as a 
> single packet?
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 0918834..1a8c0a3 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -453,7 +453,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, 
> struct Qdisc* qdisc)
>                  struct sk_buff_head *list = band2list(priv, band);
> 
>                  priv->bitmap |= (1 << band);
> -               qdisc->q.qlen++;
> +               /* Should the number of GSO packets be taken into account?*/
> +               qdisc->q.qlen += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;
>                  return __qdisc_enqueue_tail(skb, qdisc, list);
>          }
> 
> Is is at all legal to modify the q.qlen this way?

nope ;)

q.qlen is really number of skbs here.




^ permalink raw reply

* [BUG] bnx2 + vlan + TSO : doesnt work
From: Eric Dumazet @ 2011-01-17 23:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jesse Gross

Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
with current linux-2.6 tree (I need vlans on my setup)

tg3 : vlan not working

bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
are OK)

I suspect recent commits from Jesse are the problem.
(bnx2_xmit() is feeded with zeroed vlan_tci skbs)

Maybe f01a5236bd4b1401 (net offloading: Generalize
netif_get_vlan_features().) ?

I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
They only set this flag in dev->features

I dont think changing all drivers to also set vlan_features makes sense.

Is following patch the right path ? (It does solve my problem)

diff --git a/net/core/dev.c b/net/core/dev.c
index 54277df..5c13e55 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5265,6 +5265,9 @@ int register_netdevice(struct net_device *dev)
 	 */
 	dev->vlan_features |= (NETIF_F_GRO | NETIF_F_HIGHDMA);
 
+	/* allow netif_skb_features() not mask this bit from dev->features */
+	dev->vlan_features |= NETIF_F_HW_VLAN_TX;
+
 	ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
 	ret = notifier_to_errno(ret);
 	if (ret)



^ permalink raw reply related

* Re: [BUG] bnx2 + vlan + TSO : doesnt work
From: Ben Hutchings @ 2011-01-17 23:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Jesse Gross
In-Reply-To: <1295307669.3362.106.camel@edumazet-laptop>

On Tue, 2011-01-18 at 00:41 +0100, Eric Dumazet wrote:
> Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
> with current linux-2.6 tree (I need vlans on my setup)
> 
> tg3 : vlan not working
> 
> bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
> are OK)
> 
> I suspect recent commits from Jesse are the problem.
> (bnx2_xmit() is feeded with zeroed vlan_tci skbs)
> 
> Maybe f01a5236bd4b1401 (net offloading: Generalize
> netif_get_vlan_features().) ?
> 
> I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
> They only set this flag in dev->features
> 
> I dont think changing all drivers to also set vlan_features makes sense.
> 
> Is following patch the right path ? (It does solve my problem)

This isn't right.  NETIF_F_HW_VLAN_TX in vlan_features would mean that
the hardware can do two levels of VLAN tag insertion, which is generally
not true.

Ben.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 54277df..5c13e55 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5265,6 +5265,9 @@ int register_netdevice(struct net_device *dev)
>  	 */
>  	dev->vlan_features |= (NETIF_F_GRO | NETIF_F_HIGHDMA);
>  
> +	/* allow netif_skb_features() not mask this bit from dev->features */
> +	dev->vlan_features |= NETIF_F_HW_VLAN_TX;
> +
>  	ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
>  	ret = notifier_to_errno(ret);
>  	if (ret)

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [BUG] bnx2 + vlan + TSO : doesnt work
From: Eric Dumazet @ 2011-01-18  0:00 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, Jesse Gross
In-Reply-To: <1295308041.6264.22.camel@bwh-desktop>

Le lundi 17 janvier 2011 à 23:47 +0000, Ben Hutchings a écrit :
> On Tue, 2011-01-18 at 00:41 +0100, Eric Dumazet wrote:
> > Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
> > with current linux-2.6 tree (I need vlans on my setup)
> > 
> > tg3 : vlan not working
> > 
> > bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
> > are OK)
> > 
> > I suspect recent commits from Jesse are the problem.
> > (bnx2_xmit() is feeded with zeroed vlan_tci skbs)
> > 
> > Maybe f01a5236bd4b1401 (net offloading: Generalize
> > netif_get_vlan_features().) ?
> > 
> > I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
> > They only set this flag in dev->features
> > 
> > I dont think changing all drivers to also set vlan_features makes sense.
> > 
> > Is following patch the right path ? (It does solve my problem)
> 
> This isn't right.  NETIF_F_HW_VLAN_TX in vlan_features would mean that
> the hardware can do two levels of VLAN tag insertion, which is generally
> not true.
> 

So should we revert part of Jesse patch ? I want my vlan back ;)

diff --git a/net/core/dev.c b/net/core/dev.c
index 54277df..8209d93 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2076,7 +2076,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		features = netif_skb_features(skb);
 
 		if (vlan_tx_tag_present(skb) &&
-		    !(features & NETIF_F_HW_VLAN_TX)) {
+		    !(dev->features & NETIF_F_HW_VLAN_TX)) {
 			skb = __vlan_put_tag(skb, vlan_tx_tag_get(skb));
 			if (unlikely(!skb))
 				goto out;



^ permalink raw reply related

* Re: [BUG] bnx2 + vlan + TSO : doesnt work
From: Ben Hutchings @ 2011-01-18  0:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Jesse Gross
In-Reply-To: <1295308844.3362.135.camel@edumazet-laptop>

On Tue, 2011-01-18 at 01:00 +0100, Eric Dumazet wrote:
> Le lundi 17 janvier 2011 à 23:47 +0000, Ben Hutchings a écrit :
> > On Tue, 2011-01-18 at 00:41 +0100, Eric Dumazet wrote:
> > > Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
> > > with current linux-2.6 tree (I need vlans on my setup)
> > > 
> > > tg3 : vlan not working
> > > 
> > > bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
> > > are OK)
> > > 
> > > I suspect recent commits from Jesse are the problem.
> > > (bnx2_xmit() is feeded with zeroed vlan_tci skbs)
> > > 
> > > Maybe f01a5236bd4b1401 (net offloading: Generalize
> > > netif_get_vlan_features().) ?
> > > 
> > > I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
> > > They only set this flag in dev->features
> > > 
> > > I dont think changing all drivers to also set vlan_features makes sense.
> > > 
> > > Is following patch the right path ? (It does solve my problem)
> > 
> > This isn't right.  NETIF_F_HW_VLAN_TX in vlan_features would mean that
> > the hardware can do two levels of VLAN tag insertion, which is generally
> > not true.
> > 
> 
> So should we revert part of Jesse patch ? I want my vlan back ;)

Yeah, that looks right.

Ben.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 54277df..8209d93 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2076,7 +2076,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>  		features = netif_skb_features(skb);
>  
>  		if (vlan_tx_tag_present(skb) &&
> -		    !(features & NETIF_F_HW_VLAN_TX)) {
> +		    !(dev->features & NETIF_F_HW_VLAN_TX)) {
>  			skb = __vlan_put_tag(skb, vlan_tx_tag_get(skb));
>  			if (unlikely(!skb))
>  				goto out;
> 
> 

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [BUG] bnx2 + vlan + TSO : doesnt work
From: Jesse Gross @ 2011-01-18  0:13 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <1295309365.6264.23.camel@bwh-desktop>

On Mon, Jan 17, 2011 at 4:09 PM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Tue, 2011-01-18 at 01:00 +0100, Eric Dumazet wrote:
>> Le lundi 17 janvier 2011 à 23:47 +0000, Ben Hutchings a écrit :
>> > On Tue, 2011-01-18 at 00:41 +0100, Eric Dumazet wrote:
>> > > Oh well, my dev machine, with bnx2 and tg3 NICs, isnt working at all
>> > > with current linux-2.6 tree (I need vlans on my setup)
>> > >
>> > > tg3 : vlan not working
>> > >
>> > > bnx2 : vlan + TSO not working (or very slowly, since only non GSO frames
>> > > are OK)
>> > >
>> > > I suspect recent commits from Jesse are the problem.
>> > > (bnx2_xmit() is feeded with zeroed vlan_tci skbs)
>> > >
>> > > Maybe f01a5236bd4b1401 (net offloading: Generalize
>> > > netif_get_vlan_features().) ?
>> > >
>> > > I dont see NETIF_F_HW_VLAN_TX being set in vlan_features in net drivers.
>> > > They only set this flag in dev->features
>> > >
>> > > I dont think changing all drivers to also set vlan_features makes sense.
>> > >
>> > > Is following patch the right path ? (It does solve my problem)
>> >
>> > This isn't right.  NETIF_F_HW_VLAN_TX in vlan_features would mean that
>> > the hardware can do two levels of VLAN tag insertion, which is generally
>> > not true.
>> >
>>
>> So should we revert part of Jesse patch ? I want my vlan back ;)
>
> Yeah, that looks right.

I think it is better for netif_skb_features() to actually return the
correct features rather than bypass it here.  NETIF_F_HW_VLAN_TX
doesn't depend on any other offloads, so we can always include it if
it is in dev->features.

Separately, this means there is a problem with bnx2 because it allows
vlan insertion to be turned off, which would have the same effect.
Maybe it is looking directly at skb->protocol or similar for TSO.

^ permalink raw reply

* [PATCH] scm: provide full privilege set via SCM_PRIVILEGE
From: Casey Schaufler @ 2011-01-18  0:34 UTC (permalink / raw)
  To: netdev; +Cc: Casey Schaufler


Subject: [PATCH] scm: provide full privilege set via SCM_PRIVILEGE

The SCM mechanism currently provides interfaces for delivering
the uid/gid and the "security context" (LSM information) of the
peer on a UDS socket. All of the security credential information
is available, but there is no interface available to obtain it.
Further, the existing interfaces require that the user chose
between the uid/gid and the context as the existing interfaces
are exclusive.

This patch introduces an additional interface that provides
a complete set of security information from the peer credential.
No additional work is required to provide the information
internally, it is all being passed, just not exposed.

Also sent to LKML and LSM lists. 

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
---

 include/asm-generic/socket.h |    1 +
 include/linux/net.h          |    1 +
 include/linux/socket.h       |    1 +
 include/net/scm.h            |   80 +++++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c              |   11 ++++++
 5 files changed, 93 insertions(+), 1 deletions(-)
diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 9a6115e..7aa8e84 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -64,4 +64,5 @@
 #define SO_DOMAIN		39
 
 #define SO_RXQ_OVFL             40
+#define SO_PASSPRIV		41
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/linux/net.h b/include/linux/net.h
index 16faa13..159a929 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -71,6 +71,7 @@ struct net;
 #define SOCK_NOSPACE		2
 #define SOCK_PASSCRED		3
 #define SOCK_PASSSEC		4
+#define SOCK_PASSPRIV		5
 
 #ifndef ARCH_HAS_SOCKET_TYPES
 /**
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 86b652f..e9cfd68 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -147,6 +147,7 @@ static inline struct cmsghdr * cmsg_nxthdr (struct msghdr *__msg, struct cmsghdr
 #define	SCM_RIGHTS	0x01		/* rw: access rights (array of int) */
 #define SCM_CREDENTIALS 0x02		/* rw: struct ucred		*/
 #define SCM_SECURITY	0x03		/* rw: security label		*/
+#define SCM_PRIVILEGES  0x04		/* rw: privilege set		*/
 
 struct ucred {
 	__u32	pid;
diff --git a/include/net/scm.h b/include/net/scm.h
index 3165650..4b8db21 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -101,6 +101,83 @@ static inline void scm_passec(struct socket *sock, struct msghdr *msg, struct sc
 { }
 #endif /* CONFIG_SECURITY_NETWORK */
 
+static __inline__ void scm_passpriv(struct socket *sock, struct msghdr *msg,
+				struct scm_cookie *scm)
+{
+	const struct cred *credp = scm->cred;
+	const struct group_info *gip;
+	char *result;
+	char *cp;
+	int i;
+#ifdef CONFIG_SECURITY_NETWORK
+	char *secdata;
+	u32 seclen;
+	int err;
+#endif /* CONFIG_SECURITY_NETWORK */
+
+	if (!test_bit(SOCK_PASSPRIV, &sock->flags))
+		return;
+
+	gip = credp->group_info;
+
+	/*
+	 * uid + euid + gid + egid + group-list + capabilities
+	 *     + "uid=" + "euid=" + "gid=" + "egid=" + "grps="
+	 *     + "cap-e=" + "cap-p=" + "cap-i="
+	 * 10  + 10   + 10  + 10   + (ngrps * 10) + ecap + pcap + icap
+	 *     + 4 + 5 + 4 + 5 + 5 + 6 + 6 + 6
+	 */
+	i = ((4 + gip->ngroups) * 11) + (3 * (_KERNEL_CAPABILITY_U32S * 8 + 1))
+		+ 41;
+
+#ifdef CONFIG_SECURITY_NETWORK
+	err = security_secid_to_secctx(scm->secid, &secdata, &seclen);
+	if (!err)
+		/*
+		 * " context="
+		 */
+		i += seclen + 10;
+#endif /* CONFIG_SECURITY_NETWORK */
+
+	result = kzalloc(i, GFP_KERNEL);
+	if (result == NULL)
+		return;
+
+	cp = result + sprintf(result, "euid=%d uid=%d egid=%d gid=%d",
+				credp->euid, credp->uid,
+				credp->egid, credp->gid);
+
+	if (gip != NULL && gip->ngroups > 0) {
+		cp += sprintf(cp, " grps=%d", GROUP_AT(gip, 0));
+		for (i = 1 ; i < gip->ngroups; i++)
+			cp += sprintf(cp, ",%d", GROUP_AT(gip, i));
+	}
+
+	cp += sprintf(cp, " cap-e=");
+	CAP_FOR_EACH_U32(i)
+		cp += sprintf(cp, "%08x", credp->cap_effective.cap[i]);
+	cp += sprintf(cp, " cap-p=");
+	CAP_FOR_EACH_U32(i)
+		cp += sprintf(cp, "%08x", credp->cap_permitted.cap[i]);
+	cp += sprintf(cp, " cap-i=");
+	CAP_FOR_EACH_U32(i)
+		cp += sprintf(cp, "%08x", credp->cap_inheritable.cap[i]);
+
+#ifdef CONFIG_SECURITY_NETWORK
+	cp += sprintf(cp, " context=");
+	strncpy(cp, secdata, seclen);
+	cp += seclen;
+	*cp = '\0';
+
+	security_release_secctx(secdata, seclen);
+#endif /* CONFIG_SECURITY_NETWORK */
+
+	put_cmsg(msg, SOL_SOCKET, SCM_PRIVILEGES, strlen(result)+1, result);
+
+	kfree(result);
+}
+
+
 static __inline__ void scm_recv(struct socket *sock, struct msghdr *msg,
 				struct scm_cookie *scm, int flags)
 {
@@ -114,6 +191,8 @@ static __inline__ void scm_recv(struct socket *sock, struct msghdr *msg,
 	if (test_bit(SOCK_PASSCRED, &sock->flags))
 		put_cmsg(msg, SOL_SOCKET, SCM_CREDENTIALS, sizeof(scm->creds), &scm->creds);
 
+	scm_passpriv(sock, msg, scm);
+
 	scm_destroy_cred(scm);
 
 	scm_passec(sock, msg, scm);
@@ -124,6 +203,5 @@ static __inline__ void scm_recv(struct socket *sock, struct msghdr *msg,
 	scm_detach_fds(msg, scm);
 }
 
-
 #endif /* __LINUX_NET_SCM_H */
 
diff --git a/net/core/sock.c b/net/core/sock.c
index fb60801..f134126 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -725,6 +725,13 @@ set_rcvbuf:
 		else
 			clear_bit(SOCK_PASSSEC, &sock->flags);
 		break;
+
+	case SO_PASSPRIV:
+		if (valbool)
+			set_bit(SOCK_PASSPRIV, &sock->flags);
+		else
+			clear_bit(SOCK_PASSPRIV, &sock->flags);
+		break;
 	case SO_MARK:
 		if (!capable(CAP_NET_ADMIN))
 			ret = -EPERM;
@@ -950,6 +957,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = test_bit(SOCK_PASSSEC, &sock->flags) ? 1 : 0;
 		break;
 
+	case SO_PASSPRIV:
+		v.val = test_bit(SOCK_PASSPRIV, &sock->flags) ? 1 : 0;
+		break;
+
 	case SO_PEERSEC:
 		return security_socket_getpeersec_stream(sock, optval, optlen, len);
 




^ permalink raw reply related

* Re: Realtek r8168C / r8169 driver VLAN TAG stripping
From: Francois Romieu @ 2011-01-18  1:21 UTC (permalink / raw)
  To: Anand Raj Manickam; +Cc: netdev, Hayes
In-Reply-To: <AANLkTikoXfgHrJyMbe6A_HmvXT_QRo55w76st5Wo0hSv@mail.gmail.com>

Anand Raj Manickam <anandrm@gmail.com> :
> On Mon, Jan 17, 2011 at 11:52 AM, Anand Raj Manickam <anandrm@gmail.com> wrote:
[...]
> > This is the dmesg  for XID
> >
> > eth0: RTL8168c/8111c at 0xf9628000, 00:17:54:00:f6:62, XID 1c4000c0 IRQ 31
> > r8169: mac_version = 0x16

I do not have one of those (RTL_GIGA_MAC_VER_22) to check if it handles vlan
correctly yet.

> > r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> >
> > Unfortunately , i m not able to upgrade my kernel now . If there is a
> > Fix for it , that would be great !!

I doubt there is a lot of glamour/fortune/fame in backporting the 89 r8169
patches between v2.6.30 and v2.6.37 but you may help yourself and give it
a try.

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH] ethtool : Add option -L | --set-common to set common flags.
From: Mahesh Bandewar @ 2011-01-18  2:17 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, Tom Herbert, Laurent Chavey, netdev
In-Reply-To: <1295039984.5386.19.camel@bwh-desktop>

On Fri, Jan 14, 2011 at 1:19 PM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Thu, 2011-01-13 at 16:11 -0800, Mahesh Bandewar wrote:
>> This patch adds -L | --set-common option to add / remove common flags which
>> includes loopback flag. The -l | --show-common displays the current values
>> for these common flags.
>>
>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>> ---
>>  ethtool-copy.h |    1 +
>>  ethtool.8      |   16 ++++++++++
>>  ethtool.c      |   90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 105 insertions(+), 2 deletions(-)
>>
>> diff --git a/ethtool-copy.h b/ethtool-copy.h
>> index 75c3ae7..5fd18c7 100644
>> --- a/ethtool-copy.h
>> +++ b/ethtool-copy.h
>> @@ -309,6 +309,7 @@ struct ethtool_perm_addr {
>>   * flag differs from the read-only value.
>>   */
>>  enum ethtool_flags {
>> +     ETH_FLAG_LOOPBACK       = (1 << 2),     /* Loopback enable / disable */
>>       ETH_FLAG_TXVLAN         = (1 << 7),     /* TX VLAN offload enabled */
>>       ETH_FLAG_RXVLAN         = (1 << 8),     /* RX VLAN offload enabled */
>>       ETH_FLAG_LRO            = (1 << 15),    /* LRO is enabled */
>> diff --git a/ethtool.8 b/ethtool.8
>> index 1760924..cf7128f 100644
>> --- a/ethtool.8
>> +++ b/ethtool.8
>> @@ -174,6 +174,13 @@ ethtool \- Display or change ethernet card settings
>>  .B2 txvlan on off
>>  .B2 rxhash on off
>>
>> +.B ethtool \-l|\-\-show\-common
>> +.I ethX
>> +
>> +.B ethtool \-L|\-\-set\-common
>> +.I ethX
>> +.B2 loopback on off
>> +
>>  .B ethtool \-p|\-\-identify
>>  .I ethX
>>  .RI [ N ]
>> @@ -406,6 +413,15 @@ Specifies whether TX VLAN acceleration should be enabled
>>  .A2 rxhash on off
>>  Specifies whether receive hashing offload should be enabled
>>  .TP
>> +.B \-l \-\-show\-common
>> +Queries the specified ethernet device for common flag settings.
>> +.TP
>> +.B \-L \-\-set\-common
>> +Changes the common parameters of the specified ethernet device.
>> +.TP
>> +.A2 loopback on off
>> +Specifies whether loopback should be enabled.
>> +.TP
>
> I've just gone through the manual page and changed 'ethernet device' to
> 'network device' for all generic operations; please follow that.  The
> source for the manual page was also renamed to ethtool.8.in as it now
> goes through autoconf substitution.
>
OK. I'll make those changes.

>>  .B \-p \-\-identify
>>  Initiates adapter-specific action intended to enable an operator to
>>  easily identify the adapter by sight.  Typically this involves
>> diff --git a/ethtool.c b/ethtool.c
>> index 63e0ead..1a0c10c 100644
>> --- a/ethtool.c
>> +++ b/ethtool.c
> [...]
>> @@ -1905,6 +1932,13 @@ static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso,
>>       return 0;
>>  }
>>
>> +static int dump_common_flags(int loopback)
>> +{
>> +     fprintf(stdout, "loopback: %s\n", loopback ? "on" : "off");
>> +
>> +     return 0;
>> +}
>> +
>>  static int dump_rxfhash(int fhash, u64 val)
>>  {
>>       switch (fhash) {
> [...]
>> @@ -2219,6 +2257,53 @@ static int do_scoalesce(int fd, struct ifreq *ifr)
>>       return 0;
>>  }
>>
>> +static int do_gcommon(int fd, struct ifreq *ifr)
>> +{
>> +     struct ethtool_value eval;
>> +     int loopback = 0;
>> +
>> +     fprintf(stdout, "Common flags for %s:\n", devname);
>> +
>> +     eval.cmd = ETHTOOL_GFLAGS;
>> +     ifr->ifr_data = (caddr_t)&eval;
>> +     if (ioctl(fd, SIOCETHTOOL, ifr)) {
>> +             perror("Cannot get device flags");
>> +     } else {
>> +             loopback = (eval.data & ETH_FLAG_LOOPBACK) != 0;
>> +     }
>> +
>> +     return dump_common_flags(loopback);
>
> Breaking up a bitmask into a list of flag parameters is fairly
> pointless.  I realise do_goffload() and dump_offload() do that but I am
> just waiting for Michał Mirosław's changes to offload flags to be
> settled before I fix them.
>
>> +}
>> +
>> +static int do_scommon(int fd, struct ifreq *ifr)
>> +{
>> +     struct ethtool_value eval;
>> +
>> +     if (common_flags_mask) {
>> +             eval.cmd = ETHTOOL_GFLAGS;
>> +             eval.data = 0;
>> +             ifr->ifr_data = (caddr_t)&eval;
>> +             if (ioctl(fd, SIOCETHTOOL, ifr)) {
>> +                     perror("Cannot get device common flags");
>> +                     return 1;
>> +             }
>> +
>> +             eval.cmd = ETHTOOL_SFLAGS;
>> +             eval.data =
>> +                 ((eval.data & ~(common_flags_mask | off_flags_mask)) |
>> +                  (common_flags_wanted | off_flags_wanted));
>
> Why should this use off_flags_mask and off_flags_wanted?  They should
> both be 0 if this function is called.
>
That is right! Actually the get (ETHTOOL_GFLAGS) operation confused
me. I thought the values are fetched and *preserved* while setting the
new value. But when looked at it carefully, that is not the case.
Actually why that ioctl() with ETHTOOL_GFLAGS required?

>> +             if (ioctl(fd, SIOCETHTOOL, ifr)) {
>> +                     perror("Cannot set device common flags");
>> +                     return 1;
>> +             }
>> +     } else {
>> +             fprintf(stdout, "No common settings changed\n");
>> +     }
>> +
>> +     return 0;
>> +}
>> +
>>  static int do_goffload(int fd, struct ifreq *ifr)
>>  {
>>       struct ethtool_value eval;
>> @@ -2407,8 +2492,9 @@ static int do_soffload(int fd, struct ifreq *ifr)
>>               }
>>
>>               eval.cmd = ETHTOOL_SFLAGS;
>> -             eval.data = ((eval.data & ~off_flags_mask) |
>> -                          off_flags_wanted);
>> +             eval.data =
>> +                 ((eval.data & ~(off_flags_mask | common_flags_mask)) |
>> +                  (off_flags_wanted | common_flags_wanted));
>
> Similarly, why should this use common_flags_mask and
> common_flags_wanted?

For the same (wrong) reason mentioned above. I'll correct it and post
the new patch.

Thanks,
--mahesh..
>
> Ben.
>
>>
>>               err = ioctl(fd, SIOCETHTOOL, ifr);
>>               if (err) {
>
> --
> Ben Hutchings, Senior Software Engineer, Solarflare Communications
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
>

^ permalink raw reply

* Re: [PATCH] ethtool : Add option -L | --set-common to set common flags.
From: Ben Hutchings @ 2011-01-18  2:59 UTC (permalink / raw)
  To: Mahesh Bandewar; +Cc: David Miller, Tom Herbert, Laurent Chavey, netdev
In-Reply-To: <AANLkTiko=12CLgG_MMPshfwSRt7mZA+3DW+GRtxhaxh=@mail.gmail.com>

On Mon, 2011-01-17 at 18:17 -0800, Mahesh Bandewar wrote:
> On Fri, Jan 14, 2011 at 1:19 PM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
> > On Thu, 2011-01-13 at 16:11 -0800, Mahesh Bandewar wrote:
[...]
> >> +static int do_scommon(int fd, struct ifreq *ifr)
> >> +{
> >> +     struct ethtool_value eval;
> >> +
> >> +     if (common_flags_mask) {
> >> +             eval.cmd = ETHTOOL_GFLAGS;
> >> +             eval.data = 0;
> >> +             ifr->ifr_data = (caddr_t)&eval;
> >> +             if (ioctl(fd, SIOCETHTOOL, ifr)) {
> >> +                     perror("Cannot get device common flags");
> >> +                     return 1;
> >> +             }
> >> +
> >> +             eval.cmd = ETHTOOL_SFLAGS;
> >> +             eval.data =
> >> +                 ((eval.data & ~(common_flags_mask | off_flags_mask)) |
> >> +                  (common_flags_wanted | off_flags_wanted));
> >
> > Why should this use off_flags_mask and off_flags_wanted?  They should
> > both be 0 if this function is called.
> >
> That is right! Actually the get (ETHTOOL_GFLAGS) operation confused
> me. I thought the values are fetched and *preserved* while setting the
> new value. But when looked at it carefully, that is not the case.
> Actually why that ioctl() with ETHTOOL_GFLAGS required?
[...]

This is a read-modify-write operation.  We have to:

1. Parse the options to find out which flags are to be changed
(common_flags_mask) and the wanted values (common_flags_wanted).
2. Read the current flags (ETHTOOL_GFLAGS reads them into eval.data).
3. Modify the flags (eval.data = ...).
4. Write the new flags (ETHTOOL_SFLAGS).

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH 2/3] vhost-net: Unify the code of mergeable and big buffer handling
From: Jason Wang @ 2011-01-18  3:05 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, virtualization, netdev, kvm, linux-kernel
In-Reply-To: <20110117083644.GA23479@redhat.com>

Michael S. Tsirkin writes:
 > On Mon, Jan 17, 2011 at 04:11:08PM +0800, Jason Wang wrote:
 > > Codes duplication were found between the handling of mergeable and big
 > > buffers, so this patch tries to unify them. This could be easily done
 > > by adding a quota to the get_rx_bufs() which is used to limit the
 > > number of buffers it returns (for mergeable buffer, the quota is
 > > simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
 > > previous handle_rx_mergeable() could be resued also for big buffers.
 > > 
 > > Signed-off-by: Jason Wang <jasowang@redhat.com>
 > 
 > We actually started this way. However the code turned out
 > to have measureable overhead when handle_rx_mergeable
 > is called with quota 1.
 > So before applying this I'd like to see some data
 > to show this is not the case anymore.
 > 

I've run a round of test (Host->Guest) for these three patches on my desktop:

Without these patches

mergeable buffers:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.42 (10.66.91.42) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384     64    60.00       575.87   69.20    26.36    39.375  7.499  
 87380  16384    256    60.01      1123.57   73.16    34.73    21.335  5.064  
 87380  16384    512    60.00      1351.12   75.26    35.80    18.251  4.341  
 87380  16384   1024    60.00      1955.31   74.73    37.67    12.523  3.156  
 87380  16384   2048    60.01      3411.92   74.82    39.49    7.186   1.896  

bug buffers:
Netperf test results
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.109 (10.66.91.109) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384     64    60.00       567.10   72.06    26.13    41.638  7.550  
 87380  16384    256    60.00      1143.69   74.66    32.58    21.392  4.667  
 87380  16384    512    60.00      1460.92   73.94    33.40    16.585  3.746  
 87380  16384   1024    60.00      3454.85   77.49    33.89    7.349   1.607  
 87380  16384   2048    60.00      3781.11   76.51    38.38    6.631   1.663  

With these patches:

mergeable buffers:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.236 (10.66.91.236) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384     64    60.00       657.53   71.27    26.42    35.517  6.583  
 87380  16384    256    60.00      1217.73   74.34    34.67    20.004  4.665  
 87380  16384    512    60.00      1575.25   75.27    37.12    15.658  3.861  
 87380  16384   1024    60.00      2416.07   74.77    37.20    10.140  2.522  
 87380  16384   2048    60.00      3702.29   77.31    36.29    6.842   1.606  

big buffers:
Netperf test results
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.66.91.202 (10.66.91.202) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384     64    60.00       647.67   71.86    27.26    36.356  6.895  
 87380  16384    256    60.00      1265.82   76.19    36.54    19.724  4.729  
 87380  16384    512    60.00      1796.64   76.06    39.48    13.872  3.601  
 87380  16384   1024    60.00      4008.37   77.05    36.47    6.299   1.491  
 87380  16384   2048    60.00      4468.56   75.18    41.79    5.513   1.532 

Looks like the unification does not hurt the performance, and with those patches
we can get some improvement. BTW, the regression of mergeable buffer still
exist.

 > > ---
 > >  drivers/vhost/net.c |  128 +++------------------------------------------------
 > >  1 files changed, 7 insertions(+), 121 deletions(-)
 > > 
 > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 > > index 95e49de..c32a2e4 100644
 > > --- a/drivers/vhost/net.c
 > > +++ b/drivers/vhost/net.c
 > > @@ -227,6 +227,7 @@ static int peek_head_len(struct sock *sk)
 > >   * @iovcount	- returned count of io vectors we fill
 > >   * @log		- vhost log
 > >   * @log_num	- log offset
 > > + * @quota       - headcount quota, 1 for big buffer
 > >   *	returns number of buffer heads allocated, negative on error
 > >   */
 > >  static int get_rx_bufs(struct vhost_virtqueue *vq,
 > > @@ -234,7 +235,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 > >  		       int datalen,
 > >  		       unsigned *iovcount,
 > >  		       struct vhost_log *log,
 > > -		       unsigned *log_num)
 > > +		       unsigned *log_num,
 > > +		       unsigned int quota)
 > >  {
 > >  	unsigned int out, in;
 > >  	int seg = 0;
 > > @@ -242,7 +244,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 > >  	unsigned d;
 > >  	int r, nlogs = 0;
 > >  
 > > -	while (datalen > 0) {
 > > +	while (datalen > 0 && headcount < quota) {
 > >  		if (unlikely(seg >= UIO_MAXIOV)) {
 > >  			r = -ENOBUFS;
 > >  			goto err;
 > > @@ -282,116 +284,7 @@ err:
 > >  
 > >  /* Expects to be always run from workqueue - which acts as
 > >   * read-size critical section for our kind of RCU. */
 > > -static void handle_rx_big(struct vhost_net *net)
 > > -{
 > > -	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 > > -	unsigned out, in, log, s;
 > > -	int head;
 > > -	struct vhost_log *vq_log;
 > > -	struct msghdr msg = {
 > > -		.msg_name = NULL,
 > > -		.msg_namelen = 0,
 > > -		.msg_control = NULL, /* FIXME: get and handle RX aux data. */
 > > -		.msg_controllen = 0,
 > > -		.msg_iov = vq->iov,
 > > -		.msg_flags = MSG_DONTWAIT,
 > > -	};
 > > -
 > > -	struct virtio_net_hdr hdr = {
 > > -		.flags = 0,
 > > -		.gso_type = VIRTIO_NET_HDR_GSO_NONE
 > > -	};
 > > -
 > > -	size_t len, total_len = 0;
 > > -	int err;
 > > -	size_t hdr_size;
 > > -	struct socket *sock = rcu_dereference(vq->private_data);
 > > -	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
 > > -		return;
 > > -
 > > -	mutex_lock(&vq->mutex);
 > > -	vhost_disable_notify(vq);
 > > -	hdr_size = vq->vhost_hlen;
 > > -
 > > -	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 > > -		vq->log : NULL;
 > > -
 > > -	for (;;) {
 > > -		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 > > -					 ARRAY_SIZE(vq->iov),
 > > -					 &out, &in,
 > > -					 vq_log, &log);
 > > -		/* On error, stop handling until the next kick. */
 > > -		if (unlikely(head < 0))
 > > -			break;
 > > -		/* OK, now we need to know about added descriptors. */
 > > -		if (head == vq->num) {
 > > -			if (unlikely(vhost_enable_notify(vq))) {
 > > -				/* They have slipped one in as we were
 > > -				 * doing that: check again. */
 > > -				vhost_disable_notify(vq);
 > > -				continue;
 > > -			}
 > > -			/* Nothing new?  Wait for eventfd to tell us
 > > -			 * they refilled. */
 > > -			break;
 > > -		}
 > > -		/* We don't need to be notified again. */
 > > -		if (out) {
 > > -			vq_err(vq, "Unexpected descriptor format for RX: "
 > > -			       "out %d, int %d\n",
 > > -			       out, in);
 > > -			break;
 > > -		}
 > > -		/* Skip header. TODO: support TSO/mergeable rx buffers. */
 > > -		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
 > > -		msg.msg_iovlen = in;
 > > -		len = iov_length(vq->iov, in);
 > > -		/* Sanity check */
 > > -		if (!len) {
 > > -			vq_err(vq, "Unexpected header len for RX: "
 > > -			       "%zd expected %zd\n",
 > > -			       iov_length(vq->hdr, s), hdr_size);
 > > -			break;
 > > -		}
 > > -		err = sock->ops->recvmsg(NULL, sock, &msg,
 > > -					 len, MSG_DONTWAIT | MSG_TRUNC);
 > > -		/* TODO: Check specific error and bomb out unless EAGAIN? */
 > > -		if (err < 0) {
 > > -			vhost_discard_vq_desc(vq, 1);
 > > -			break;
 > > -		}
 > > -		/* TODO: Should check and handle checksum. */
 > > -		if (err > len) {
 > > -			pr_debug("Discarded truncated rx packet: "
 > > -				 " len %d > %zd\n", err, len);
 > > -			vhost_discard_vq_desc(vq, 1);
 > > -			continue;
 > > -		}
 > > -		len = err;
 > > -		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size);
 > > -		if (err) {
 > > -			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
 > > -			       vq->iov->iov_base, err);
 > > -			break;
 > > -		}
 > > -		len += hdr_size;
 > > -		vhost_add_used_and_signal(&net->dev, vq, head, len);
 > > -		if (unlikely(vq_log))
 > > -			vhost_log_write(vq, vq_log, log, len);
 > > -		total_len += len;
 > > -		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 > > -			vhost_poll_queue(&vq->poll);
 > > -			break;
 > > -		}
 > > -	}
 > > -
 > > -	mutex_unlock(&vq->mutex);
 > > -}
 > > -
 > > -/* Expects to be always run from workqueue - which acts as
 > > - * read-size critical section for our kind of RCU. */
 > > -static void handle_rx_mergeable(struct vhost_net *net)
 > > +static void handle_rx(struct vhost_net *net)
 > >  {
 > >  	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 > >  	unsigned uninitialized_var(in), log;
 > > @@ -431,7 +324,8 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >  		sock_len += sock_hlen;
 > >  		vhost_len = sock_len + vhost_hlen;
 > >  		headcount = get_rx_bufs(vq, vq->heads, vhost_len,
 > > -					&in, vq_log, &log);
 > > +					&in, vq_log, &log,
 > > +					likely(mergeable) ? UIO_MAXIOV : 1);
 > >  		/* On error, stop handling until the next kick. */
 > >  		if (unlikely(headcount < 0))
 > >  			break;
 > > @@ -497,14 +391,6 @@ static void handle_rx_mergeable(struct vhost_net *net)
 > >  	mutex_unlock(&vq->mutex);
 > >  }
 > >  
 > > -static void handle_rx(struct vhost_net *net)
 > > -{
 > > -	if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
 > > -		handle_rx_mergeable(net);
 > > -	else
 > > -		handle_rx_big(net);
 > > -}
 > > -
 > >  static void handle_tx_kick(struct vhost_work *work)
 > >  {
 > >  	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,

^ permalink raw reply

* Re: [PATCH v2] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Po-Yu Chuang @ 2011-01-18  3:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, linux-kernel, ratbert, bhutchings, joe, dilinger
In-Reply-To: <1295288462.3335.55.camel@edumazet-laptop>

Dear Eric,

On Tue, Jan 18, 2011 at 2:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> BTW, shouldnt you use cpu_to_be32() or cpu_to_le32(), if this driver is
> multi platform ?

This reminds me another thing. Should I use u32 instead of unsigned int for all
hardware related variables (registers, descriptors) ?
Not quite sure about these cross-platform issues.

best regards,
Po-Yu Chuang

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: John Fastabend @ 2011-01-18  3:16 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Oleg V. Ukhno, netdev@vger.kernel.org, David S. Miller
In-Reply-To: <26330.1295049912@death>

On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
> Oleg V. Ukhno <olegu@yandex-team.ru> wrote:
>> Jay Vosburgh wrote:
>>
>>> 	This is a violation of the 802.3ad (now 802.1ax) standard, 5.2.1
>>> (f), which requires that all frames of a given "conversation" are passed
>>> to a single port.
>>>
>>> 	The existing layer3+4 hash has a similar problem (that it may
>>> send packets from a conversation to multiple ports), but for that case
>>> it's an unlikely exception (only in the case of IP fragmentation), but
>>> here it's the norm.  At a minimum, this must be clearly documented.
>>>
>>> 	Also, what does a round robin in 802.3ad provide that the
>>> existing round robin does not?  My presumption is that you're looking to
>>> get the aggregator autoconfiguration that 802.3ad provides, but you
>>> don't say.
> 
> 	I'm still curious about this question.  Given the rather
> intricate setup of your particular network (described below), I'm not
> sure why 802.3ad is of benefit over traditional etherchannel
> (balance-rr / balance-xor).
> 
>>> 	I don't necessarily think this is a bad cheat (round robining on
>>> 802.3ad as an explicit non-standard extension), since everybody wants to
>>> stripe their traffic across multiple slaves.  I've given some thought to
>>> making round robin into just another hash mode, but this also does some
>>> magic to the MAC addresses of the outgoing frames (more on that below).
>> Yes, I am resetting MAC addresses when transmitting packets to have switch
>> to put packets into different ports of the receiving etherchannel.
> 
> 	By "etherchannel" do you really mean "Cisco switch with a
> port-channel group using LACP"?
> 
>> I am using this patch to provide full-mesh ISCSI connectivity between at
>> least 4 hosts (all hosts of course are in same ethernet segment) and every
>> host is connected with aggregate link with 4 slaves(usually).
>> Using round-robin I provide near-equal load striping when transmitting,
>> using MAC address magic I force switch to stripe packets over all slave
>> links in destination port-channel(when number of rx-ing slaves is equal to
>> number ot tx-ing slaves and is even).
> 
> 	By "MAC address magic" do you mean that you're assigning
> specifically chosen MAC addresses to the slaves so that the switch's
> hash is essentially "assigning" the bonding slaves to particular ports
> on the outgoing port-channel group?
> 
> 	Assuming that this is the case, it's an interesting idea, but
> I'm unconvinced that it's better on 802.3ad vs. balance-rr.  Unless I'm
> missing something, you can get everything you need from an option to
> have balance-rr / balance-xor utilize the slave's permanent address as
> the source address for outgoing traffic.
> 
>> [...] So I am able to utilize all slaves
>> for tx and for rx up to maximum capacity; besides I am getting L2 link
>> failure detection (and load rebalancing), which is (in my opinion) much
>> faster and robust than L3 or than dm-multipath provides.
>> It's my idea with the patch
> 
> 	Can somebody (John?) more knowledgable than I about dm-multipath
> comment on the above?

Here I'll give it a go.

I don't think detecting L2 link failure this way is very robust. If there
is a failure farther away then your immediate link your going to break
completely? Your bonding hash will continue to round robin the iscsi
packets and half them will get dropped on the floor. dm-multipath handles
this reasonably gracefully. Also in this bonding environment you seem to
be very sensitive to RTT times on the network. Maybe not bad out right but
I wouldn't consider this robust either.

You could tweak your scsi timeout values and fail_fast values, set the io
retry to 0 to cause the fail over to occur faster. I suspect you already
did this and still it is too slow? Maybe adding a checker in multipathd to
listen for link events would be fast enough. The checker could then fail
the path immediately.

I'll try to address your comments from the other thread here. In general I
wonder if it would be better to solve the problems in dm-multipath rather than
add another bonding mode?

OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)

The dm-multipath layer is adding latency? How much? If this is really true
maybe its best to the address the real issue here and not avoid it by
using the bonding layer.

OVU - it handles any link failures bad, because of it's command queue 
limitation(all queued commands above 32 are discarded in case of path 
failure, as I remember)

Maybe true but only link failures with the immediate peer are handled
with a bonding strategy. By working at the block layer we can detect
failures throughout the path. I would need to look into this again I
know when we were looking at this sometime ago there was some talk about
improving this behavior. I need to take some time to go back through the
error recovery stuff to remember how this works.

OVU - it performs very bad when there are many devices and maтy paths(I was 
unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths 
per each disk)

Hmm well that seems like something is broken. I'll try this setup when
I get some time next few days. This really shouldn't be the case dm-multipath
should not add a bunch of extra latency or effect throughput significantly.
By the way what are you seeing without mpio?

Thanks,
John

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox