[PATCH] NET: Multiple queue hardware support

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] NET: Multiple queue hardware support
@ 2007-06-18 18:42 PJ Waskiewicz
  2007-06-18 18:42 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: PJ Waskiewicz @ 2007-06-18 18:42 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Please consider these patches for 2.6.23 inclusion.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.  The requested e1000 changes are still being resolved,
and will be sent at a later date.

I did not modify other users of netif_queue_stopped() in net/core/netpoll.c,
net/core/dev.c, or net/core/pktgen.c, since no classification occurs for
the skb being sent to the device.  Therefore, packets should always be
ending up in queue 0, so there's no need to check the subqueue status either.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation
  2007-06-18 18:42 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
@ 2007-06-18 18:42 ` PJ Waskiewicz
  2007-06-18 18:42 ` [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
  2007-06-18 18:42 ` [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  2 siblings, 0 replies; 24+ messages in thread
From: PJ Waskiewicz @ 2007-06-18 18:42 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Add a brief howto to Documentation/networking for multiqueue.  It
explains how to use the multiqueue API in a driver to support
multiqueue paths from the stack, as well as the qdiscs to use for
feeding a multiqueue device.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 Documentation/networking/multiqueue.txt |   98 +++++++++++++++++++++++++++++++
 1 files changed, 98 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt
new file mode 100644
index 0000000..8201767
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,98 @@
+
+		HOWTO for multiqueue network device support
+		===========================================
+
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO for multiqueue devices
+
+
+Intro: Kernel support for multiqueue devices
+---------------------------------------------------------
+
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+
+
+Section 2: Base driver requirements for implementing multiqueue support
+-----------------------------------------------------------------------
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features
+bitmap on device initialization.  Below is an example from e1000:
+
+#ifdef CONFIG_E1000_MQ
+	if ( (adapter->hw.mac.type == e1000_82571) ||
+	     (adapter->hw.mac.type == e1000_82572) ||
+	     (adapter->hw.mac.type == e1000_80003es2lan))
+		netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+
+
+Section 3: Qdisc support for multiqueue devices
+-----------------------------------------------
+
+Currently two qdiscs support multiqueue devices.  A new round-robin qdisc,
+sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb->queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+
+sch_rr has been added for hardware that doesn't want scheduling policies from
+software, so it's a straight round-robin qdisc.  It uses the same syntax and
+classification priomap that sch_prio uses, so it should be intuitive to
+configure for people who've used sch_prio.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  Upon load of the
+qdisc, PRIO will make a best-effort assignment of queue to PRIO band to evenly
+distribute traffic flows.  The algorithm can be found in prio_tune() in
+net/sched/sch_prio.c.  Once the association is made, any skb that is
+classified will have skb->queue_mapping set, which will allow the driver to
+properly queue skb's to multiple queues.  sch_prio can have these features
+compiled in or out of the module.
+
+
+Section 4: Brief howto using PRIO for multiqueue devices
+--------------------------------------------------------
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: prio
+
+This will create 3 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 2 Tx queues, the band mapping
+would look like:
+
+band 0 => queue 0
+band 1 => queue 1
+band 2 => queue 1
+
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS -> Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the "normal"
+traffic classification, which is band 1.  Therefore pings will be send out
+queue 1 on the NIC.
+
+The behavior of tc filters remains the same, where it will override TOS priority
+classification.
+
+
+Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-18 18:42 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
  2007-06-18 18:42 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
@ 2007-06-18 18:42 ` PJ Waskiewicz
  2007-06-18 19:05   ` Patrick McHardy
  2007-06-18 18:42 ` [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  2 siblings, 1 reply; 24+ messages in thread
From: PJ Waskiewicz @ 2007-06-18 18:42 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio to be compiled with or without multiqueue hardware
support.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/pkt_sched.h |   11 +
 net/sched/Kconfig         |   22 ++
 net/sched/Makefile        |    1 
 net/sched/sch_generic.c   |    3 
 net/sched/sch_prio.c      |   64 +++++-
 net/sched/sch_rr.c        |  516 +++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 610 insertions(+), 7 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..0d1adaf 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -22,6 +22,7 @@
 #define TC_PRIO_CONTROL			7
 
 #define TC_PRIO_MAX			15
+#define TC_RR_MAX			15
 
 /* Generic queue statistics, available for all the elements.
    Particular schedulers may have also their private records.
@@ -90,6 +91,16 @@ struct tc_fifo_qopt
 	__u32	limit;	/* Queue length: bytes for bfifo, packets for pfifo */
 };
 
+/* RR section */
+#define TCQ_RR_BANDS	16
+#define TCQ_MIN_RR_BANDS 2
+
+struct tc_rr_qopt
+{
+	int	bands;			/* Number of bands */
+	__u8	priomap[TC_RR_MAX+1];	/* Map: Linux priority -> RR band */
+};
+
 /* PRIO section */
 
 #define TCQ_PRIO_BANDS	16
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..a532554 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,28 @@ config NET_SCH_PRIO
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_prio.
 
+config NET_SCH_PRIO_MQ
+	bool "Multiple hardware queue support for PRIO"
+	depends on NET_SCH_PRIO
+	---help---
+	  Say Y here if you want to allow the PRIO qdisc to assign
+	  flows to multiple hardware queues on an ethernet device.  This
+	  will still work on devices with 1 queue.
+
+	  Consider this scheduler for devices that do not use
+	  hardware-based scheduling policies.  Otherwise, use NET_SCH_RR.
+
+	  Most people will say N here.
+
+config NET_SCH_RR
+	tristate "Multi Band Round Robin Queuing (RR)"
+	---help---
+	  Say Y here if you want to use an n-band round robin packet
+	  scheduler.
+
+	  To compile this code as a module, choose M here: the
+	  module will be caleld sch_rr.
+
 config NET_SCH_RED
 	tristate "Random Early Detection (RED)"
 	---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 020767a..d3ed44e 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_NET_SCH_SFQ)	+= sch_sfq.o
 obj-$(CONFIG_NET_SCH_TBF)	+= sch_tbf.o
 obj-$(CONFIG_NET_SCH_TEQL)	+= sch_teql.o
 obj-$(CONFIG_NET_SCH_PRIO)	+= sch_prio.o
+obj-$(CONFIG_NET_SCH_RR)	+= sch_rr.o
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 9461e8a..203d5c4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
 	spin_unlock(&dev->queue_lock);
 
 	ret = NETDEV_TX_BUSY;
-	if (!netif_queue_stopped(dev))
+	if (!netif_queue_stopped(dev) &&
+	    !netif_subqueue_stopped(dev, skb->queue_mapping))
 		/* churn baby churn .. */
 		ret = dev_hard_start_xmit(skb, dev);
 
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..44ecdc6 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -43,6 +43,7 @@ struct prio_sched_data
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+	u16 band2queue[TC_PRIO_MAX + 1];
 };
 
 
@@ -70,14 +71,25 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 #endif
 			if (TC_H_MAJ(band))
 				band = 0;
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+			skb->queue_mapping =
+				q->band2queue[q->prio2band[band&TC_PRIO_MAX]];
+#endif
 			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
-	if (band >= q->bands)
+	if (band >= q->bands) {
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+ 		skb->queue_mapping = q->band2queue[q->prio2band[0]];
+#endif
 		return q->queues[q->prio2band[0]];
+	}
 
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+ 	skb->queue_mapping = q->band2queue[band];
+#endif
 	return q->queues[band];
 }
 
@@ -144,12 +156,22 @@ prio_dequeue(struct Qdisc* sch)
 	struct Qdisc *qdisc;
 
 	for (prio = 0; prio < q->bands; prio++) {
-		qdisc = q->queues[prio];
-		skb = qdisc->dequeue(qdisc);
-		if (skb) {
-			sch->q.qlen--;
-			return skb;
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, q->band2queue[prio])) {
+#endif
+			qdisc = q->queues[prio];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				return skb;
+			}
+#ifdef CONFIG_NET_SCH_PRIO_MQ
 		}
+#endif
 	}
 	return NULL;
 
@@ -200,6 +222,10 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 	struct prio_sched_data *q = qdisc_priv(sch);
 	struct tc_prio_qopt *qopt = RTA_DATA(opt);
 	int i;
+	int queue;
+	int qmapoffset;
+	int offset;
+	int mod;
 
 	if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
 		return -EINVAL;
@@ -242,6 +268,32 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 			}
 		}
 	}
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+	/* setup queue to band mapping */
+	if (q->bands < sch->dev->egress_subqueue_count) {
+		qmapoffset = 1;
+		mod = sch->dev->egress_subqueue_count;
+	} else {
+		mod = q->bands % sch->dev->egress_subqueue_count;
+		qmapoffset = q->bands / sch->dev->egress_subqueue_count
+				+ ((mod) ? 1 : 0);
+	}
+
+	queue = 0;
+	offset = 0;
+	for (i = 0; i < q->bands; i++) {
+		q->band2queue[i] = queue;
+		if ( ((i + 1) - offset) == qmapoffset) {
+			queue++;
+			offset += qmapoffset;
+			if (mod)
+				mod--;
+			qmapoffset = q->bands /
+				sch->dev->egress_subqueue_count +
+				((mod) ? 1 : 0);
+		}
+	}
+#endif
 	return 0;
 }
 
diff --git a/net/sched/sch_rr.c b/net/sched/sch_rr.c
new file mode 100644
index 0000000..ce9f237
--- /dev/null
+++ b/net/sched/sch_rr.c
@@ -0,0 +1,516 @@
+/*
+ * net/sched/sch_rr.c	Simple n-band round-robin scheduler.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * The core part of this qdisc is based on sch_prio.  ->dequeue() is where
+ * this scheduler functionally differs.
+ *
+ * Author:	PJ Waskiewicz, <peter.p.waskiewicz.jr@intel.com>
+ *
+ * Original Authors (from PRIO): Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
+ * Fixes:       19990609: J Hadi Salim <hadi@nortelnetworks.com>:
+ *              Init --  EINVAL when opt undefined
+ */
+
+#include <linux/module.h>
+#include <asm/uaccess.h>
+#include <asm/system.h>
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/mm.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <linux/in.h>
+#include <linux/errno.h>
+#include <linux/interrupt.h>
+#include <linux/if_ether.h>
+#include <linux/inet.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/notifier.h>
+#include <net/ip.h>
+#include <net/route.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/sock.h>
+#include <net/pkt_sched.h>
+
+
+struct rr_sched_data
+{
+	int bands;
+	int curband;
+	struct tcf_proto *filter_list;
+	u8  prio2band[TC_RR_MAX + 1];
+	struct Qdisc *queues[TCQ_RR_BANDS];
+	u16 band2queue[TC_RR_MAX + 1];
+};
+
+
+static struct Qdisc *rr_classify(struct sk_buff *skb, struct Qdisc *sch,
+				 int *qerr)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	u32 band = skb->priority;
+	struct tcf_result res;
+
+	*qerr = NET_XMIT_BYPASS;
+	if (TC_H_MAJ(skb->priority) != sch->handle) {
+#ifdef CONFIG_NET_CLS_ACT
+		switch (tc_classify(skb, q->filter_list, &res)) {
+		case TC_ACT_STOLEN:
+		case TC_ACT_QUEUED:
+			*qerr = NET_XMIT_SUCCESS;
+		case TC_ACT_SHOT:
+			return NULL;
+		}
+
+		if (!q->filter_list ) {
+#else
+		if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) {
+#endif
+			if (TC_H_MAJ(band))
+				band = 0;
+ 			skb->queue_mapping =
+ 				  q->band2queue[q->prio2band[band&TC_RR_MAX]];
+
+			return q->queues[q->prio2band[band&TC_RR_MAX]];
+		}
+		band = res.classid;
+	}
+	band = TC_H_MIN(band) - 1;
+	if (band > q->bands) {
+ 		skb->queue_mapping = q->band2queue[q->prio2band[0]];
+		return q->queues[q->prio2band[0]];
+	}
+
+ 	skb->queue_mapping = q->band2queue[band];
+
+	return q->queues[band];
+}
+
+static int rr_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct Qdisc *qdisc;
+	int ret;
+
+	qdisc = rr_classify(skb, sch, &ret);
+#ifdef CONFIG_NET_CLS_ACT
+	if (qdisc == NULL) {
+
+		if (ret == NET_XMIT_BYPASS)
+			sch->qstats.drops++;
+		kfree_skb(skb);
+		return ret;
+	}
+#endif
+
+	if ((ret = qdisc->enqueue(skb, qdisc)) == NET_XMIT_SUCCESS) {
+		sch->bstats.bytes += skb->len;
+		sch->bstats.packets++;
+		sch->q.qlen++;
+		return NET_XMIT_SUCCESS;
+	}
+	sch->qstats.drops++;
+	return ret;
+}
+
+
+static int rr_requeue(struct sk_buff *skb, struct Qdisc* sch)
+{
+	struct Qdisc *qdisc;
+	int ret;
+
+	qdisc = rr_classify(skb, sch, &ret);
+#ifdef CONFIG_NET_CLS_ACT
+	if (qdisc == NULL) {
+		if (ret == NET_XMIT_BYPASS)
+			sch->qstats.drops++;
+		kfree_skb(skb);
+		return ret;
+	}
+#endif
+
+	if ((ret = qdisc->ops->requeue(skb, qdisc)) == NET_XMIT_SUCCESS) {
+		sch->q.qlen++;
+		sch->qstats.requeues++;
+		return 0;
+	}
+	sch->qstats.drops++;
+	return NET_XMIT_DROP;
+}
+
+
+static struct sk_buff *rr_dequeue(struct Qdisc* sch)
+{
+	struct sk_buff *skb;
+	struct rr_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	int bandcount;
+
+	/* Only take one pass through the queues.  If nothing is available,
+	 * return nothing.
+	 */
+	for (bandcount = 0; bandcount < q->bands; bandcount++) {
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.  If the queue is stopped, try the
+		 * next queue.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, q->band2queue[q->curband])) {
+			qdisc = q->queues[q->curband];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				q->curband++;
+				if (q->curband >= q->bands)
+					q->curband = 0;
+				return skb;
+			}
+		}
+		q->curband++;
+		if (q->curband >= q->bands)
+			q->curband = 0;
+	}
+	return NULL;
+}
+
+static unsigned int rr_drop(struct Qdisc* sch)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	int band;
+	unsigned int len;
+	struct Qdisc *qdisc;
+
+	for (band = q->bands - 1; band >= 0; band--) {
+		qdisc = q->queues[band];
+		if (qdisc->ops->drop && (len = qdisc->ops->drop(qdisc)) != 0) {
+			sch->q.qlen--;
+			return len;
+		}
+	}
+	return 0;
+}
+
+
+static void rr_reset(struct Qdisc* sch)
+{
+	int band;
+	struct rr_sched_data *q = qdisc_priv(sch);
+
+	for (band = 0; band < q->bands; band++)
+		qdisc_reset(q->queues[band]);
+	sch->q.qlen = 0;
+}
+
+static void rr_destroy(struct Qdisc* sch)
+{
+	int band;
+	struct rr_sched_data *q = qdisc_priv(sch);
+
+	tcf_destroy_chain(q->filter_list);
+	for (band = 0; band < q->bands; band++)
+		qdisc_destroy(q->queues[band]);
+}
+
+static int rr_tune(struct Qdisc *sch, struct rtattr *opt)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	struct tc_rr_qopt *qopt = RTA_DATA(opt);
+	int i;
+	int queue;
+	int qmapoffset;
+	int offset;
+	int mod;
+
+	if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
+		return -EINVAL;
+	if (qopt->bands > TCQ_RR_BANDS || qopt->bands < 2)
+		return -EINVAL;
+
+	for (i = 0; i <= TC_RR_MAX; i++) {
+		if (qopt->priomap[i] >= qopt->bands)
+			return -EINVAL;
+	}
+
+	sch_tree_lock(sch);
+	q->bands = qopt->bands;
+	memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1);
+	q->curband = 0;
+
+	for (i = q->bands; i < TCQ_RR_BANDS; i++) {
+		struct Qdisc *child = xchg(&q->queues[i], &noop_qdisc);
+		if (child != &noop_qdisc) {
+			qdisc_tree_decrease_qlen(child, child->q.qlen);
+			qdisc_destroy(child);
+		}
+	}
+	sch_tree_unlock(sch);
+
+	for (i = 0; i < q->bands; i++) {
+		if (q->queues[i] == &noop_qdisc) {
+			struct Qdisc *child;
+			child = qdisc_create_dflt(sch->dev, &pfifo_qdisc_ops,
+						  TC_H_MAKE(sch->handle, i + 1));
+			if (child) {
+				sch_tree_lock(sch);
+				child = xchg(&q->queues[i], child);
+
+				if (child != &noop_qdisc) {
+					qdisc_tree_decrease_qlen(child,
+								 child->q.qlen);
+					qdisc_destroy(child);
+				}
+				sch_tree_unlock(sch);
+			}
+		}
+	}
+	/* setup queue to band mapping - best effort to map into available
+	 * hardware queues
+	 */
+	if (q->bands < sch->dev->egress_subqueue_count) {
+		qmapoffset = 1;
+		mod = sch->dev->egress_subqueue_count;
+	} else {
+		mod = q->bands % sch->dev->egress_subqueue_count;
+		qmapoffset = q->bands / sch->dev->egress_subqueue_count
+				+ ((mod) ? 1 : 0);
+	}
+
+	queue = 0;
+	offset = 0;
+	for (i = 0; i < q->bands; i++) {
+		q->band2queue[i] = queue;
+		if ( ((i + 1) - offset) == qmapoffset) {
+			queue++;
+			offset += qmapoffset;
+			if (mod)
+				mod--;
+			qmapoffset = q->bands /
+				sch->dev->egress_subqueue_count +
+				((mod) ? 1 : 0);
+		}
+	}
+
+	return 0;
+}
+
+static int rr_init(struct Qdisc *sch, struct rtattr *opt)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	int i;
+
+	for (i = 0; i < TCQ_RR_BANDS; i++)
+		q->queues[i] = &noop_qdisc;
+
+	if (opt == NULL) {
+		return -EINVAL;
+	} else {
+		int err;
+
+		if ((err = rr_tune(sch, opt)) != 0)
+			return err;
+	}
+	return 0;
+}
+
+static int rr_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tc_rr_qopt opt;
+
+	opt.bands = q->bands;
+	memcpy(&opt.priomap, q->prio2band, TC_RR_MAX + 1);
+	RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+	return skb->len;
+
+rtattr_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int rr_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
+		    struct Qdisc **old)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	unsigned long band = arg - 1;
+
+	if (band >= q->bands)
+		return -EINVAL;
+
+	if (new == NULL)
+		new = &noop_qdisc;
+
+	sch_tree_lock(sch);
+	*old = q->queues[band];
+	q->queues[band] = new;
+	qdisc_tree_decrease_qlen(*old, (*old)->q.qlen);
+	qdisc_reset(*old);
+	sch_tree_unlock(sch);
+
+	return 0;
+}
+
+static struct Qdisc *rr_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	unsigned long band = arg - 1;
+
+	if (band >= q->bands)
+		return NULL;
+
+	return q->queues[band];
+}
+
+static unsigned long rr_get(struct Qdisc *sch, u32 classid)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	unsigned long band = TC_H_MIN(classid);
+
+	if (band - 1 >= q->bands)
+		return 0;
+	return band;
+}
+
+static unsigned long rr_bind(struct Qdisc *sch, unsigned long parent,
+			     u32 classid)
+{
+	return rr_get(sch, classid);
+}
+
+
+static void rr_put(struct Qdisc *q, unsigned long cl)
+{
+	return;
+}
+
+static int rr_change(struct Qdisc *sch, u32 handle, u32 parent,
+		     struct rtattr **tca, unsigned long *arg)
+{
+	unsigned long cl = *arg;
+	struct rr_sched_data *q = qdisc_priv(sch);
+
+	if (cl - 1 > q->bands)
+		return -ENOENT;
+	return 0;
+}
+
+static int rr_delete(struct Qdisc *sch, unsigned long cl)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	if (cl - 1 > q->bands)
+		return -ENOENT;
+	return 0;
+}
+
+
+static int rr_dump_class(struct Qdisc *sch, unsigned long cl,
+			 struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+
+	if (cl - 1 > q->bands)
+		return -ENOENT;
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	if (q->queues[cl - 1])
+		tcm->tcm_info = q->queues[cl - 1]->handle;
+	return 0;
+}
+
+static int rr_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+			       struct gnet_dump *d)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *cl_q;
+
+	cl_q = q->queues[cl - 1];
+	if (gnet_stats_copy_basic(d, &cl_q->bstats) < 0 ||
+	    gnet_stats_copy_queue(d, &cl_q->qstats) < 0)
+		return -1;
+
+	return 0;
+}
+
+static void rr_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+	int band;
+
+	if (arg->stop)
+		return;
+
+	for (band = 0; band < q->bands; band++) {
+		if (arg->count < arg->skip) {
+			arg->count++;
+			continue;
+		}
+		if (arg->fn(sch, band + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static struct tcf_proto **rr_find_tcf(struct Qdisc *sch, unsigned long cl)
+{
+	struct rr_sched_data *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return &q->filter_list;
+}
+
+static struct Qdisc_class_ops rr_class_ops = {
+	.graft		=	rr_graft,
+	.leaf		=	rr_leaf,
+	.get		=	rr_get,
+	.put		=	rr_put,
+	.change		=	rr_change,
+	.delete		=	rr_delete,
+	.walk		=	rr_walk,
+	.tcf_chain	=	rr_find_tcf,
+	.bind_tcf	=	rr_bind,
+	.unbind_tcf	=	rr_put,
+	.dump		=	rr_dump_class,
+	.dump_stats	=	rr_dump_class_stats,
+};
+
+static struct Qdisc_ops rr_qdisc_ops = {
+	.next		=	NULL,
+	.cl_ops		=	&rr_class_ops,
+	.id		=	"rr",
+	.priv_size	=	sizeof(struct rr_sched_data),
+	.enqueue	=	rr_enqueue,
+	.dequeue	=	rr_dequeue,
+	.requeue	=	rr_requeue,
+	.drop		=	rr_drop,
+	.init		=	rr_init,
+	.reset		=	rr_reset,
+	.destroy	=	rr_destroy,
+	.change		=	rr_tune,
+	.dump		=	rr_dump,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init rr_module_init(void)
+{
+	return register_qdisc(&rr_qdisc_ops);
+}
+
+static void __exit rr_module_exit(void)
+{
+	unregister_qdisc(&rr_qdisc_ops);
+}
+
+module_init(rr_module_init)
+module_exit(rr_module_exit)
+
+MODULE_LICENSE("GPL");

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-18 18:42 ` [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
@ 2007-06-18 19:05   ` Patrick McHardy
  2007-06-18 20:36     ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 24+ messages in thread
From: Patrick McHardy @ 2007-06-18 19:05 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
>
> diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
> index 6d7542c..44ecdc6 100644
> --- a/net/sched/sch_prio.c
> +++ b/net/sched/sch_prio.c
>  	}
> +#ifdef CONFIG_NET_SCH_PRIO_MQ
> +	/* setup queue to band mapping */
> +	if (q->bands < sch->dev->egress_subqueue_count) {
> +		qmapoffset = 1;
> +		mod = sch->dev->egress_subqueue_count;
> +	} else {
> +		mod = q->bands % sch->dev->egress_subqueue_count;
> +		qmapoffset = q->bands / sch->dev->egress_subqueue_count
> +				+ ((mod) ? 1 : 0);
> +	}
> +
> +	queue = 0;
> +	offset = 0;
> +	for (i = 0; i < q->bands; i++) {
> +		q->band2queue[i] = queue;
> +		if ( ((i + 1) - offset) == qmapoffset) {
> +			queue++;
> +			offset += qmapoffset;
> +			if (mod)
> +				mod--;
> +			qmapoffset = q->bands /
> +				sch->dev->egress_subqueue_count +
> +				((mod) ? 1 : 0);
> +		}
> +	}
> +#endif

This should really go, its not only ugly, it also makes no sense to use
more bands than queues since that means multiple bands of different
priorities are controlled through a single queue state, so lower priority
bands can stop the queue for higher priority ones.

The user should enable multiqueue behaviour and using it with a non-matching
parameters should simply return an error.

>  	return 0;
>  }
>  
> diff --git a/net/sched/sch_rr.c b/net/sched/sch_rr.c
> new file mode 100644
> index 0000000..ce9f237
> --- /dev/null
> +++ b/net/sched/sch_rr.c

For which multiqueue capable device is this? Jamal mentioned that e1000 
uses drr.

> @@ -0,0 +1,516 @@
> +/*
> + * net/sched/sch_rr.c	Simple n-band round-robin scheduler.
> + *
> + *		This program is free software; you can redistribute it and/or
> + *		modify it under the terms of the GNU General Public License
> + *		as published by the Free Software Foundation; either version
> + *		2 of the License, or (at your option) any later version.
> + *
> + * The core part of this qdisc is based on sch_prio.  ->dequeue() is where
> + * this scheduler functionally differs.
> + *
> + * Author:	PJ Waskiewicz, <peter.p.waskiewicz.jr@intel.com>
> + *
> + * Original Authors (from PRIO): Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
> + * Fixes:       19990609: J Hadi Salim <hadi@nortelnetworks.com>:
> + *              Init --  EINVAL when opt undefined
> + */
> +
> +#include <linux/module.h>
> +#include <asm/uaccess.h>
> +#include <asm/system.h>
> +#include <linux/bitops.h>
> +#include <linux/types.h>
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +#include <linux/mm.h>
> +#include <linux/socket.h>
> +#include <linux/sockios.h>
> +#include <linux/in.h>
> +#include <linux/errno.h>
> +#include <linux/interrupt.h>
> +#include <linux/if_ether.h>
> +#include <linux/inet.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/notifier.h>
> +#include <net/ip.h>
> +#include <net/route.h>
> +#include <linux/skbuff.h>
> +#include <net/netlink.h>
> +#include <net/sock.h>
> +#include <net/pkt_sched.h>

Lots os unnecessary includes. I have a patch that cleans this up for 
net/sched,
this is the relevant sch_prio part where you copied this from:

--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -12,28 +12,12 @@
  */
 
 #include <linux/module.h>
-#include <asm/uaccess.h>
-#include <asm/system.h>
-#include <linux/bitops.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/string.h>
-#include <linux/mm.h>
-#include <linux/socket.h>
-#include <linux/sockios.h>
-#include <linux/in.h>
 #include <linux/errno.h>
-#include <linux/interrupt.h>
-#include <linux/if_ether.h>
-#include <linux/inet.h>
-#include <linux/netdevice.h>
-#include <linux/etherdevice.h>
-#include <linux/notifier.h>
-#include <net/ip.h>
-#include <net/route.h>
 #include <linux/skbuff.h>
 #include <net/netlink.h>
-#include <net/sock.h>
 #include <net/pkt_sched.h>

> +
> +
> +struct rr_sched_data
> +{
> +	int bands;
> +	int curband;
> +	struct tcf_proto *filter_list;
> +	u8  prio2band[TC_RR_MAX + 1];
> +	struct Qdisc *queues[TCQ_RR_BANDS];
> +	u16 band2queue[TC_RR_MAX + 1];
> +};
> +
> +
> +static struct Qdisc *rr_classify(struct sk_buff *skb, struct Qdisc *sch,
> +				 int *qerr)
> +{
> +	struct rr_sched_data *q = qdisc_priv(sch);
> +	u32 band = skb->priority;
> +	struct tcf_result res;
> +
> +	*qerr = NET_XMIT_BYPASS;
> +	if (TC_H_MAJ(skb->priority) != sch->handle) {
> +#ifdef CONFIG_NET_CLS_ACT
> +		switch (tc_classify(skb, q->filter_list, &res)) {
> +		case TC_ACT_STOLEN:
> +		case TC_ACT_QUEUED:
> +			*qerr = NET_XMIT_SUCCESS;
> +		case TC_ACT_SHOT:
> +			return NULL;
> +		}
> +
> +		if (!q->filter_list ) {
> +#else
> +		if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) {
> +#endif
> +			if (TC_H_MAJ(band))
> +				band = 0;
> + 			skb->queue_mapping =
> + 				  q->band2queue[q->prio2band[band&TC_RR_MAX]];
> +
> +			return q->queues[q->prio2band[band&TC_RR_MAX]];
> +		}
> +		band = res.classid;
> +	}
> +	band = TC_H_MIN(band) - 1;
> +	if (band > q->bands) {

You copied an off-by-one from an old sch_prio version here.

>
> +static int rr_tune(struct Qdisc *sch, struct rtattr *opt)
> +{
> +	struct rr_sched_data *q = qdisc_priv(sch);
> +	struct tc_rr_qopt *qopt = RTA_DATA(opt);


Nested attributes please, don't repeat sch_prio's mistake.


> ...
> +	/* setup queue to band mapping - best effort to map into available
> +	 * hardware queues
> +	 */
> +	if (q->bands < sch->dev->egress_subqueue_count) {
> +		qmapoffset = 1;
> +		mod = sch->dev->egress_subqueue_count;
> +	} else {
> +		mod = q->bands % sch->dev->egress_subqueue_count;
> +		qmapoffset = q->bands / sch->dev->egress_subqueue_count
> +				+ ((mod) ? 1 : 0);
> +	}
> +
> +	queue = 0;
> +	offset = 0;
> +	for (i = 0; i < q->bands; i++) {
> +		q->band2queue[i] = queue;
> +		if ( ((i + 1) - offset) == qmapoffset) {
> +			queue++;
> +			offset += qmapoffset;
> +			if (mod)
> +				mod--;
> +			qmapoffset = q->bands /
> +				sch->dev->egress_subqueue_count +
> +				((mod) ? 1 : 0);
> +		}
> +	}

Should go as well.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-18 19:05   ` Patrick McHardy
@ 2007-06-18 20:36     ` Waskiewicz Jr, Peter P
  2007-06-18 20:54       ` Patrick McHardy
  0 siblings, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-18 20:36 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> PJ Waskiewicz wrote:
> >
> > diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 
> > 6d7542c..44ecdc6 100644
> > --- a/net/sched/sch_prio.c
> > +++ b/net/sched/sch_prio.c
> >  	}
> > +#ifdef CONFIG_NET_SCH_PRIO_MQ
> > +	/* setup queue to band mapping */
> > +	if (q->bands < sch->dev->egress_subqueue_count) {
> > +		qmapoffset = 1;
> > +		mod = sch->dev->egress_subqueue_count;
> > +	} else {
> > +		mod = q->bands % sch->dev->egress_subqueue_count;
> > +		qmapoffset = q->bands / sch->dev->egress_subqueue_count
> > +				+ ((mod) ? 1 : 0);
> > +	}
> > +
> > +	queue = 0;
> > +	offset = 0;
> > +	for (i = 0; i < q->bands; i++) {
> > +		q->band2queue[i] = queue;
> > +		if ( ((i + 1) - offset) == qmapoffset) {
> > +			queue++;
> > +			offset += qmapoffset;
> > +			if (mod)
> > +				mod--;
> > +			qmapoffset = q->bands /
> > +				sch->dev->egress_subqueue_count +
> > +				((mod) ? 1 : 0);
> > +		}
> > +	}
> > +#endif
> 
> This should really go, its not only ugly, it also makes no 
> sense to use more bands than queues since that means multiple 
> bands of different priorities are controlled through a single 
> queue state, so lower priority bands can stop the queue for 
> higher priority ones.
> 
> The user should enable multiqueue behaviour and using it with 
> a non-matching parameters should simply return an error.

That sounds fine to me.  I'll clean this and sch_prio up.

> > diff --git a/net/sched/sch_rr.c b/net/sched/sch_rr.c new file mode 
> > 100644 index 0000000..ce9f237
> > --- /dev/null
> > +++ b/net/sched/sch_rr.c
> 
> For which multiqueue capable device is this? Jamal mentioned 
> that e1000 
> uses drr.

E1000 is capable of doing DRR and WRR, however the way our drivers are
written, they're straight round-robin.  But this qdisc would be useful
for devices such as wireless where they have their own scheduler in the
MAC, and do not want the stack to prioritize traffic beyond that.  This
way a user can classify multiple flows into multiple bands, but the
driver will control the prioritization of the traffic beyond that.  For
MAC's that have no strict scheduler (such as e1000 as it is today), they
can use sch_prio to achieve multiple queue support, but have scheduling
priority from the stack.

This qdisc can also be useful at the physical netdev layer for
virtualization I would think, but perhaps I haven't thought that hard
about that yet.

> Lots os unnecessary includes. I have a patch that cleans this up for 
> net/sched,
> this is the relevant sch_prio part where you copied this from:

I'll clean the includes up.  Thanks!

> > +	band = TC_H_MIN(band) - 1;
> > +	if (band > q->bands) {
> 
> You copied an off-by-one from an old sch_prio version here.

Hmm.  This is the sch_prio from the first 2.6.23-dev tree.  I'll resync
and make sure it's the correct one.

> > +static int rr_tune(struct Qdisc *sch, struct rtattr *opt)
> > +{
> > +	struct rr_sched_data *q = qdisc_priv(sch);
> > +	struct tc_rr_qopt *qopt = RTA_DATA(opt);
> 
> 
> Nested attributes please, don't repeat sch_prio's mistake.

I'm not sure I understand what you mean here about nested attributes.

> 
> > ...
> > +	/* setup queue to band mapping - best effort to map 
> into available
> > +	 * hardware queues
> > +	 */
> > +	if (q->bands < sch->dev->egress_subqueue_count) {
> > +		qmapoffset = 1;
> > +		mod = sch->dev->egress_subqueue_count;
> > +	} else {
> > +		mod = q->bands % sch->dev->egress_subqueue_count;
> > +		qmapoffset = q->bands / sch->dev->egress_subqueue_count
> > +				+ ((mod) ? 1 : 0);
> > +	}
> > +
> > +	queue = 0;
> > +	offset = 0;
> > +	for (i = 0; i < q->bands; i++) {
> > +		q->band2queue[i] = queue;
> > +		if ( ((i + 1) - offset) == qmapoffset) {
> > +			queue++;
> > +			offset += qmapoffset;
> > +			if (mod)
> > +				mod--;
> > +			qmapoffset = q->bands /
> > +				sch->dev->egress_subqueue_count +
> > +				((mod) ? 1 : 0);
> > +		}
> > +	}
> 
> Should go as well.
> 

Works for me.  I'll clean this up as well.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-18 20:36     ` Waskiewicz Jr, Peter P
@ 2007-06-18 20:54       ` Patrick McHardy
  2007-06-18 21:04         ` Waskiewicz Jr, Peter P
  2007-06-21 17:55         ` Waskiewicz Jr, Peter P
  0 siblings, 2 replies; 24+ messages in thread
From: Patrick McHardy @ 2007-06-18 20:54 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>>>+	band = TC_H_MIN(band) - 1;
>>>+	if (band > q->bands) {
>>
>>You copied an off-by-one from an old sch_prio version here.
> 
> 
> Hmm.  This is the sch_prio from the first 2.6.23-dev tree.  I'll resync
> and make sure it's the correct one.

Current 2.6.22-rc and net-2.6.23 have

        if (band >= q->bands)



>>>+static int rr_tune(struct Qdisc *sch, struct rtattr *opt)
>>>+{
>>>+	struct rr_sched_data *q = qdisc_priv(sch);
>>>+	struct tc_rr_qopt *qopt = RTA_DATA(opt);
>>
>>
>>Nested attributes please, don't repeat sch_prio's mistake.
> 
> 
> I'm not sure I understand what you mean here about nested attributes.


Nested netlink attributes, like most qdisc use, instead of
struct tc_rr_qopt (or additionally). The way you've done
it makes it hard to add further attributes later.

BTw, couldn't you just merge sch_rr with prio? AFAICT you
only need a new dequeue function, a new struct Qdisc_ops
and a MODULE_ALIAS.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-18 20:54       ` Patrick McHardy
@ 2007-06-18 21:04         ` Waskiewicz Jr, Peter P
  2007-06-18 21:11           ` Patrick McHardy
  2007-06-21 17:55         ` Waskiewicz Jr, Peter P
  1 sibling, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-18 21:04 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> > Hmm.  This is the sch_prio from the first 2.6.23-dev tree.  I'll 
> > resync and make sure it's the correct one.
> 
> Current 2.6.22-rc and net-2.6.23 have
> 
>         if (band >= q->bands)

I just pulled 2.6.23 down, and see that is true.  I must have had that
left over.  I'll fix that.

> > I'm not sure I understand what you mean here about nested 
> attributes.
> 
> 
> Nested netlink attributes, like most qdisc use, instead of 
> struct tc_rr_qopt (or additionally). The way you've done it 
> makes it hard to add further attributes later.

I'm going to need to think about this more, since I'm not immediately
getting what you're referring to.  I see the qdisc using tc_prio_qopt as
a single member; do you have an example outside of the qdiscs I can look
at and see what you're referring to?  Please bear with me: my netlink
skills are still very green.

> 
> BTw, couldn't you just merge sch_rr with prio? AFAICT you 
> only need a new dequeue function, a new struct Qdisc_ops and 
> a MODULE_ALIAS.

Are you suggesting a module that can determine RR or PRIO at runtime?
Because the two are so similar, I definitely thought about combining
them, but because of the dequeue difference, you'd need a load-time
switch to determine which mode to run the module in.  That would break
ABI for sch_prio, which I was trying to avoid.

Thanks Patrick,
-PJ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-18 21:04         ` Waskiewicz Jr, Peter P
@ 2007-06-18 21:11           ` Patrick McHardy
  0 siblings, 0 replies; 24+ messages in thread
From: Patrick McHardy @ 2007-06-18 21:11 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>> Nested netlink attributes, like most qdisc use, instead of 
>> struct tc_rr_qopt (or additionally). The way you've done it 
>> makes it hard to add further attributes later.
>>     
>
> I'm going to need to think about this more, since I'm not immediately
> getting what you're referring to.  I see the qdisc using tc_prio_qopt as
> a single member; do you have an example outside of the qdiscs I can look
> at and see what you're referring to?  Please bear with me: my netlink
> skills are still very green.
>   

Qdisc private parameters are within the TCA_OPTION attribute. The
data under that attribute can either be a structure (which you used)
or more netlink attributes specific to a single qdisc, which allows
to easily add new attributes.

For a simple qdisc example look at sch_red or grep for rta_parse_nested
and nla_parse_nested.

>   
>> BTw, couldn't you just merge sch_rr with prio? AFAICT you 
>> only need a new dequeue function, a new struct Qdisc_ops and 
>> a MODULE_ALIAS.
>>     
>
> Are you suggesting a module that can determine RR or PRIO at runtime?
> Because the two are so similar, I definitely thought about combining
> them, but because of the dequeue difference, you'd need a load-time
> switch to determine which mode to run the module in.  That would break
> ABI for sch_prio, which I was trying to avoid.

Yes, all you need to do is register two different struct Qdisc_ops and
provide an alias for autoloading.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-18 20:54       ` Patrick McHardy
  2007-06-18 21:04         ` Waskiewicz Jr, Peter P
@ 2007-06-21 17:55         ` Waskiewicz Jr, Peter P
  2007-06-21 18:04           ` Patrick McHardy
  1 sibling, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-21 17:55 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> BTw, couldn't you just merge sch_rr with prio? AFAICT you 
> only need a new dequeue function, a new struct Qdisc_ops and 
> a MODULE_ALIAS.

Ok, I have this somewhat working, but need to poll for some help from
the community.  I used MODULE_ALIAS("sch_rr") in sch_prio.c, and
modprobe is happily loading sch_prio.ko when I ask for sch_rr.ko.  It
also recognizes the correct ops struct to associate with the instance of
the module.  However, when I try to load the qdisc via tc (modified
version that knows sch_rr), I'm getting No Such File or Directory from
RTNETLINK.  It's looking for sch_rr.ko, and is bailing.  I've scoured
the code looking for a reason why, and am drawing a blank.  I'll
continue looking, but if this sounds familiar to someone who knows how
to get around this, please reply and let me know.

Thanks,

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-21 17:55         ` Waskiewicz Jr, Peter P
@ 2007-06-21 18:04           ` Patrick McHardy
  2007-06-21 18:12             ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 24+ messages in thread
From: Patrick McHardy @ 2007-06-21 18:04 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>> BTw, couldn't you just merge sch_rr with prio? AFAICT you 
>> only need a new dequeue function, a new struct Qdisc_ops and 
>> a MODULE_ALIAS.
>
> Ok, I have this somewhat working, but need to poll for some help from
> the community.  I used MODULE_ALIAS("sch_rr") in sch_prio.c, and
> modprobe is happily loading sch_prio.ko when I ask for sch_rr.ko.  It
> also recognizes the correct ops struct to associate with the instance of
> the module.  However, when I try to load the qdisc via tc (modified
> version that knows sch_rr), I'm getting No Such File or Directory from
> RTNETLINK.  It's looking for sch_rr.ko, and is bailing.  I've scoured
> the code looking for a reason why, and am drawing a blank.  I'll
> continue looking, but if this sounds familiar to someone who knows how
> to get around this, please reply and let me know.

Please post the code.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-21 18:04           ` Patrick McHardy
@ 2007-06-21 18:12             ` Waskiewicz Jr, Peter P
  2007-06-21 18:17               ` Patrick McHardy
  0 siblings, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-21 18:12 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

[-- Attachment #1: Type: text/plain, Size: 188 bytes --]


> Please post the code.
> 

Code is attached.  Please forgive the attachment and any whitespace
damage...currently using Doubtlook to send this (cringe).

Thanks,
-PJ Waskiewicz

[-- Attachment #2: sch_prio.c --]
[-- Type: application/octet-stream, Size: 12463 bytes --]

/*
 * net/sched/sch_prio.c	Simple 3-band priority "scheduler".
 *
 *		This program is free software; you can redistribute it and/or
 *		modify it under the terms of the GNU General Public License
 *		as published by the Free Software Foundation; either version
 *		2 of the License, or (at your option) any later version.
 *
 * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
 * Fixes:       19990609: J Hadi Salim <hadi@nortelnetworks.com>:
 *              Init --  EINVAL when opt undefined
 * Additions:	Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
 *		Added round-robin scheduling for selection at load-time
 */

#include <linux/module.h>
#include <asm/uaccess.h>
#include <asm/system.h>
#include <linux/bitops.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/socket.h>
#include <linux/sockios.h>
#include <linux/in.h>
#include <linux/errno.h>
#include <linux/interrupt.h>
#include <linux/if_ether.h>
#include <linux/inet.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/notifier.h>
#include <net/ip.h>
#include <net/route.h>
#include <linux/skbuff.h>
#include <net/netlink.h>
#include <net/sock.h>
#include <net/pkt_sched.h>


struct prio_sched_data
{
	int bands;
#ifdef CONFIG_NET_SCH_RR
	int curband; /* for round-robin */
#endif
	struct tcf_proto *filter_list;
	u8  prio2band[TC_PRIO_MAX+1];
	struct Qdisc *queues[TCQ_PRIO_BANDS];
	u16 band2queue[TC_PRIO_MAX + 1];
};


static struct Qdisc *
prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	u32 band = skb->priority;
	struct tcf_result res;

	*qerr = NET_XMIT_BYPASS;
	if (TC_H_MAJ(skb->priority) != sch->handle) {
#ifdef CONFIG_NET_CLS_ACT
		switch (tc_classify(skb, q->filter_list, &res)) {
		case TC_ACT_STOLEN:
		case TC_ACT_QUEUED:
			*qerr = NET_XMIT_SUCCESS;
		case TC_ACT_SHOT:
			return NULL;
		}

		if (!q->filter_list ) {
#else
		if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) {
#endif
			if (TC_H_MAJ(band))
				band = 0;
#ifdef CONFIG_NET_SCH_PRIO_MQ
			skb->queue_mapping =
				q->band2queue[q->prio2band[band&TC_PRIO_MAX]];
#endif
			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
		}
		band = res.classid;
	}
	band = TC_H_MIN(band) - 1;
	if (band >= q->bands) {
#ifdef CONFIG_NET_SCH_PRIO_MQ
 		skb->queue_mapping = q->band2queue[q->prio2band[0]];
#endif
		return q->queues[q->prio2band[0]];
	}

#ifdef CONFIG_NET_SCH_PRIO_MQ
 	skb->queue_mapping = q->band2queue[band];
#endif
	return q->queues[band];
}

static int
prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
{
	struct Qdisc *qdisc;
	int ret;

	qdisc = prio_classify(skb, sch, &ret);
#ifdef CONFIG_NET_CLS_ACT
	if (qdisc == NULL) {

		if (ret == NET_XMIT_BYPASS)
			sch->qstats.drops++;
		kfree_skb(skb);
		return ret;
	}
#endif

	if ((ret = qdisc->enqueue(skb, qdisc)) == NET_XMIT_SUCCESS) {
		sch->bstats.bytes += skb->len;
		sch->bstats.packets++;
		sch->q.qlen++;
		return NET_XMIT_SUCCESS;
	}
	sch->qstats.drops++;
	return ret;
}


static int
prio_requeue(struct sk_buff *skb, struct Qdisc* sch)
{
	struct Qdisc *qdisc;
	int ret;

	qdisc = prio_classify(skb, sch, &ret);
#ifdef CONFIG_NET_CLS_ACT
	if (qdisc == NULL) {
		if (ret == NET_XMIT_BYPASS)
			sch->qstats.drops++;
		kfree_skb(skb);
		return ret;
	}
#endif

	if ((ret = qdisc->ops->requeue(skb, qdisc)) == NET_XMIT_SUCCESS) {
		sch->q.qlen++;
		sch->qstats.requeues++;
		return 0;
	}
	sch->qstats.drops++;
	return NET_XMIT_DROP;
}


static struct sk_buff *
prio_dequeue(struct Qdisc* sch)
{
	struct sk_buff *skb;
	struct prio_sched_data *q = qdisc_priv(sch);
	int prio;
	struct Qdisc *qdisc;

	for (prio = 0; prio < q->bands; prio++) {
#ifdef CONFIG_NET_SCH_PRIO_MQ
		/* Check if the target subqueue is available before
		 * pulling an skb.  This way we avoid excessive requeues
		 * for slower queues.
		 */
		if (!netif_subqueue_stopped(sch->dev, q->band2queue[prio])) {
#endif
			qdisc = q->queues[prio];
			skb = qdisc->dequeue(qdisc);
			if (skb) {
				sch->q.qlen--;
				return skb;
			}
#ifdef CONFIG_NET_SCH_PRIO_MQ
		}
#endif
	}
	return NULL;

}

#ifdef CONFIG_NET_SCH_RR
static struct sk_buff *rr_dequeue(struct Qdisc* sch)
{
	struct sk_buff *skb;
	struct prio_sched_data *q = qdisc_priv(sch);
	struct Qdisc *qdisc;
	int bandcount;

	/* Only take one pass through the queues.  If nothing is available,
	 * return nothing.
	 */
	for (bandcount = 0; bandcount < q->bands; bandcount++) {
		/* Check if the target subqueue is available before
		 * pulling an skb.  This way we avoid excessive requeues
		 * for slower queues.  If the queue is stopped, try the
		 * next queue.
		 */
		if (!netif_subqueue_stopped(sch->dev, q->band2queue[q->curband])) {
			qdisc = q->queues[q->curband];
			skb = qdisc->dequeue(qdisc);
			if (skb) {
				sch->q.qlen--;
				q->curband++;
				if (q->curband >= q->bands)
					q->curband = 0;
				return skb;
			}
		}
		q->curband++;
		if (q->curband >= q->bands)
			q->curband = 0;
	}
	return NULL;
}
#endif

static unsigned int prio_drop(struct Qdisc* sch)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	int prio;
	unsigned int len;
	struct Qdisc *qdisc;

	for (prio = q->bands-1; prio >= 0; prio--) {
		qdisc = q->queues[prio];
		if (qdisc->ops->drop && (len = qdisc->ops->drop(qdisc)) != 0) {
			sch->q.qlen--;
			return len;
		}
	}
	return 0;
}


static void
prio_reset(struct Qdisc* sch)
{
	int prio;
	struct prio_sched_data *q = qdisc_priv(sch);

	for (prio=0; prio<q->bands; prio++)
		qdisc_reset(q->queues[prio]);
	sch->q.qlen = 0;
}

static void
prio_destroy(struct Qdisc* sch)
{
	int prio;
	struct prio_sched_data *q = qdisc_priv(sch);

	tcf_destroy_chain(q->filter_list);
	for (prio=0; prio<q->bands; prio++)
		qdisc_destroy(q->queues[prio]);
}

static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	struct tc_prio_qopt *qopt = RTA_DATA(opt);
	int i;
	int queue;

	if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
		return -EINVAL;
	if (qopt->bands > TCQ_PRIO_BANDS || qopt->bands < 2)
		return -EINVAL;

	for (i=0; i<=TC_PRIO_MAX; i++) {
		if (qopt->priomap[i] >= qopt->bands)
			return -EINVAL;
	}

	/* If we're prio multiqueue or are using round-robin, make
	 * sure the number of incoming bands matches the number of
	 * queues on the device we're associating with.
	 */
#ifdef CONFIG_NET_SCH_RR
	if (strcmp("rr", sch->ops->id) == 0)
		if (qopt->bands != sch->dev->egress_subqueue_count)
			return -EINVAL;
#endif

#ifdef CONFIG_NET_SCH_PRIO_MQ
	if (strcmp("prio", sch->ops->id) == 0)
		if (qopt->bands != sch->dev->egress_subqueue_count)
			return -EINVAL;
#endif

	sch_tree_lock(sch);
	q->bands = qopt->bands;
	memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1);

	for (i=q->bands; i<TCQ_PRIO_BANDS; i++) {
		struct Qdisc *child = xchg(&q->queues[i], &noop_qdisc);
		if (child != &noop_qdisc) {
			qdisc_tree_decrease_qlen(child, child->q.qlen);
			qdisc_destroy(child);
		}
	}
	sch_tree_unlock(sch);

	for (i=0; i<q->bands; i++) {
		if (q->queues[i] == &noop_qdisc) {
			struct Qdisc *child;
			child = qdisc_create_dflt(sch->dev, &pfifo_qdisc_ops,
						  TC_H_MAKE(sch->handle, i + 1));
			if (child) {
				sch_tree_lock(sch);
				child = xchg(&q->queues[i], child);

				if (child != &noop_qdisc) {
					qdisc_tree_decrease_qlen(child,
								 child->q.qlen);
					qdisc_destroy(child);
				}
				sch_tree_unlock(sch);
			}
		}
	}

	/* setup queue to band mapping */
	for (i = 0, queue = 0; i < q->bands; i++, queue++)
		q->band2queue[i] = queue;

	return 0;
}

static int prio_init(struct Qdisc *sch, struct rtattr *opt)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	int i;

	for (i=0; i<TCQ_PRIO_BANDS; i++)
		q->queues[i] = &noop_qdisc;

	if (opt == NULL) {
		return -EINVAL;
	} else {
		int err;

		if ((err= prio_tune(sch, opt)) != 0)
			return err;
	}
	return 0;
}

static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	unsigned char *b = skb_tail_pointer(skb);
	struct tc_prio_qopt opt;

	opt.bands = q->bands;
	memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
	RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
	return skb->len;

rtattr_failure:
	nlmsg_trim(skb, b);
	return -1;
}

static int prio_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
		      struct Qdisc **old)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	unsigned long band = arg - 1;

	if (band >= q->bands)
		return -EINVAL;

	if (new == NULL)
		new = &noop_qdisc;

	sch_tree_lock(sch);
	*old = q->queues[band];
	q->queues[band] = new;
	qdisc_tree_decrease_qlen(*old, (*old)->q.qlen);
	qdisc_reset(*old);
	sch_tree_unlock(sch);

	return 0;
}

static struct Qdisc *
prio_leaf(struct Qdisc *sch, unsigned long arg)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	unsigned long band = arg - 1;

	if (band >= q->bands)
		return NULL;

	return q->queues[band];
}

static unsigned long prio_get(struct Qdisc *sch, u32 classid)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	unsigned long band = TC_H_MIN(classid);

	if (band - 1 >= q->bands)
		return 0;
	return band;
}

static unsigned long prio_bind(struct Qdisc *sch, unsigned long parent, u32 classid)
{
	return prio_get(sch, classid);
}


static void prio_put(struct Qdisc *q, unsigned long cl)
{
	return;
}

static int prio_change(struct Qdisc *sch, u32 handle, u32 parent, struct rtattr **tca, unsigned long *arg)
{
	unsigned long cl = *arg;
	struct prio_sched_data *q = qdisc_priv(sch);

	if (cl - 1 > q->bands)
		return -ENOENT;
	return 0;
}

static int prio_delete(struct Qdisc *sch, unsigned long cl)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	if (cl - 1 > q->bands)
		return -ENOENT;
	return 0;
}


static int prio_dump_class(struct Qdisc *sch, unsigned long cl, struct sk_buff *skb,
			   struct tcmsg *tcm)
{
	struct prio_sched_data *q = qdisc_priv(sch);

	if (cl - 1 > q->bands)
		return -ENOENT;
	tcm->tcm_handle |= TC_H_MIN(cl);
	if (q->queues[cl-1])
		tcm->tcm_info = q->queues[cl-1]->handle;
	return 0;
}

static int prio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
				 struct gnet_dump *d)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	struct Qdisc *cl_q;

	cl_q = q->queues[cl - 1];
	if (gnet_stats_copy_basic(d, &cl_q->bstats) < 0 ||
	    gnet_stats_copy_queue(d, &cl_q->qstats) < 0)
		return -1;

	return 0;
}

static void prio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	int prio;

	if (arg->stop)
		return;

	for (prio = 0; prio < q->bands; prio++) {
		if (arg->count < arg->skip) {
			arg->count++;
			continue;
		}
		if (arg->fn(sch, prio+1, arg) < 0) {
			arg->stop = 1;
			break;
		}
		arg->count++;
	}
}

static struct tcf_proto ** prio_find_tcf(struct Qdisc *sch, unsigned long cl)
{
	struct prio_sched_data *q = qdisc_priv(sch);

	if (cl)
		return NULL;
	return &q->filter_list;
}

static struct Qdisc_class_ops prio_class_ops = {
	.graft		=	prio_graft,
	.leaf		=	prio_leaf,
	.get		=	prio_get,
	.put		=	prio_put,
	.change		=	prio_change,
	.delete		=	prio_delete,
	.walk		=	prio_walk,
	.tcf_chain	=	prio_find_tcf,
	.bind_tcf	=	prio_bind,
	.unbind_tcf	=	prio_put,
	.dump		=	prio_dump_class,
	.dump_stats	=	prio_dump_class_stats,
};

static struct Qdisc_ops prio_qdisc_ops = {
	.next		=	NULL,
	.cl_ops		=	&prio_class_ops,
	.id		=	"prio",
	.priv_size	=	sizeof(struct prio_sched_data),
	.enqueue	=	prio_enqueue,
	.dequeue	=	prio_dequeue,
	.requeue	=	prio_requeue,
	.drop		=	prio_drop,
	.init		=	prio_init,
	.reset		=	prio_reset,
	.destroy	=	prio_destroy,
	.change		=	prio_tune,
	.dump		=	prio_dump,
	.owner		=	THIS_MODULE,
};

#ifdef CONFIG_NET_SCH_RR
static struct Qdisc_ops rr_qdisc_ops = {
	.next		=	NULL,
	.cl_ops		=	&prio_class_ops,
	.id		=	"rr",
	.priv_size	=	sizeof(struct prio_sched_data),
	.enqueue	=	prio_enqueue,
	.dequeue	=	rr_dequeue,
	.requeue	=	prio_requeue,
	.drop		=	prio_drop,
	.init		=	prio_init,
	.reset		=	prio_reset,
	.destroy	=	prio_destroy,
	.change		=	prio_tune,
	.dump		=	prio_dump,
	.owner		=	THIS_MODULE,
};
#endif

static int __init prio_module_init(void)
{
	register_qdisc(&prio_qdisc_ops);
#ifdef CONFIG_NET_SCH_RR
	register_qdisc(&rr_qdisc_ops);
#endif
	return 0;
}

static void __exit prio_module_exit(void)
{
	unregister_qdisc(&prio_qdisc_ops);
#ifdef CONFIG_NET_SCH_RR
	unregister_qdisc(&rr_qdisc_ops);
#endif
}

module_init(prio_module_init)
module_exit(prio_module_exit)

MODULE_LICENSE("GPL");
MODULE_ALIAS("sch_rr");

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-21 18:12             ` Waskiewicz Jr, Peter P
@ 2007-06-21 18:17               ` Patrick McHardy
  2007-06-21 18:23                 ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 24+ messages in thread
From: Patrick McHardy @ 2007-06-21 18:17 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>> Please post the code.
>>
>>     
>
> Code is attached.  Please forgive the attachment and any whitespace
> damage...currently using Doubtlook to send this (cringe).
>   

The code looks correct. Are you sure you had the config option
enabled during your test?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-21 18:17               ` Patrick McHardy
@ 2007-06-21 18:23                 ` Waskiewicz Jr, Peter P
  2007-06-21 19:10                   ` Patrick McHardy
  0 siblings, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-21 18:23 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

[-- Attachment #1: Type: text/plain, Size: 1166 bytes --]

> Waskiewicz Jr, Peter P wrote:
> >> Please post the code.
> >>
> >>     
> >
> > Code is attached.  Please forgive the attachment and any whitespace 
> > damage...currently using Doubtlook to send this (cringe).
> >   
> 
> The code looks correct. Are you sure you had the config 
> option enabled during your test?
>

Yes.  This is the tc command I used to configure the qdisc (with q_rr.c
attached from my patches iproute2 package):

# tc qdisc add dev eth2 root handle 1: rr bands 8
RTNETLINK answers: No such file or directory

At this point, sch_prio gets loaded correctly, but it obviously fails to
finish loading the qdisc.  Using prio works though:

# tc qdisc add dev eth2 root handle 1: prio bands 8

And yes, the NIC I'm working with has 8 queues, just to be clear.  Any
help is definitely appreciated; I'm going to keep this copy of the code
for now, but am going to get the separate module written back up just in
case this can't be solved in the short-term.  This is the only piece
keeping me from sending these patches back for consideration, so I'll
keep the parallel effort going.

Thanks Patrick,

-PJ Waskiewicz

[-- Attachment #2: q_rr.c --]
[-- Type: application/octet-stream, Size: 2664 bytes --]

/*
 * q_rr.c		RR.
 *
 *		This program is free software; you can redistribute it and/or
 *		modify it under the terms of the GNU General Public License
 *		as published by the Free Software Foundation; either version
 *		2 of the License, or (at your option) any later version.
 *
 * Authors:	PJ Waskiewicz, <peter.p.waskiewicz.jr@intel.com>
 * Original Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru> (from PRIO)
 *
 * Changes:
 *
 * Ole Husgaard <sparre@login.dknet.dk>: 990513: prio2band map was always reset.
 * J Hadi Salim <hadi@cyberus.ca>: 990609: priomap fix.
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <syslog.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>

#include "utils.h"
#include "tc_util.h"

static void explain(void)
{
	fprintf(stderr, "Usage: ... rr bands NUMBER priomap P1 P2...\n");
}

#define usage() return(-1)

static int rr_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct nlmsghdr *n)
{
	int ok = 0;
	int pmap_mode = 0;
	int idx = 0;
	struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 }};

	while (argc > 0) {
		if (strcmp(*argv, "bands") == 0) {
			if (pmap_mode)
				explain();
			NEXT_ARG();
			if (get_integer(&opt.bands, *argv, 10)) {
				fprintf(stderr, "Illegal \"bands\"\n");
				return -1;
			}
			ok++;
		} else if (strcmp(*argv, "priomap") == 0) {
			if (pmap_mode) {
				fprintf(stderr, "Error: duplicate priomap\n");
				return -1;
			}
			pmap_mode = 1;
		} else if (strcmp(*argv, "help") == 0) {
			explain();
			return -1;
		} else {
			unsigned band;
			if (!pmap_mode) {
				fprintf(stderr, "What is \"%s\"?\n", *argv);
				explain();
				return -1;
			}
			if (get_unsigned(&band, *argv, 10)) {
				fprintf(stderr, "Illegal \"priomap\" element\n");
				return -1;
			}
			if (band > opt.bands) {
				fprintf(stderr, "\"priomap\" element is out of bands\n");
				return -1;
			}
			if (idx > TC_PRIO_MAX) {
				fprintf(stderr, "\"priomap\" index > TC_RR_MAX=%u\n", TC_PRIO_MAX);
				return -1;
			}
			opt.priomap[idx++] = band;
		}
		argc--; argv++;
	}

	addattr_l(n, 1024, TCA_OPTIONS, &opt, sizeof(opt));
	return 0;
}

int rr_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
{
	int i;
	struct tc_prio_qopt *qopt;

	if (opt == NULL)
		return 0;

	if (RTA_PAYLOAD(opt)  < sizeof(*qopt))
		return -1;
	qopt = RTA_DATA(opt);
	fprintf(f, "bands %u priomap ", qopt->bands);
	for (i=0; i <= TC_PRIO_MAX; i++)
		fprintf(f, " %d", qopt->priomap[i]);
	return 0;
}

struct qdisc_util rr_qdisc_util = {
	.id	 	= "rr",
	.parse_qopt	= rr_parse_opt,
	.print_qopt	= rr_print_opt,
};

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-21 18:23                 ` Waskiewicz Jr, Peter P
@ 2007-06-21 19:10                   ` Patrick McHardy
  2007-06-21 20:15                     ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 24+ messages in thread
From: Patrick McHardy @ 2007-06-21 19:10 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>> The code looks correct. Are you sure you had the config 
>> option enabled during your test?
>>
>
> Yes.  This is the tc command I used to configure the qdisc (with q_rr.c
> attached from my patches iproute2 package):
>
> # tc qdisc add dev eth2 root handle 1: rr bands 8
> RTNETLINK answers: No such file or directory

Again, I bet you don't have CONFIG_NET_SCH_RR enabled:


# lsmod|grep prio
# tc qdisc add dev dummy0 root handle 1: rr bands 8
# lsmod|grep prio
sch_prio                5760  1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-21 19:10                   ` Patrick McHardy
@ 2007-06-21 20:15                     ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-21 20:15 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> > Yes.  This is the tc command I used to configure the qdisc (with 
> > q_rr.c attached from my patches iproute2 package):
> >
> > # tc qdisc add dev eth2 root handle 1: rr bands 8 RTNETLINK 
> answers: 
> > No such file or directory
> 
> Again, I bet you don't have CONFIG_NET_SCH_RR enabled:

Chalk this up to serious user error.  Having CONFIG_NET_SCH_RR=m isn't
defining it...I'm not sure why I thought that was correct.  Thanks for
bearing with me Patrick.  Working as intended now.  :-)

-PJ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-18 18:42 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
  2007-06-18 18:42 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
  2007-06-18 18:42 ` [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
@ 2007-06-18 18:42 ` PJ Waskiewicz
  2007-06-18 19:10   ` Patrick McHardy
  2007-06-19  6:28   ` David Miller
  2 siblings, 2 replies; 24+ messages in thread
From: PJ Waskiewicz @ 2007-06-18 18:42 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/etherdevice.h |    3 +-
 include/linux/netdevice.h   |   62 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |    2 +
 net/core/dev.c              |   27 +++++++++++++++----
 net/core/skbuff.c           |    3 ++
 net/ethernet/eth.c          |    9 +++---
 6 files changed, 94 insertions(+), 12 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void		eth_header_cache_update(struct hh_cache *hh, struct net_device *dev
 extern int		eth_header_cache(struct neighbour *neigh,
 					 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7913ee..bf532a0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+	/* Give a control state for each queue.  This struct may contain
+	 * per-queue locks in the future.
+	 */
+	unsigned long	state;
+};
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -325,6 +333,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -543,6 +552,10 @@ struct net_device
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
+
+ 	/* The TX queue control structures */
+ 	struct net_device_subqueue	*egress_subqueue;
+ 	int				egress_subqueue_count;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -705,6 +718,48 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index)
+{
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+                                         u16 queue_index)
+{
+	return test_bit(__LINK_STATE_XOFF,
+	                &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+	                       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -995,8 +1050,11 @@ static inline void netif_tx_disable(struct net_device *dev)
 extern void		ether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-				       void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+					  void (*setup)(struct net_device *),
+					  int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+	alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int		register_netdev(struct net_device *dev);
 extern void		unregister_netdev(struct net_device *dev);
 /* Functions used for multicast support */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e7367c7..8bcd870 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -215,6 +215,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@pkt_type: Packet class
  *	@fclone: skbuff clone status
  *	@ip_summed: Driver fed us an IP checksum
+ *	@queue_mapping: Queue mapping for multiqueue devices
  *	@priority: Packet queueing priority
  *	@users: User count - see {datagram,tcp}.c
  *	@protocol: Packet protocol from driver
@@ -269,6 +270,7 @@ struct sk_buff {
 			__u16	csum_offset;
 		};
 	};
+	__u16			queue_mapping;
 	__u32			priority;
 	__u8			local_df:1,
 				cloned:1,
diff --git a/net/core/dev.c b/net/core/dev.c
index 2609062..29d44c0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1545,6 +1545,8 @@ gso:
 		spin_lock(&dev->queue_lock);
 		q = dev->qdisc;
 		if (q->enqueue) {
+			/* reset queue_mapping to zero */
+			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
 			qdisc_run(dev);
 			spin_unlock(&dev->queue_lock);
@@ -3343,16 +3345,18 @@ static struct net_device_stats *internal_stats(struct net_device *dev)
 }
 
 /**
- *	alloc_netdev - allocate network device
+ *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
  *	@name:		device name format string
  *	@setup:		callback to initialize device
+ *	@queue_count:	the number of subqueues to allocate
  *
  *	Allocates a struct net_device with private data area for driver use
- *	and performs basic initialization.
+ *	and performs basic initialization.  Also allocates subqueue structs
+ *	for each queue on the device.
  */
-struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-		void (*setup)(struct net_device *))
+struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+		void (*setup)(struct net_device *), int queue_count)
 {
 	void *p;
 	struct net_device *dev;
@@ -3377,12 +3381,23 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
 
+  	alloc_size = (sizeof(struct net_device_subqueue) * queue_count);
+  
+  	p = kzalloc(alloc_size, GFP_KERNEL);
+  	if (!p) {
+  		printk(KERN_ERR "alloc_netdev: Unable to allocate queues.\n");
+  		return NULL;
+  	}
+  
+  	dev->egress_subqueue = p;
+  	dev->egress_subqueue_count = queue_count;
+
 	dev->get_stats = internal_stats;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
 }
-EXPORT_SYMBOL(alloc_netdev);
+EXPORT_SYMBOL(alloc_netdev_mq);
 
 /**
  *	free_netdev - free network device
@@ -3396,6 +3411,7 @@ void free_netdev(struct net_device *dev)
 {
 #ifdef CONFIG_SYSFS
 	/*  Compatibility with error handling in drivers */
+	kfree((char *)dev->egress_subqueue);
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
 		kfree((char *)dev - dev->padded);
 		return;
@@ -3407,6 +3423,7 @@ void free_netdev(struct net_device *dev)
 	/* will free via device release */
 	put_device(&dev->dev);
 #else
+	kfree((char *)dev->egress_subqueue);
 	kfree((char *)dev - dev->padded);
 #endif
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7c6a34e..7bbed45 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -418,6 +418,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
+	C(queue_mapping);
 	C(priority);
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
@@ -459,6 +460,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 	new->sk		= NULL;
 	new->dev	= old->dev;
+	new->queue_mapping = old->queue_mapping;
 	new->priority	= old->priority;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
@@ -1925,6 +1927,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		tail = nskb;
 
 		nskb->dev = skb->dev;
+		nskb->queue_mapping = skb->queue_mapping;
 		nskb->priority = skb->priority;
 		nskb->protocol = skb->protocol;
 		nskb->dst = dst_clone(skb->dst);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0ac2524..87a509c 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -316,9 +316,10 @@ void ether_setup(struct net_device *dev)
 EXPORT_SYMBOL(ether_setup);
 
 /**
- * alloc_etherdev - Allocates and sets up an Ethernet device
+ * alloc_etherdev_mq - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
  *	for this Ethernet device
+ * @queue_count: The number of queues this device has.
  *
  * Fill in the fields of the device structure with Ethernet-generic
  * values. Basically does everything except registering the device.
@@ -328,8 +329,8 @@ EXPORT_SYMBOL(ether_setup);
  * this private data area.
  */
 
-struct net_device *alloc_etherdev(int sizeof_priv)
+struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count)
 {
-	return alloc_netdev(sizeof_priv, "eth%d", ether_setup);
+	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);
 }
-EXPORT_SYMBOL(alloc_etherdev);
+EXPORT_SYMBOL(alloc_etherdev_mq);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-18 18:42 ` [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
@ 2007-06-18 19:10   ` Patrick McHardy
  2007-06-18 20:26     ` Waskiewicz Jr, Peter P
  2007-06-19  6:28   ` David Miller
  1 sibling, 1 reply; 24+ messages in thread
From: Patrick McHardy @ 2007-06-18 19:10 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
> Add the multiqueue hardware device support API to the core network
> stack.  Allow drivers to allocate multiple queues and manage them
> at the netdev level if they choose to do so.
>   

Should be 2/3 and qdisc changes should be 3/3. Well actually the qdisc 
sch_generic changes
belong in this patch as well and the qdisc changes should be split in 
one change per qdisc.

>  /* Functions used for multicast support */
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index e7367c7..8bcd870 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -215,6 +215,7 @@ typedef unsigned char *sk_buff_data_t;
>   *	@pkt_type: Packet class
>   *	@fclone: skbuff clone status
>   *	@ip_summed: Driver fed us an IP checksum
> + *	@queue_mapping: Queue mapping for multiqueue devices
>   *	@priority: Packet queueing priority
>   *	@users: User count - see {datagram,tcp}.c
>   *	@protocol: Packet protocol from driver
> @@ -269,6 +270,7 @@ struct sk_buff {
>  			__u16	csum_offset;
>  		};
>  	};
> +	__u16			queue_mapping;
>   

We have a 4 byte hole on 64 bit after iif where this would fit in.

> @@ -3377,12 +3381,23 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>  	if (sizeof_priv)
>  		dev->priv = netdev_priv(dev);
>  
> +  	alloc_size = (sizeof(struct net_device_subqueue) * queue_count);
> +  
> +  	p = kzalloc(alloc_size, GFP_KERNEL);
> +  	if (!p) {
> +  		printk(KERN_ERR "alloc_netdev: Unable to allocate queues.\n");
> +  		return NULL;
>   

Same leak here that you already fixed a couple of posts ago.

> +  	}
> +  
> +  	dev->egress_subqueue = p;
> +  	dev->egress_subqueue_count = queue_count;
> +
>  	dev->get_stats = internal_stats;
>  	setup(dev);
>  	strcpy(dev->name, name);
>  	return dev;
>  }
> -EXPORT_SYMBOL(alloc_netdev);
> +EXPORT_SYMBOL(alloc_netdev_mq);
>  
>  /**
>   *	free_netdev - free network device
> @@ -3396,6 +3411,7 @@ void free_netdev(struct net_device *dev)
>  {
>  #ifdef CONFIG_SYSFS
>  	/*  Compatibility with error handling in drivers */
> +	kfree((char *)dev->egress_subqueue);
>   
And the pointless cast as well.

>  	if (dev->reg_state == NETREG_UNINITIALIZED) {
>  		kfree((char *)dev - dev->padded);
>  		return;
> @@ -3407,6 +3423,7 @@ void free_netdev(struct net_device *dev)
>  	/* will free via device release */
>  	put_device(&dev->dev);
>  #else
> +	kfree((char *)dev->egress_subqueue);
>   

And here.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-18 19:10   ` Patrick McHardy
@ 2007-06-18 20:26     ` Waskiewicz Jr, Peter P
  2007-06-18 20:28       ` Patrick McHardy
  0 siblings, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-18 20:26 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> PJ Waskiewicz wrote:
> > Add the multiqueue hardware device support API to the core network 
> > stack.  Allow drivers to allocate multiple queues and 
> manage them at 
> > the netdev level if they choose to do so.
> >   
> 
> Should be 2/3 and qdisc changes should be 3/3. Well actually 
> the qdisc sch_generic changes belong in this patch as well 
> and the qdisc changes should be split in one change per qdisc.

I'll re-arrange the patches.

> >  /* Functions used for multicast support */ diff --git 
> > a/include/linux/skbuff.h b/include/linux/skbuff.h index 
> > e7367c7..8bcd870 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -215,6 +215,7 @@ typedef unsigned char *sk_buff_data_t;
> >   *	@pkt_type: Packet class
> >   *	@fclone: skbuff clone status
> >   *	@ip_summed: Driver fed us an IP checksum
> > + *	@queue_mapping: Queue mapping for multiqueue devices
> >   *	@priority: Packet queueing priority
> >   *	@users: User count - see {datagram,tcp}.c
> >   *	@protocol: Packet protocol from driver
> > @@ -269,6 +270,7 @@ struct sk_buff {
> >  			__u16	csum_offset;
> >  		};
> >  	};
> > +	__u16			queue_mapping;
> >   
> 
> We have a 4 byte hole on 64 bit after iif where this would fit in.

I'll move the variable.  Thanks for this!

> > @@ -3377,12 +3381,23 @@ struct net_device *alloc_netdev(int 
> sizeof_priv, const char *name,
> >  	if (sizeof_priv)
> >  		dev->priv = netdev_priv(dev);
> >  
> > +  	alloc_size = (sizeof(struct net_device_subqueue) * queue_count);
> > +  
> > +  	p = kzalloc(alloc_size, GFP_KERNEL);
> > +  	if (!p) {
> > +  		printk(KERN_ERR "alloc_netdev: Unable to 
> allocate queues.\n");
> > +  		return NULL;
> >   
> 
> Same leak here that you already fixed a couple of posts ago.

Heavy sigh...I synced off an older personal git repository.  I'll clean
this and the casts up and repost with the other feedback.

Thanks Patrick,

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-18 20:26     ` Waskiewicz Jr, Peter P
@ 2007-06-18 20:28       ` Patrick McHardy
  0 siblings, 0 replies; 24+ messages in thread
From: Patrick McHardy @ 2007-06-18 20:28 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>
>>>  /* Functions used for multicast support */ diff --git 
>>> a/include/linux/skbuff.h b/include/linux/skbuff.h index 
>>> e7367c7..8bcd870 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -215,6 +215,7 @@ typedef unsigned char *sk_buff_data_t;
>>>   *	@pkt_type: Packet class
>>>   *	@fclone: skbuff clone status
>>>   *	@ip_summed: Driver fed us an IP checksum
>>> + *	@queue_mapping: Queue mapping for multiqueue devices
>>>   *	@priority: Packet queueing priority
>>>   *	@users: User count - see {datagram,tcp}.c
>>>   *	@protocol: Packet protocol from driver
>>> @@ -269,6 +270,7 @@ struct sk_buff {
>>>  			__u16	csum_offset;
>>>  		};
>>>  	};
>>> +	__u16			queue_mapping;
>>>   
>>>       
>> We have a 4 byte hole on 64 bit after iif where this would fit in.
>>     
>
> I'll move the variable.  Thanks for this!
>   

Or maybe move iif down to queue_mapping so both are near the
queueing related stuff.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-18 18:42 ` [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  2007-06-18 19:10   ` Patrick McHardy
@ 2007-06-19  6:28   ` David Miller
  2007-06-19 17:31     ` Waskiewicz Jr, Peter P
  2007-06-19 20:01     ` Waskiewicz Jr, Peter P
  1 sibling, 2 replies; 24+ messages in thread
From: David Miller @ 2007-06-19  6:28 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
Date: Mon, 18 Jun 2007 11:42:29 -0700

> +
> + 	/* The TX queue control structures */
> + 	struct net_device_subqueue	*egress_subqueue;
> + 	int				egress_subqueue_count;

Since every net device will have at least one subqueue, I
would suggest that you do this as follows:

1) In net_device change the quoted part of the patch above to:

	int				egress_subqueue_count;
	struct net_device_subqueue	egress_subqueue[0];

2) In alloc_netdev():

   Factor (sizeof(struct egress_subqueue) * num_subqueues) into
   the net_device allocation size, place the "priv" area after
   the subqueues.

This will save us pointer dereferences on all of these quite
common accesses.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-19  6:28   ` David Miller
@ 2007-06-19 17:31     ` Waskiewicz Jr, Peter P
  2007-06-19 20:01     ` Waskiewicz Jr, Peter P
  1 sibling, 0 replies; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-19 17:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jeff, Kok, Auke-jan H, hadi, kaber

> From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
> Date: Mon, 18 Jun 2007 11:42:29 -0700
> 
> > +
> > + 	/* The TX queue control structures */
> > + 	struct net_device_subqueue	*egress_subqueue;
> > + 	int				egress_subqueue_count;
> 
> Since every net device will have at least one subqueue, I 
> would suggest that you do this as follows:
> 
> 1) In net_device change the quoted part of the patch above to:
> 
> 	int				egress_subqueue_count;
> 	struct net_device_subqueue	egress_subqueue[0];
> 
> 2) In alloc_netdev():
> 
>    Factor (sizeof(struct egress_subqueue) * num_subqueues) into
>    the net_device allocation size, place the "priv" area after
>    the subqueues.
> 
> This will save us pointer dereferences on all of these quite 
> common accesses.

Thanks Dave!  I'll be putting this in today and run a test pass on it.
Thanks for the feedback.

-PJ Waskiewicz 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-19  6:28   ` David Miller
  2007-06-19 17:31     ` Waskiewicz Jr, Peter P
@ 2007-06-19 20:01     ` Waskiewicz Jr, Peter P
  2007-06-19 22:37       ` David Miller
  1 sibling, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-19 20:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jeff, Kok, Auke-jan H, hadi, kaber

> From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
> Date: Mon, 18 Jun 2007 11:42:29 -0700
> 
> > +
> > + 	/* The TX queue control structures */
> > + 	struct net_device_subqueue	*egress_subqueue;
> > + 	int				egress_subqueue_count;
> 
> Since every net device will have at least one subqueue, I 
> would suggest that you do this as follows:
> 
> 1) In net_device change the quoted part of the patch above to:
> 
> 	int				egress_subqueue_count;
> 	struct net_device_subqueue	egress_subqueue[0];
> 
> 2) In alloc_netdev():
> 
>    Factor (sizeof(struct egress_subqueue) * num_subqueues) into
>    the net_device allocation size, place the "priv" area after
>    the subqueues.
> 
> This will save us pointer dereferences on all of these quite 
> common accesses.

I've been thinking about this more today, so please bear with me if I'm
missing something.  Right now, with how qdisc_restart() is running, we'd
definitely call netif_subqueue_stopped(dev, skb->queue_mapping) for all
multi-ring and single-ring devices.  However, with Jamal's and Krishna's
qdisc_restart() rewrite patch, the checks for netif_queue_stopped() and
netif_subqueue_stopped() would be pushed into the qdisc's ->dequeue()
functions.  If that's the case, then the only checks on
egress_subqueue[x] would be for multi-ring adapters, or if someone was
silly enough to load sch_{rr|prio} onto a single-ring device with
multiqueue hardware support compiled in.  Given all of that, I'm not
sure allocating egress_subqueue[0] at compile time or runtime would make
any difference either way.  If I'm missing something, please let me know
- I'd like to reduce any unnecessary pointer dereferences if possible,
but given the proposed qdisc_restart(), I think the code as-is would be
ok.

Thanks,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-19 20:01     ` Waskiewicz Jr, Peter P
@ 2007-06-19 22:37       ` David Miller
  2007-06-19 23:11         ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2007-06-19 22:37 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com>
Date: Tue, 19 Jun 2007 13:01:18 -0700

> I've been thinking about this more today, so please bear with me if I'm
> missing something.  Right now, with how qdisc_restart() is running, we'd
> definitely call netif_subqueue_stopped(dev, skb->queue_mapping) for all
> multi-ring and single-ring devices.  However, with Jamal's and Krishna's
> qdisc_restart() rewrite patch, the checks for netif_queue_stopped() and
> netif_subqueue_stopped() would be pushed into the qdisc's ->dequeue()
> functions.  If that's the case, then the only checks on
> egress_subqueue[x] would be for multi-ring adapters, or if someone was
> silly enough to load sch_{rr|prio} onto a single-ring device with
> multiqueue hardware support compiled in.  Given all of that, I'm not
> sure allocating egress_subqueue[0] at compile time or runtime would make
> any difference either way.  If I'm missing something, please let me know
> - I'd like to reduce any unnecessary pointer dereferences if possible,
> but given the proposed qdisc_restart(), I think the code as-is would be
> ok.

It's not being allocated at "compile time", it's being allocated
linearly into one block of ram in order to avoid pointer
derefs but it's still "dynamic" in that the size isn't known
until the alloc_netdev() call.

We do this trick all over the networking, TCP sockets are 3 or 4
different structures, all allocated into a linear block of
memory so that:

1) only one memory allocation needs to be done for each object
   create, this is not relevant for the net_device case
   except in extreme examples bringing thousands of devices
   up and down which I suppose someone can give a realistic
   example of :-)

2) each part can be accessed as an offset from the other instead
   of a pointer deref which costs a cpu memory read

Please change the allocation strategy as I recommended, thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-19 22:37       ` David Miller
@ 2007-06-19 23:11         ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-19 23:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jeff, Kok, Auke-jan H, hadi, kaber

> It's not being allocated at "compile time", it's being 
> allocated linearly into one block of ram in order to avoid 
> pointer derefs but it's still "dynamic" in that the size 
> isn't known until the alloc_netdev() call.
> 
> We do this trick all over the networking, TCP sockets are 3 
> or 4 different structures, all allocated into a linear block 
> of memory so that:
> 
> 1) only one memory allocation needs to be done for each object
>    create, this is not relevant for the net_device case
>    except in extreme examples bringing thousands of devices
>    up and down which I suppose someone can give a realistic
>    example of :-)
> 
> 2) each part can be accessed as an offset from the other instead
>    of a pointer deref which costs a cpu memory read
> 
> Please change the allocation strategy as I recommended, thanks.

Later this afternoon someone in my group came over and bonked me in the
head, and made this obvious to me.  Thanks for the follow-up; great
catch on the efficiency.  It's in my code now and is currently running
some test passes.

Thanks Dave,

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2007-06-21 20:15 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-18 18:42 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
2007-06-18 18:42 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
2007-06-18 18:42 ` [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
2007-06-18 19:05   ` Patrick McHardy
2007-06-18 20:36     ` Waskiewicz Jr, Peter P
2007-06-18 20:54       ` Patrick McHardy
2007-06-18 21:04         ` Waskiewicz Jr, Peter P
2007-06-18 21:11           ` Patrick McHardy
2007-06-21 17:55         ` Waskiewicz Jr, Peter P
2007-06-21 18:04           ` Patrick McHardy
2007-06-21 18:12             ` Waskiewicz Jr, Peter P
2007-06-21 18:17               ` Patrick McHardy
2007-06-21 18:23                 ` Waskiewicz Jr, Peter P
2007-06-21 19:10                   ` Patrick McHardy
2007-06-21 20:15                     ` Waskiewicz Jr, Peter P
2007-06-18 18:42 ` [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
2007-06-18 19:10   ` Patrick McHardy
2007-06-18 20:26     ` Waskiewicz Jr, Peter P
2007-06-18 20:28       ` Patrick McHardy
2007-06-19  6:28   ` David Miller
2007-06-19 17:31     ` Waskiewicz Jr, Peter P
2007-06-19 20:01     ` Waskiewicz Jr, Peter P
2007-06-19 22:37       ` David Miller
2007-06-19 23:11         ` Waskiewicz Jr, Peter P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).