[PATCH] NET: Multiple queue hardware support

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] NET: Multiple queue hardware support
@ 2007-06-28 16:20 PJ Waskiewicz
  2007-06-28 16:21 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
                   ` (4 more replies)
  0 siblings, 5 replies; 60+ messages in thread
From: PJ Waskiewicz @ 2007-06-28 16:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Please consider these patches for 2.6.23 inclusion.

Updates since the last submission:

1. Fixed alloc_netdev_mq() queue_count bug.

2. Fixed the TCA_PRIO_MQ options layout.

3. Protected sch_prio and sch_rr multiqueue code with NET_SCH_MULTIQUEUE.

4. Added RTA_{GET|PUT}_FLAG in place of RTA_DATA for passing multiqueue
   options to and from the qdisc.

5. Allow sch_prio and sch_rr to take 0 bands when in multiqueue mode.  This
   will set q->bands to dev->egress_subqueue_count; added this also to the
   kernel doc.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation
  2007-06-28 16:20 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
@ 2007-06-28 16:21 ` PJ Waskiewicz
  2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 60+ messages in thread
From: PJ Waskiewicz @ 2007-06-28 16:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Add a brief howto to Documentation/networking for multiqueue.  It
explains how to use the multiqueue API in a driver to support
multiqueue paths from the stack, as well as the qdiscs to use for
feeding a multiqueue device.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 Documentation/networking/multiqueue.txt |  111 +++++++++++++++++++++++++++++++
 1 files changed, 111 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt
new file mode 100644
index 0000000..00b60cc
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,111 @@
+
+		HOWTO for multiqueue network device support
+		===========================================
+
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO or RR for multiqueue devices
+
+
+Intro: Kernel support for multiqueue devices
+---------------------------------------------------------
+
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+
+
+Section 1: Base driver requirements for implementing multiqueue support
+-----------------------------------------------------------------------
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features
+bitmap on device initialization.  Below is an example from e1000:
+
+#ifdef CONFIG_E1000_MQ
+	if ( (adapter->hw.mac.type == e1000_82571) ||
+	     (adapter->hw.mac.type == e1000_82572) ||
+	     (adapter->hw.mac.type == e1000_80003es2lan))
+		netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+
+
+Section 2: Qdisc support for multiqueue devices
+-----------------------------------------------
+
+Currently two qdiscs support multiqueue devices.  A new round-robin qdisc,
+sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb->queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+
+sch_rr has been added for hardware that doesn't want scheduling policies from
+software, so it's a straight round-robin qdisc.  It uses the same syntax and
+classification priomap that sch_prio uses, so it should be intuitive to
+configure for people who've used sch_prio.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  If PRIO has been
+built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of
+bands requested is equal to the number of queues on the hardware.  If they
+are equal, it sets a one-to-one mapping up between the queues and bands.  If
+they're not equal, it will not load the qdisc.  This is the same behavior
+for RR.  Once the association is made, any skb that is classified will have
+skb->queue_mapping set, which will allow the driver to properly queue skb's
+to multiple queues.
+
+
+Section 3: Brief howto using PRIO and RR for multiqueue devices
+---------------------------------------------------------------
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: prio bands 4 multiqueue
+
+This will create 4 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 4 Tx queues, the band mapping
+would look like:
+
+band 0 => queue 0
+band 1 => queue 1
+band 2 => queue 2
+band 3 => queue 3
+
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS -> Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the "normal"
+traffic classification, which is band 1.  Therefore pings will be send out
+queue 1 on the NIC.
+
+Note the use of the multiqueue keyword.  This is only in versions of iproute2
+that support multiqueue networking devices; if this is omitted when loading
+a qdisc onto a multiqueue device, the qdisc will load and operate the same
+if it were loaded onto a single-queue device (i.e. - sends all traffic to
+queue 0).
+
+Another alternative to multiqueue band allocation can be done by using the
+multiqueue option and specify 0 bands.  If this is the case, the qdisc will
+allocate the number of bands to equal the number of queues that the device
+reports, and bring the qdisc online.
+
+The behavior of tc filters remains the same, where it will override TOS priority
+classification.
+
+
+Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 16:20 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
  2007-06-28 16:21 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
@ 2007-06-28 16:21 ` PJ Waskiewicz
  2007-06-28 16:31   ` Patrick McHardy
                     ` (2 more replies)
  2007-06-28 16:21 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 60+ messages in thread
From: PJ Waskiewicz @ 2007-06-28 16:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Updated: Fixed allocation of subqueues in alloc_netdev_mq() to
allocate all subqueues, not num - 1.

Added checks for netif_subqueue_stopped() to netpoll,
pktgen, and software device dev_queue_xmit().  This will ensure
external events to these subsystems will be handled correctly if
a subqueue is shut down.

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/etherdevice.h |    3 +-
 include/linux/netdevice.h   |   62 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |    4 ++-
 net/core/dev.c              |   27 +++++++++++++------
 net/core/netpoll.c          |    8 +++---
 net/core/pktgen.c           |   10 +++++--
 net/core/skbuff.c           |    3 ++
 net/ethernet/eth.c          |    9 +++---
 8 files changed, 104 insertions(+), 22 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void		eth_header_cache_update(struct hh_cache *hh, struct net_device *dev
 extern int		eth_header_cache(struct neighbour *neigh,
 					 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c0cc19..7078745 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+	/* Give a control state for each queue.  This struct may contain
+	 * per-queue locks in the future.
+	 */
+	unsigned long   state;
+};
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -331,6 +339,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -557,6 +566,10 @@ struct net_device
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
+
+	/* The TX queue control structures */
+	int				egress_subqueue_count;
+	struct net_device_subqueue	egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -719,6 +732,48 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index)
+{
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+					 u16 queue_index)
+{
+	return test_bit(__LINK_STATE_XOFF,
+			&dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+			       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -1009,8 +1064,11 @@ static inline void netif_tx_disable(struct net_device *dev)
 extern void		ether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-				       void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+				       void (*setup)(struct net_device *),
+				       int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+	alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int		register_netdev(struct net_device *dev);
 extern void		unregister_netdev(struct net_device *dev);
 /* Functions used for secondary unicast and multicast support */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b7b2628..6979f4b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -197,6 +197,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@tstamp: Time we arrived
  *	@dev: Device we arrived on/are leaving by
  *	@iif: ifindex of device we arrived on
+ *	@queue_mapping: Queue mapping for multiqueue devices
  *	@transport_header: Transport layer header
  *	@network_header: Network layer header
  *	@mac_header: Link layer header
@@ -247,7 +248,8 @@ struct sk_buff {
 	ktime_t			tstamp;
 	struct net_device	*dev;
 	int			iif;
-	/* 4 byte hole on 64 bit*/
+	__u16			queue_mapping;
+	/* 2 byte hole on 64 bit*/
 
 	struct  dst_entry	*dst;
 	struct	sec_path	*sp;
diff --git a/net/core/dev.c b/net/core/dev.c
index 778e102..d244771 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1429,7 +1429,9 @@ gso:
 			skb->next = nskb;
 			return rc;
 		}
-		if (unlikely(netif_queue_stopped(dev) && skb->next))
+		if (unlikely((netif_queue_stopped(dev) ||
+			     netif_subqueue_stopped(dev, skb->queue_mapping)) &&
+			     skb->next))
 			return NETDEV_TX_BUSY;
 	} while (skb->next);
 
@@ -1547,6 +1549,8 @@ gso:
 		spin_lock(&dev->queue_lock);
 		q = dev->qdisc;
 		if (q->enqueue) {
+			/* reset queue_mapping to zero */
+			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
 			qdisc_run(dev);
 			spin_unlock(&dev->queue_lock);
@@ -1576,7 +1580,8 @@ gso:
 
 			HARD_TX_LOCK(dev, cpu);
 
-			if (!netif_queue_stopped(dev)) {
+			if (!netif_queue_stopped(dev) &&
+			    !netif_subqueue_stopped(dev, skb->queue_mapping)) {
 				rc = 0;
 				if (!dev_hard_start_xmit(skb, dev)) {
 					HARD_TX_UNLOCK(dev);
@@ -3539,16 +3544,18 @@ static struct net_device_stats *internal_stats(struct net_device *dev)
 }
 
 /**
- *	alloc_netdev - allocate network device
+ *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
  *	@name:		device name format string
  *	@setup:		callback to initialize device
+ *	@queue_count:	the number of subqueues to allocate
  *
  *	Allocates a struct net_device with private data area for driver use
- *	and performs basic initialization.
+ *	and performs basic initialization.  Also allocates subquue structs
+ *	for each queue on the device at the end of the netdevice.
  */
-struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-		void (*setup)(struct net_device *))
+struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+		void (*setup)(struct net_device *), int queue_count)
 {
 	void *p;
 	struct net_device *dev;
@@ -3557,7 +3564,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
 	/* ensure 32-byte alignment of both the device and private area */
-	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
+	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
+		     (sizeof(struct net_device_subqueue) * queue_count)) &
+		     ~NETDEV_ALIGN_CONST;
 	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
 
 	p = kzalloc(alloc_size, GFP_KERNEL);
@@ -3573,12 +3582,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
 
+	dev->egress_subqueue_count = queue_count;
+
 	dev->get_stats = internal_stats;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
 }
-EXPORT_SYMBOL(alloc_netdev);
+EXPORT_SYMBOL(alloc_netdev_mq);
 
 /**
  *	free_netdev - free network device
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..2ace33d 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -66,8 +66,9 @@ static void queue_process(struct work_struct *work)
 
 		local_irq_save(flags);
 		netif_tx_lock(dev);
-		if (netif_queue_stopped(dev) ||
-		    dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
+		if ((netif_queue_stopped(dev) ||
+		     netif_subqueue_stopped(dev, skb->queue_mapping)) ||
+		     dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
 			skb_queue_head(&npinfo->txq, skb);
 			netif_tx_unlock(dev);
 			local_irq_restore(flags);
@@ -254,7 +255,8 @@ static void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
 			/* try until next clock tick */
 			for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
 					tries > 0; --tries) {
-				if (!netif_queue_stopped(dev))
+				if (!netif_queue_stopped(dev) &&
+				    !netif_subqueue_stopped(dev, skb->queue_mapping))
 					status = dev->hard_start_xmit(skb, dev);
 
 				if (status == NETDEV_TX_OK)
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 9cd3a1c..dffe067 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3139,7 +3139,9 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		}
 	}
 
-	if (netif_queue_stopped(odev) || need_resched()) {
+	if ((netif_queue_stopped(odev) ||
+	     netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) ||
+	     need_resched()) {
 		idle_start = getCurUs();
 
 		if (!netif_running(odev)) {
@@ -3154,7 +3156,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 
 		pkt_dev->idle_acc += getCurUs() - idle_start;
 
-		if (netif_queue_stopped(odev)) {
+		if (netif_queue_stopped(odev) ||
+		    netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 			pkt_dev->next_tx_us = getCurUs();	/* TODO */
 			pkt_dev->next_tx_ns = 0;
 			goto out;	/* Try the next interface */
@@ -3181,7 +3184,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 	}
 
 	netif_tx_lock_bh(odev);
-	if (!netif_queue_stopped(odev)) {
+	if (!netif_queue_stopped(odev) &&
+	    !netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 
 		atomic_inc(&(pkt_dev->skb->users));
 	      retry_now:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8d8e8fc..e9eea50 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -419,6 +419,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
+	C(queue_mapping);
 	C(priority);
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
@@ -460,6 +461,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 	new->sk		= NULL;
 	new->dev	= old->dev;
+	new->queue_mapping = old->queue_mapping;
 	new->priority	= old->priority;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
@@ -1927,6 +1929,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		tail = nskb;
 
 		nskb->dev = skb->dev;
+		nskb->queue_mapping = skb->queue_mapping;
 		nskb->priority = skb->priority;
 		nskb->protocol = skb->protocol;
 		nskb->dst = dst_clone(skb->dst);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0ac2524..87a509c 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -316,9 +316,10 @@ void ether_setup(struct net_device *dev)
 EXPORT_SYMBOL(ether_setup);
 
 /**
- * alloc_etherdev - Allocates and sets up an Ethernet device
+ * alloc_etherdev_mq - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
  *	for this Ethernet device
+ * @queue_count: The number of queues this device has.
  *
  * Fill in the fields of the device structure with Ethernet-generic
  * values. Basically does everything except registering the device.
@@ -328,8 +329,8 @@ EXPORT_SYMBOL(ether_setup);
  * this private data area.
  */
 
-struct net_device *alloc_etherdev(int sizeof_priv)
+struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count)
 {
-	return alloc_netdev(sizeof_priv, "eth%d", ether_setup);
+	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);
 }
-EXPORT_SYMBOL(alloc_etherdev);
+EXPORT_SYMBOL(alloc_etherdev_mq);

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 16:20 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
  2007-06-28 16:21 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
  2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
@ 2007-06-28 16:21 ` PJ Waskiewicz
  2007-06-28 16:35   ` Patrick McHardy
  2007-06-28 17:13   ` Patrick McHardy
  2007-06-28 17:57 ` [CORE] Stack changes to add multiqueue hardware support API Patrick McHardy
  2007-06-28 17:57 ` [SCHED] Qdisc changes and sch_rr added for multiqueue Patrick McHardy
  4 siblings, 2 replies; 60+ messages in thread
From: PJ Waskiewicz @ 2007-06-28 16:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Updated: Cleaned up Kconfig options for multiqueue.  Cleaned up
sch_rr and sch_prio multiqueue handling.  Added nested compat netlink
options for new options.  Allowing a 0 band option for prio and rr when
in multiqueue mode so it defaults to the number of queues on the NIC.

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue
hardware
support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/pkt_sched.h |    9 +++
 net/sched/Kconfig         |   23 +++++++
 net/sched/sch_prio.c      |  147 +++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 166 insertions(+), 13 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..268c515 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
 	__u8	priomap[TC_PRIO_MAX+1];	/* Map: logical priority -> PRIO band */
 };
 
+enum
+{
+	TCA_PRIO_UNSPEC,
+	TCA_PRIO_MQ,
+	__TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX    (__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..65ee9e7 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,29 @@ config NET_SCH_PRIO
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_prio.
 
+config NET_SCH_RR
+	tristate "Multi Band Round Robin Queuing (RR)"
+	select NET_SCH_PRIO
+	---help---
+	  Say Y here if you want to use an n-band round robin packet
+	  scheduler.
+
+	  The module uses sch_prio for its framework and is aliased as
+	  sch_rr, so it will load sch_prio, although it is referred
+	  to using sch_rr.
+
+config NET_SCH_MULTIQUEUE
+	bool "Multiple hardware queue support"
+	---help---
+	  Say Y here if you want to allow supported qdiscs to assign flows to
+	  multiple hardware queues on an ethernet device.  This will
+	  still work on devices with 1 queue.
+
+	  Current qdiscs supporting this feature are NET_SCH_PRIO and
+	  NET_SCH_RR.
+
+	  Most people will say N here.
+
 config NET_SCH_RED
 	tristate "Random Early Detection (RED)"
 	---help---
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..2ceba92 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,13 @@
 struct prio_sched_data
 {
 	int bands;
+	int curband; /* for round-robin */
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+	unsigned char mq;
+#endif
 };
 
 
@@ -70,14 +74,34 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 #endif
 			if (TC_H_MAJ(band))
 				band = 0;
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+			if (q->mq)
+				skb->queue_mapping = 
+						q->prio2band[band&TC_PRIO_MAX];
+			else
+				skb->queue_mapping = 0;
+#endif
 			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
-	if (band >= q->bands)
+	if (band >= q->bands) {
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+		if (q->mq)
+			skb->queue_mapping = q->prio2band[0];
+		else
+			skb->queue_mapping = 0;
+#endif
 		return q->queues[q->prio2band[0]];
+	}
 
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+	if (q->mq)
+		skb->queue_mapping = band;
+	else
+		skb->queue_mapping = 0;
+#endif
 	return q->queues[band];
 }
 
@@ -144,17 +168,65 @@ prio_dequeue(struct Qdisc* sch)
 	struct Qdisc *qdisc;
 
 	for (prio = 0; prio < q->bands; prio++) {
-		qdisc = q->queues[prio];
-		skb = qdisc->dequeue(qdisc);
-		if (skb) {
-			sch->q.qlen--;
-			return skb;
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, (q->mq ? prio : 0))) {
+#endif
+			qdisc = q->queues[prio];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				return skb;
+			}
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
 		}
+#endif
 	}
 	return NULL;
 
 }
 
+static struct sk_buff *rr_dequeue(struct Qdisc* sch)
+{
+	struct sk_buff *skb;
+	struct prio_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	int bandcount;
+
+	/* Only take one pass through the queues.  If nothing is available,
+	 * return nothing.
+	 */
+	for (bandcount = 0; bandcount < q->bands; bandcount++) {
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.  If the queue is stopped, try the
+		 * next queue.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, (q->mq ? q->curband : 0))) {
+#endif
+			qdisc = q->queues[q->curband];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				q->curband++;
+				if (q->curband >= q->bands)
+					q->curband = 0;
+				return skb;
+			}
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+		}
+#endif
+		q->curband++;
+		if (q->curband >= q->bands)
+			q->curband = 0;
+	}
+	return NULL;
+}
+
 static unsigned int prio_drop(struct Qdisc* sch)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
@@ -198,21 +270,39 @@ prio_destroy(struct Qdisc* sch)
 static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
-	struct tc_prio_qopt *qopt = RTA_DATA(opt);
+	struct tc_prio_qopt *qopt;
+	struct rtattr *tb[TCA_PRIO_MAX];
 	int i;
 
-	if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
+	if (rtattr_parse_nested_compat(tb, TCA_PRIO_MAX, opt, qopt,
+				       sizeof(*qopt)))
 		return -EINVAL;
-	if (qopt->bands > TCQ_PRIO_BANDS || qopt->bands < 2)
+	q->bands = qopt->bands;
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+	/* If we're multiqueue, make sure the number of incoming bands
+	 * matches the number of queues on the device we're associating with.
+	 * If the number of bands requested is zero, then set q->bands to
+	 * dev->egress_subqueue_count.
+	 */
+	q->mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]);
+
+	if (q->mq) {
+		if (q->bands == 0)
+			q->bands = sch->dev->egress_subqueue_count;
+		else if (q->bands != sch->dev->egress_subqueue_count)
+			return -EINVAL;
+	}
+#endif
+
+	if (q->bands > TCQ_PRIO_BANDS || q->bands < 2)
 		return -EINVAL;
 
 	for (i=0; i<=TC_PRIO_MAX; i++) {
-		if (qopt->priomap[i] >= qopt->bands)
+		if (qopt->priomap[i] >= q->bands)
 			return -EINVAL;
 	}
 
 	sch_tree_lock(sch);
-	q->bands = qopt->bands;
 	memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1);
 
 	for (i=q->bands; i<TCQ_PRIO_BANDS; i++) {
@@ -268,11 +358,19 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
 	unsigned char *b = skb_tail_pointer(skb);
+	struct rtattr *nest;
 	struct tc_prio_qopt opt;
 
 	opt.bands = q->bands;
 	memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
-	RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+	nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+	if (q->mq)
+		RTA_PUT_FLAG(skb, TCA_PRIO_MQ);
+#endif
+	RTA_NEST_COMPAT_END(skb, nest);
+
 	return skb->len;
 
 rtattr_failure:
@@ -443,17 +541,40 @@ static struct Qdisc_ops prio_qdisc_ops = {
 	.owner		=	THIS_MODULE,
 };
 
+static struct Qdisc_ops rr_qdisc_ops = {
+	.next		=	NULL,
+	.cl_ops		=	&prio_class_ops,
+	.id		=	"rr",
+	.priv_size	=	sizeof(struct prio_sched_data),
+	.enqueue	=	prio_enqueue,
+	.dequeue	=	rr_dequeue,
+	.requeue	=	prio_requeue,
+	.drop		=	prio_drop,
+	.init		=	prio_init,
+	.reset		=	prio_reset,
+	.destroy	=	prio_destroy,
+	.change		=	prio_tune,
+	.dump		=	prio_dump,
+	.owner		=	THIS_MODULE,
+};
+
 static int __init prio_module_init(void)
 {
-	return register_qdisc(&prio_qdisc_ops);
+	int err;
+	err = register_qdisc(&prio_qdisc_ops);
+	if (!err)
+		err = register_qdisc(&rr_qdisc_ops);
+	return err;
 }
 
 static void __exit prio_module_exit(void)
 {
 	unregister_qdisc(&prio_qdisc_ops);
+	unregister_qdisc(&rr_qdisc_ops);
 }
 
 module_init(prio_module_init)
 module_exit(prio_module_exit)
 
 MODULE_LICENSE("GPL");
+MODULE_ALIAS("sch_rr");

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
@ 2007-06-28 16:31   ` Patrick McHardy
  2007-06-28 17:00   ` Patrick McHardy
  2007-06-29  3:39   ` David Miller
  2 siblings, 0 replies; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 16:31 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
> Updated: Fixed allocation of subqueues in alloc_netdev_mq() to
> allocate all subqueues, not num - 1.
>
> Added checks for netif_subqueue_stopped() to netpoll,
> pktgen, and software device dev_queue_xmit().  This will ensure
> external events to these subsystems will be handled correctly if
> a subqueue is shut down.
>
> Add the multiqueue hardware device support API to the core network
> stack.  Allow drivers to allocate multiple queues and manage them
> at the netdev level if they choose to do so.
>
> Added a new field to sk_buff, namely queue_mapping, for drivers to
> know which tx_ring to select based on OS classification of the flow.
>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
>   

Acked-by: Patrick McHardy <kaber@trash.net>

skb->iif and queue_mapping should probably go somewhere near
the other shaping stuff and unsigned int seems to be a better
choice for egress_subqueue_count, but I can take care of that
when this patch is in Dave's tree.





^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 16:21 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
@ 2007-06-28 16:35   ` Patrick McHardy
  2007-06-28 16:43     ` Waskiewicz Jr, Peter P
  2007-06-28 16:50     ` Patrick McHardy
  2007-06-28 17:13   ` Patrick McHardy
  1 sibling, 2 replies; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 16:35 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:

> +
>  static int __init prio_module_init(void)
>  {
> -	return register_qdisc(&prio_qdisc_ops);
> +	int err;
> +	err = register_qdisc(&prio_qdisc_ops);
> +	if (!err)
> +		err = register_qdisc(&rr_qdisc_ops);
> +	return err;
>  }
>  

Thats still broken. I'll fix this and some minor cleanness issues myself so
you don't have to go through another resend.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 16:35   ` Patrick McHardy
@ 2007-06-28 16:43     ` Waskiewicz Jr, Peter P
  2007-06-28 16:46       ` Patrick McHardy
  2007-06-28 16:50     ` Patrick McHardy
  1 sibling, 1 reply; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 16:43 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> PJ Waskiewicz wrote:
> 
> > +
> >  static int __init prio_module_init(void)  {
> > -	return register_qdisc(&prio_qdisc_ops);
> > +	int err;
> > +	err = register_qdisc(&prio_qdisc_ops);
> > +	if (!err)
> > +		err = register_qdisc(&rr_qdisc_ops);
> > +	return err;
> >  }
> >  
> 
> Thats still broken. I'll fix this and some minor cleanness 
> issues myself so you don't have to go through another resend.

Auke and I just looked at register_qdisc() and this code.  Maybe we
haven't had enough coffee yet, but register_qdisc() returns 0 on
success.  So if register_qdisc(&prio_qdisc_ops) succeeds, then
rr_qdisc_ops gets registered.  I'm curious what is broken with this.

Thanks Patrick
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 16:43     ` Waskiewicz Jr, Peter P
@ 2007-06-28 16:46       ` Patrick McHardy
  2007-06-28 16:50         ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 16:46 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>>PJ Waskiewicz wrote:
>>
>>
>>>+
>>> static int __init prio_module_init(void)  {
>>>-	return register_qdisc(&prio_qdisc_ops);
>>>+	int err;
>>>+	err = register_qdisc(&prio_qdisc_ops);
>>>+	if (!err)
>>>+		err = register_qdisc(&rr_qdisc_ops);
>>>+	return err;
>>> }
>>> 
>>
>>Thats still broken. I'll fix this and some minor cleanness 
>>issues myself so you don't have to go through another resend.
> 
> 
> Auke and I just looked at register_qdisc() and this code.  Maybe we
> haven't had enough coffee yet, but register_qdisc() returns 0 on
> success.  So if register_qdisc(&prio_qdisc_ops) succeeds, then
> rr_qdisc_ops gets registered.  I'm curious what is broken with this.

Its not error handling. You do:

err = register qdisc 1
if (err)
	return err;
err = register qdisc 2
if (err)
	unregister qdisc 2
return err

anyways, I already fixed that and cleaned up prio_classify the
way I suggested. Will send shortly.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 16:46       ` Patrick McHardy
@ 2007-06-28 16:50         ` Waskiewicz Jr, Peter P
  2007-06-28 16:53           ` Patrick McHardy
  0 siblings, 1 reply; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 16:50 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi


> Its not error handling. You do:
> 
> err = register qdisc 1
> if (err)
> 	return err;
> err = register qdisc 2
> if (err)
> 	unregister qdisc 2
> return err
> 
> anyways, I already fixed that and cleaned up prio_classify 
> the way I suggested. Will send shortly.

Thanks for fixing; however, the current sch_prio doesn't unregister the
qdisc if register_qdisc() on prio fails, or does that happen implicitly
because the module will probably unload?

Thanks again Patrick,
-PJ

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 16:35   ` Patrick McHardy
  2007-06-28 16:43     ` Waskiewicz Jr, Peter P
@ 2007-06-28 16:50     ` Patrick McHardy
  1 sibling, 0 replies; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 16:50 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: PJ Waskiewicz, davem, netdev, jeff, auke-jan.h.kok, hadi

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

Patrick McHardy wrote:
> PJ Waskiewicz wrote:
> 
>> +
>>  static int __init prio_module_init(void)
>>  {
>> -    return register_qdisc(&prio_qdisc_ops);
>> +    int err;
>> +    err = register_qdisc(&prio_qdisc_ops);
>> +    if (!err)
>> +        err = register_qdisc(&rr_qdisc_ops);
>> +    return err;
>>  }
>>  
> 
> 
> Thats still broken. I'll fix this and some minor cleanness issues myself so
> you don't have to go through another resend.


Here it is, fixed error handling and cleaned up prio_classify.

There are still too many ifdefs in there for my taste, and I'm
wondering whether the NET_SCH_MULTIQUEUE option should really
be NETDEVICES_MULTIQUEUE. That would allow to move the #ifdefs
to netif_subqueue_stopped and keep the qdiscs clean.
I can send a patch for that on top of your patches.


[-- Attachment #2: x --]
[-- Type: text/plain, Size: 8182 bytes --]

[SCHED] Qdisc changes and sch_rr added for multiqueue

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue
hardware support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit ab2c102d05981e983b494e9211e1c987ae0b80ee
tree edb9457aeed6eb4ac8adef517b783b3361ba8830
parent 1937d8b868253885584dd49fe7ba804eda6b525f
author Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Thu, 28 Jun 2007 18:47:49 +0200
committer Patrick McHardy <kaber@trash.net> Thu, 28 Jun 2007 18:47:49 +0200

 include/linux/pkt_sched.h |    9 +++
 net/sched/Kconfig         |   23 +++++++
 net/sched/sch_prio.c      |  142 ++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 159 insertions(+), 15 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..268c515 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
 	__u8	priomap[TC_PRIO_MAX+1];	/* Map: logical priority -> PRIO band */
 };
 
+enum
+{
+	TCA_PRIO_UNSPEC,
+	TCA_PRIO_MQ,
+	__TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX    (__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..65ee9e7 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,29 @@ config NET_SCH_PRIO
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_prio.
 
+config NET_SCH_RR
+	tristate "Multi Band Round Robin Queuing (RR)"
+	select NET_SCH_PRIO
+	---help---
+	  Say Y here if you want to use an n-band round robin packet
+	  scheduler.
+
+	  The module uses sch_prio for its framework and is aliased as
+	  sch_rr, so it will load sch_prio, although it is referred
+	  to using sch_rr.
+
+config NET_SCH_MULTIQUEUE
+	bool "Multiple hardware queue support"
+	---help---
+	  Say Y here if you want to allow supported qdiscs to assign flows to
+	  multiple hardware queues on an ethernet device.  This will
+	  still work on devices with 1 queue.
+
+	  Current qdiscs supporting this feature are NET_SCH_PRIO and
+	  NET_SCH_RR.
+
+	  Most people will say N here.
+
 config NET_SCH_RED
 	tristate "Random Early Detection (RED)"
 	---help---
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..b35b9c5 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,13 @@
 struct prio_sched_data
 {
 	int bands;
+	int curband; /* for round-robin */
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+	unsigned char mq;
+#endif
 };
 
 
@@ -70,14 +74,21 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 #endif
 			if (TC_H_MAJ(band))
 				band = 0;
-			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
+			band = q->prio2band[band&TC_PRIO_MAX];
+			goto out;
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
 	if (band >= q->bands)
-		return q->queues[q->prio2band[0]];
-
+		band = q->prio2band[0];
+out:
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+	if (q->mq)
+		skb->queue_mapping = band;
+	else
+		skb->queue_mapping = 0;
+#endif
 	return q->queues[band];
 }
 
@@ -144,17 +155,65 @@ prio_dequeue(struct Qdisc* sch)
 	struct Qdisc *qdisc;
 
 	for (prio = 0; prio < q->bands; prio++) {
-		qdisc = q->queues[prio];
-		skb = qdisc->dequeue(qdisc);
-		if (skb) {
-			sch->q.qlen--;
-			return skb;
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, (q->mq ? prio : 0))) {
+#endif
+			qdisc = q->queues[prio];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				return skb;
+			}
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
 		}
+#endif
 	}
 	return NULL;
 
 }
 
+static struct sk_buff *rr_dequeue(struct Qdisc* sch)
+{
+	struct sk_buff *skb;
+	struct prio_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	int bandcount;
+
+	/* Only take one pass through the queues.  If nothing is available,
+	 * return nothing.
+	 */
+	for (bandcount = 0; bandcount < q->bands; bandcount++) {
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.  If the queue is stopped, try the
+		 * next queue.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, (q->mq ? q->curband : 0))) {
+#endif
+			qdisc = q->queues[q->curband];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				q->curband++;
+				if (q->curband >= q->bands)
+					q->curband = 0;
+				return skb;
+			}
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+		}
+#endif
+		q->curband++;
+		if (q->curband >= q->bands)
+			q->curband = 0;
+	}
+	return NULL;
+}
+
 static unsigned int prio_drop(struct Qdisc* sch)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
@@ -198,21 +257,39 @@ prio_destroy(struct Qdisc* sch)
 static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
-	struct tc_prio_qopt *qopt = RTA_DATA(opt);
+	struct tc_prio_qopt *qopt;
+	struct rtattr *tb[TCA_PRIO_MAX];
 	int i;
 
-	if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
+	if (rtattr_parse_nested_compat(tb, TCA_PRIO_MAX, opt, qopt,
+				       sizeof(*qopt)))
 		return -EINVAL;
-	if (qopt->bands > TCQ_PRIO_BANDS || qopt->bands < 2)
+	q->bands = qopt->bands;
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+	/* If we're multiqueue, make sure the number of incoming bands
+	 * matches the number of queues on the device we're associating with.
+	 * If the number of bands requested is zero, then set q->bands to
+	 * dev->egress_subqueue_count.
+	 */
+	q->mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]);
+
+	if (q->mq) {
+		if (q->bands == 0)
+			q->bands = sch->dev->egress_subqueue_count;
+		else if (q->bands != sch->dev->egress_subqueue_count)
+			return -EINVAL;
+	}
+#endif
+
+	if (q->bands > TCQ_PRIO_BANDS || q->bands < 2)
 		return -EINVAL;
 
 	for (i=0; i<=TC_PRIO_MAX; i++) {
-		if (qopt->priomap[i] >= qopt->bands)
+		if (qopt->priomap[i] >= q->bands)
 			return -EINVAL;
 	}
 
 	sch_tree_lock(sch);
-	q->bands = qopt->bands;
 	memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1);
 
 	for (i=q->bands; i<TCQ_PRIO_BANDS; i++) {
@@ -268,11 +345,19 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
 	unsigned char *b = skb_tail_pointer(skb);
+	struct rtattr *nest;
 	struct tc_prio_qopt opt;
 
 	opt.bands = q->bands;
 	memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
-	RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+	nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+	if (q->mq)
+		RTA_PUT_FLAG(skb, TCA_PRIO_MQ);
+#endif
+	RTA_NEST_COMPAT_END(skb, nest);
+
 	return skb->len;
 
 rtattr_failure:
@@ -443,17 +528,44 @@ static struct Qdisc_ops prio_qdisc_ops = {
 	.owner		=	THIS_MODULE,
 };
 
+static struct Qdisc_ops rr_qdisc_ops = {
+	.next		=	NULL,
+	.cl_ops		=	&prio_class_ops,
+	.id		=	"rr",
+	.priv_size	=	sizeof(struct prio_sched_data),
+	.enqueue	=	prio_enqueue,
+	.dequeue	=	rr_dequeue,
+	.requeue	=	prio_requeue,
+	.drop		=	prio_drop,
+	.init		=	prio_init,
+	.reset		=	prio_reset,
+	.destroy	=	prio_destroy,
+	.change		=	prio_tune,
+	.dump		=	prio_dump,
+	.owner		=	THIS_MODULE,
+};
+
 static int __init prio_module_init(void)
 {
-	return register_qdisc(&prio_qdisc_ops);
+	int err;
+
+	err = register_qdisc(&prio_qdisc_ops);
+	if (err < 0)
+		return err;
+	err = register_qdisc(&rr_qdisc_ops);
+	if (err < 0)
+		unregister_qdisc(&prio_qdisc_ops);
+	return err;
 }
 
 static void __exit prio_module_exit(void)
 {
 	unregister_qdisc(&prio_qdisc_ops);
+	unregister_qdisc(&rr_qdisc_ops);
 }
 
 module_init(prio_module_init)
 module_exit(prio_module_exit)
 
 MODULE_LICENSE("GPL");
+MODULE_ALIAS("sch_rr");

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 16:50         ` Waskiewicz Jr, Peter P
@ 2007-06-28 16:53           ` Patrick McHardy
  0 siblings, 0 replies; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 16:53 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
> Thanks for fixing; however, the current sch_prio doesn't unregister the
> qdisc if register_qdisc() on prio fails, or does that happen implicitly
> because the module will probably unload?


It failed, there's nothing to unregister. But when you register two
qdiscs and the second one fails you have to unregister the first one.

Your way works too, but it might fail registering the second one without
providing any feedback to the user.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  2007-06-28 16:31   ` Patrick McHardy
@ 2007-06-28 17:00   ` Patrick McHardy
  2007-06-28 19:00     ` Waskiewicz Jr, Peter P
  2007-06-29  3:39   ` David Miller
  2 siblings, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 17:00 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
>  include/linux/etherdevice.h |    3 +-
>  include/linux/netdevice.h   |   62 ++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/skbuff.h      |    4 ++-
>  net/core/dev.c              |   27 +++++++++++++------
>  net/core/netpoll.c          |    8 +++---
>  net/core/pktgen.c           |   10 +++++--
>  net/core/skbuff.c           |    3 ++
>  net/ethernet/eth.c          |    9 +++---
>  8 files changed, 104 insertions(+), 22 deletions(-)

> include/linux/pkt_sched.h |    9 +++
> net/sched/Kconfig         |   23 +++++++
> net/sched/sch_prio.c      |  147
+++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 166 insertions(+), 13 deletions(-)


Quick question: where are the sch_generic changes? :)

If you hold for ten minutes I'll post a set of slightly changed
patches with the NETDEVICES_MULTIQUEUE option and a fix for this.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 16:21 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
  2007-06-28 16:35   ` Patrick McHardy
@ 2007-06-28 17:13   ` Patrick McHardy
  2007-06-28 19:04     ` Waskiewicz Jr, Peter P
  1 sibling, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 17:13 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
> +#ifdef CONFIG_NET_SCH_MULTIQUEUE
> +			if (q->mq)
> +				skb->queue_mapping = 
> +						q->prio2band[band&TC_PRIO_MAX];
> +			else
> +				skb->queue_mapping = 0;
> +#endif


Setting it to zero here is wrong, consider:

root qdisc: prio multiqueue
child qdisc: prio non-multiqueue

The top-level qdisc will set it, the child qdisc will unset it again.
When multiqueue is inactive it should not touch it.

I'll fix that as well.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 16:20 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
                   ` (2 preceding siblings ...)
  2007-06-28 16:21 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
@ 2007-06-28 17:57 ` Patrick McHardy
  2007-06-28 17:57 ` [SCHED] Qdisc changes and sch_rr added for multiqueue Patrick McHardy
  4 siblings, 0 replies; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 17:57 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

This is an slightly changed version of Peter's patch.
Changes are:

- introduce NETDEVICES_MULTIQUEUE config option instead of
  NET_SCH_MULTIQUEUE. Schedulers build in multiqueue support
  automatically when that option is selected.

- make netif_*_subqueue functions NOPs for the
  !NETDEVICES_MULTIQUEUE case

- move skb->iif and skb->queue_mapping next to the other
  packet scheduling related members

- fix comment: the two byte hole in struct sk_buff is
  on both 32 and 64 bit


[-- Attachment #2: 01.diff --]
[-- Type: text/x-diff, Size: 15283 bytes --]

[CORE] Stack changes to add multiqueue hardware	support API

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit 48e334930c5fbb64a821733a7056e2800057c128
tree d7fabeffef8ea856d345fe44be24a916b0a698af
parent 0552c565358330c59913f2b512355f196e31bd74
author Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Thu, 28 Jun 2007 18:41:00 +0200
committer Patrick McHardy <kaber@trash.net> Thu, 28 Jun 2007 19:43:53 +0200

 drivers/net/Kconfig         |    8 +++++
 include/linux/etherdevice.h |    3 +-
 include/linux/netdevice.h   |   76 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |   25 ++++++++++++--
 net/core/dev.c              |   27 +++++++++++----
 net/core/netpoll.c          |    8 +++--
 net/core/pktgen.c           |   10 ++++--
 net/core/skbuff.c           |    3 ++
 net/ethernet/eth.c          |    9 +++--
 9 files changed, 145 insertions(+), 24 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index e5549dc..8bce4fb 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -28,6 +28,14 @@ config NETDEVICES
 # that for each of the symbols.
 if NETDEVICES
 
+config NETDEVICES_MULTIQUEUE
+	bool "Netdevice multiple hardware queue support"
+	---help---
+	  Say Y here if you want to allow the network stack to use multiple
+	  hardware TX queues on an ethernet device.
+
+	  Most people will say N here.
+
 config IFB
 	tristate "Intermediate Functional Block support"
 	depends on NET_CLS_ACT
diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..6cdb973 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void		eth_header_cache_update(struct hh_cache *hh, struct net_device *dev
 extern int		eth_header_cache(struct neighbour *neigh,
 					 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, unsigned int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c0cc19..1b43e15 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+	/* Give a control state for each queue.  This struct may contain
+	 * per-queue locks in the future.
+	 */
+	unsigned long   state;
+};
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -331,6 +339,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -557,6 +566,10 @@ struct net_device
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
+
+	/* The TX queue control structures */
+	unsigned int			egress_subqueue_count;
+	struct net_device_subqueue	egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -719,6 +732,62 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+#endif
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+#endif
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+					 u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	return test_bit(__LINK_STATE_XOFF,
+			&dev->egress_subqueue[queue_index].state);
+#else
+	return 0;
+#endif
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+			       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+#endif
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+#else
+	return 0;
+#endif
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -1009,8 +1078,11 @@ static inline void netif_tx_disable(struct net_device *dev)
 extern void		ether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-				       void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+				       void (*setup)(struct net_device *),
+				       unsigned int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+	alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int		register_netdev(struct net_device *dev);
 extern void		unregister_netdev(struct net_device *dev);
 /* Functions used for secondary unicast and multicast support */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b7b2628..e3851bb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -196,7 +196,6 @@ typedef unsigned char *sk_buff_data_t;
  *	@sk: Socket we are owned by
  *	@tstamp: Time we arrived
  *	@dev: Device we arrived on/are leaving by
- *	@iif: ifindex of device we arrived on
  *	@transport_header: Transport layer header
  *	@network_header: Network layer header
  *	@mac_header: Link layer header
@@ -231,6 +230,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@nfctinfo: Relationship of this skb to the connection
  *	@nfct_reasm: netfilter conntrack re-assembly pointer
  *	@nf_bridge: Saved data about a bridged frame - see br_netfilter.c
+ *	@iif: ifindex of device we arrived on
+ *	@queue_mapping: Queue mapping for multiqueue devices
  *	@tc_index: Traffic control index
  *	@tc_verd: traffic control verdict
  *	@dma_cookie: a cookie to one of several possible DMA operations
@@ -246,8 +247,6 @@ struct sk_buff {
 	struct sock		*sk;
 	ktime_t			tstamp;
 	struct net_device	*dev;
-	int			iif;
-	/* 4 byte hole on 64 bit*/
 
 	struct  dst_entry	*dst;
 	struct	sec_path	*sp;
@@ -290,12 +289,18 @@ struct sk_buff {
 #ifdef CONFIG_BRIDGE_NETFILTER
 	struct nf_bridge_info	*nf_bridge;
 #endif
+
+	int			iif;
+	__u16			queue_mapping;
+
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
 #ifdef CONFIG_NET_CLS_ACT
 	__u16			tc_verd;	/* traffic control verdict */
 #endif
 #endif
+	/* 2 byte hole */
+
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
 #endif
@@ -1721,6 +1726,20 @@ static inline void skb_init_secmark(struct sk_buff *skb)
 { }
 #endif
 
+static inline void skb_set_queue_mapping(struct sk_buff *skb, u16 queue_mapping)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	skb->queue_mapping = queue_mapping;
+#endif
+}
+
+static inline void skb_copy_queue_mapping(struct sk_buff *to, const struct sk_buff *from)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	to->queue_mapping = from->queue_mapping;
+#endif
+}
+
 static inline int skb_is_gso(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->gso_size;
diff --git a/net/core/dev.c b/net/core/dev.c
index 778e102..e94991a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1429,7 +1429,9 @@ gso:
 			skb->next = nskb;
 			return rc;
 		}
-		if (unlikely(netif_queue_stopped(dev) && skb->next))
+		if (unlikely((netif_queue_stopped(dev) ||
+			     netif_subqueue_stopped(dev, skb->queue_mapping)) &&
+			     skb->next))
 			return NETDEV_TX_BUSY;
 	} while (skb->next);
 
@@ -1547,6 +1549,8 @@ gso:
 		spin_lock(&dev->queue_lock);
 		q = dev->qdisc;
 		if (q->enqueue) {
+			/* reset queue_mapping to zero */
+			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
 			qdisc_run(dev);
 			spin_unlock(&dev->queue_lock);
@@ -1576,7 +1580,8 @@ gso:
 
 			HARD_TX_LOCK(dev, cpu);
 
-			if (!netif_queue_stopped(dev)) {
+			if (!netif_queue_stopped(dev) &&
+			    !netif_subqueue_stopped(dev, skb->queue_mapping)) {
 				rc = 0;
 				if (!dev_hard_start_xmit(skb, dev)) {
 					HARD_TX_UNLOCK(dev);
@@ -3539,16 +3544,18 @@ static struct net_device_stats *internal_stats(struct net_device *dev)
 }
 
 /**
- *	alloc_netdev - allocate network device
+ *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
  *	@name:		device name format string
  *	@setup:		callback to initialize device
+ *	@queue_count:	the number of subqueues to allocate
  *
  *	Allocates a struct net_device with private data area for driver use
- *	and performs basic initialization.
+ *	and performs basic initialization.  Also allocates subquue structs
+ *	for each queue on the device at the end of the netdevice.
  */
-struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-		void (*setup)(struct net_device *))
+struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+		void (*setup)(struct net_device *), unsigned int queue_count)
 {
 	void *p;
 	struct net_device *dev;
@@ -3557,7 +3564,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
 	/* ensure 32-byte alignment of both the device and private area */
-	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
+	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
+		     (sizeof(struct net_device_subqueue) * queue_count)) &
+		     ~NETDEV_ALIGN_CONST;
 	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
 
 	p = kzalloc(alloc_size, GFP_KERNEL);
@@ -3573,12 +3582,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
 
+	dev->egress_subqueue_count = queue_count;
+
 	dev->get_stats = internal_stats;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
 }
-EXPORT_SYMBOL(alloc_netdev);
+EXPORT_SYMBOL(alloc_netdev_mq);
 
 /**
  *	free_netdev - free network device
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..2ace33d 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -66,8 +66,9 @@ static void queue_process(struct work_struct *work)
 
 		local_irq_save(flags);
 		netif_tx_lock(dev);
-		if (netif_queue_stopped(dev) ||
-		    dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
+		if ((netif_queue_stopped(dev) ||
+		     netif_subqueue_stopped(dev, skb->queue_mapping)) ||
+		     dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
 			skb_queue_head(&npinfo->txq, skb);
 			netif_tx_unlock(dev);
 			local_irq_restore(flags);
@@ -254,7 +255,8 @@ static void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
 			/* try until next clock tick */
 			for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
 					tries > 0; --tries) {
-				if (!netif_queue_stopped(dev))
+				if (!netif_queue_stopped(dev) &&
+				    !netif_subqueue_stopped(dev, skb->queue_mapping))
 					status = dev->hard_start_xmit(skb, dev);
 
 				if (status == NETDEV_TX_OK)
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 9cd3a1c..dffe067 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3139,7 +3139,9 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		}
 	}
 
-	if (netif_queue_stopped(odev) || need_resched()) {
+	if ((netif_queue_stopped(odev) ||
+	     netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) ||
+	     need_resched()) {
 		idle_start = getCurUs();
 
 		if (!netif_running(odev)) {
@@ -3154,7 +3156,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 
 		pkt_dev->idle_acc += getCurUs() - idle_start;
 
-		if (netif_queue_stopped(odev)) {
+		if (netif_queue_stopped(odev) ||
+		    netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 			pkt_dev->next_tx_us = getCurUs();	/* TODO */
 			pkt_dev->next_tx_ns = 0;
 			goto out;	/* Try the next interface */
@@ -3181,7 +3184,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 	}
 
 	netif_tx_lock_bh(odev);
-	if (!netif_queue_stopped(odev)) {
+	if (!netif_queue_stopped(odev) &&
+	    !netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 
 		atomic_inc(&(pkt_dev->skb->users));
 	      retry_now:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8d8e8fc..af03556 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -419,6 +419,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
+	skb_copy_queue_mapping(n, skb);
 	C(priority);
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
@@ -460,6 +461,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 	new->sk		= NULL;
 	new->dev	= old->dev;
+	skb_copy_queue_mapping(new, old);
 	new->priority	= old->priority;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
@@ -1927,6 +1929,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		tail = nskb;
 
 		nskb->dev = skb->dev;
+		skb_copy_queue_mapping(nskb, skb);
 		nskb->priority = skb->priority;
 		nskb->protocol = skb->protocol;
 		nskb->dst = dst_clone(skb->dst);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0ac2524..1387e54 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -316,9 +316,10 @@ void ether_setup(struct net_device *dev)
 EXPORT_SYMBOL(ether_setup);
 
 /**
- * alloc_etherdev - Allocates and sets up an Ethernet device
+ * alloc_etherdev_mq - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
  *	for this Ethernet device
+ * @queue_count: The number of queues this device has.
  *
  * Fill in the fields of the device structure with Ethernet-generic
  * values. Basically does everything except registering the device.
@@ -328,8 +329,8 @@ EXPORT_SYMBOL(ether_setup);
  * this private data area.
  */
 
-struct net_device *alloc_etherdev(int sizeof_priv)
+struct net_device *alloc_etherdev_mq(int sizeof_priv, unsigned int queue_count)
 {
-	return alloc_netdev(sizeof_priv, "eth%d", ether_setup);
+	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);
 }
-EXPORT_SYMBOL(alloc_etherdev);
+EXPORT_SYMBOL(alloc_etherdev_mq);

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 16:20 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
                   ` (3 preceding siblings ...)
  2007-06-28 17:57 ` [CORE] Stack changes to add multiqueue hardware support API Patrick McHardy
@ 2007-06-28 17:57 ` Patrick McHardy
  4 siblings, 0 replies; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 17:57 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

[-- Attachment #1: Type: text/plain, Size: 374 bytes --]

Updated version of Peter's patch, changes:

- remove NET_SCH_MULTIQUEUE
- remove all ifdefs, the price is 4-8 bytes additional
  memory usage per prio qdisc.
- return -EOPNOTSUPP when multiqueue is requested but not supported
  on the device or not compiled in
- clean up prio_classify, only a single assigment of skb->queue_mapping
- fix error handling during registation


[-- Attachment #2: 02.diff --]
[-- Type: text/x-diff, Size: 7517 bytes --]

[SCHED] Qdisc changes and sch_rr added for multiqueue

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue
hardware support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit 69316230fac0d52803f098565d530b8061d32aea
tree 06fa5dc8cfb274ad712c3646df38efe01aa8ea52
parent 48e334930c5fbb64a821733a7056e2800057c128
author Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Thu, 28 Jun 2007 18:47:49 +0200
committer Patrick McHardy <kaber@trash.net> Thu, 28 Jun 2007 19:56:28 +0200

 include/linux/pkt_sched.h |    9 +++
 net/sched/Kconfig         |   11 ++++
 net/sched/sch_prio.c      |  127 ++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 132 insertions(+), 15 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..268c515 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
 	__u8	priomap[TC_PRIO_MAX+1];	/* Map: logical priority -> PRIO band */
 };
 
+enum
+{
+	TCA_PRIO_UNSPEC,
+	TCA_PRIO_MQ,
+	__TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX    (__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..f321794 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,17 @@ config NET_SCH_PRIO
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_prio.
 
+config NET_SCH_RR
+	tristate "Multi Band Round Robin Queuing (RR)"
+	select NET_SCH_PRIO
+	---help---
+	  Say Y here if you want to use an n-band round robin packet
+	  scheduler.
+
+	  The module uses sch_prio for its framework and is aliased as
+	  sch_rr, so it will load sch_prio, although it is referred
+	  to using sch_rr.
+
 config NET_SCH_RED
 	tristate "Random Early Detection (RED)"
 	---help---
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..cc6b541 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,11 @@
 struct prio_sched_data
 {
 	int bands;
+	int curband; /* for round-robin */
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+	int mq;
 };
 
 
@@ -70,14 +72,17 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 #endif
 			if (TC_H_MAJ(band))
 				band = 0;
-			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
+			band = q->prio2band[band&TC_PRIO_MAX];
+			goto out;
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
 	if (band >= q->bands)
-		return q->queues[q->prio2band[0]];
-
+		band = q->prio2band[0];
+out:
+	if (q->mq)
+		skb_set_queue_mapping(skb, band);
 	return q->queues[band];
 }
 
@@ -144,17 +149,58 @@ prio_dequeue(struct Qdisc* sch)
 	struct Qdisc *qdisc;
 
 	for (prio = 0; prio < q->bands; prio++) {
-		qdisc = q->queues[prio];
-		skb = qdisc->dequeue(qdisc);
-		if (skb) {
-			sch->q.qlen--;
-			return skb;
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, (q->mq ? prio : 0))) {
+			qdisc = q->queues[prio];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				return skb;
+			}
 		}
 	}
 	return NULL;
 
 }
 
+static struct sk_buff *rr_dequeue(struct Qdisc* sch)
+{
+	struct sk_buff *skb;
+	struct prio_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	int bandcount;
+
+	/* Only take one pass through the queues.  If nothing is available,
+	 * return nothing.
+	 */
+	for (bandcount = 0; bandcount < q->bands; bandcount++) {
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.  If the queue is stopped, try the
+		 * next queue.
+		 */
+		if (!netif_subqueue_stopped(sch->dev,
+					    (q->mq ? q->curband : 0))) {
+			qdisc = q->queues[q->curband];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				q->curband++;
+				if (q->curband >= q->bands)
+					q->curband = 0;
+				return skb;
+			}
+		}
+		q->curband++;
+		if (q->curband >= q->bands)
+			q->curband = 0;
+	}
+	return NULL;
+}
+
 static unsigned int prio_drop(struct Qdisc* sch)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
@@ -198,21 +244,39 @@ prio_destroy(struct Qdisc* sch)
 static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
-	struct tc_prio_qopt *qopt = RTA_DATA(opt);
+	struct tc_prio_qopt *qopt;
+	struct rtattr *tb[TCA_PRIO_MAX];
 	int i;
 
-	if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
+	if (rtattr_parse_nested_compat(tb, TCA_PRIO_MAX, opt, qopt,
+				       sizeof(*qopt)))
 		return -EINVAL;
-	if (qopt->bands > TCQ_PRIO_BANDS || qopt->bands < 2)
+	q->bands = qopt->bands;
+	/* If we're multiqueue, make sure the number of incoming bands
+	 * matches the number of queues on the device we're associating with.
+	 * If the number of bands requested is zero, then set q->bands to
+	 * dev->egress_subqueue_count.
+	 */
+	q->mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]);
+	if (q->mq) {
+		if (netif_is_multiqueue(sch->dev)) {
+			if (q->bands == 0)
+				q->bands = sch->dev->egress_subqueue_count;
+			else if (q->bands != sch->dev->egress_subqueue_count)
+				return -EINVAL;
+		} else
+			return -EOPNOTSUPP;
+	}
+
+	if (q->bands > TCQ_PRIO_BANDS || q->bands < 2)
 		return -EINVAL;
 
 	for (i=0; i<=TC_PRIO_MAX; i++) {
-		if (qopt->priomap[i] >= qopt->bands)
+		if (qopt->priomap[i] >= q->bands)
 			return -EINVAL;
 	}
 
 	sch_tree_lock(sch);
-	q->bands = qopt->bands;
 	memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1);
 
 	for (i=q->bands; i<TCQ_PRIO_BANDS; i++) {
@@ -268,11 +332,17 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
 	unsigned char *b = skb_tail_pointer(skb);
+	struct rtattr *nest;
 	struct tc_prio_qopt opt;
 
 	opt.bands = q->bands;
 	memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
-	RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+	nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+	if (q->mq)
+		RTA_PUT_FLAG(skb, TCA_PRIO_MQ);
+	RTA_NEST_COMPAT_END(skb, nest);
+
 	return skb->len;
 
 rtattr_failure:
@@ -443,17 +513,44 @@ static struct Qdisc_ops prio_qdisc_ops = {
 	.owner		=	THIS_MODULE,
 };
 
+static struct Qdisc_ops rr_qdisc_ops = {
+	.next		=	NULL,
+	.cl_ops		=	&prio_class_ops,
+	.id		=	"rr",
+	.priv_size	=	sizeof(struct prio_sched_data),
+	.enqueue	=	prio_enqueue,
+	.dequeue	=	rr_dequeue,
+	.requeue	=	prio_requeue,
+	.drop		=	prio_drop,
+	.init		=	prio_init,
+	.reset		=	prio_reset,
+	.destroy	=	prio_destroy,
+	.change		=	prio_tune,
+	.dump		=	prio_dump,
+	.owner		=	THIS_MODULE,
+};
+
 static int __init prio_module_init(void)
 {
-	return register_qdisc(&prio_qdisc_ops);
+	int err;
+
+	err = register_qdisc(&prio_qdisc_ops);
+	if (err < 0)
+		return err;
+	err = register_qdisc(&rr_qdisc_ops);
+	if (err < 0)
+		unregister_qdisc(&prio_qdisc_ops);
+	return err;
 }
 
 static void __exit prio_module_exit(void)
 {
 	unregister_qdisc(&prio_qdisc_ops);
+	unregister_qdisc(&rr_qdisc_ops);
 }
 
 module_init(prio_module_init)
 module_exit(prio_module_exit)
 
 MODULE_LICENSE("GPL");
+MODULE_ALIAS("sch_rr");

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 17:00   ` Patrick McHardy
@ 2007-06-28 19:00     ` Waskiewicz Jr, Peter P
  2007-06-28 19:03       ` Patrick McHardy
  0 siblings, 1 reply; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 19:00 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> PJ Waskiewicz wrote:
> >  include/linux/etherdevice.h |    3 +-
> >  include/linux/netdevice.h   |   62 
> ++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/skbuff.h      |    4 ++-
> >  net/core/dev.c              |   27 +++++++++++++------
> >  net/core/netpoll.c          |    8 +++---
> >  net/core/pktgen.c           |   10 +++++--
> >  net/core/skbuff.c           |    3 ++
> >  net/ethernet/eth.c          |    9 +++---
> >  8 files changed, 104 insertions(+), 22 deletions(-)
> 
> > include/linux/pkt_sched.h |    9 +++
> > net/sched/Kconfig         |   23 +++++++
> > net/sched/sch_prio.c      |  147
> +++++++++++++++++++++++++++++++++++++++++----
> >  3 files changed, 166 insertions(+), 13 deletions(-)
> 
> 
> Quick question: where are the sch_generic changes? :)
> 
> If you hold for ten minutes I'll post a set of slightly 
> changed patches with the NETDEVICES_MULTIQUEUE option and a 
> fix for this.

Jamal's and KK's qdisc_restart() rewrite took the netif_queue_stopped()
call out of sch_generic.c.  So the underlying qdisc is only responsible
for checking the queue status now before dequeueing.

-PJ

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:00     ` Waskiewicz Jr, Peter P
@ 2007-06-28 19:03       ` Patrick McHardy
  2007-06-28 19:06         ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 19:03 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>>Quick question: where are the sch_generic changes? :)
>>
>>If you hold for ten minutes I'll post a set of slightly 
>>changed patches with the NETDEVICES_MULTIQUEUE option and a 
>>fix for this.
> 
> 
> Jamal's and KK's qdisc_restart() rewrite took the netif_queue_stopped()
> call out of sch_generic.c.  So the underlying qdisc is only responsible
> for checking the queue status now before dequeueing.


Yes, I noticed that now. Doesn't seem right though as long as
queueing while queue is stopped is treated as a bug by the
drivers.

But I vaguely recall seeing a discussion about this, I'll check
the archives.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 17:13   ` Patrick McHardy
@ 2007-06-28 19:04     ` Waskiewicz Jr, Peter P
  2007-06-28 19:17       ` Patrick McHardy
  0 siblings, 1 reply; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 19:04 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> PJ Waskiewicz wrote:
> > +#ifdef CONFIG_NET_SCH_MULTIQUEUE
> > +			if (q->mq)
> > +				skb->queue_mapping = 
> > +						
> q->prio2band[band&TC_PRIO_MAX];
> > +			else
> > +				skb->queue_mapping = 0;
> > +#endif
> 
> 
> Setting it to zero here is wrong, consider:
> 
> root qdisc: prio multiqueue
> child qdisc: prio non-multiqueue
> 
> The top-level qdisc will set it, the child qdisc will unset it again.
> When multiqueue is inactive it should not touch it.
> 
> I'll fix that as well.

But the child can't assume the device is multiqueue if the child is
non-multiqueue.  This is the same issue with IP forwarding, where if you
forward through a multiqueue device to a non-mq device, you don't know
if the destination device is multiqueue.  So the last qdisc to actually
dequeue into a device should have control what the queue mapping is.  If
a user had a multiqueue qdisc as root, and configured a child qdisc as
non-mq, that is a configuration error if the underlying device is indeed
multiqueue IMO.

-PJ

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:03       ` Patrick McHardy
@ 2007-06-28 19:06         ` Waskiewicz Jr, Peter P
  2007-06-28 19:20           ` Patrick McHardy
  0 siblings, 1 reply; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 19:06 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> Waskiewicz Jr, Peter P wrote:
> >>Quick question: where are the sch_generic changes? :)
> >>
> >>If you hold for ten minutes I'll post a set of slightly changed 
> >>patches with the NETDEVICES_MULTIQUEUE option and a fix for this.
> > 
> > 
> > Jamal's and KK's qdisc_restart() rewrite took the 
> netif_queue_stopped()
> > call out of sch_generic.c.  So the underlying qdisc is only 
> responsible
> > for checking the queue status now before dequeueing.
> 
> 
> Yes, I noticed that now. Doesn't seem right though as long as
> queueing while queue is stopped is treated as a bug by the
> drivers.
> 
> But I vaguely recall seeing a discussion about this, I'll check
> the archives.

The basic gist is before the dequeue is done, the qdisc is locked by the
qdisc is running bit, so another CPU cannot get in there.  So if the
queue isn't stopped when a dequeue is done, that same queue should not
be stopped when hard_start_xmit() is called.  The only thing I could
think of that could happen is some out-of-band cleanup routine in the
driver where the tx_ring lock is held, and the skb is bounced back,
where the driver returns NETIF_TX_BUSY, and you requeue.  This is an
extreme corner case, so the check could be removed.

-PJ

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 19:04     ` Waskiewicz Jr, Peter P
@ 2007-06-28 19:17       ` Patrick McHardy
  2007-06-28 19:21         ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 19:17 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>>PJ Waskiewicz wrote:
>>
>>>+#ifdef CONFIG_NET_SCH_MULTIQUEUE
>>>+			if (q->mq)
>>>+				skb->queue_mapping = 
>>>+						
>>
>>q->prio2band[band&TC_PRIO_MAX];
>>
>>>+			else
>>>+				skb->queue_mapping = 0;
>>>+#endif
>>
>>
>>Setting it to zero here is wrong, consider:
>>
>>root qdisc: prio multiqueue
>>child qdisc: prio non-multiqueue
>>
>>The top-level qdisc will set it, the child qdisc will unset it again.
>>When multiqueue is inactive it should not touch it.
>>
>>I'll fix that as well.
> 
> 
> But the child can't assume the device is multiqueue if the child is
> non-multiqueue.

The child doesn't have to assume anything.

> This is the same issue with IP forwarding, where if you
> forward through a multiqueue device to a non-mq device, you don't know
> if the destination device is multiqueue.

No its not. I'm talking about nested qdiscs, which are all on
a single device.

> So the last qdisc to actually
> dequeue into a device should have control what the queue mapping is.

Fully agreed. And that is always the top-level qdisc.

> If
> a user had a multiqueue qdisc as root, and configured a child qdisc as
> non-mq, that is a configuration error if the underlying device is indeed
> multiqueue IMO.

Absolutely not. First of all, its perfectly valid to use non-multiqueue
qdiscs on multiqueue devices. Secondly, its only the root qdisc that
has to know about multiqueue since that one controls the child qdiscs.

Think about it, it makes absolutely no sense to have the child
qdisc even know about multiqueue. Changing my example to have
a multiqueue qdisc as child:

root qdisc: 2 band prio multiqueue
child qdisc of band 0: 2 band prio multiqueue

When the root qdisc decides to dequeue band0, it checks whether subqueue
0 is active and dequeues the child qdisc. If the child qdisc is indeed
another multiqueue qdisc as you suggest, if might decide to dequeue its
own band 1 and checks that subqueue state. So where should the packet
finally end up? And what if one of both subqueues is inactive?

The only reasonable thing it can do is not care about multiqueue and
just dequeue as usual. In fact I think it should be an error to
configure multiqueue on a non-root qdisc.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:06         ` Waskiewicz Jr, Peter P
@ 2007-06-28 19:20           ` Patrick McHardy
  2007-06-28 19:32             ` Jeff Garzik
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 19:20 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>>Waskiewicz Jr, Peter P wrote:
>>
>>Yes, I noticed that now. Doesn't seem right though as long as
>>queueing while queue is stopped is treated as a bug by the
>>drivers.
>>
>>But I vaguely recall seeing a discussion about this, I'll check
>>the archives.
> 
> 
> The basic gist is before the dequeue is done, the qdisc is locked by the
> qdisc is running bit, so another CPU cannot get in there.  So if the
> queue isn't stopped when a dequeue is done, that same queue should not
> be stopped when hard_start_xmit() is called.  The only thing I could
> think of that could happen is some out-of-band cleanup routine in the
> driver where the tx_ring lock is held, and the skb is bounced back,
> where the driver returns NETIF_TX_BUSY, and you requeue.  This is an
> extreme corner case, so the check could be removed.


Yes, but there are users that don't go through qdiscs, like netpoll,
Having them check the QDISC_RUNNING bit seems ugly.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 19:17       ` Patrick McHardy
@ 2007-06-28 19:21         ` Waskiewicz Jr, Peter P
  2007-06-28 19:24           ` Patrick McHardy
  0 siblings, 1 reply; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 19:21 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> Absolutely not. First of all, its perfectly valid to use 
> non-multiqueue qdiscs on multiqueue devices. Secondly, its 
> only the root qdisc that has to know about multiqueue since 
> that one controls the child qdiscs.
> 
> Think about it, it makes absolutely no sense to have the 
> child qdisc even know about multiqueue. Changing my example 
> to have a multiqueue qdisc as child:
> 
> root qdisc: 2 band prio multiqueue
> child qdisc of band 0: 2 band prio multiqueue
> 
> When the root qdisc decides to dequeue band0, it checks 
> whether subqueue 0 is active and dequeues the child qdisc. If 
> the child qdisc is indeed another multiqueue qdisc as you 
> suggest, if might decide to dequeue its own band 1 and checks 
> that subqueue state. So where should the packet finally end 
> up? And what if one of both subqueues is inactive?
> 
> The only reasonable thing it can do is not care about 
> multiqueue and just dequeue as usual. In fact I think it 
> should be an error to configure multiqueue on a non-root qdisc.

Ack.  This is a thought process that trips me up from time to time...I
see child qdisc, and think that's the last qdisc to dequeue and send to
the device, not the first one to dequeue.  So please disregard my
comments before; I totally agree with you.  Great catch here; I really
like the prio_classify() cleanup.

As always, many thanks for your feedback and help Patrick.

-PJ

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 19:21         ` Waskiewicz Jr, Peter P
@ 2007-06-28 19:24           ` Patrick McHardy
  2007-06-28 19:27             ` Waskiewicz Jr, Peter P
  2007-06-29  4:20             ` David Miller
  0 siblings, 2 replies; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 19:24 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

[-- Attachment #1: Type: text/plain, Size: 641 bytes --]

Waskiewicz Jr, Peter P wrote:
>>[...]
>>The only reasonable thing it can do is not care about 
>>multiqueue and just dequeue as usual. In fact I think it 
>>should be an error to configure multiqueue on a non-root qdisc.
> 
> 
> Ack.  This is a thought process that trips me up from time to time...I
> see child qdisc, and think that's the last qdisc to dequeue and send to
> the device, not the first one to dequeue.  So please disregard my
> comments before; I totally agree with you.  Great catch here; I really
> like the prio_classify() cleanup.


Thanks. This updated patch makes configuring a non-root qdisc for
multiqueue an error.


[-- Attachment #2: 02.diff --]
[-- Type: text/x-diff, Size: 7570 bytes --]

[SCHED] Qdisc changes and sch_rr added for multiqueue

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue
hardware support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit 8798d6bf4f3ed4f5b162184282adc714ef5b69b1
tree b3ba373a0534b905b34abc520eba6adecdcc3ca5
parent 48e334930c5fbb64a821733a7056e2800057c128
author Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Thu, 28 Jun 2007 18:47:49 +0200
committer Patrick McHardy <kaber@trash.net> Thu, 28 Jun 2007 21:23:44 +0200

 include/linux/pkt_sched.h |    9 +++
 net/sched/Kconfig         |   11 ++++
 net/sched/sch_prio.c      |  129 ++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 134 insertions(+), 15 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..268c515 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
 	__u8	priomap[TC_PRIO_MAX+1];	/* Map: logical priority -> PRIO band */
 };
 
+enum
+{
+	TCA_PRIO_UNSPEC,
+	TCA_PRIO_MQ,
+	__TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX    (__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..f321794 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,17 @@ config NET_SCH_PRIO
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_prio.
 
+config NET_SCH_RR
+	tristate "Multi Band Round Robin Queuing (RR)"
+	select NET_SCH_PRIO
+	---help---
+	  Say Y here if you want to use an n-band round robin packet
+	  scheduler.
+
+	  The module uses sch_prio for its framework and is aliased as
+	  sch_rr, so it will load sch_prio, although it is referred
+	  to using sch_rr.
+
 config NET_SCH_RED
 	tristate "Random Early Detection (RED)"
 	---help---
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..4045220 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,11 @@
 struct prio_sched_data
 {
 	int bands;
+	int curband; /* for round-robin */
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+	int mq;
 };
 
 
@@ -70,14 +72,17 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 #endif
 			if (TC_H_MAJ(band))
 				band = 0;
-			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
+			band = q->prio2band[band&TC_PRIO_MAX];
+			goto out;
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
 	if (band >= q->bands)
-		return q->queues[q->prio2band[0]];
-
+		band = q->prio2band[0];
+out:
+	if (q->mq)
+		skb_set_queue_mapping(skb, band);
 	return q->queues[band];
 }
 
@@ -144,17 +149,58 @@ prio_dequeue(struct Qdisc* sch)
 	struct Qdisc *qdisc;
 
 	for (prio = 0; prio < q->bands; prio++) {
-		qdisc = q->queues[prio];
-		skb = qdisc->dequeue(qdisc);
-		if (skb) {
-			sch->q.qlen--;
-			return skb;
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, (q->mq ? prio : 0))) {
+			qdisc = q->queues[prio];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				return skb;
+			}
 		}
 	}
 	return NULL;
 
 }
 
+static struct sk_buff *rr_dequeue(struct Qdisc* sch)
+{
+	struct sk_buff *skb;
+	struct prio_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	int bandcount;
+
+	/* Only take one pass through the queues.  If nothing is available,
+	 * return nothing.
+	 */
+	for (bandcount = 0; bandcount < q->bands; bandcount++) {
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.  If the queue is stopped, try the
+		 * next queue.
+		 */
+		if (!netif_subqueue_stopped(sch->dev,
+					    (q->mq ? q->curband : 0))) {
+			qdisc = q->queues[q->curband];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				q->curband++;
+				if (q->curband >= q->bands)
+					q->curband = 0;
+				return skb;
+			}
+		}
+		q->curband++;
+		if (q->curband >= q->bands)
+			q->curband = 0;
+	}
+	return NULL;
+}
+
 static unsigned int prio_drop(struct Qdisc* sch)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
@@ -198,21 +244,41 @@ prio_destroy(struct Qdisc* sch)
 static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
-	struct tc_prio_qopt *qopt = RTA_DATA(opt);
+	struct tc_prio_qopt *qopt;
+	struct rtattr *tb[TCA_PRIO_MAX];
 	int i;
 
-	if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
+	if (rtattr_parse_nested_compat(tb, TCA_PRIO_MAX, opt, qopt,
+				       sizeof(*qopt)))
 		return -EINVAL;
-	if (qopt->bands > TCQ_PRIO_BANDS || qopt->bands < 2)
+	q->bands = qopt->bands;
+	/* If we're multiqueue, make sure the number of incoming bands
+	 * matches the number of queues on the device we're associating with.
+	 * If the number of bands requested is zero, then set q->bands to
+	 * dev->egress_subqueue_count.
+	 */
+	q->mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]);
+	if (q->mq) {
+		if (sch->handle != TC_H_ROOT)
+			return -EINVAL;
+		if (netif_is_multiqueue(sch->dev)) {
+			if (q->bands == 0)
+				q->bands = sch->dev->egress_subqueue_count;
+			else if (q->bands != sch->dev->egress_subqueue_count)
+				return -EINVAL;
+		} else
+			return -EOPNOTSUPP;
+	}
+
+	if (q->bands > TCQ_PRIO_BANDS || q->bands < 2)
 		return -EINVAL;
 
 	for (i=0; i<=TC_PRIO_MAX; i++) {
-		if (qopt->priomap[i] >= qopt->bands)
+		if (qopt->priomap[i] >= q->bands)
 			return -EINVAL;
 	}
 
 	sch_tree_lock(sch);
-	q->bands = qopt->bands;
 	memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1);
 
 	for (i=q->bands; i<TCQ_PRIO_BANDS; i++) {
@@ -268,11 +334,17 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
 	unsigned char *b = skb_tail_pointer(skb);
+	struct rtattr *nest;
 	struct tc_prio_qopt opt;
 
 	opt.bands = q->bands;
 	memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
-	RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+	nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+	if (q->mq)
+		RTA_PUT_FLAG(skb, TCA_PRIO_MQ);
+	RTA_NEST_COMPAT_END(skb, nest);
+
 	return skb->len;
 
 rtattr_failure:
@@ -443,17 +515,44 @@ static struct Qdisc_ops prio_qdisc_ops = {
 	.owner		=	THIS_MODULE,
 };
 
+static struct Qdisc_ops rr_qdisc_ops = {
+	.next		=	NULL,
+	.cl_ops		=	&prio_class_ops,
+	.id		=	"rr",
+	.priv_size	=	sizeof(struct prio_sched_data),
+	.enqueue	=	prio_enqueue,
+	.dequeue	=	rr_dequeue,
+	.requeue	=	prio_requeue,
+	.drop		=	prio_drop,
+	.init		=	prio_init,
+	.reset		=	prio_reset,
+	.destroy	=	prio_destroy,
+	.change		=	prio_tune,
+	.dump		=	prio_dump,
+	.owner		=	THIS_MODULE,
+};
+
 static int __init prio_module_init(void)
 {
-	return register_qdisc(&prio_qdisc_ops);
+	int err;
+
+	err = register_qdisc(&prio_qdisc_ops);
+	if (err < 0)
+		return err;
+	err = register_qdisc(&rr_qdisc_ops);
+	if (err < 0)
+		unregister_qdisc(&prio_qdisc_ops);
+	return err;
 }
 
 static void __exit prio_module_exit(void)
 {
 	unregister_qdisc(&prio_qdisc_ops);
+	unregister_qdisc(&rr_qdisc_ops);
 }
 
 module_init(prio_module_init)
 module_exit(prio_module_exit)
 
 MODULE_LICENSE("GPL");
+MODULE_ALIAS("sch_rr");

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 19:24           ` Patrick McHardy
@ 2007-06-28 19:27             ` Waskiewicz Jr, Peter P
  2007-06-29  4:20             ` David Miller
  1 sibling, 0 replies; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 19:27 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> Waskiewicz Jr, Peter P wrote:
> >>[...]
> >>The only reasonable thing it can do is not care about 
> multiqueue and 
> >>just dequeue as usual. In fact I think it should be an error to 
> >>configure multiqueue on a non-root qdisc.
> > 
> > 
> > Ack.  This is a thought process that trips me up from time 
> to time...I 
> > see child qdisc, and think that's the last qdisc to dequeue 
> and send 
> > to the device, not the first one to dequeue.  So please 
> disregard my 
> > comments before; I totally agree with you.  Great catch 
> here; I really 
> > like the prio_classify() cleanup.
> 
> 
> Thanks. This updated patch makes configuring a non-root qdisc 
> for multiqueue an error.
> 

The patch looks fine to me.  Thanks Patrick.

-PJ

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:20           ` Patrick McHardy
@ 2007-06-28 19:32             ` Jeff Garzik
  2007-06-28 19:37               ` Patrick McHardy
  2007-06-28 20:39               ` David Miller
  0 siblings, 2 replies; 60+ messages in thread
From: Jeff Garzik @ 2007-06-28 19:32 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Waskiewicz Jr, Peter P, davem, netdev, Kok, Auke-jan H, hadi

Patrick McHardy wrote:
> Yes, but there are users that don't go through qdiscs, like netpoll,
> Having them check the QDISC_RUNNING bit seems ugly.

Is netpoll the only such user?

netpoll tends to be a special case in every sense of the word, and I 
wish it was less so :/

	Jeff



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:32             ` Jeff Garzik
@ 2007-06-28 19:37               ` Patrick McHardy
  2007-06-28 21:11                 ` Waskiewicz Jr, Peter P
  2007-06-28 20:39               ` David Miller
  1 sibling, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 19:37 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Waskiewicz Jr, Peter P, davem, netdev, Kok, Auke-jan H, hadi

Jeff Garzik wrote:
> Patrick McHardy wrote:
> 
>> Yes, but there are users that don't go through qdiscs, like netpoll,
>> Having them check the QDISC_RUNNING bit seems ugly.
> 
> 
> Is netpoll the only such user?

I'm not sure, I just remembered that one :)

Looking at Peter's multiqueue patch, which should include all
hard_start_xmit users (I'm not seeing sch_teql though, Peter?)
the only other one is pktgen.

> netpoll tends to be a special case in every sense of the word, and I
> wish it was less so :/

Indeed.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:32             ` Jeff Garzik
  2007-06-28 19:37               ` Patrick McHardy
@ 2007-06-28 20:39               ` David Miller
  1 sibling, 0 replies; 60+ messages in thread
From: David Miller @ 2007-06-28 20:39 UTC (permalink / raw)
  To: jeff; +Cc: kaber, peter.p.waskiewicz.jr, netdev, auke-jan.h.kok, hadi

From: Jeff Garzik <jeff@garzik.org>
Date: Thu, 28 Jun 2007 15:32:40 -0400

> Patrick McHardy wrote:
> > Yes, but there are users that don't go through qdiscs, like netpoll,
> > Having them check the QDISC_RUNNING bit seems ugly.
> 
> Is netpoll the only such user?
> 
> netpoll tends to be a special case in every sense of the word, and I 
> wish it was less so :/

Seconded.

I'm perfectly happy to consider rearchitecting of netpoll to something
that works better.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:37               ` Patrick McHardy
@ 2007-06-28 21:11                 ` Waskiewicz Jr, Peter P
  2007-06-28 21:18                   ` Patrick McHardy
  0 siblings, 1 reply; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 21:11 UTC (permalink / raw)
  To: Patrick McHardy, Jeff Garzik; +Cc: davem, netdev, Kok, Auke-jan H, hadi

> -----Original Message-----
> From: Patrick McHardy [mailto:kaber@trash.net] 
> Sent: Thursday, June 28, 2007 12:37 PM
> To: Jeff Garzik
> Cc: Waskiewicz Jr, Peter P; davem@davemloft.net; 
> netdev@vger.kernel.org; Kok, Auke-jan H; hadi@cyberus.ca
> Subject: Re: [PATCH 2/3] NET: [CORE] Stack changes to add 
> multiqueue hardware support API
> 
> Jeff Garzik wrote:
> > Patrick McHardy wrote:
> > 
> >> Yes, but there are users that don't go through qdiscs, 
> like netpoll, 
> >> Having them check the QDISC_RUNNING bit seems ugly.
> > 
> > 
> > Is netpoll the only such user?
> 
> I'm not sure, I just remembered that one :)
> 
> Looking at Peter's multiqueue patch, which should include all 
> hard_start_xmit users (I'm not seeing sch_teql though, 
> Peter?) the only other one is pktgen.

Ugh.  That is another netif_queue_stopped() that needs
netif_subqueue_stopped().  I can send an updated patch for the core to
fix this based from your patches Patrick.

> 
> > netpoll tends to be a special case in every sense of the 
> word, and I 
> > wish it was less so :/
> 
> Indeed.

So what do we do about netpoll then wrt netif_(sub)queue_stopped() being
removed from qdisc_restart()?  The fallout of having netpoll() cause a
queue to stop (queue 0 only) is the skb sent will be requeued, since the
driver will return NETIF_TX_BUSY if this actually happens.  But this is
a corner case, and we won't lose packets; we'll just have increased
latency on that queue.  Should I worry about this or just move forward
with the sch_teql.c change and repost the core patch?

Thanks,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 21:11                 ` Waskiewicz Jr, Peter P
@ 2007-06-28 21:18                   ` Patrick McHardy
  2007-06-28 23:08                     ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-28 21:18 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: Jeff Garzik, davem, netdev, Kok, Auke-jan H, hadi

[-- Attachment #1: Type: text/plain, Size: 1048 bytes --]

Waskiewicz Jr, Peter P wrote:
>>
>> Looking at Peter's multiqueue patch, which should include all 
>> hard_start_xmit users (I'm not seeing sch_teql though, 
>> Peter?) the only other one is pktgen.
>>     
>
> Ugh.  That is another netif_queue_stopped() that needs
> netif_subqueue_stopped().  I can send an updated patch for the core to
> fix this based from your patches Patrick.
>   

I still have the tree around, here's an updated version.

>
> So what do we do about netpoll then wrt netif_(sub)queue_stopped() being
> removed from qdisc_restart()?  The fallout of having netpoll() cause a
> queue to stop (queue 0 only) is the skb sent will be requeued, since the
> driver will return NETIF_TX_BUSY if this actually happens.  But this is
> a corner case, and we won't lose packets; we'll just have increased
> latency on that queue.  Should I worry about this or just move forward
> with the sch_teql.c change and repost the core patch?
>   


I don't think you need to worry about that, the subqueue
patch just follows the existing code.


[-- Attachment #2: 01.diff --]
[-- Type: text/x-diff, Size: 16275 bytes --]

[CORE] Stack changes to add multiqueue hardware	support API

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit 8658c76aeb209c91598d275b99d602bf5aeccb63
tree 915d315cdcff46d1f9aadb785a0173d318452122
parent 0552c565358330c59913f2b512355f196e31bd74
author Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Thu, 28 Jun 2007 18:41:00 +0200
committer Patrick McHardy <kaber@trash.net> Thu, 28 Jun 2007 23:17:13 +0200

 drivers/net/Kconfig         |    8 +++++
 include/linux/etherdevice.h |    3 +-
 include/linux/netdevice.h   |   76 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |   25 ++++++++++++--
 net/core/dev.c              |   27 +++++++++++----
 net/core/netpoll.c          |    8 +++--
 net/core/pktgen.c           |   10 ++++--
 net/core/skbuff.c           |    3 ++
 net/ethernet/eth.c          |    9 +++--
 net/sched/sch_teql.c        |    6 +++
 10 files changed, 150 insertions(+), 25 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index e5549dc..8bce4fb 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -28,6 +28,14 @@ config NETDEVICES
 # that for each of the symbols.
 if NETDEVICES
 
+config NETDEVICES_MULTIQUEUE
+	bool "Netdevice multiple hardware queue support"
+	---help---
+	  Say Y here if you want to allow the network stack to use multiple
+	  hardware TX queues on an ethernet device.
+
+	  Most people will say N here.
+
 config IFB
 	tristate "Intermediate Functional Block support"
 	depends on NET_CLS_ACT
diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..6cdb973 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void		eth_header_cache_update(struct hh_cache *hh, struct net_device *dev
 extern int		eth_header_cache(struct neighbour *neigh,
 					 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, unsigned int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c0cc19..1b43e15 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+	/* Give a control state for each queue.  This struct may contain
+	 * per-queue locks in the future.
+	 */
+	unsigned long   state;
+};
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -331,6 +339,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -557,6 +566,10 @@ struct net_device
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
+
+	/* The TX queue control structures */
+	unsigned int			egress_subqueue_count;
+	struct net_device_subqueue	egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -719,6 +732,62 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+#endif
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+#endif
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+					 u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	return test_bit(__LINK_STATE_XOFF,
+			&dev->egress_subqueue[queue_index].state);
+#else
+	return 0;
+#endif
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+			       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+#endif
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+#else
+	return 0;
+#endif
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -1009,8 +1078,11 @@ static inline void netif_tx_disable(struct net_device *dev)
 extern void		ether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-				       void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+				       void (*setup)(struct net_device *),
+				       unsigned int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+	alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int		register_netdev(struct net_device *dev);
 extern void		unregister_netdev(struct net_device *dev);
 /* Functions used for secondary unicast and multicast support */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b7b2628..e3851bb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -196,7 +196,6 @@ typedef unsigned char *sk_buff_data_t;
  *	@sk: Socket we are owned by
  *	@tstamp: Time we arrived
  *	@dev: Device we arrived on/are leaving by
- *	@iif: ifindex of device we arrived on
  *	@transport_header: Transport layer header
  *	@network_header: Network layer header
  *	@mac_header: Link layer header
@@ -231,6 +230,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@nfctinfo: Relationship of this skb to the connection
  *	@nfct_reasm: netfilter conntrack re-assembly pointer
  *	@nf_bridge: Saved data about a bridged frame - see br_netfilter.c
+ *	@iif: ifindex of device we arrived on
+ *	@queue_mapping: Queue mapping for multiqueue devices
  *	@tc_index: Traffic control index
  *	@tc_verd: traffic control verdict
  *	@dma_cookie: a cookie to one of several possible DMA operations
@@ -246,8 +247,6 @@ struct sk_buff {
 	struct sock		*sk;
 	ktime_t			tstamp;
 	struct net_device	*dev;
-	int			iif;
-	/* 4 byte hole on 64 bit*/
 
 	struct  dst_entry	*dst;
 	struct	sec_path	*sp;
@@ -290,12 +289,18 @@ struct sk_buff {
 #ifdef CONFIG_BRIDGE_NETFILTER
 	struct nf_bridge_info	*nf_bridge;
 #endif
+
+	int			iif;
+	__u16			queue_mapping;
+
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
 #ifdef CONFIG_NET_CLS_ACT
 	__u16			tc_verd;	/* traffic control verdict */
 #endif
 #endif
+	/* 2 byte hole */
+
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
 #endif
@@ -1721,6 +1726,20 @@ static inline void skb_init_secmark(struct sk_buff *skb)
 { }
 #endif
 
+static inline void skb_set_queue_mapping(struct sk_buff *skb, u16 queue_mapping)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	skb->queue_mapping = queue_mapping;
+#endif
+}
+
+static inline void skb_copy_queue_mapping(struct sk_buff *to, const struct sk_buff *from)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	to->queue_mapping = from->queue_mapping;
+#endif
+}
+
 static inline int skb_is_gso(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->gso_size;
diff --git a/net/core/dev.c b/net/core/dev.c
index 778e102..e94991a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1429,7 +1429,9 @@ gso:
 			skb->next = nskb;
 			return rc;
 		}
-		if (unlikely(netif_queue_stopped(dev) && skb->next))
+		if (unlikely((netif_queue_stopped(dev) ||
+			     netif_subqueue_stopped(dev, skb->queue_mapping)) &&
+			     skb->next))
 			return NETDEV_TX_BUSY;
 	} while (skb->next);
 
@@ -1547,6 +1549,8 @@ gso:
 		spin_lock(&dev->queue_lock);
 		q = dev->qdisc;
 		if (q->enqueue) {
+			/* reset queue_mapping to zero */
+			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
 			qdisc_run(dev);
 			spin_unlock(&dev->queue_lock);
@@ -1576,7 +1580,8 @@ gso:
 
 			HARD_TX_LOCK(dev, cpu);
 
-			if (!netif_queue_stopped(dev)) {
+			if (!netif_queue_stopped(dev) &&
+			    !netif_subqueue_stopped(dev, skb->queue_mapping)) {
 				rc = 0;
 				if (!dev_hard_start_xmit(skb, dev)) {
 					HARD_TX_UNLOCK(dev);
@@ -3539,16 +3544,18 @@ static struct net_device_stats *internal_stats(struct net_device *dev)
 }
 
 /**
- *	alloc_netdev - allocate network device
+ *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
  *	@name:		device name format string
  *	@setup:		callback to initialize device
+ *	@queue_count:	the number of subqueues to allocate
  *
  *	Allocates a struct net_device with private data area for driver use
- *	and performs basic initialization.
+ *	and performs basic initialization.  Also allocates subquue structs
+ *	for each queue on the device at the end of the netdevice.
  */
-struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-		void (*setup)(struct net_device *))
+struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+		void (*setup)(struct net_device *), unsigned int queue_count)
 {
 	void *p;
 	struct net_device *dev;
@@ -3557,7 +3564,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
 	/* ensure 32-byte alignment of both the device and private area */
-	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
+	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
+		     (sizeof(struct net_device_subqueue) * queue_count)) &
+		     ~NETDEV_ALIGN_CONST;
 	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
 
 	p = kzalloc(alloc_size, GFP_KERNEL);
@@ -3573,12 +3582,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
 
+	dev->egress_subqueue_count = queue_count;
+
 	dev->get_stats = internal_stats;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
 }
-EXPORT_SYMBOL(alloc_netdev);
+EXPORT_SYMBOL(alloc_netdev_mq);
 
 /**
  *	free_netdev - free network device
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..2ace33d 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -66,8 +66,9 @@ static void queue_process(struct work_struct *work)
 
 		local_irq_save(flags);
 		netif_tx_lock(dev);
-		if (netif_queue_stopped(dev) ||
-		    dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
+		if ((netif_queue_stopped(dev) ||
+		     netif_subqueue_stopped(dev, skb->queue_mapping)) ||
+		     dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
 			skb_queue_head(&npinfo->txq, skb);
 			netif_tx_unlock(dev);
 			local_irq_restore(flags);
@@ -254,7 +255,8 @@ static void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
 			/* try until next clock tick */
 			for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
 					tries > 0; --tries) {
-				if (!netif_queue_stopped(dev))
+				if (!netif_queue_stopped(dev) &&
+				    !netif_subqueue_stopped(dev, skb->queue_mapping))
 					status = dev->hard_start_xmit(skb, dev);
 
 				if (status == NETDEV_TX_OK)
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 9cd3a1c..dffe067 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3139,7 +3139,9 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		}
 	}
 
-	if (netif_queue_stopped(odev) || need_resched()) {
+	if ((netif_queue_stopped(odev) ||
+	     netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) ||
+	     need_resched()) {
 		idle_start = getCurUs();
 
 		if (!netif_running(odev)) {
@@ -3154,7 +3156,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 
 		pkt_dev->idle_acc += getCurUs() - idle_start;
 
-		if (netif_queue_stopped(odev)) {
+		if (netif_queue_stopped(odev) ||
+		    netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 			pkt_dev->next_tx_us = getCurUs();	/* TODO */
 			pkt_dev->next_tx_ns = 0;
 			goto out;	/* Try the next interface */
@@ -3181,7 +3184,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 	}
 
 	netif_tx_lock_bh(odev);
-	if (!netif_queue_stopped(odev)) {
+	if (!netif_queue_stopped(odev) &&
+	    !netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 
 		atomic_inc(&(pkt_dev->skb->users));
 	      retry_now:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8d8e8fc..af03556 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -419,6 +419,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
+	skb_copy_queue_mapping(n, skb);
 	C(priority);
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
@@ -460,6 +461,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 	new->sk		= NULL;
 	new->dev	= old->dev;
+	skb_copy_queue_mapping(new, old);
 	new->priority	= old->priority;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
@@ -1927,6 +1929,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		tail = nskb;
 
 		nskb->dev = skb->dev;
+		skb_copy_queue_mapping(nskb, skb);
 		nskb->priority = skb->priority;
 		nskb->protocol = skb->protocol;
 		nskb->dst = dst_clone(skb->dst);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0ac2524..1387e54 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -316,9 +316,10 @@ void ether_setup(struct net_device *dev)
 EXPORT_SYMBOL(ether_setup);
 
 /**
- * alloc_etherdev - Allocates and sets up an Ethernet device
+ * alloc_etherdev_mq - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
  *	for this Ethernet device
+ * @queue_count: The number of queues this device has.
  *
  * Fill in the fields of the device structure with Ethernet-generic
  * values. Basically does everything except registering the device.
@@ -328,8 +329,8 @@ EXPORT_SYMBOL(ether_setup);
  * this private data area.
  */
 
-struct net_device *alloc_etherdev(int sizeof_priv)
+struct net_device *alloc_etherdev_mq(int sizeof_priv, unsigned int queue_count)
 {
-	return alloc_netdev(sizeof_priv, "eth%d", ether_setup);
+	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);
 }
-EXPORT_SYMBOL(alloc_etherdev);
+EXPORT_SYMBOL(alloc_etherdev_mq);
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index f05ad9a..dfe7e45 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -277,6 +277,7 @@ static int teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 	int busy;
 	int nores;
 	int len = skb->len;
+	int subq = skb->queue_mapping;
 	struct sk_buff *skb_res = NULL;
 
 	start = master->slaves;
@@ -293,7 +294,9 @@ restart:
 
 		if (slave->qdisc_sleeping != q)
 			continue;
-		if (netif_queue_stopped(slave) || ! netif_running(slave)) {
+		if (netif_queue_stopped(slave) ||
+		    netif_subqueue_stopped(slave, subq) ||
+		    !netif_running(slave)) {
 			busy = 1;
 			continue;
 		}
@@ -302,6 +305,7 @@ restart:
 		case 0:
 			if (netif_tx_trylock(slave)) {
 				if (!netif_queue_stopped(slave) &&
+				    !netif_subqueue_stopped(slave, subq) &&
 				    slave->hard_start_xmit(skb, slave) == 0) {
 					netif_tx_unlock(slave);
 					master->slaves = NEXT_SLAVE(q);

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 21:18                   ` Patrick McHardy
@ 2007-06-28 23:08                     ` Waskiewicz Jr, Peter P
  2007-06-28 23:31                       ` David Miller
  0 siblings, 1 reply; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 23:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Jeff Garzik, davem, netdev, Kok, Auke-jan H, hadi

> Waskiewicz Jr, Peter P wrote:
> >>
> >> Looking at Peter's multiqueue patch, which should include all 
> >> hard_start_xmit users (I'm not seeing sch_teql though,
> >> Peter?) the only other one is pktgen.
> >>     
> >
> > Ugh.  That is another netif_queue_stopped() that needs 
> > netif_subqueue_stopped().  I can send an updated patch for 
> the core to 
> > fix this based from your patches Patrick.
> >   
> 
> I still have the tree around, here's an updated version.
> 
> >
> > So what do we do about netpoll then wrt netif_(sub)queue_stopped() 
> > being removed from qdisc_restart()?  The fallout of having 
> netpoll() 
> > cause a queue to stop (queue 0 only) is the skb sent will 
> be requeued, 
> > since the driver will return NETIF_TX_BUSY if this actually 
> happens.  
> > But this is a corner case, and we won't lose packets; we'll 
> just have 
> > increased latency on that queue.  Should I worry about this or just 
> > move forward with the sch_teql.c change and repost the core patch?
> >   
> 
> 
> I don't think you need to worry about that, the subqueue 
> patch just follows the existing code.

Thanks Patrick for taking care of this.  I am totally fine with this
patch; if anyone else has feedback, please send it.  If not, I'm excited
to see if these can be considered for 2.6.23 now.  :)  Thanks everyone
for the help.

Cheers,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 23:08                     ` Waskiewicz Jr, Peter P
@ 2007-06-28 23:31                       ` David Miller
  0 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2007-06-28 23:31 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr; +Cc: kaber, jeff, netdev, auke-jan.h.kok, hadi

From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com>
Date: Thu, 28 Jun 2007 16:08:43 -0700

> Thanks Patrick for taking care of this.  I am totally fine with this
> patch; if anyone else has feedback, please send it.  If not, I'm excited
> to see if these can be considered for 2.6.23 now.  :)  Thanks everyone
> for the help.

I'll look over the current patches later this evening, I was
initially waiting for the GSO BUG() akpm reported to get fixed
and Herbert took care of that an hour or so ago.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  2007-06-28 16:31   ` Patrick McHardy
  2007-06-28 17:00   ` Patrick McHardy
@ 2007-06-29  3:39   ` David Miller
  2007-06-29 10:54     ` Jeff Garzik
  2 siblings, 1 reply; 60+ messages in thread
From: David Miller @ 2007-06-29  3:39 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
Date: Thu, 28 Jun 2007 09:21:13 -0700

> -struct net_device *alloc_netdev(int sizeof_priv, const char *name,
> -		void (*setup)(struct net_device *))
> +struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
> +		void (*setup)(struct net_device *), int queue_count)
>  {
>  	void *p;
>  	struct net_device *dev;
> @@ -3557,7 +3564,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>  	BUG_ON(strlen(name) >= sizeof(dev->name));
>  
>  	/* ensure 32-byte alignment of both the device and private area */
> -	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
> +	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
> +		     (sizeof(struct net_device_subqueue) * queue_count)) &
> +		     ~NETDEV_ALIGN_CONST;
>  	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
>  
>  	p = kzalloc(alloc_size, GFP_KERNEL);
> @@ -3573,12 +3582,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>  	if (sizeof_priv)
>  		dev->priv = netdev_priv(dev);
>  
> +	dev->egress_subqueue_count = queue_count;
> +
>  	dev->get_stats = internal_stats;
>  	setup(dev);
>  	strcpy(dev->name, name);
>  	return dev;
>  }

This isn't going to work.

The pointer returned from netdev_priv() doesn't take into account the
variable sized queues at the end of struct netdev, so we can stomp
over the queues with the private area.

This probably works by luck because of NETDEV_ALIGN.

The simplest fix is to just make netdev_priv() use dev->priv,
except when it's being initialized during allocation, and
that's what I'm going to do when I apply your patch.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-28 19:24           ` Patrick McHardy
  2007-06-28 19:27             ` Waskiewicz Jr, Peter P
@ 2007-06-29  4:20             ` David Miller
  2007-06-29  8:45               ` Waskiewicz Jr, Peter P
                                 ` (2 more replies)
  1 sibling, 3 replies; 60+ messages in thread
From: David Miller @ 2007-06-29  4:20 UTC (permalink / raw)
  To: kaber; +Cc: peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok, hadi

From: Patrick McHardy <kaber@trash.net>
Date: Thu, 28 Jun 2007 21:24:37 +0200

> Waskiewicz Jr, Peter P wrote:
> >>[...]
> >>The only reasonable thing it can do is not care about 
> >>multiqueue and just dequeue as usual. In fact I think it 
> >>should be an error to configure multiqueue on a non-root qdisc.
> > 
> > 
> > Ack.  This is a thought process that trips me up from time to time...I
> > see child qdisc, and think that's the last qdisc to dequeue and send to
> > the device, not the first one to dequeue.  So please disregard my
> > comments before; I totally agree with you.  Great catch here; I really
> > like the prio_classify() cleanup.
> 
> 
> Thanks. This updated patch makes configuring a non-root qdisc for
> multiqueue an error.

Ok everything is checked into net-2.6.23, thanks everyone.

Now I get to pose a problem for everyone, prove to me how useful
this new code is by showing me how it can be used to solve a
reocurring problem in virtualized network drivers of which I've
had to code one up recently, see my most recent blog entry at:

	http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html

Anyways the gist of the issue is (and this happens for Sun LDOMS
networking, lguest, IBM iSeries, etc.) that we have a single
virtualized network device.  There is a "port" to the control
node (which switches packets to the real network for the guest)
and one "port" to each of the other guests.

Each guest gets a unique MAC address.  There is a queue per-port
that can fill up.

What all the drivers like this do right now is stop the queue if
any of the per-port queues fill up, and that's why my sunvnet
driver does right now as well.  We can only thus wakeup the
queue when all of the ports have some space.

The ports (and thus the queues) are selected by destination
MAC address.  Each port has a remote MAC address, if there
is an exact match with a port's remote MAC we'd use that port
and thus that port's queue.  If there is no exact match
(some other node on the real network, broadcast, multicast,
etc.) we want to use the control node's port and port queue.

So the problem to solve is to make a way for drivers to do the queue
selection before the generic queueing layer starts to try and push
things to the driver.  Perhaps a classifier in the driver or similar.

The solution to this problem generalizes to the other facility
we want now, hashing the transmit queue by smp_processor_id()
or similar.  With that in place we can look at doing the TX locking
per-queue too as is hinted at by the comments above the per-queue
structure in the current net-2.6.23 tree.

My current work-in-progress sunvnet.c driver is included below so
we can discuss things concretely with code.

I'm listening. :-)

-------------------- sunvnet.h --------------------
#ifndef _SUNVNET_H
#define _SUNVNET_H

#define DESC_NCOOKIES(entry_size)	\
	((entry_size) - sizeof(struct vio_net_desc))

/* length of time before we decide the hardware is borked,
 * and dev->tx_timeout() should be called to fix the problem
 */
#define VNET_TX_TIMEOUT			(5 * HZ)

#define VNET_TX_RING_SIZE		512
#define VNET_TX_WAKEUP_THRESH(dr)	((dr)->pending / 4)

/* VNET packets are sent in buffers with the first 6 bytes skipped
 * so that after the ethernet header the IPv4/IPv6 headers are aligned
 * properly.
 */
#define VNET_PACKET_SKIP		6

struct vnet_tx_entry {
	void			*buf;
	unsigned int		ncookies;
	struct ldc_trans_cookie	cookies[2];
};

struct vnet;
struct vnet_port {
	struct vio_driver_state	vio;

	struct hlist_node	hash;
	u8			raddr[ETH_ALEN];

	struct vnet		*vp;

	struct vnet_tx_entry	tx_bufs[VNET_TX_RING_SIZE];

	struct list_head	list;
};

static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio)
{
	return container_of(vio, struct vnet_port, vio);
}

#define VNET_PORT_HASH_SIZE	16
#define VNET_PORT_HASH_MASK	(VNET_PORT_HASH_SIZE - 1)

static inline unsigned int vnet_hashfn(u8 *mac)
{
	unsigned int val = mac[4] ^ mac[5];

	return val & (VNET_PORT_HASH_MASK);
}

struct vnet {
	/* Protects port_list and port_hash.  */
	spinlock_t		lock;

	struct net_device	*dev;

	u32			msg_enable;
	struct vio_dev		*vdev;

	struct list_head	port_list;

	struct hlist_head	port_hash[VNET_PORT_HASH_SIZE];
};

#endif /* _SUNVNET_H */
-------------------- sunvnet.c --------------------
/* sunvnet.c: Sun LDOM Virtual Network Driver.
 *
 * Copyright (C) 2007 David S. Miller <davem@davemloft.net>
 */

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/types.h>
#include <linux/slab.h>
#include <linux/delay.h>
#include <linux/init.h>
#include <linux/netdevice.h>
#include <linux/ethtool.h>
#include <linux/etherdevice.h>

#include <asm/vio.h>
#include <asm/ldc.h>

#include "sunvnet.h"

#define DRV_MODULE_NAME		"sunvnet"
#define PFX DRV_MODULE_NAME	": "
#define DRV_MODULE_VERSION	"1.0"
#define DRV_MODULE_RELDATE	"June 25, 2007"

static char version[] __devinitdata =
	DRV_MODULE_NAME ".c:v" DRV_MODULE_VERSION " (" DRV_MODULE_RELDATE ")\n";
MODULE_AUTHOR("David S. Miller (davem@davemloft.net)");
MODULE_DESCRIPTION("Sun LDOM virtual network driver");
MODULE_LICENSE("GPL");
MODULE_VERSION(DRV_MODULE_VERSION);

/* Ordered from largest major to lowest */
static struct vio_version vnet_versions[] = {
	{ .major = 1, .minor = 0 },
};

static inline u32 vnet_tx_dring_avail(struct vio_dring_state *dr)
{
	return vio_dring_avail(dr, VNET_TX_RING_SIZE);
}

static int vnet_handle_unknown(struct vnet_port *port, void *arg)
{
	struct vio_msg_tag *pkt = arg;

	printk(KERN_ERR PFX "Received unknown msg [%02x:%02x:%04x:%08x]\n",
	       pkt->type, pkt->stype, pkt->stype_env, pkt->sid);
	printk(KERN_ERR PFX "Resetting connection.\n");

	ldc_disconnect(port->vio.lp);

	return -ECONNRESET;
}

static int vnet_send_attr(struct vio_driver_state *vio)
{
	struct vnet_port *port = to_vnet_port(vio);
	struct net_device *dev = port->vp->dev;
	struct vio_net_attr_info pkt;
	int i;

	memset(&pkt, 0, sizeof(pkt));
	pkt.tag.type = VIO_TYPE_CTRL;
	pkt.tag.stype = VIO_SUBTYPE_INFO;
	pkt.tag.stype_env = VIO_ATTR_INFO;
	pkt.tag.sid = vio_send_sid(vio);
	pkt.xfer_mode = VIO_DRING_MODE;
	pkt.addr_type = VNET_ADDR_ETHERMAC;
	pkt.ack_freq = 0;
	for (i = 0; i < 6; i++)
		pkt.addr |= (u64)dev->dev_addr[i] << ((5 - i) * 8);
	pkt.mtu = ETH_FRAME_LEN;

	viodbg(HS, "SEND NET ATTR xmode[0x%x] atype[0x%x] addr[%llx] "
	       "ackfreq[%u] mtu[%llu]\n",
	       pkt.xfer_mode, pkt.addr_type,
	       (unsigned long long) pkt.addr,
	       pkt.ack_freq,
	       (unsigned long long) pkt.mtu);

	return vio_ldc_send(vio, &pkt, sizeof(pkt));
}

static int handle_attr_info(struct vio_driver_state *vio,
			    struct vio_net_attr_info *pkt)
{
	viodbg(HS, "GOT NET ATTR INFO xmode[0x%x] atype[0x%x] addr[%llx] "
	       "ackfreq[%u] mtu[%llu]\n",
	       pkt->xfer_mode, pkt->addr_type,
	       (unsigned long long) pkt->addr,
	       pkt->ack_freq,
	       (unsigned long long) pkt->mtu);

	pkt->tag.sid = vio_send_sid(vio);

	if (pkt->xfer_mode != VIO_DRING_MODE ||
	    pkt->addr_type != VNET_ADDR_ETHERMAC ||
	    pkt->mtu != ETH_FRAME_LEN) {
		viodbg(HS, "SEND NET ATTR NACK\n");

		pkt->tag.stype = VIO_SUBTYPE_NACK;

		(void) vio_ldc_send(vio, pkt, sizeof(*pkt));

		return -ECONNRESET;
	} else {
		viodbg(HS, "SEND NET ATTR ACK\n");

		pkt->tag.stype = VIO_SUBTYPE_ACK;

		return vio_ldc_send(vio, pkt, sizeof(*pkt));
	}

}

static int handle_attr_ack(struct vio_driver_state *vio,
			   struct vio_net_attr_info *pkt)
{
	viodbg(HS, "GOT NET ATTR ACK\n");

	return 0;
}

static int handle_attr_nack(struct vio_driver_state *vio,
			    struct vio_net_attr_info *pkt)
{
	viodbg(HS, "GOT NET ATTR NACK\n");

	return -ECONNRESET;
}

static int vnet_handle_attr(struct vio_driver_state *vio, void *arg)
{
	struct vio_net_attr_info *pkt = arg;

	switch (pkt->tag.stype) {
	case VIO_SUBTYPE_INFO:
		return handle_attr_info(vio, pkt);

	case VIO_SUBTYPE_ACK:
		return handle_attr_ack(vio, pkt);

	case VIO_SUBTYPE_NACK:
		return handle_attr_nack(vio, pkt);

	default:
		return -ECONNRESET;
	}
}

static void vnet_handshake_complete(struct vio_driver_state *vio)
{
	struct vio_dring_state *dr;

	dr = &vio->drings[VIO_DRIVER_RX_RING];
	dr->snd_nxt = dr->rcv_nxt = 1;

	dr = &vio->drings[VIO_DRIVER_TX_RING];
	dr->snd_nxt = dr->rcv_nxt = 1;
}

/* The hypervisor interface that implements copying to/from imported
 * memory from another domain requires that copies are done to 8-byte
 * aligned buffers, and that the lengths of such copies are also 8-byte
 * multiples.
 *
 * So we align skb->data to an 8-byte multiple and pad-out the data
 * area so we can round the copy length up to the next multiple of
 * 8 for the copy.
 *
 * The transmitter puts the actual start of the packet 6 bytes into
 * the buffer it sends over, so that the IP headers after the ethernet
 * header are aligned properly.  These 6 bytes are not in the descriptor
 * length, they are simply implied.  This offset is represented using
 * the VNET_PACKET_SKIP macro.
 */
static struct sk_buff *alloc_and_align_skb(struct net_device *dev,
					   unsigned int len)
{
	struct sk_buff *skb = netdev_alloc_skb(dev, len+VNET_PACKET_SKIP+8+8);
	unsigned long addr, off;

	if (unlikely(!skb))
		return NULL;

	addr = (unsigned long) skb->data;
	off = ((addr + 7UL) & ~7UL) - addr;
	if (off)
		skb_reserve(skb, off);

	return skb;
}

static int vnet_rx_one(struct vnet_port *port, unsigned int len,
		       struct ldc_trans_cookie *cookies, int ncookies)
{
	struct net_device *dev = port->vp->dev;
	unsigned int copy_len;
	struct sk_buff *skb;
	int err;

	skb = alloc_and_align_skb(dev, len);
	err = -ENOMEM;
	if (!skb) {
		dev->stats.rx_missed_errors++;
		goto out_dropped;
	}

	copy_len = (len + VNET_PACKET_SKIP + 7U) & ~7U;
	skb_put(skb, copy_len);
	err = ldc_copy(port->vio.lp, LDC_COPY_IN,
		       skb->data, copy_len, 0,
		       cookies, ncookies);
	if (err < 0) {
		dev->stats.rx_frame_errors++;
		goto out_free_skb;
	}

	skb_pull(skb, VNET_PACKET_SKIP);
	skb_trim(skb, len);
	skb->protocol = eth_type_trans(skb, dev);

	dev->stats.rx_packets++;
	dev->stats.rx_bytes += len;

	netif_rx(skb);

	return 0;

out_free_skb:
	kfree_skb(skb);

out_dropped:
	dev->stats.rx_dropped++;
	return err;
}

static int vnet_send_ack(struct vnet_port *port, struct vio_dring_state *dr,
			 u32 start, u32 end, u8 vio_dring_state)
{
	struct vio_dring_data hdr = {
		.tag = {
			.type		= VIO_TYPE_DATA,
			.stype		= VIO_SUBTYPE_ACK,
			.stype_env	= VIO_DRING_DATA,
			.sid		= vio_send_sid(&port->vio),
		},
		.dring_ident		= dr->ident,
		.start_idx		= start,
		.end_idx		= end,
		.state			= vio_dring_state,
	};
	int err, delay;

	hdr.seq = dr->snd_nxt;
	delay = 1;
	do {
		err = vio_ldc_send(&port->vio, &hdr, sizeof(hdr));
		if (err > 0) {
			dr->snd_nxt++;
			break;
		}
		udelay(delay);
		if ((delay <<= 1) > 128)
			delay = 128;
	} while (err == -EAGAIN);

	return err;
}

static u32 next_idx(u32 idx, struct vio_dring_state *dr)
{
	if (++idx == dr->num_entries)
		idx = 0;
	return idx;
}

static u32 prev_idx(u32 idx, struct vio_dring_state *dr)
{
	if (idx == 0)
		idx = dr->num_entries - 1;
	else
		idx--;

	return idx;
}

static struct vio_net_desc *get_rx_desc(struct vnet_port *port,
					struct vio_dring_state *dr,
					u32 index)
{
	struct vio_net_desc *desc = port->vio.desc_buf;
	int err;

	err = ldc_get_dring_entry(port->vio.lp, desc, dr->entry_size,
				  (index * dr->entry_size),
				  dr->cookies, dr->ncookies);
	if (err < 0)
		return ERR_PTR(err);

	return desc;
}

static int put_rx_desc(struct vnet_port *port,
		       struct vio_dring_state *dr,
		       struct vio_net_desc *desc,
		       u32 index)
{
	int err;

	err = ldc_put_dring_entry(port->vio.lp, desc, dr->entry_size,
				  (index * dr->entry_size),
				  dr->cookies, dr->ncookies);
	if (err < 0)
		return err;

	return 0;
}

static int vnet_walk_rx_one(struct vnet_port *port,
			    struct vio_dring_state *dr,
			    u32 index, int *needs_ack)
{
	struct vio_net_desc *desc = get_rx_desc(port, dr, index);
	struct vio_driver_state *vio = &port->vio;
	int err;

	if (IS_ERR(desc))
		return PTR_ERR(desc);

	viodbg(DATA, "vio_walk_rx_one desc[%02x:%02x:%08x:%08x:%lx:%lx]\n",
	       desc->hdr.state, desc->hdr.ack,
	       desc->size, desc->ncookies,
	       desc->cookies[0].cookie_addr,
	       desc->cookies[0].cookie_size);

	if (desc->hdr.state != VIO_DESC_READY)
		return 1;
	err = vnet_rx_one(port, desc->size, desc->cookies, desc->ncookies);
	if (err == -ECONNRESET)
		return err;
	desc->hdr.state = VIO_DESC_DONE;
	err = put_rx_desc(port, dr, desc, index);
	if (err < 0)
		return err;
	*needs_ack = desc->hdr.ack;
	return 0;
}

static int vnet_walk_rx(struct vnet_port *port, struct vio_dring_state *dr,
			u32 start, u32 end)
{
	struct vio_driver_state *vio = &port->vio;
	int ack_start = -1, ack_end = -1;

	end = (end == (u32) -1) ? prev_idx(start, dr) : next_idx(end, dr);

	viodbg(DATA, "vnet_walk_rx start[%08x] end[%08x]\n", start, end);

	while (start != end) {
		int ack = 0, err = vnet_walk_rx_one(port, dr, start, &ack);
		if (err == -ECONNRESET)
			return err;
		if (err != 0)
			break;
		if (ack_start == -1)
			ack_start = start;
		ack_end = start;
		start = next_idx(start, dr);
		if (ack && start != end) {
			err = vnet_send_ack(port, dr, ack_start, ack_end,
					    VIO_DRING_ACTIVE);
			if (err == -ECONNRESET)
				return err;
			ack_start = -1;
		}
	}
	if (unlikely(ack_start == -1))
		ack_start = ack_end = prev_idx(start, dr);
	return vnet_send_ack(port, dr, ack_start, ack_end, VIO_DRING_STOPPED);
}

static int vnet_rx(struct vnet_port *port, void *msgbuf)
{
	struct vio_dring_data *pkt = msgbuf;
	struct vio_dring_state *dr = &port->vio.drings[VIO_DRIVER_RX_RING];
	struct vio_driver_state *vio = &port->vio;

	viodbg(DATA, "vnet_rx stype_env[%04x] seq[%016lx] rcv_nxt[%016lx]\n",
	       pkt->tag.stype_env, pkt->seq, dr->rcv_nxt);

	if (unlikely(pkt->tag.stype_env != VIO_DRING_DATA))
		return 0;
	if (unlikely(pkt->seq != dr->rcv_nxt)) {
		printk(KERN_ERR PFX "RX out of sequence seq[0x%lx] "
		       "rcv_nxt[0x%lx]\n", pkt->seq, dr->rcv_nxt);
		return 0;
	}

	dr->rcv_nxt++;

	/* XXX Validate pkt->start_idx and pkt->end_idx XXX */

	return vnet_walk_rx(port, dr, pkt->start_idx, pkt->end_idx);
}

static int idx_is_pending(struct vio_dring_state *dr, u32 end)
{
	u32 idx = dr->cons;
	int found = 0;

	while (idx != dr->prod) {
		if (idx == end) {
			found = 1;
			break;
		}
		idx = next_idx(idx, dr);
	}
	return found;
}

static int vnet_ack(struct vnet_port *port, void *msgbuf)
{
	struct vio_dring_state *dr = &port->vio.drings[VIO_DRIVER_TX_RING];
	struct vio_dring_data *pkt = msgbuf;
	struct net_device *dev;
	struct vnet *vp;
	u32 end;

	if (unlikely(pkt->tag.stype_env != VIO_DRING_DATA))
		return 0;

	end = pkt->end_idx;
	if (unlikely(!idx_is_pending(dr, end)))
		return 0;

	dr->cons = next_idx(end, dr);

	vp = port->vp;
	dev = vp->dev;
	if (unlikely(netif_queue_stopped(dev) &&
		     vnet_tx_dring_avail(dr) >= VNET_TX_WAKEUP_THRESH(dr)))
		return 1;

	return 0;
}

static int vnet_nack(struct vnet_port *port, void *msgbuf)
{
	/* XXX just reset or similar XXX */
	return 0;
}

static void maybe_tx_wakeup(struct vnet *vp)
{
	struct net_device *dev = vp->dev;

	netif_tx_lock(dev);
	if (likely(netif_queue_stopped(dev))) {
		struct vnet_port *port;
		int wake = 1;

		list_for_each_entry(port, &vp->port_list, list) {
			struct vio_dring_state *dr;

			dr = &port->vio.drings[VIO_DRIVER_TX_RING];
			if (vnet_tx_dring_avail(dr) <
			    VNET_TX_WAKEUP_THRESH(dr)) {
				wake = 0;
				break;
			}
		}
		if (wake)
			netif_wake_queue(dev);
	}
	netif_tx_unlock(dev);
}

static void vnet_event(void *arg, int event)
{
	struct vnet_port *port = arg;
	struct vio_driver_state *vio = &port->vio;
	unsigned long flags;
	int tx_wakeup, err;

	spin_lock_irqsave(&vio->lock, flags);

	if (unlikely(event == LDC_EVENT_RESET ||
		     event == LDC_EVENT_UP)) {
		vio_link_state_change(vio, event);
		spin_unlock_irqrestore(&vio->lock, flags);

		return;
	}

	if (unlikely(event != LDC_EVENT_DATA_READY)) {
		printk(KERN_WARNING PFX "Unexpected LDC event %d\n", event);
		spin_unlock_irqrestore(&vio->lock, flags);
		return;
	}

	tx_wakeup = err = 0;
	while (1) {
		union {
			struct vio_msg_tag tag;
			u64 raw[8];
		} msgbuf;

		err = ldc_read(vio->lp, &msgbuf, sizeof(msgbuf));
		if (unlikely(err < 0)) {
			if (err == -ECONNRESET)
				vio_conn_reset(vio);
			break;
		}
		if (err == 0)
			break;
		viodbg(DATA, "TAG [%02x:%02x:%04x:%08x]\n",
		       msgbuf.tag.type,
		       msgbuf.tag.stype,
		       msgbuf.tag.stype_env,
		       msgbuf.tag.sid);
		err = vio_validate_sid(vio, &msgbuf.tag);
		if (err < 0)
			break;

		if (likely(msgbuf.tag.type == VIO_TYPE_DATA)) {
			if (msgbuf.tag.stype == VIO_SUBTYPE_INFO) {
				err = vnet_rx(port, &msgbuf);
			} else if (msgbuf.tag.stype == VIO_SUBTYPE_ACK) {
				err = vnet_ack(port, &msgbuf);
				if (err > 0)
					tx_wakeup |= err;
			} else if (msgbuf.tag.stype == VIO_SUBTYPE_NACK) {
				err = vnet_nack(port, &msgbuf);
			}
		} else if (msgbuf.tag.type == VIO_TYPE_CTRL) {
			err = vio_control_pkt_engine(vio, &msgbuf);
			if (err)
				break;
		} else {
			err = vnet_handle_unknown(port, &msgbuf);
		}
		if (err == -ECONNRESET)
			break;
	}
	spin_unlock(&vio->lock);
	if (unlikely(tx_wakeup && err != -ECONNRESET))
		maybe_tx_wakeup(port->vp);
	local_irq_restore(flags);
}

static int __vnet_tx_trigger(struct vnet_port *port)
{
	struct vio_dring_state *dr = &port->vio.drings[VIO_DRIVER_TX_RING];
	struct vio_dring_data hdr = {
		.tag = {
			.type		= VIO_TYPE_DATA,
			.stype		= VIO_SUBTYPE_INFO,
			.stype_env	= VIO_DRING_DATA,
			.sid		= vio_send_sid(&port->vio),
		},
		.dring_ident		= dr->ident,
		.start_idx		= dr->prod,
		.end_idx		= (u32) -1,
	};
	int err, delay;

	hdr.seq = dr->snd_nxt;
	delay = 1;
	do {
		err = vio_ldc_send(&port->vio, &hdr, sizeof(hdr));
		if (err > 0) {
			dr->snd_nxt++;
			break;
		}
		udelay(delay);
		if ((delay <<= 1) > 128)
			delay = 128;
	} while (err == -EAGAIN);

	return err;
}

struct vnet_port *__tx_port_find(struct vnet *vp, struct sk_buff *skb)
{
	unsigned int hash = vnet_hashfn(skb->data);
	struct hlist_head *hp = &vp->port_hash[hash];
	struct hlist_node *n;
	struct vnet_port *port;

	hlist_for_each_entry(port, n, hp, hash) {
		if (!compare_ether_addr(port->raddr, skb->data))
			return port;
	}
	port = NULL;
	if (!list_empty(&vp->port_list))
		port = list_entry(vp->port_list.next, struct vnet_port, list);

	return port;
}

struct vnet_port *tx_port_find(struct vnet *vp, struct sk_buff *skb)
{
	struct vnet_port *ret;
	unsigned long flags;

	spin_lock_irqsave(&vp->lock, flags);
	ret = __tx_port_find(vp, skb);
	spin_unlock_irqrestore(&vp->lock, flags);

	return ret;
}

static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
	struct vnet *vp = netdev_priv(dev);
	struct vnet_port *port = tx_port_find(vp, skb);
	struct vio_dring_state *dr;
	struct vio_net_desc *d;
	unsigned long flags;
	unsigned int len;
	void *tx_buf;
	int i, err;

	if (unlikely(!port))
		goto out_dropped;

	spin_lock_irqsave(&port->vio.lock, flags);

	dr = &port->vio.drings[VIO_DRIVER_TX_RING];
	if (unlikely(vnet_tx_dring_avail(dr) < 2)) {
		if (!netif_queue_stopped(dev)) {
			netif_stop_queue(dev);

			/* This is a hard error, log it. */
			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
			       "queue awake!\n", dev->name);
			dev->stats.tx_errors++;
		}
		spin_unlock_irqrestore(&port->vio.lock, flags);
		return NETDEV_TX_BUSY;
	}

	d = vio_dring_cur(dr);

	tx_buf = port->tx_bufs[dr->prod].buf;
	skb_copy_from_linear_data(skb, tx_buf + VNET_PACKET_SKIP, skb->len);

	len = skb->len;
	if (len < ETH_ZLEN) {
		len = ETH_ZLEN;
		memset(tx_buf+VNET_PACKET_SKIP+skb->len, 0, len - skb->len);
	}

	d->hdr.ack = VIO_ACK_ENABLE;
	d->size = len;
	d->ncookies = port->tx_bufs[dr->prod].ncookies;
	for (i = 0; i < d->ncookies; i++)
		d->cookies[i] = port->tx_bufs[dr->prod].cookies[i];

	/* This has to be a non-SMP write barrier because we are writing
	 * to memory which is shared with the peer LDOM.
	 */
	wmb();

	d->hdr.state = VIO_DESC_READY;

	err = __vnet_tx_trigger(port);
	if (unlikely(err < 0)) {
		printk(KERN_INFO PFX "%s: TX trigger error %d\n",
		       dev->name, err);
		d->hdr.state = VIO_DESC_FREE;
		dev->stats.tx_carrier_errors++;
		goto out_dropped_unlock;
	}

	dev->stats.tx_packets++;
	dev->stats.tx_bytes += skb->len;

	dr->prod = (dr->prod + 1) & (VNET_TX_RING_SIZE - 1);
	if (unlikely(vnet_tx_dring_avail(dr) < 2)) {
		netif_stop_queue(dev);
		if (vnet_tx_dring_avail(dr) > VNET_TX_WAKEUP_THRESH(dr))
			netif_wake_queue(dev);
	}

	spin_unlock_irqrestore(&port->vio.lock, flags);

	dev_kfree_skb(skb);

	dev->trans_start = jiffies;
	return NETDEV_TX_OK;

out_dropped_unlock:
	spin_unlock_irqrestore(&port->vio.lock, flags);

out_dropped:
	dev_kfree_skb(skb);
	dev->stats.tx_dropped++;
	return NETDEV_TX_OK;
}

static void vnet_tx_timeout(struct net_device *dev)
{
	/* XXX Implement me XXX */
}

static int vnet_open(struct net_device *dev)
{
	netif_carrier_on(dev);
	netif_start_queue(dev);

	return 0;
}

static int vnet_close(struct net_device *dev)
{
	netif_stop_queue(dev);
	netif_carrier_off(dev);

	return 0;
}

static void vnet_set_rx_mode(struct net_device *dev)
{
	/* XXX Implement multicast support XXX */
}

static int vnet_change_mtu(struct net_device *dev, int new_mtu)
{
	if (new_mtu != ETH_DATA_LEN)
		return -EINVAL;

	dev->mtu = new_mtu;
	return 0;
}

static int vnet_set_mac_addr(struct net_device *dev, void *p)
{
	return -EINVAL;
}

static void vnet_get_drvinfo(struct net_device *dev,
			     struct ethtool_drvinfo *info)
{
	strcpy(info->driver, DRV_MODULE_NAME);
	strcpy(info->version, DRV_MODULE_VERSION);
}

static u32 vnet_get_msglevel(struct net_device *dev)
{
	struct vnet *vp = netdev_priv(dev);
	return vp->msg_enable;
}

static void vnet_set_msglevel(struct net_device *dev, u32 value)
{
	struct vnet *vp = netdev_priv(dev);
	vp->msg_enable = value;
}

static const struct ethtool_ops vnet_ethtool_ops = {
	.get_drvinfo		= vnet_get_drvinfo,
	.get_msglevel		= vnet_get_msglevel,
	.set_msglevel		= vnet_set_msglevel,
	.get_link		= ethtool_op_get_link,
	.get_perm_addr		= ethtool_op_get_perm_addr,
};

static void vnet_port_free_tx_bufs(struct vnet_port *port)
{
	struct vio_dring_state *dr;
	int i;

	dr = &port->vio.drings[VIO_DRIVER_TX_RING];
	if (dr->base) {
		ldc_free_exp_dring(port->vio.lp, dr->base,
				   (dr->entry_size * dr->num_entries),
				   dr->cookies, dr->ncookies);
		dr->base = NULL;
		dr->entry_size = 0;
		dr->num_entries = 0;
		dr->pending = 0;
		dr->ncookies = 0;
	}

	for (i = 0; i < VNET_TX_RING_SIZE; i++) {
		void *buf = port->tx_bufs[i].buf;

		if (!buf)
			continue;

		ldc_unmap(port->vio.lp,
			  port->tx_bufs[i].cookies,
			  port->tx_bufs[i].ncookies);

		kfree(buf);
		port->tx_bufs[i].buf = NULL;
	}
}

static int __devinit vnet_port_alloc_tx_bufs(struct vnet_port *port)
{
	struct vio_dring_state *dr;
	unsigned long len;
	int i, err, ncookies;
	void *dring;

	for (i = 0; i < VNET_TX_RING_SIZE; i++) {
		void *buf = kzalloc(ETH_FRAME_LEN + 8, GFP_KERNEL);
		int map_len = (ETH_FRAME_LEN + 7) & ~7;

		err = -ENOMEM;
		if (!buf) {
			printk(KERN_ERR "TX buffer allocation failure\n");
			goto err_out;
		}
		err = -EFAULT;
		if ((unsigned long)buf & (8UL - 1)) {
			printk(KERN_ERR "TX buffer misaligned\n");
			kfree(buf);
			goto err_out;
		}

		err = ldc_map_single(port->vio.lp, buf, map_len,
				     port->tx_bufs[i].cookies, 2,
				     (LDC_MAP_SHADOW |
				      LDC_MAP_DIRECT |
				      LDC_MAP_RW));
		if (err < 0) {
			kfree(buf);
			goto err_out;
		}
		port->tx_bufs[i].buf = buf;
		port->tx_bufs[i].ncookies = err;
	}

	dr = &port->vio.drings[VIO_DRIVER_TX_RING];

	len = (VNET_TX_RING_SIZE *
	       (sizeof(struct vio_net_desc) +
		(sizeof(struct ldc_trans_cookie) * 2)));

	ncookies = VIO_MAX_RING_COOKIES;
	dring = ldc_alloc_exp_dring(port->vio.lp, len,
				    dr->cookies, &ncookies,
				    (LDC_MAP_SHADOW |
				     LDC_MAP_DIRECT |
				     LDC_MAP_RW));
	if (IS_ERR(dring)) {
		err = PTR_ERR(dring);
		goto err_out;
	}

	dr->base = dring;
	dr->entry_size = (sizeof(struct vio_net_desc) +
			  (sizeof(struct ldc_trans_cookie) * 2));
	dr->num_entries = VNET_TX_RING_SIZE;
	dr->prod = dr->cons = 0;
	dr->pending = VNET_TX_RING_SIZE;
	dr->ncookies = ncookies;

	return 0;

err_out:
	vnet_port_free_tx_bufs(port);

	return err;
}

static struct ldc_channel_config vnet_ldc_cfg = {
	.event		= vnet_event,
	.mtu		= 64,
	.mode		= LDC_MODE_UNRELIABLE,
};

static struct vio_driver_ops vnet_vio_ops = {
	.send_attr		= vnet_send_attr,
	.handle_attr		= vnet_handle_attr,
	.handshake_complete	= vnet_handshake_complete,
};

const char *remote_macaddr_prop = "remote-mac-address";

static int __devinit vnet_port_probe(struct vio_dev *vdev,
				     const struct vio_device_id *id)
{
	struct mdesc_node *endp;
	struct vnet_port *port;
	unsigned long flags;
	struct vnet *vp;
	const u64 *rmac;
	int len, i, err, switch_port;

	vp = dev_get_drvdata(vdev->dev.parent);
	if (!vp) {
		printk(KERN_ERR PFX "Cannot find port parent vnet.\n");
		return -ENODEV;
	}

	rmac = md_get_property(vdev->mp, remote_macaddr_prop, &len);
	if (!rmac) {
		printk(KERN_ERR PFX "Port lacks %s property.\n",
		       remote_macaddr_prop);
		return -ENODEV;
	}

	endp = vio_find_endpoint(vdev);
	if (!endp) {
		printk(KERN_ERR PFX "Port lacks channel-endpoint.\n");
		return -ENODEV;
	}

	port = kzalloc(sizeof(*port), GFP_KERNEL);
	if (!port) {
		printk(KERN_ERR PFX "Cannot allocate vnet_port.\n");
		return -ENOMEM;
	}

	for (i = 0; i < ETH_ALEN; i++)
		port->raddr[i] = (*rmac >> (5 - i) * 8) & 0xff;

	port->vp = vp;

	err = vio_driver_init(&port->vio, vdev, VDEV_NETWORK, endp,
			      vnet_versions, ARRAY_SIZE(vnet_versions),
			      &vnet_vio_ops, vp->dev->name);
	if (err)
		goto err_out_free_port;

	err = vio_ldc_alloc(&port->vio, &vnet_ldc_cfg, port);
	if (err)
		goto err_out_free_port;

	err = vnet_port_alloc_tx_bufs(port);
	if (err)
		goto err_out_free_ldc;

	INIT_HLIST_NODE(&port->hash);
	INIT_LIST_HEAD(&port->list);

	switch_port = 0;
	if (md_get_property(vdev->mp, "switch-port", NULL) != NULL)
		switch_port = 1;

	spin_lock_irqsave(&vp->lock, flags);
	if (switch_port)
		list_add(&port->list, &vp->port_list);
	else
		list_add_tail(&port->list, &vp->port_list);
	hlist_add_head(&port->hash, &vp->port_hash[vnet_hashfn(port->raddr)]);
	spin_unlock_irqrestore(&vp->lock, flags);

	dev_set_drvdata(&vdev->dev, port);

	printk(KERN_INFO "%s: PORT ( remote-mac ", vp->dev->name);
	for (i = 0; i < 6; i++)
		printk("%2.2x%c", port->raddr[i], i == 5 ? ' ' : ':');
	if (switch_port)
		printk("switch-port ");
	printk(")\n");

	vio_port_up(&port->vio);

	return 0;

err_out_free_ldc:
	vio_ldc_free(&port->vio);

err_out_free_port:
	kfree(port);

	return err;
}

static int vnet_port_remove(struct vio_dev *vdev)
{
	struct vnet_port *port = dev_get_drvdata(&vdev->dev);

	if (port) {
		struct vnet *vp = port->vp;
		unsigned long flags;

		del_timer_sync(&port->vio.timer);

		spin_lock_irqsave(&vp->lock, flags);
		list_del(&port->list);
		hlist_del(&port->hash);
		spin_unlock_irqrestore(&vp->lock, flags);

		vnet_port_free_tx_bufs(port);
		vio_ldc_free(&port->vio);

		dev_set_drvdata(&vdev->dev, NULL);

		kfree(port);
	}
	return 0;
}

static struct vio_device_id vnet_port_match[] = {
	{
		.type = "vnet-port",
	},
	{},
};
MODULE_DEVICE_TABLE(vio, vnet_match);

static struct vio_driver vnet_port_driver = {
	.id_table	= vnet_port_match,
	.probe		= vnet_port_probe,
	.remove		= vnet_port_remove,
	.driver		= {
		.name	= "vnet_port",
		.owner	= THIS_MODULE,
	}
};

const char *local_mac_prop = "local-mac-address";

static int __devinit vnet_probe(struct vio_dev *vdev,
				const struct vio_device_id *id)
{
	static int vnet_version_printed;
	struct net_device *dev;
	struct vnet *vp;
	const u64 *mac;
	int err, i, len;

	if (vnet_version_printed++ == 0)
		printk(KERN_INFO "%s", version);

	mac = md_get_property(vdev->mp, local_mac_prop, &len);
	if (!mac) {
		printk(KERN_ERR PFX "vnet lacks %s property.\n",
		       local_mac_prop);
		err = -ENODEV;
		goto err_out;
	}

	dev = alloc_etherdev(sizeof(*vp));
	if (!dev) {
		printk(KERN_ERR PFX "Etherdev alloc failed, aborting.\n");
		err = -ENOMEM;
		goto err_out;
	}

	for (i = 0; i < ETH_ALEN; i++)
		dev->dev_addr[i] = (*mac >> (5 - i) * 8) & 0xff;

	memcpy(dev->perm_addr, dev->dev_addr, dev->addr_len);

	SET_NETDEV_DEV(dev, &vdev->dev);

	vp = netdev_priv(dev);

	spin_lock_init(&vp->lock);
	vp->dev = dev;
	vp->vdev = vdev;

	INIT_LIST_HEAD(&vp->port_list);
	for (i = 0; i < VNET_PORT_HASH_SIZE; i++)
		INIT_HLIST_HEAD(&vp->port_hash[i]);

	dev->open = vnet_open;
	dev->stop = vnet_close;
	dev->set_multicast_list = vnet_set_rx_mode;
	dev->set_mac_address = vnet_set_mac_addr;
	dev->tx_timeout = vnet_tx_timeout;
	dev->ethtool_ops = &vnet_ethtool_ops;
	dev->watchdog_timeo = VNET_TX_TIMEOUT;
	dev->change_mtu = vnet_change_mtu;
	dev->hard_start_xmit = vnet_start_xmit;

	err = register_netdev(dev);
	if (err) {
		printk(KERN_ERR PFX "Cannot register net device, "
		       "aborting.\n");
		goto err_out_free_dev;
	}

	printk(KERN_INFO "%s: Sun LDOM vnet ", dev->name);

	for (i = 0; i < 6; i++)
		printk("%2.2x%c", dev->dev_addr[i], i == 5 ? '\n' : ':');

	dev_set_drvdata(&vdev->dev, vp);

	return 0;

err_out_free_dev:
	free_netdev(dev);

err_out:
	return err;
}

static int vnet_remove(struct vio_dev *vdev)
{

	struct vnet *vp = dev_get_drvdata(&vdev->dev);

	if (vp) {
		/* XXX unregister port, or at least check XXX */
		unregister_netdevice(vp->dev);
		dev_set_drvdata(&vdev->dev, NULL);
	}
	return 0;
}

static struct vio_device_id vnet_match[] = {
	{
		.type = "network",
	},
	{},
};
MODULE_DEVICE_TABLE(vio, vnet_match);

static struct vio_driver vnet_driver = {
	.id_table	= vnet_match,
	.probe		= vnet_probe,
	.remove		= vnet_remove,
	.driver		= {
		.name	= "vnet",
		.owner	= THIS_MODULE,
	}
};

static int __init vnet_init(void)
{
	int err = vio_register_driver(&vnet_driver);

	if (!err) {
		err = vio_register_driver(&vnet_port_driver);
		if (err)
			vio_unregister_driver(&vnet_driver);
	}

	return err;
}

static void __exit vnet_exit(void)
{
	vio_unregister_driver(&vnet_port_driver);
	vio_unregister_driver(&vnet_driver);
}

module_init(vnet_init);
module_exit(vnet_exit);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29  4:20             ` David Miller
@ 2007-06-29  8:45               ` Waskiewicz Jr, Peter P
  2007-06-29 11:43               ` Multiqueue and virtualization WAS(Re: " jamal
  2007-06-30 14:33               ` Patrick McHardy
  2 siblings, 0 replies; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-29  8:45 UTC (permalink / raw)
  To: David Miller, kaber; +Cc: netdev, jeff, Kok, Auke-jan H, hadi

> Ok everything is checked into net-2.6.23, thanks everyone.

Dave, thank you for your patience and feedback on this whole process.
Patrick and everyone else, thank you for your feedback and assistance.

I am looking at your posed virtualization question, but I need sleep
since I just remembered I'm on east coast time here at OLS, and it's
4:30am.

Thanks again,

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-29  3:39   ` David Miller
@ 2007-06-29 10:54     ` Jeff Garzik
  0 siblings, 0 replies; 60+ messages in thread
From: Jeff Garzik @ 2007-06-29 10:54 UTC (permalink / raw)
  To: David Miller; +Cc: peter.p.waskiewicz.jr, netdev, auke-jan.h.kok, hadi, kaber

David Miller wrote:
> From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
> Date: Thu, 28 Jun 2007 09:21:13 -0700
> 
>> -struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>> -		void (*setup)(struct net_device *))
>> +struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
>> +		void (*setup)(struct net_device *), int queue_count)
>>  {
>>  	void *p;
>>  	struct net_device *dev;
>> @@ -3557,7 +3564,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>>  	BUG_ON(strlen(name) >= sizeof(dev->name));
>>  
>>  	/* ensure 32-byte alignment of both the device and private area */
>> -	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
>> +	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
>> +		     (sizeof(struct net_device_subqueue) * queue_count)) &
>> +		     ~NETDEV_ALIGN_CONST;
>>  	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
>>  
>>  	p = kzalloc(alloc_size, GFP_KERNEL);
>> @@ -3573,12 +3582,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>>  	if (sizeof_priv)
>>  		dev->priv = netdev_priv(dev);
>>  
>> +	dev->egress_subqueue_count = queue_count;
>> +
>>  	dev->get_stats = internal_stats;
>>  	setup(dev);
>>  	strcpy(dev->name, name);
>>  	return dev;
>>  }
> 
> This isn't going to work.
> 
> The pointer returned from netdev_priv() doesn't take into account the
> variable sized queues at the end of struct netdev, so we can stomp
> over the queues with the private area.
> 
> This probably works by luck because of NETDEV_ALIGN.
> 
> The simplest fix is to just make netdev_priv() use dev->priv,
> except when it's being initialized during allocation, and
> that's what I'm going to do when I apply your patch.

Ugh.  That will reverse the gains we had with the current setup, won't it?

Also, what happens when we want to add ingress_queue[0] ?

	Jeff






^ permalink raw reply	[flat|nested] 60+ messages in thread

* Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29  4:20             ` David Miller
  2007-06-29  8:45               ` Waskiewicz Jr, Peter P
@ 2007-06-29 11:43               ` jamal
  2007-06-29 11:59                 ` Patrick McHardy
  2007-06-30 14:33               ` Patrick McHardy
  2 siblings, 1 reply; 60+ messages in thread
From: jamal @ 2007-06-29 11:43 UTC (permalink / raw)
  To: David Miller; +Cc: kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok


Ive changed the topic for you friend - otherwise most people wont follow
(as youve said a few times yourself ;->).

On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote:

> Now I get to pose a problem for everyone, prove to me how useful
> this new code is by showing me how it can be used to solve a
> reocurring problem in virtualized network drivers of which I've
> had to code one up recently, see my most recent blog entry at:
> 
> 	http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html
> 

nice.

> Anyways the gist of the issue is (and this happens for Sun LDOMS
> networking, lguest, IBM iSeries, etc.) that we have a single
> virtualized network device.  There is a "port" to the control
> node (which switches packets to the real network for the guest)
> and one "port" to each of the other guests.
> 
> Each guest gets a unique MAC address.  There is a queue per-port
> that can fill up.
> 
> What all the drivers like this do right now is stop the queue if
> any of the per-port queues fill up, and that's why my sunvnet
> driver does right now as well.  We can only thus wakeup the
> queue when all of the ports have some space.

Is a netdevice really the correct construct for the host side?
Sounds to me a layer above the netdevice is the way to go. A bridge for
example or L3 routing or even simple tc classify/redirection etc.
I havent used what has become openvz these days in many years (or played
with Erics approach), but if i recall correctly - it used to have a
single netdevice per guest on the host. Thats close to what a basic
qemu/UML has today. In such a case it is something above netdevices
which does the guest selection.
 
> The ports (and thus the queues) are selected by destinationt
> MAC address.  Each port has a remote MAC address, if there
> is an exact match with a port's remote MAC we'd use that port
> and thus that port's queue.  If there is no exact match
> (some other node on the real network, broadcast, multicast,
> etc.) we want to use the control node's port and port queue.
> 

Ok, Dave, isnt that what a bridge does? ;-> Youd need filtering to go
with it (for example to restrict guest0 from getting certain brodcasts
etc) - but we already have that. 

> So the problem to solve is to make a way for drivers to do the queue
> selection before the generic queueing layer starts to try and push
> things to the driver.  Perhaps a classifier in the driver or similar.
>
> The solution to this problem generalizes to the other facility
> we want now, hashing the transmit queue by smp_processor_id()
> or similar.  With that in place we can look at doing the TX locking
> per-queue too as is hinted at by the comments above the per-queue
> structure in the current net-2.6.23 tree.

A major surgery will be needed on the tx path if you want to hash tx
queue to processor id. Our unit construct (today, net-2.6.23) that can
be tied to a cpu is a netdevice. OTOH, if you used a netdevice it should
work as is. But i am possibly missing something in your comments.
What do you have in mind.

> My current work-in-progress sunvnet.c driver is included below so
> we can discuss things concretely with code.
> 
> I'm listening. :-)

And you got words above.

cheers,
jamal



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 11:43               ` Multiqueue and virtualization WAS(Re: " jamal
@ 2007-06-29 11:59                 ` Patrick McHardy
  2007-06-29 12:54                   ` jamal
  0 siblings, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-29 11:59 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

jamal wrote:
> On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote:
> 
>>Each guest gets a unique MAC address.  There is a queue per-port
>>that can fill up.
>>
>>What all the drivers like this do right now is stop the queue if
>>any of the per-port queues fill up, and that's why my sunvnet
>>driver does right now as well.  We can only thus wakeup the
>>queue when all of the ports have some space.
> 
> 
> Is a netdevice really the correct construct for the host side?
> Sounds to me a layer above the netdevice is the way to go. A bridge for
> example or L3 routing or even simple tc classify/redirection etc.
> I havent used what has become openvz these days in many years (or played
> with Erics approach), but if i recall correctly - it used to have a
> single netdevice per guest on the host. Thats close to what a basic
> qemu/UML has today. In such a case it is something above netdevices
> which does the guest selection.


I'm guessing that that wouldn't allow to do unicast filtering for
the guests on the real device without hacking the bridge code for
this special case. The difference to a real bridge is that the
all addresses are completely known in advance, so it doesn't need
promiscous mode for learning.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 11:59                 ` Patrick McHardy
@ 2007-06-29 12:54                   ` jamal
  2007-06-29 13:08                     ` Patrick McHardy
  2007-06-29 21:31                     ` David Miller
  0 siblings, 2 replies; 60+ messages in thread
From: jamal @ 2007-06-29 12:54 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David Miller, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:

> I'm guessing that that wouldn't allow to do unicast filtering for
> the guests on the real device without hacking the bridge code for
> this special case. 

For ingress (i guess you could say for egress as well): we can do it as
well today with tc filtering on the host - it is involved but is part of
provisioning for a guest IMO.
A substantial amount of ethernet switches (ok, not the $5 ones) do
filtering at the same level.

> The difference to a real bridge is that the
> all addresses are completely known in advance, so it doesn't need
> promiscous mode for learning.

You mean the per-virtual MAC addresses are known in advance, right?
This is fine. The bridging or otherwise (like L3 etc) is for
interconnecting once you have the provisioning done. And you could build
different "broadcast domains" by having multiple bridges.

To go off on a slight tangent:
I think you have to look at the two types of NICs separately 
1) dumb ones where you may have to use the mcast filters in hardware to
pretend you have a unicast address per virtual device - those will be
really hard to simulate using a separate netdevice per MAC address.
Actually your bigger problem on those is tx MAC address selection
because that is not built into the hardware. I still think even for
these types something above netdevice (bridge, L3 routing, tc action
redirect etc) will do.
2) The new NICs being built for virtualization; those allow you to
explicitly have clean separation of IO where the only thing that is
shared between virtual devices is the wire and the bus (otherwise
each has its own registers etc) i.e the hardware is designed with this
in mind. In such a case, i think a separate netdevice per single MAC
address - possibly tied to a separate CPU should work.

cheers,
jamal

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 12:54                   ` jamal
@ 2007-06-29 13:08                     ` Patrick McHardy
  2007-06-29 13:19                       ` jamal
  2007-06-29 15:33                       ` Ben Greear
  2007-06-29 21:31                     ` David Miller
  1 sibling, 2 replies; 60+ messages in thread
From: Patrick McHardy @ 2007-06-29 13:08 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

jamal wrote:
> On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:
> 
> 
>>The difference to a real bridge is that the
>>all addresses are completely known in advance, so it doesn't need
>>promiscous mode for learning.
> 
> 
> You mean the per-virtual MAC addresses are known in advance, right?

Yes.

> This is fine. The bridging or otherwise (like L3 etc) is for
> interconnecting once you have the provisioning done. And you could build
> different "broadcast domains" by having multiple bridges.


Right, but the current bridging code always uses promiscous mode
and its nice to avoid that if possible. Looking at the code, it
should be easy to avoid though by disabling learning (and thus
promisous mode) and adding unicast filters for all static fdb entries.

> To go off on a slight tangent:
> I think you have to look at the two types of NICs separately 
> 1) dumb ones where you may have to use the mcast filters in hardware to
> pretend you have a unicast address per virtual device - those will be
> really hard to simulate using a separate netdevice per MAC address.
> Actually your bigger problem on those is tx MAC address selection
> because that is not built into the hardware. I still think even for
> these types something above netdevice (bridge, L3 routing, tc action
> redirect etc) will do.


Have a look at my secondary unicast address patches in case you didn't
notice them before (there's also a driver example for e1000 on netdev):

http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7

You still need to do filtering in software, but you can have the NIC
pre-filter in case it supports it, otherwise it goes to promiscous mode.

> 2) The new NICs being built for virtualization; those allow you to
> explicitly have clean separation of IO where the only thing that is
> shared between virtual devices is the wire and the bus (otherwise
> each has its own registers etc) i.e the hardware is designed with this
> in mind. In such a case, i think a separate netdevice per single MAC
> address - possibly tied to a separate CPU should work.


Agreed, that could also be useful for non-virtualized use.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 13:08                     ` Patrick McHardy
@ 2007-06-29 13:19                       ` jamal
  2007-06-29 15:33                       ` Ben Greear
  1 sibling, 0 replies; 60+ messages in thread
From: jamal @ 2007-06-29 13:19 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David Miller, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

On Fri, 2007-29-06 at 15:08 +0200, Patrick McHardy wrote:
> jamal wrote:
> > On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:

> Right, but the current bridging code always uses promiscous mode
> and its nice to avoid that if possible. 
> Looking at the code, it
> should be easy to avoid though by disabling learning (and thus
> promisous mode) and adding unicast filters for all static fdb entries.
> 

Yes, that would do it for static provisioning (I suspect that would work
today unless bridging has no knobs to turn off going into promisc). But
you could even allow for learning and just have extra filters in tc
before bridging disallowing things.

> Have a look at my secondary unicast address patches in case you didn't
> notice them before (there's also a driver example for e1000 on netdev):
> 
> http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7
> 
> You still need to do filtering in software, but you can have the NIC
> pre-filter in case it supports it, otherwise it goes to promiscous mode.
> 

Ok, I will look at them when i get back. Sorry - havent caught up on
netdev.

cheers,
jamal


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 13:08                     ` Patrick McHardy
  2007-06-29 13:19                       ` jamal
@ 2007-06-29 15:33                       ` Ben Greear
  2007-06-29 15:58                         ` Patrick McHardy
  2007-06-29 21:36                         ` David Miller
  1 sibling, 2 replies; 60+ messages in thread
From: Ben Greear @ 2007-06-29 15:33 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: hadi, David Miller, peter.p.waskiewicz.jr, netdev, jeff,
	auke-jan.h.kok

Patrick McHardy wrote:
> Right, but the current bridging code always uses promiscous mode
> and its nice to avoid that if possible. Looking at the code, it
> should be easy to avoid though by disabling learning (and thus
> promisous mode) and adding unicast filters for all static fdb entries.
>   
I am curious about why people are so hot to do away with promisc mode.  
It seems to me
that in a modern switched environment, there should only very rarely be 
unicast packets received
on an interface that does not want to receive them.

Could someone give a quick example of when I am wrong and promisc mode 
would allow
a NIC to receive a significant number of packets not really destined for it?

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 15:33                       ` Ben Greear
@ 2007-06-29 15:58                         ` Patrick McHardy
  2007-06-29 16:16                           ` Ben Greear
  2007-06-29 21:36                         ` David Miller
  1 sibling, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-29 15:58 UTC (permalink / raw)
  To: Ben Greear
  Cc: hadi, David Miller, peter.p.waskiewicz.jr, netdev, jeff,
	auke-jan.h.kok

Ben Greear wrote:
> Patrick McHardy wrote:
> 
>> Right, but the current bridging code always uses promiscous mode
>> and its nice to avoid that if possible. Looking at the code, it
>> should be easy to avoid though by disabling learning (and thus
>> promisous mode) and adding unicast filters for all static fdb entries.
>>   
> 
> I am curious about why people are so hot to do away with promisc mode. 
> It seems to me
> that in a modern switched environment, there should only very rarely be
> unicast packets received
> on an interface that does not want to receive them.


I don't know if that really was Dave's reason to handle it in a driver.

> Could someone give a quick example of when I am wrong and promisc mode
> would allow
> a NIC to receive a significant number of packets not really destined for
> it?


In a switched environment it won't have a big effect, I agree.
It might help avoid receiving unwanted multicast traffic, which
could be more significant than unicast.

Anyways, why be wasteful when it can be avoided .. :)


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 15:58                         ` Patrick McHardy
@ 2007-06-29 16:16                           ` Ben Greear
  0 siblings, 0 replies; 60+ messages in thread
From: Ben Greear @ 2007-06-29 16:16 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: hadi, David Miller, peter.p.waskiewicz.jr, netdev, jeff,
	auke-jan.h.kok

Patrick McHardy wrote:
> Ben Greear wrote:
>   
>> Could someone give a quick example of when I am wrong and promisc mode
>> would allow
>> a NIC to receive a significant number of packets not really destined for
>> it?
>>     
>
>
> In a switched environment it won't have a big effect, I agree.
> It might help avoid receiving unwanted multicast traffic, which
> could be more significant than unicast.
>
> Anyways, why be wasteful when it can be avoided .. :)
>   
Ok, I had forgotten about multicast, thanks for the reminder!

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 12:54                   ` jamal
  2007-06-29 13:08                     ` Patrick McHardy
@ 2007-06-29 21:31                     ` David Miller
  2007-06-30  1:30                       ` jamal
  1 sibling, 1 reply; 60+ messages in thread
From: David Miller @ 2007-06-29 21:31 UTC (permalink / raw)
  To: hadi; +Cc: kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

This conversation begins to go into a pointless direction already, as
I feared it would.

Nobody is going to configure bridges, classification, tc, and all of
this other crap just for a simple virtualized guest networking device.

It's a confined and well defined case that doesn't need any of that.
You've got to be fucking kidding me if you think I'm going to go
through the bridging code and all of that layering instead of my
hash demux on transmit which is 4 or 5 lines of C code at best.

Such a suggestion is beyond stupid.

Maybe for the control node switch, yes, but not for the guest network
devices.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 15:33                       ` Ben Greear
  2007-06-29 15:58                         ` Patrick McHardy
@ 2007-06-29 21:36                         ` David Miller
  2007-06-30  7:51                           ` Benny Amorsen
  1 sibling, 1 reply; 60+ messages in thread
From: David Miller @ 2007-06-29 21:36 UTC (permalink / raw)
  To: greearb; +Cc: kaber, hadi, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

From: Ben Greear <greearb@candelatech.com>
Date: Fri, 29 Jun 2007 08:33:06 -0700

> Patrick McHardy wrote:
> > Right, but the current bridging code always uses promiscous mode
> > and its nice to avoid that if possible. Looking at the code, it
> > should be easy to avoid though by disabling learning (and thus
> > promisous mode) and adding unicast filters for all static fdb entries.
> >   
> I am curious about why people are so hot to do away with promisc mode.  
> It seems to me
> that in a modern switched environment, there should only very rarely be 
> unicast packets received
> on an interface that does not want to receive them.
> 
> Could someone give a quick example of when I am wrong and promisc mode 
> would allow
> a NIC to receive a significant number of packets not really destined for it?

You're neighbour on the switch is being pummeled with multicast traffic,
and now you get to see it all too.

Switches don't obviate the cost of promiscuous mode, you keep wanting
to discuss this and think it doesn't matter, but it does.

And some people still use hubs, believe it or not.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 21:31                     ` David Miller
@ 2007-06-30  1:30                       ` jamal
  2007-06-30  4:35                         ` David Miller
  0 siblings, 1 reply; 60+ messages in thread
From: jamal @ 2007-06-30  1:30 UTC (permalink / raw)
  To: David Miller; +Cc: kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote:
> This conversation begins to go into a pointless direction already, as
> I feared it would.
> 
> Nobody is going to configure bridges, classification, tc, and all of
> this other crap just for a simple virtualized guest networking device.
> 
> It's a confined and well defined case that doesn't need any of that.
> You've got to be fucking kidding me if you think I'm going to go
> through the bridging code and all of that layering instead of my
> hash demux on transmit which is 4 or 5 lines of C code at best.
> 
> Such a suggestion is beyond stupid.
> 

Ok, calm down - will you please? 
If you are soliciting for opinions, then you should be expecting all
sorts of answers, otherwise why bother posting. If you think you are
misunderstood just clarify. Otherwise you are being totaly unreasonable.

> Maybe for the control node switch, yes, but not for the guest network
> devices.

And that is precisely what i was talking about - and i am sure thats how
the discussion with Patrick was.

cheers,
jamal


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-30  1:30                       ` jamal
@ 2007-06-30  4:35                         ` David Miller
  2007-06-30 14:52                           ` jamal
  0 siblings, 1 reply; 60+ messages in thread
From: David Miller @ 2007-06-30  4:35 UTC (permalink / raw)
  To: hadi; +Cc: kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

From: jamal <hadi@cyberus.ca>
Date: Fri, 29 Jun 2007 21:30:53 -0400

> On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote:
> > Maybe for the control node switch, yes, but not for the guest network
> > devices.
> 
> And that is precisely what i was talking about - and i am sure thats how
> the discussion with Patrick was.

Awesome, but let's concentrate on the client since I can actually
implement and test anything we come up with :-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29 21:36                         ` David Miller
@ 2007-06-30  7:51                           ` Benny Amorsen
  0 siblings, 0 replies; 60+ messages in thread
From: Benny Amorsen @ 2007-06-30  7:51 UTC (permalink / raw)
  To: netdev

>>>>> "DM" == David Miller <davem@davemloft.net> writes:

DM> And some people still use hubs, believe it or not.

Hubs are 100Mbps at most. You could of course make a flooding Gbps
switch, but it would be rather silly. If you care about multicast
performance, you get a switch with IGMP snooping.

/Benny

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-29  4:20             ` David Miller
  2007-06-29  8:45               ` Waskiewicz Jr, Peter P
  2007-06-29 11:43               ` Multiqueue and virtualization WAS(Re: " jamal
@ 2007-06-30 14:33               ` Patrick McHardy
  2007-06-30 14:37                 ` Waskiewicz Jr, Peter P
  2 siblings, 1 reply; 60+ messages in thread
From: Patrick McHardy @ 2007-06-30 14:33 UTC (permalink / raw)
  To: David Miller; +Cc: peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok, hadi

David Miller wrote:
> Now I get to pose a problem for everyone, prove to me how useful
> this new code is by showing me how it can be used to solve a
> reocurring problem in virtualized network drivers of which I've
> had to code one up recently, see my most recent blog entry at:
> 
> 	http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html
> 
> Anyways the gist of the issue is (and this happens for Sun LDOMS
> networking, lguest, IBM iSeries, etc.) that we have a single
> virtualized network device.  There is a "port" to the control
> node (which switches packets to the real network for the guest)
> and one "port" to each of the other guests.
> 
> Each guest gets a unique MAC address.  There is a queue per-port
> that can fill up.
> 
> What all the drivers like this do right now is stop the queue if
> any of the per-port queues fill up, and that's why my sunvnet
> driver does right now as well.  We can only thus wakeup the
> queue when all of the ports have some space.
> 
> The ports (and thus the queues) are selected by destination
> MAC address.  Each port has a remote MAC address, if there
> is an exact match with a port's remote MAC we'd use that port
> and thus that port's queue.  If there is no exact match
> (some other node on the real network, broadcast, multicast,
> etc.) we want to use the control node's port and port queue.
> 
> So the problem to solve is to make a way for drivers to do the queue
> selection before the generic queueing layer starts to try and push
> things to the driver.  Perhaps a classifier in the driver or similar.


That sounds like the only reasonable possibility if you really
do want to use queues. Another possibility would be to not use
a queue and make the hole thing unreliable and treat full rx
rings of the guests as "loss on the wire". Not sure if that makes
any sense.

I was thinking about adding a way for (multiqueue) drivers to use
other default qdiscs than pfifo_fast so they can default to a
multiband prio or something else that makes sense for them ..
maybe a dev->qdisc_setup hook that is invoked from dev_activate.
They would need to be able to add a default classifier for this
to have any effect (the grand plan is to get rid of the horrible
wme scheduler). Specialized classifiers like your dst MAC classifier
and maybe even WME should then probably be built into the driver and
don't register with the API, so they don't become globally visible.

> The solution to this problem generalizes to the other facility
> we want now, hashing the transmit queue by smp_processor_id()
> or similar.  With that in place we can look at doing the TX locking
> per-queue too as is hinted at by the comments above the per-queue
> structure in the current net-2.6.23 tree.


It would be great if we could finally get a working e1000 multiqueue
patch so work in this area can actually be tested.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-30 14:33               ` Patrick McHardy
@ 2007-06-30 14:37                 ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 60+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-30 14:37 UTC (permalink / raw)
  To: Patrick McHardy, David Miller; +Cc: netdev, jeff, Kok, Auke-jan H, hadi

> It would be great if we could finally get a working e1000 
> multiqueue patch so work in this area can actually be tested.

I'm actively working on this right now.  I'm on vacation next week, but
hopefully I can get something working before I leave OLS and post it.

-PJ

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-30  4:35                         ` David Miller
@ 2007-06-30 14:52                           ` jamal
  2007-06-30 20:33                             ` David Miller
  0 siblings, 1 reply; 60+ messages in thread
From: jamal @ 2007-06-30 14:52 UTC (permalink / raw)
  To: David Miller; +Cc: kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote:

> Awesome, but let's concentrate on the client since I can actually
> implement and test anything we come up with :-)

Ok, you need to clear one premise for me then ;->
You said the model is for the guest/client to hook have a port to the
host and one to each guest; i think this is the confusing part for me
(and may have led to the switch discussion) because i have not seen this
model used before. What i have seen before is that the host side
connects the different guests. In such a scenario, on the guest is a
single port that connects to the host - the host worries (lets forget
the switch/bridge for a sec) about how to get packets from guestX to
guestY pending consultation of access control details.
What is the advantage of direct domain-domain connection? Is it a
scalable?

cheers,
jamal

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-30 14:52                           ` jamal
@ 2007-06-30 20:33                             ` David Miller
  2007-07-03 12:42                               ` jamal
  0 siblings, 1 reply; 60+ messages in thread
From: David Miller @ 2007-06-30 20:33 UTC (permalink / raw)
  To: hadi; +Cc: kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

From: jamal <hadi@cyberus.ca>
Date: Sat, 30 Jun 2007 10:52:44 -0400

> On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote:
> 
> > Awesome, but let's concentrate on the client since I can actually
> > implement and test anything we come up with :-)
> 
> Ok, you need to clear one premise for me then ;->
> You said the model is for the guest/client to hook have a port to the
> host and one to each guest; i think this is the confusing part for me
> (and may have led to the switch discussion) because i have not seen this
> model used before. What i have seen before is that the host side
> connects the different guests. In such a scenario, on the guest is a
> single port that connects to the host - the host worries (lets forget
> the switch/bridge for a sec) about how to get packets from guestX to
> guestY pending consultation of access control details.
> What is the advantage of direct domain-domain connection? Is it a
> scalable?

It's like twice as fast, since the switch doesn't have to copy
the packet in, switch it, then the destination guest copies it
into it's address space.

There is approximately one copy for each hop you go over through these
virtual devices.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-30 20:33                             ` David Miller
@ 2007-07-03 12:42                               ` jamal
  2007-07-03 21:24                                 ` David Miller
  0 siblings, 1 reply; 60+ messages in thread
From: jamal @ 2007-07-03 12:42 UTC (permalink / raw)
  To: David Miller; +Cc: kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

On Sat, 2007-30-06 at 13:33 -0700, David Miller wrote:

> It's like twice as fast, since the switch doesn't have to copy
> the packet in, switch it, then the destination guest copies it
> into it's address space.
> 
> There is approximately one copy for each hop you go over through these
> virtual devices.

Ok - i see what you are getting at, and while it makes more sense to me
now, let me continue to be _the_ devils advocate (sip some esspresso
before responding or reading): 
for some reason i always thought that packets going across these things
(likely not in the case of hypervisor based virtualization like Xen)
just have their skbs cloned when crossing domains, is that not the
case?[1]
Assuming they copy, the balance that needs to be stricken now is
between:

a) copy is expensive
vs
b1) For N guests, N^2 queues in the system vs N queues and 1 vs N
replicated global info.
b2) The architecture challenges to resolve the fact you now have to deal
with a mesh (1-1 mapping) instead of star topology between the guests. 

I dont think #b1 is such a big deal; in the old days when i had played
with what is now openvz, i was happy to get 1024 virtual routers/guests
(each running Zebra/OSPF). I could live with a little more wasted memory
if the copy is reduced.
I think sub-consciously i am questioning #b2. Do you really need that
sacrifice just so that you can avoid one extra copy between two guests?
If i was running virtual routers or servers i think the majority of
traffic (by far) would be between a domain and outside of the box not
between any two domains within the same box. 

cheers,
jamal

[1] But then if this is true, i can think of a simple way to attack the
other domains by inserting a kernel module into a domain that reduced
the refcount of each received skb to 0. I would be suprised if the
openvz type approach hasnt thought this through.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-07-03 12:42                               ` jamal
@ 2007-07-03 21:24                                 ` David Miller
  2007-07-04  2:20                                   ` jamal
  0 siblings, 1 reply; 60+ messages in thread
From: David Miller @ 2007-07-03 21:24 UTC (permalink / raw)
  To: hadi; +Cc: kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

From: jamal <hadi@cyberus.ca>
Date: Tue, 03 Jul 2007 08:42:33 -0400

> (likely not in the case of hypervisor based virtualization like Xen)
> just have their skbs cloned when crossing domains, is that not the
> case?[1]
> Assuming they copy, the balance that needs to be stricken now is
> between:

Sigh, I kind of hoped I wouldn't have to give a lesson in
hypervisors and virtualized I/O and all the issues contained
within, but if you keep pushing the "avoid the copy" idea I
guess I am forced to educate. :-)

First, keep in mind that my Linux guest drivers are talking
to Solaris control node servers and switches, I cannot control
the API for any of this stuff.  And I think that's a good thing
in fact.

Exporting memory between nodes is _THE_ problem with virtualized I/O
in hypervisor based systems.

These things should even be able to work between two guests that
simply DO NOT trust each other at all.

With that in mind the hypervisor provides a very small shim layer of
interface for exporting memory between two nodes.  There is a
pseudo-pagetable where you export pages, and a set of interfaces
one of which copies to/from inported memory to/from local memory.

If a guest gets stuck, reboots, crashes, or gets stuck, you have
to be able to revoke the memory the remote node has inported.
When this happens, if the inporting node comes back to life and
tries to touch those pages it takes a fault.

Taking a fault is easy if the nodes go through the hypervisor copy
interface, they just get a return value back.  If, instead, you try to
map in those pages or program them into the IOMMU of the PCI
controller, you get faults, and extremely difficult to handle faults
at that.  If the IOMMU takes the exception on a revoked page, your
E1000 card resets when it gets the master abort from the PCI
controller.  On the CPU side you have to annotate every single kernel
access to this memory mapping of inported pages, just like we have to
annotate all userspace accesses with exception tables mapping load and
store instructions to fixup code, in order to handler the fault
correctly.

Next, you don't trust the other end as we already stated, so you
can't export object in a page that belong to other objects.  For
example, if a SKB's data sits in the same page as the plain-text
password the user just typed in, you can't export that page.

That's why you have to copy into a purpose-built set of memory
that is composed of pages that _ONLY_ contain TX packet buffers
and nothing else.

The cost of going through the switch is too high, and the copies are
necessary, so concentrate on allowing me to map the guest ports to the
egress queues.  Anything else is a waste of discussion time, I've been
pouring over these issues endlessly for weeks, so if I'm saying doing
copies and avoiding the switch is necessary I do in fact mean it. :-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-07-03 21:24                                 ` David Miller
@ 2007-07-04  2:20                                   ` jamal
  2007-07-06  7:32                                     ` Rusty Russell
  0 siblings, 1 reply; 60+ messages in thread
From: jamal @ 2007-07-04  2:20 UTC (permalink / raw)
  To: David Miller; +Cc: kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote:
[.. some useful stuff here deleted ..]

> That's why you have to copy into a purpose-built set of memory
> that is composed of pages that _ONLY_ contain TX packet buffers
> and nothing else.
> 
> The cost of going through the switch is too high, and the copies are
> necessary, so concentrate on allowing me to map the guest ports to the
> egress queues.  Anything else is a waste of discussion time, I've been
> pouring over these issues endlessly for weeks, so if I'm saying doing
> copies and avoiding the switch is necessary I do in fact mean it. :-)

ok, i get it Dave ;-> Thanks for your patience, that was useful.
Now that is clear for me, I will go back and look at your original email
and try to get back on track to what you really asked ;->

cheers,
jamal


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-07-04  2:20                                   ` jamal
@ 2007-07-06  7:32                                     ` Rusty Russell
  2007-07-06 14:39                                       ` jamal
  0 siblings, 1 reply; 60+ messages in thread
From: Rusty Russell @ 2007-07-06  7:32 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, kaber, peter.p.waskiewicz.jr, netdev, jeff,
	auke-jan.h.kok

On Tue, 2007-07-03 at 22:20 -0400, jamal wrote:
> On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote:
> [.. some useful stuff here deleted ..]
> 
> > That's why you have to copy into a purpose-built set of memory
> > that is composed of pages that _ONLY_ contain TX packet buffers
> > and nothing else.
> > 
> > The cost of going through the switch is too high, and the copies are
> > necessary, so concentrate on allowing me to map the guest ports to the
> > egress queues.  Anything else is a waste of discussion time, I've been
> > pouring over these issues endlessly for weeks, so if I'm saying doing
> > copies and avoiding the switch is necessary I do in fact mean it. :-)
> 
> ok, i get it Dave ;-> Thanks for your patience, that was useful.
> Now that is clear for me, I will go back and look at your original email
> and try to get back on track to what you really asked ;->

To expand on this, there are already "virtual" nic drivers in tree which
do the demux based on dst mac and send to appropriate other guest
(iseries_veth.c and Carsten Otte said the S/390 drivers do too).  lguest
and DaveM's LDOM make two more.

There is currently no good way to write such a driver.  If one recipient
is full, you have to drop the packet: if you netif_stop_queue, it means
a slow/buggy recipient blocks packets going to other recipients.  But
dropping packets makes networking suck.

Some hypervisors (eg. Xen) only have a virtual NIC which is
point-to-point: this sidesteps the issue, with the risk that you might
need a huge number of virtual NICs if you wanted arbitrary guests to
talk to each other (Xen doesn't support that, they route/bridge through
dom0).

Most hypervisors have a sensible maximum on the number of guests they
could talk to, so I'm not too unhappy with a static number of queues.
But the dstmac -> queue mapping changes in hypervisor-specific ways, so
it really needs to be managed by the driver...

Hope that adds something,
Rusty.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-07-06  7:32                                     ` Rusty Russell
@ 2007-07-06 14:39                                       ` jamal
  2007-07-06 15:59                                         ` James Chapman
                                                           ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: jamal @ 2007-07-06 14:39 UTC (permalink / raw)
  To: Rusty Russell
  Cc: David Miller, kaber, peter.p.waskiewicz.jr, netdev, jeff,
	auke-jan.h.kok

On Fri, 2007-06-07 at 17:32 +1000, Rusty Russell wrote:

[..some good stuff deleted here ..]

> Hope that adds something,

It does - thanks. 
 
I think i was letting my experience pollute my thinking earlier when
Dave posted. The copy-avoidance requirement is clear to me[1].

I had another issue which wasnt clear but you touched on it so this
breaks the ice for me - be gentle please, my asbestos suit is full of
dust these days ;->:

The first thing that crossed my mind was "if you want to select a
destination port based on a destination MAC you are talking about a
switch/bridge". You bring up the issue of "a huge number of virtual NICs
if you wanted arbitrary guests" which is a real one[2]. 
Lets take the case of a small number of guests; a bridge of course would
solve the problem with the copy-avoidance with the caveat being:
- you now have N bridges and their respective tables for N domains i.e
one on each domain 
- N netdevices on each domain as well (of course you could say that is
not very different resourcewise from N queues instead). 

If i got this right, still not answering the netif_stop question posed:
the problem you are also trying to resolve now is get rid of N
netdevices on each guest for a usability reason; i.e have one netdevice,
move the bridging/switching functionality/tables into the driver;
replace the ports with queues instead of netdevices. Did i get that
right? 
If the issue is usability of listing 1024 netdevices, i can think of
many ways to resolve it.
One way we can resolve the listing is with a simple tag to the netdev
struct i could say "list netdevices for guest 0-10" etc etc.

I am having a little problem differentiating conceptually the case of a
guest being different from the host/dom0 if you want to migrate the
switching/bridging functions into each guest. So even if this doesnt
apply to all domains, it does apply to the dom0.
I like netdevices today (as opposed to queues within netdevices):
- the stack knows them well (I can add IP addresses, i can point routes
to, I can change MAC addresses, i can bring them administratively
down/up, I can add qos rules etc etc).
I can also tie netdevices to a CPU and therefore scale that way. I see
this viable at least from the host/dom0 perspective if a netdevice
represents a guest.

Sorry for the long email - drained some of my morning coffee.
Ok, kill me.

cheers,
jamal

[1] My experience is around qemu/uml/old-openvz - their model is to let
the host do the routing/switching between guests or the outside of the
box. From your description i would add Xen to that behavior.
>From Daves posting, i understand that for many good reasons, any time
you move between any one domain to another you are copying. So if you
use Xen and you want to go from domainX to domainY you go to dom0 which
implies copying domainX->dom0 then dom0->domainY. 
BTW, one curve that threw me off a little is it seems most of the
hardware that provides virtualization also provides point-to-point
connections between different domains; i always thought that they all
provided a point-to-point to the dom0 equivalent and let the dom0 worry
about how things get from domainX to domainY.

[2] Unfortunately that means if i wanted 1024 virtual routers/guest
domains i have at least 1024 netdevices on each guest connected to the
bridge on the guest. I have a freaking problem listing 72 netdevices
today on some device i have.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-07-06 14:39                                       ` jamal
@ 2007-07-06 15:59                                         ` James Chapman
  2007-07-08  2:30                                         ` Rusty Russell
  2007-07-08  6:03                                         ` David Miller
  2 siblings, 0 replies; 60+ messages in thread
From: James Chapman @ 2007-07-06 15:59 UTC (permalink / raw)
  To: hadi
  Cc: Rusty Russell, David Miller, kaber, peter.p.waskiewicz.jr, netdev,
	jeff, auke-jan.h.kok

jamal wrote:
> If the issue is usability of listing 1024 netdevices, i can think of
> many ways to resolve it.
> One way we can resolve the listing is with a simple tag to the netdev
> struct i could say "list netdevices for guest 0-10" etc etc.

This would be a useful feature, not only for virtualization. I've seen 
some boxes with thousands of net devices (mostly ppp, but also some 
ATM). It would be nice to be able to assign a tag to an arbitrary set of 
devices.

Does the network namespace stuff help with any of this?

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-07-06 14:39                                       ` jamal
  2007-07-06 15:59                                         ` James Chapman
@ 2007-07-08  2:30                                         ` Rusty Russell
  2007-07-08  6:03                                         ` David Miller
  2 siblings, 0 replies; 60+ messages in thread
From: Rusty Russell @ 2007-07-08  2:30 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, kaber, peter.p.waskiewicz.jr, netdev, jeff,
	auke-jan.h.kok

On Fri, 2007-07-06 at 10:39 -0400, jamal wrote:
> The first thing that crossed my mind was "if you want to select a
> destination port based on a destination MAC you are talking about a
> switch/bridge". You bring up the issue of "a huge number of virtual NICs
> if you wanted arbitrary guests" which is a real one[2].

Hi Jamal,

	I'm deeply tempted to agree with you that the answer is multiple
virtual NICs (and I've been tempted to abandon lguest's N-way transport
scheme), except that it looks like we're going to have multi-queue NICs
for other reasons.

	Otherwise I'd be tempted to say "create/destroy virtual NICs as other
guests appear/vanish from the network".  Noone does this today, but that
doesn't make it wrong.

> If i got this right, still not answering the netif_stop question posed:
> the problem you are also trying to resolve now is get rid of N
> netdevices on each guest for a usability reason; i.e have one netdevice,
> move the bridging/switching functionality/tables into the driver;
> replace the ports with queues instead of netdevices. Did i get that
> right? 

	Yep, well summarized.  I guess the question is: should the Intel guys
be representing their multi-queue NICs as multiple NICs rather than
adding the subqueue concept?

> BTW, one curve that threw me off a little is it seems most of the
> hardware that provides virtualization also provides point-to-point
> connections between different domains; i always thought that they all
> provided a point-to-point to the dom0 equivalent and let the dom0 worry
> about how things get from domainX to domainY.

Yeah, but that has obvious limitations as people care more about
inter-guest I/O: we want direct inter-guest networking...

Cheers,
Rusty.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-07-06 14:39                                       ` jamal
  2007-07-06 15:59                                         ` James Chapman
  2007-07-08  2:30                                         ` Rusty Russell
@ 2007-07-08  6:03                                         ` David Miller
  2 siblings, 0 replies; 60+ messages in thread
From: David Miller @ 2007-07-08  6:03 UTC (permalink / raw)
  To: hadi; +Cc: rusty, kaber, peter.p.waskiewicz.jr, netdev, jeff, auke-jan.h.kok

From: jamal <hadi@cyberus.ca>
Date: Fri, 06 Jul 2007 10:39:15 -0400

> If the issue is usability of listing 1024 netdevices, i can think of
> many ways to resolve it.

I would agree with this if there were a reason for it, it's totally
unnecessary complication as far as I can see.

These virtual devices are an ethernet with the subnet details exposed
to the driver, nothing more.

I see zero benefit to having a netdev for each guest or node we can
speak to whatsoever.  It's a very heavy abstraction to use for
something that is so bloody simple.

My demux on ->hard_start_xmit() is _5 DAMN LINES OF CODE_, you want to
replace that with a full netdev because of some minor difficulties in
figuring out to record the queuing state.  It's beyond unreasonable.

Netdevs are like salt, if you put too much in your food it tastes
awful.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2007-07-08  6:02 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-28 16:20 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
2007-06-28 16:21 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
2007-06-28 16:31   ` Patrick McHardy
2007-06-28 17:00   ` Patrick McHardy
2007-06-28 19:00     ` Waskiewicz Jr, Peter P
2007-06-28 19:03       ` Patrick McHardy
2007-06-28 19:06         ` Waskiewicz Jr, Peter P
2007-06-28 19:20           ` Patrick McHardy
2007-06-28 19:32             ` Jeff Garzik
2007-06-28 19:37               ` Patrick McHardy
2007-06-28 21:11                 ` Waskiewicz Jr, Peter P
2007-06-28 21:18                   ` Patrick McHardy
2007-06-28 23:08                     ` Waskiewicz Jr, Peter P
2007-06-28 23:31                       ` David Miller
2007-06-28 20:39               ` David Miller
2007-06-29  3:39   ` David Miller
2007-06-29 10:54     ` Jeff Garzik
2007-06-28 16:21 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
2007-06-28 16:35   ` Patrick McHardy
2007-06-28 16:43     ` Waskiewicz Jr, Peter P
2007-06-28 16:46       ` Patrick McHardy
2007-06-28 16:50         ` Waskiewicz Jr, Peter P
2007-06-28 16:53           ` Patrick McHardy
2007-06-28 16:50     ` Patrick McHardy
2007-06-28 17:13   ` Patrick McHardy
2007-06-28 19:04     ` Waskiewicz Jr, Peter P
2007-06-28 19:17       ` Patrick McHardy
2007-06-28 19:21         ` Waskiewicz Jr, Peter P
2007-06-28 19:24           ` Patrick McHardy
2007-06-28 19:27             ` Waskiewicz Jr, Peter P
2007-06-29  4:20             ` David Miller
2007-06-29  8:45               ` Waskiewicz Jr, Peter P
2007-06-29 11:43               ` Multiqueue and virtualization WAS(Re: " jamal
2007-06-29 11:59                 ` Patrick McHardy
2007-06-29 12:54                   ` jamal
2007-06-29 13:08                     ` Patrick McHardy
2007-06-29 13:19                       ` jamal
2007-06-29 15:33                       ` Ben Greear
2007-06-29 15:58                         ` Patrick McHardy
2007-06-29 16:16                           ` Ben Greear
2007-06-29 21:36                         ` David Miller
2007-06-30  7:51                           ` Benny Amorsen
2007-06-29 21:31                     ` David Miller
2007-06-30  1:30                       ` jamal
2007-06-30  4:35                         ` David Miller
2007-06-30 14:52                           ` jamal
2007-06-30 20:33                             ` David Miller
2007-07-03 12:42                               ` jamal
2007-07-03 21:24                                 ` David Miller
2007-07-04  2:20                                   ` jamal
2007-07-06  7:32                                     ` Rusty Russell
2007-07-06 14:39                                       ` jamal
2007-07-06 15:59                                         ` James Chapman
2007-07-08  2:30                                         ` Rusty Russell
2007-07-08  6:03                                         ` David Miller
2007-06-30 14:33               ` Patrick McHardy
2007-06-30 14:37                 ` Waskiewicz Jr, Peter P
2007-06-28 17:57 ` [CORE] Stack changes to add multiqueue hardware support API Patrick McHardy
2007-06-28 17:57 ` [SCHED] Qdisc changes and sch_rr added for multiqueue Patrick McHardy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).