[PATCH] NET: Multiple queue hardware support

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] NET: Multiple queue hardware support
@ 2007-06-23 21:36 PJ Waskiewicz
  2007-06-23 21:36 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: PJ Waskiewicz @ 2007-06-23 21:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Please consider these patches for 2.6.23 inclusion.

These patches are built against Patrick McHardy's recently submitted
RTNETLINK nested compat attribute patches.  They're needed to preserve
ABI between sch_{rr|prio} and iproute2.

Updates since the last submission:

1. Added checks for netif_subqueue_stopped() to net/core/netpoll.c,
   net/core/pktgen.c, and to software device hard_start_xmit in
   dev_queue_xmit().

2. Removed TCA_PRIO_TEST and added TCA_PRIO_MQ for sch_prio and sch_rr.

3. Fixed dependancy issues in net/sched/Kconfig with NET_SCH_RR.

4. Implemented the new nested compat attribute API for MQ in NET_SCH_PRIO
   and NET_SCH_RR.

5. Allow sch_rr and sch_prio to turn multiqueue hardware support on and off
   at loadtime.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.  The requested e1000 changes are still being resolved,
and will be sent at a later date.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation
  2007-06-23 21:36 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
@ 2007-06-23 21:36 ` PJ Waskiewicz
  2007-06-23 21:36 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  2007-06-23 21:36 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
  2 siblings, 0 replies; 31+ messages in thread
From: PJ Waskiewicz @ 2007-06-23 21:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Add a brief howto to Documentation/networking for multiqueue.  It
explains how to use the multiqueue API in a driver to support
multiqueue paths from the stack, as well as the qdiscs to use for
feeding a multiqueue device.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 Documentation/networking/multiqueue.txt |  106 +++++++++++++++++++++++++++++++
 1 files changed, 106 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt
new file mode 100644
index 0000000..b7ede56
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,106 @@
+
+		HOWTO for multiqueue network device support
+		===========================================
+
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO or RR for multiqueue devices
+
+
+Intro: Kernel support for multiqueue devices
+---------------------------------------------------------
+
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+
+
+Section 1: Base driver requirements for implementing multiqueue support
+-----------------------------------------------------------------------
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features
+bitmap on device initialization.  Below is an example from e1000:
+
+#ifdef CONFIG_E1000_MQ
+	if ( (adapter->hw.mac.type == e1000_82571) ||
+	     (adapter->hw.mac.type == e1000_82572) ||
+	     (adapter->hw.mac.type == e1000_80003es2lan))
+		netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+
+
+Section 2: Qdisc support for multiqueue devices
+-----------------------------------------------
+
+Currently two qdiscs support multiqueue devices.  A new round-robin qdisc,
+sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb->queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+
+sch_rr has been added for hardware that doesn't want scheduling policies from
+software, so it's a straight round-robin qdisc.  It uses the same syntax and
+classification priomap that sch_prio uses, so it should be intuitive to
+configure for people who've used sch_prio.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  If PRIO has been
+built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of
+bands requested is equal to the number of queues on the hardware.  If they
+are equal, it sets a one-to-one mapping up between the queues and bands.  If
+they're not equal, it will not load the qdisc.  This is the same behavior
+for RR.  Once the association is made, any skb that is classified will have
+skb->queue_mapping set, which will allow the driver to properly queue skb's
+to multiple queues.
+
+
+Section 3: Brief howto using PRIO and RR for multiqueue devices
+---------------------------------------------------------------
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: prio bands 4 multiqueue
+
+This will create 4 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 4 Tx queues, the band mapping
+would look like:
+
+band 0 => queue 0
+band 1 => queue 1
+band 2 => queue 2
+band 3 => queue 3
+
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS -> Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the "normal"
+traffic classification, which is band 1.  Therefore pings will be send out
+queue 1 on the NIC.
+
+Note the use of the multiqueue keyword.  This is only in versions of iproute2
+that support multiqueue networking devices; if this is omitted when loading
+a qdisc onto a multiqueue device, the qdisc will load and operate the same
+if it were loaded onto a single-queue device (i.e. - sends all traffic to
+queue 0).
+
+The behavior of tc filters remains the same, where it will override TOS priority
+classification.
+
+
+Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-23 21:36 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
  2007-06-23 21:36 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
@ 2007-06-23 21:36 ` PJ Waskiewicz
  2007-06-24 12:00   ` Patrick McHardy
  2007-06-23 21:36 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
  2 siblings, 1 reply; 31+ messages in thread
From: PJ Waskiewicz @ 2007-06-23 21:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Updated: Added checks for netif_subqueue_stopped() to netpoll,
pktgen, and software device dev_queue_xmit().  This will ensure
external events to these subsystems will be handled correctly if
a subqueue is shut down.

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/etherdevice.h |    3 +-
 include/linux/netdevice.h   |   62 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |    4 ++-
 net/core/dev.c              |   27 +++++++++++++------
 net/core/netpoll.c          |    8 +++---
 net/core/pktgen.c           |   10 +++++--
 net/core/skbuff.c           |    3 ++
 net/ethernet/eth.c          |    9 +++---
 8 files changed, 104 insertions(+), 22 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void		eth_header_cache_update(struct hh_cache *hh, struct net_device *dev
 extern int		eth_header_cache(struct neighbour *neigh,
 					 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7913ee..6509eb4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+	/* Give a control state for each queue.  This struct may contain
+	 * per-queue locks in the future.
+	 */
+	unsigned long	state;
+};
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -325,6 +333,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -543,6 +552,10 @@ struct net_device
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
+
+ 	/* The TX queue control structures */
+ 	int				egress_subqueue_count;
+ 	struct net_device_subqueue	egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -705,6 +718,48 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index)
+{
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+                                         u16 queue_index)
+{
+	return test_bit(__LINK_STATE_XOFF,
+	                &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+	                       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -995,8 +1050,11 @@ static inline void netif_tx_disable(struct net_device *dev)
 extern void		ether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-				       void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+					  void (*setup)(struct net_device *),
+					  int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+	alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int		register_netdev(struct net_device *dev);
 extern void		unregister_netdev(struct net_device *dev);
 /* Functions used for multicast support */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e7367c7..01b5e25 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -197,6 +197,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@tstamp: Time we arrived
  *	@dev: Device we arrived on/are leaving by
  *	@iif: ifindex of device we arrived on
+ *	@queue_mapping: Queue mapping for multiqueue devices
  *	@transport_header: Transport layer header
  *	@network_header: Network layer header
  *	@mac_header: Link layer header
@@ -246,7 +247,8 @@ struct sk_buff {
 	ktime_t			tstamp;
 	struct net_device	*dev;
 	int			iif;
-	/* 4 byte hole on 64 bit*/
+	__u16			queue_mapping;
+	/* 2 byte hole on 64 bit*/
 
 	struct  dst_entry	*dst;
 	struct	sec_path	*sp;
diff --git a/net/core/dev.c b/net/core/dev.c
index 2609062..9ea8a47 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1429,7 +1429,9 @@ gso:
 			skb->next = nskb;
 			return rc;
 		}
-		if (unlikely(netif_queue_stopped(dev) && skb->next))
+		if (unlikely((netif_queue_stopped(dev) ||
+			     netif_subqueue_stopped(dev, skb->queue_mapping)) &&
+			     skb->next))
 			return NETDEV_TX_BUSY;
 	} while (skb->next);
 
@@ -1545,6 +1547,8 @@ gso:
 		spin_lock(&dev->queue_lock);
 		q = dev->qdisc;
 		if (q->enqueue) {
+			/* reset queue_mapping to zero */
+			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
 			qdisc_run(dev);
 			spin_unlock(&dev->queue_lock);
@@ -1574,7 +1578,8 @@ gso:
 
 			HARD_TX_LOCK(dev, cpu);
 
-			if (!netif_queue_stopped(dev)) {
+			if (!netif_queue_stopped(dev) &&
+			    !netif_subqueue_stopped(dev, skb->queue_mapping)) {
 				rc = 0;
 				if (!dev_hard_start_xmit(skb, dev)) {
 					HARD_TX_UNLOCK(dev);
@@ -3343,16 +3348,18 @@ static struct net_device_stats *internal_stats(struct net_device *dev)
 }
 
 /**
- *	alloc_netdev - allocate network device
+ *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
  *	@name:		device name format string
  *	@setup:		callback to initialize device
+ *	@queue_count:	the number of subqueues to allocate
  *
  *	Allocates a struct net_device with private data area for driver use
- *	and performs basic initialization.
+ *	and performs basic initialization.  Also allocates subqueue structs
+ *	for each queue on the device at the end of the netdevice.
  */
-struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-		void (*setup)(struct net_device *))
+struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+		void (*setup)(struct net_device *), int queue_count)
 {
 	void *p;
 	struct net_device *dev;
@@ -3361,7 +3368,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
 	/* ensure 32-byte alignment of both the device and private area */
-	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
+	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
+		     (sizeof(struct net_device_subqueue) * (queue_count - 1))) &
+		     ~NETDEV_ALIGN_CONST;
 	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
 
 	p = kzalloc(alloc_size, GFP_KERNEL);
@@ -3377,12 +3386,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
 
+  	dev->egress_subqueue_count = queue_count;
+
 	dev->get_stats = internal_stats;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
 }
-EXPORT_SYMBOL(alloc_netdev);
+EXPORT_SYMBOL(alloc_netdev_mq);
 
 /**
  *	free_netdev - free network device
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..aac8acf 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -66,8 +66,9 @@ static void queue_process(struct work_struct *work)
 
 		local_irq_save(flags);
 		netif_tx_lock(dev);
-		if (netif_queue_stopped(dev) ||
-		    dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
+		if ((netif_queue_stopped(dev) || 
+		     netif_subqueue_stopped(dev, skb->queue_mapping)) ||
+		     dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
 			skb_queue_head(&npinfo->txq, skb);
 			netif_tx_unlock(dev);
 			local_irq_restore(flags);
@@ -254,7 +255,8 @@ static void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
 			/* try until next clock tick */
 			for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
 					tries > 0; --tries) {
-				if (!netif_queue_stopped(dev))
+				if (!netif_queue_stopped(dev) &&
+				    !netif_subqueue_stopped(dev, skb->queue_mapping))
 					status = dev->hard_start_xmit(skb, dev);
 
 				if (status == NETDEV_TX_OK)
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 9cd3a1c..dffe067 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3139,7 +3139,9 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		}
 	}
 
-	if (netif_queue_stopped(odev) || need_resched()) {
+	if ((netif_queue_stopped(odev) ||
+	     netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) ||
+	     need_resched()) {
 		idle_start = getCurUs();
 
 		if (!netif_running(odev)) {
@@ -3154,7 +3156,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 
 		pkt_dev->idle_acc += getCurUs() - idle_start;
 
-		if (netif_queue_stopped(odev)) {
+		if (netif_queue_stopped(odev) ||
+		    netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 			pkt_dev->next_tx_us = getCurUs();	/* TODO */
 			pkt_dev->next_tx_ns = 0;
 			goto out;	/* Try the next interface */
@@ -3181,7 +3184,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 	}
 
 	netif_tx_lock_bh(odev);
-	if (!netif_queue_stopped(odev)) {
+	if (!netif_queue_stopped(odev) &&
+	    !netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 
 		atomic_inc(&(pkt_dev->skb->users));
 	      retry_now:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7c6a34e..7bbed45 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -418,6 +418,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
+	C(queue_mapping);
 	C(priority);
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
@@ -459,6 +460,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 	new->sk		= NULL;
 	new->dev	= old->dev;
+	new->queue_mapping = old->queue_mapping;
 	new->priority	= old->priority;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
@@ -1925,6 +1927,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		tail = nskb;
 
 		nskb->dev = skb->dev;
+		nskb->queue_mapping = skb->queue_mapping;
 		nskb->priority = skb->priority;
 		nskb->protocol = skb->protocol;
 		nskb->dst = dst_clone(skb->dst);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0ac2524..87a509c 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -316,9 +316,10 @@ void ether_setup(struct net_device *dev)
 EXPORT_SYMBOL(ether_setup);
 
 /**
- * alloc_etherdev - Allocates and sets up an Ethernet device
+ * alloc_etherdev_mq - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
  *	for this Ethernet device
+ * @queue_count: The number of queues this device has.
  *
  * Fill in the fields of the device structure with Ethernet-generic
  * values. Basically does everything except registering the device.
@@ -328,8 +329,8 @@ EXPORT_SYMBOL(ether_setup);
  * this private data area.
  */
 
-struct net_device *alloc_etherdev(int sizeof_priv)
+struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count)
 {
-	return alloc_netdev(sizeof_priv, "eth%d", ether_setup);
+	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);
 }
-EXPORT_SYMBOL(alloc_etherdev);
+EXPORT_SYMBOL(alloc_etherdev_mq);

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-23 21:36 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
@ 2007-06-24 12:00   ` Patrick McHardy
  2007-06-25 16:25     ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2007-06-24 12:00 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
> +struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
> +		void (*setup)(struct net_device *), int queue_count)
>  {
>  	void *p;
>  	struct net_device *dev;
> @@ -3361,7 +3368,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>  	BUG_ON(strlen(name) >= sizeof(dev->name));
>  
>  	/* ensure 32-byte alignment of both the device and private area */
> -	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
> +	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
> +		     (sizeof(struct net_device_subqueue) * (queue_count - 1))) &


Why queue_count - 1 ? It should be queue_count I think.


Otherwise ACK for this patch except that it should also contain the
sch_generic changes.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-24 12:00   ` Patrick McHardy
@ 2007-06-25 16:25     ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 31+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-25 16:25 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> >  	/* ensure 32-byte alignment of both the device and 
> private area */
> > -	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & 
> ~NETDEV_ALIGN_CONST;
> > +	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
> > +		     (sizeof(struct net_device_subqueue) * 
> (queue_count - 1))) &
> 
> 
> Why queue_count - 1 ? It should be queue_count I think.

I'm not sure what went through my head, but I'll fix this.

> Otherwise ACK for this patch except that it should also 
> contain the sch_generic changes.

I misread your previous mail; I'll get the sch_generic.c changes into
this patch.

Thanks Patrick,
-PJ

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-23 21:36 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
  2007-06-23 21:36 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
  2007-06-23 21:36 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
@ 2007-06-23 21:36 ` PJ Waskiewicz
  2007-06-24 12:16   ` Patrick McHardy
  2007-06-24 22:22   ` Patrick McHardy
  2 siblings, 2 replies; 31+ messages in thread
From: PJ Waskiewicz @ 2007-06-23 21:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Updated: This patch applies on top of Patrick McHardy's RTNETLINK
nested compat attribute patches.  These are required to preserve
ABI for iproute2 when working with the multiqueue qdiscs.

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue hardware
support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/pkt_sched.h |    4 +-
 net/sched/Kconfig         |   30 +++++++++++++
 net/sched/sch_generic.c   |    3 +
 net/sched/sch_prio.c      |  106 ++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 129 insertions(+), 14 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 09808b7..ec3a9a5 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -103,8 +103,8 @@ struct tc_prio_qopt
 
 enum
 {
-	TCA_PRIO_UNPSEC,
-	TCA_PRIO_TEST,
+	TCA_PRIO_UNSPEC,
+	TCA_PRIO_MQ,
 	__TCA_PRIO_MAX
 };
 
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..7f14fa6 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -102,8 +102,16 @@ config NET_SCH_ATM
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_atm.
 
+config NET_SCH_BANDS
+        bool "Multi Band Queueing (PRIO and RR)"
+        ---help---
+          Say Y here if you want to use n-band multiqueue packet
+          schedulers.  These include a priority-based scheduler and
+	   a round-robin scheduler.
+
 config NET_SCH_PRIO
 	tristate "Multi Band Priority Queueing (PRIO)"
+	depends on NET_SCH_BANDS
 	---help---
 	  Say Y here if you want to use an n-band priority queue packet
 	  scheduler.
@@ -111,6 +119,28 @@ config NET_SCH_PRIO
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_prio.
 
+config NET_SCH_RR
+	tristate "Multi Band Round Robin Queuing (RR)"
+	depends on NET_SCH_BANDS
+	select NET_SCH_PRIO
+	---help---
+	  Say Y here if you want to use an n-band round robin packet
+	  scheduler.
+
+	  The module uses sch_prio for its framework and is aliased as
+	  sch_rr, so it will load sch_prio, although it is referred
+	  to using sch_rr.
+
+config NET_SCH_BANDS_MQ
+	bool "Multiple hardware queue support"
+	depends on NET_SCH_BANDS
+	---help---
+	  Say Y here if you want to allow the PRIO and RR qdiscs to assign
+	  flows to multiple hardware queues on an ethernet device.  This
+	  will still work on devices with 1 queue.
+
+	  Most people will say N here.
+
 config NET_SCH_RED
 	tristate "Random Early Detection (RED)"
 	---help---
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 9461e8a..203d5c4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
 	spin_unlock(&dev->queue_lock);
 
 	ret = NETDEV_TX_BUSY;
-	if (!netif_queue_stopped(dev))
+	if (!netif_queue_stopped(dev) &&
+	    !netif_subqueue_stopped(dev, skb->queue_mapping))
 		/* churn baby churn .. */
 		ret = dev_hard_start_xmit(skb, dev);
 
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 40a13e8..8a716f0 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,11 @@
 struct prio_sched_data
 {
 	int bands;
+	int curband; /* for round-robin */
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+	unsigned char mq;
 };
 
 
@@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 #endif
 			if (TC_H_MAJ(band))
 				band = 0;
+			if (q->mq)
+				skb->queue_mapping = 
+						q->prio2band[band&TC_PRIO_MAX];
+			else
+				skb->queue_mapping = 0;
 			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
-	if (band >= q->bands)
+	if (band >= q->bands) {
+		if (q->mq)
+			skb->queue_mapping = q->prio2band[0];
+		else
+			skb->queue_mapping = 0;
 		return q->queues[q->prio2band[0]];
+	}
 
+	if (q->mq)
+		skb->queue_mapping = band;
+	else
+		skb->queue_mapping = 0;
 	return q->queues[band];
 }
 
@@ -144,17 +160,57 @@ prio_dequeue(struct Qdisc* sch)
 	struct Qdisc *qdisc;
 
 	for (prio = 0; prio < q->bands; prio++) {
-		qdisc = q->queues[prio];
-		skb = qdisc->dequeue(qdisc);
-		if (skb) {
-			sch->q.qlen--;
-			return skb;
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, (q->mq ? prio : 0))) {
+			qdisc = q->queues[prio];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				return skb;
+			}
 		}
 	}
 	return NULL;
 
 }
 
+static struct sk_buff *rr_dequeue(struct Qdisc* sch)
+{
+	struct sk_buff *skb;
+	struct prio_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	int bandcount;
+
+	/* Only take one pass through the queues.  If nothing is available,
+	 * return nothing.
+	 */
+	for (bandcount = 0; bandcount < q->bands; bandcount++) {
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.  If the queue is stopped, try the
+		 * next queue.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, (q->mq ? q->curband : 0))) {
+			qdisc = q->queues[q->curband];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				q->curband++;
+				if (q->curband >= q->bands)
+					q->curband = 0;
+				return skb;
+			}
+		}
+		q->curband++;
+		if (q->curband >= q->bands)
+			q->curband = 0;
+	}
+	return NULL;
+}
+
 static unsigned int prio_drop(struct Qdisc* sch)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
@@ -202,7 +258,7 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 	struct rtattr *tb[TCA_PRIO_MAX];
 	int i;
 
-	if (rtattr_parse_nested_compat(tb, TCA_PRIO_MAX, opt, (void *)&qopt,
+	if (rtattr_parse_nested_compat(tb, TCA_PRIO_MAX, opt, qopt,
 				       sizeof(*qopt)))
 		return -EINVAL;
 	if (qopt->bands > TCQ_PRIO_BANDS || qopt->bands < 2)
@@ -213,8 +269,14 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 			return -EINVAL;
 	}
 
-	if (tb[TCA_PRIO_TEST-1])
-		printk("TCA_PRIO_TEST: %u\n", *(u32 *)RTA_DATA(tb[TCA_PRIO_TEST-1]));
+	/* If we're multiqueue, make sure the number of incoming bands
+	 * matches the number of queues on the device we're associating with.
+	 */
+	if (tb[TCA_PRIO_MQ - 1])
+		q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
+
+	if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count))
+		return -EINVAL;
 
 	sch_tree_lock(sch);
 	q->bands = qopt->bands;
@@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
 	memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
 
 	nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt);
-	RTA_PUT_U32(skb, TCA_PRIO_TEST, 321);
+	RTA_PUT_U8(skb, TCA_PRIO_MQ, q->mq);
 	RTA_NEST_COMPAT_END(skb, nest);
 	return skb->len;
 
@@ -452,17 +514,39 @@ static struct Qdisc_ops prio_qdisc_ops = {
 	.owner		=	THIS_MODULE,
 };
 
+static struct Qdisc_ops rr_qdisc_ops = {
+	.next		=	NULL,
+	.cl_ops		=	&prio_class_ops,
+	.id		=	"rr",
+	.priv_size	=	sizeof(struct prio_sched_data),
+	.enqueue	=	prio_enqueue,
+	.dequeue	=	rr_dequeue,
+	.requeue	=	prio_requeue,
+	.drop		=	prio_drop,
+	.init		=	prio_init,
+	.reset		=	prio_reset,
+	.destroy	=	prio_destroy,
+	.change		=	prio_tune,
+	.dump		=	prio_dump,
+	.owner		=	THIS_MODULE,
+};
+
 static int __init prio_module_init(void)
 {
-	return register_qdisc(&prio_qdisc_ops);
+	register_qdisc(&prio_qdisc_ops);
+	register_qdisc(&rr_qdisc_ops);
+
+	return 0;
 }
 
 static void __exit prio_module_exit(void)
 {
 	unregister_qdisc(&prio_qdisc_ops);
+	unregister_qdisc(&rr_qdisc_ops);
 }
 
 module_init(prio_module_init)
 module_exit(prio_module_exit)
 
 MODULE_LICENSE("GPL");
+MODULE_ALIAS("sch_rr");

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-23 21:36 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
@ 2007-06-24 12:16   ` Patrick McHardy
  2007-06-25 17:27     ` Waskiewicz Jr, Peter P
  2007-06-25 21:53     ` Waskiewicz Jr, Peter P
  2007-06-24 22:22   ` Patrick McHardy
  1 sibling, 2 replies; 31+ messages in thread
From: Patrick McHardy @ 2007-06-24 12:16 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> index 09808b7..ec3a9a5 100644
> --- a/include/linux/pkt_sched.h
> +++ b/include/linux/pkt_sched.h
> @@ -103,8 +103,8 @@ struct tc_prio_qopt
>  
>  enum
>  {
> -	TCA_PRIO_UNPSEC,
> -	TCA_PRIO_TEST,


You misunderstood me. You can work on top of my compat attribute
patches, but the example code should not have to go in to apply
your patch.


> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 475df84..7f14fa6 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -102,8 +102,16 @@ config NET_SCH_ATM
>  	  To compile this code as a module, choose M here: the
>  	  module will be called sch_atm.
>  
> +config NET_SCH_BANDS
> +        bool "Multi Band Queueing (PRIO and RR)"

This options seems useless. Its not used *anywhere* except for
dependencies.

> +        ---help---
> +          Say Y here if you want to use n-band multiqueue packet
> +          schedulers.  These include a priority-based scheduler and
> +	   a round-robin scheduler.
> +
>  config NET_SCH_PRIO
>  	tristate "Multi Band Priority Queueing (PRIO)"
> +	depends on NET_SCH_BANDS

And this dependency as well.

>  	---help---
>  	  Say Y here if you want to use an n-band priority queue packet
>  	  scheduler.
> @@ -111,6 +119,28 @@ config NET_SCH_PRIO
>  	  To compile this code as a module, choose M here: the
>  	  module will be called sch_prio.
>  
> +config NET_SCH_RR
> +	tristate "Multi Band Round Robin Queuing (RR)"
> +	depends on NET_SCH_BANDS

Same here. RR

> +	select NET_SCH_PRIO
> +	---help---
> +	  Say Y here if you want to use an n-band round robin packet
> +	  scheduler.
> +
> +	  The module uses sch_prio for its framework and is aliased as
> +	  sch_rr, so it will load sch_prio, although it is referred
> +	  to using sch_rr.
> +
> +config NET_SCH_BANDS_MQ
> +	bool "Multiple hardware queue support"
> +	depends on NET_SCH_BANDS


OK, again:

Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. Nothing at
all changes for NET_SCH_PRIO itself. Additionally introduce a
boolean NET_SCH_MULTIQUEUE. No dependencies at all. Use
NET_SCH_MULTIQUEUE to guard the multiqueue code in sch_prio.c.
Your current code doesn't even have any ifdefs anymore though,
so this might not be needed at all.

Additionally you could later introduce E1000_MULTIQUEUE and
have that select NET_SCH_MULTIQUEUE.

> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 9461e8a..203d5c4 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
>  	spin_unlock(&dev->queue_lock);
>  
>  	ret = NETDEV_TX_BUSY;
> -	if (!netif_queue_stopped(dev))
> +	if (!netif_queue_stopped(dev) &&
> +	    !netif_subqueue_stopped(dev, skb->queue_mapping))
>  		/* churn baby churn .. */
>  		ret = dev_hard_start_xmit(skb, dev);

I'll try again - please move this to patch 2/3.



> diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
> index 40a13e8..8a716f0 100644
> --- a/net/sched/sch_prio.c
> +++ b/net/sched/sch_prio.c
> @@ -40,9 +40,11 @@
>  struct prio_sched_data
>  {
>  	int bands;
> +	int curband; /* for round-robin */
>  	struct tcf_proto *filter_list;
>  	u8  prio2band[TC_PRIO_MAX+1];
>  	struct Qdisc *queues[TCQ_PRIO_BANDS];
> +	unsigned char mq;
>  };
>  
>  
> @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
>  #endif
>  			if (TC_H_MAJ(band))
>  				band = 0;
> +			if (q->mq)
> +				skb->queue_mapping = 
> +						q->prio2band[band&TC_PRIO_MAX];
> +			else
> +				skb->queue_mapping = 0;


Might look cleaner if you have one central point where queue_mapping is
set and the band is returned.

> +	/* If we're multiqueue, make sure the number of incoming bands
> +	 * matches the number of queues on the device we're associating with.
> +	 */
> +	if (tb[TCA_PRIO_MQ - 1])
> +		q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);


If you're using it as a flag, please use RTA_GET_FLAG(),
otherwise RTA_GET_U8.

> +	if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count))
> +		return -EINVAL;
>  
>  	sch_tree_lock(sch);
>  	q->bands = qopt->bands;
> @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
>  	memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
>  
>  	nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt);
> -	RTA_PUT_U32(skb, TCA_PRIO_TEST, 321);
> +	RTA_PUT_U8(skb, TCA_PRIO_MQ, q->mq);


And RTA_PUT_FLAG. Now that I think of it, does it even makes sense
to have a prio private flag for this instead of a qdisc global one?

>  static int __init prio_module_init(void)
>  {
> -	return register_qdisc(&prio_qdisc_ops);
> +	register_qdisc(&prio_qdisc_ops);
> +	register_qdisc(&rr_qdisc_ops);

Proper error handling please.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-24 12:16   ` Patrick McHardy
@ 2007-06-25 17:27     ` Waskiewicz Jr, Peter P
  2007-06-25 17:29       ` Patrick McHardy
  2007-06-25 21:53     ` Waskiewicz Jr, Peter P
  1 sibling, 1 reply; 31+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-25 17:27 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> >  enum
> >  {
> > -	TCA_PRIO_UNPSEC,
> > -	TCA_PRIO_TEST,
> 
> 
> You misunderstood me. You can work on top of my compat 
> attribute patches, but the example code should not have to go 
> in to apply your patch.

Ok.  I'll fix my patches.

> > diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 
> > 475df84..7f14fa6 100644
> > --- a/net/sched/Kconfig
> > +++ b/net/sched/Kconfig
> > @@ -102,8 +102,16 @@ config NET_SCH_ATM
> >  	  To compile this code as a module, choose M here: the
> >  	  module will be called sch_atm.
> >  
> > +config NET_SCH_BANDS
> > +        bool "Multi Band Queueing (PRIO and RR)"
> 
> This options seems useless. Its not used *anywhere* except 
> for dependencies.

I was trying to group the multiqueue qdiscs together with this.  But I
can see just having the multiqueue option for scheduling will cover
this.  I'll remove this.

> > +config NET_SCH_BANDS_MQ
> > +	bool "Multiple hardware queue support"
> > +	depends on NET_SCH_BANDS
> 
> 
> OK, again:
> 
> Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. 
> Nothing at all changes for NET_SCH_PRIO itself. Additionally 
> introduce a boolean NET_SCH_MULTIQUEUE. No dependencies at 
> all. Use NET_SCH_MULTIQUEUE to guard the multiqueue code in 
> sch_prio.c.
> Your current code doesn't even have any ifdefs anymore 
> though, so this might not be needed at all.
> 
> Additionally you could later introduce E1000_MULTIQUEUE and 
> have that select NET_SCH_MULTIQUEUE.

I'll clean this up.  Thanks for the persistance.  :)

> > diff --git a/net/sched/sch_generic.c 
> b/net/sched/sch_generic.c index 
> > 9461e8a..203d5c4 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct 
> net_device *dev)
> >  	spin_unlock(&dev->queue_lock);
> >  
> >  	ret = NETDEV_TX_BUSY;
> > -	if (!netif_queue_stopped(dev))
> > +	if (!netif_queue_stopped(dev) &&
> > +	    !netif_subqueue_stopped(dev, skb->queue_mapping))
> >  		/* churn baby churn .. */
> >  		ret = dev_hard_start_xmit(skb, dev);
> 
> I'll try again - please move this to patch 2/3.

I'm sorry; I misread your original comment about this.  I'll move the
change (although this disappears with Jamal's and KK's qdisc_restart()
cleanup).

> > diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 
> > 40a13e8..8a716f0 100644
> > --- a/net/sched/sch_prio.c
> > +++ b/net/sched/sch_prio.c
> > @@ -40,9 +40,11 @@
> >  struct prio_sched_data
> >  {
> >  	int bands;
> > +	int curband; /* for round-robin */
> >  	struct tcf_proto *filter_list;
> >  	u8  prio2band[TC_PRIO_MAX+1];
> >  	struct Qdisc *queues[TCQ_PRIO_BANDS];
> > +	unsigned char mq;
> >  };
> >  
> >  
> > @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
> > *sch, int *qerr)  #endif
> >  			if (TC_H_MAJ(band))
> >  				band = 0;
> > +			if (q->mq)
> > +				skb->queue_mapping = 
> > +						
> q->prio2band[band&TC_PRIO_MAX];
> > +			else
> > +				skb->queue_mapping = 0;
> 
> 
> Might look cleaner if you have one central point where 
> queue_mapping is set and the band is returned.

I'll see how easy it'll be to condense this; because the queue being
selected in the qdisc can be different based on a few different things,
I'm not sure how easy it'll be to assign this in one spot.  I'll play
around with it and see what I can come up with.

> > +	/* If we're multiqueue, make sure the number of incoming bands
> > +	 * matches the number of queues on the device we're 
> associating with.
> > +	 */
> > +	if (tb[TCA_PRIO_MQ - 1])
> > +		q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
> 
> 
> If you're using it as a flag, please use RTA_GET_FLAG(), 
> otherwise RTA_GET_U8.

Will do.  Thanks.

> > +	if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count))
> > +		return -EINVAL;
> >  
> >  	sch_tree_lock(sch);
> >  	q->bands = qopt->bands;
> > @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, 
> struct sk_buff *skb)
> >  	memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
> >  
> >  	nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt);
> > -	RTA_PUT_U32(skb, TCA_PRIO_TEST, 321);
> > +	RTA_PUT_U8(skb, TCA_PRIO_MQ, q->mq);
> 
> 
> And RTA_PUT_FLAG. Now that I think of it, does it even makes 
> sense to have a prio private flag for this instead of a qdisc 
> global one?

There currently aren't any other qdiscs that are natural fits for
multiqueue that I can see.  I can see the benefit though of having this
as a global flag in the qdisc API; let me check it out, and if it makes
sense, I can move it.

> >  static int __init prio_module_init(void)  {
> > -	return register_qdisc(&prio_qdisc_ops);
> > +	register_qdisc(&prio_qdisc_ops);
> > +	register_qdisc(&rr_qdisc_ops);
> 
> Proper error handling please.

Will do.

Thanks,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-25 17:27     ` Waskiewicz Jr, Peter P
@ 2007-06-25 17:29       ` Patrick McHardy
  0 siblings, 0 replies; 31+ messages in thread
From: Patrick McHardy @ 2007-06-25 17:29 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>
>> And RTA_PUT_FLAG. Now that I think of it, does it even makes 
>> sense to have a prio private flag for this instead of a qdisc 
>> global one?
>>     
>
> There currently aren't any other qdiscs that are natural fits for
> multiqueue that I can see.  I can see the benefit though of having this
> as a global flag in the qdisc API; let me check it out, and if it makes
> sense, I can move it.
>   

Yes, that thought occured to me as well. Keeping it private
seems better.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-24 12:16   ` Patrick McHardy
  2007-06-25 17:27     ` Waskiewicz Jr, Peter P
@ 2007-06-25 21:53     ` Waskiewicz Jr, Peter P
  2007-06-25 21:58       ` Patrick McHardy
  1 sibling, 1 reply; 31+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-25 21:53 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> > @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
> > *sch, int *qerr)  #endif
> >  			if (TC_H_MAJ(band))
> >  				band = 0;
> > +			if (q->mq)
> > +				skb->queue_mapping = 
> > +						
> q->prio2band[band&TC_PRIO_MAX];
> > +			else
> > +				skb->queue_mapping = 0;
> 
> 
> Might look cleaner if you have one central point where 
> queue_mapping is set and the band is returned.

I've taken a stab at this.  I can have one return point, but I'll still
have multiple assignments of skb->queue_mapping due to the different
branches for which queue to select in the qdisc.  I suppose we can do a
rewrite of prio_classify(), but to me that seems beyond the scope of the
multiqueue patches themselves.  What do you think?

Thanks,
-PJ

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-25 21:53     ` Waskiewicz Jr, Peter P
@ 2007-06-25 21:58       ` Patrick McHardy
  2007-06-25 22:07         ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2007-06-25 21:58 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>>> @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
>>> *sch, int *qerr)  #endif
>>>  			if (TC_H_MAJ(band))
>>>  				band = 0;
>>> +			if (q->mq)
>>> +				skb->queue_mapping = 
>>> +						
>>>       
>> q->prio2band[band&TC_PRIO_MAX];
>>     
>>> +			else
>>> +				skb->queue_mapping = 0;
>>>       
>> Might look cleaner if you have one central point where 
>> queue_mapping is set and the band is returned.
>>     
>
> I've taken a stab at this.  I can have one return point, but I'll still
> have multiple assignments of skb->queue_mapping due to the different
> branches for which queue to select in the qdisc.  I suppose we can do a
> rewrite of prio_classify(), but to me that seems beyond the scope of the
> multiqueue patches themselves.  What do you think?
>   

Thats not necessary. I just though you could add one exit point:


...
out:
    skb->queue_mapping = q->mq ? band : 0;
    return q->queues[band];
}

But if that doesn't work don't bother ..


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-25 21:58       ` Patrick McHardy
@ 2007-06-25 22:07         ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 31+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-25 22:07 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> Thats not necessary. I just though you could add one exit point:
> 
> 
> ...
> out:
>     skb->queue_mapping = q->mq ? band : 0;
>     return q->queues[band];
> }
> 
> But if that doesn't work don't bother ..

Unfortunately it won't, given how band might be used like this to select
the queue:

return q->queues[q->prio2band[band&TC_PRIO_MAX]];

I'll keep this in mind though, and if it can be done cleanly, I'll
submit a patch.

Thanks Patrick,
-PJ

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-23 21:36 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
  2007-06-24 12:16   ` Patrick McHardy
@ 2007-06-24 22:22   ` Patrick McHardy
  2007-06-25 17:29     ` Waskiewicz Jr, Peter P
  1 sibling, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2007-06-24 22:22 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
> +	/* If we're multiqueue, make sure the number of incoming bands
> +	 * matches the number of queues on the device we're associating with.
> +	 */
> +	if (tb[TCA_PRIO_MQ - 1])
> +		q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
> +
> +	if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count))
> +		return -EINVAL;


A nice thing you could do for the user here is use
egress_subqueue_count as default when qopt->bands == 0
(and change tc prio to accept 0 in case it doesn't).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
  2007-06-24 22:22   ` Patrick McHardy
@ 2007-06-25 17:29     ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 31+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-25 17:29 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> PJ Waskiewicz wrote:
> > +	/* If we're multiqueue, make sure the number of incoming bands
> > +	 * matches the number of queues on the device we're 
> associating with.
> > +	 */
> > +	if (tb[TCA_PRIO_MQ - 1])
> > +		q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
> > +
> > +	if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count))
> > +		return -EINVAL;
> 
> 
> A nice thing you could do for the user here is use 
> egress_subqueue_count as default when qopt->bands == 0 (and 
> change tc prio to accept 0 in case it doesn't).

prio only allows a minimum of 2 bands right now.  I see what you're
suggesting though; let me think about this.  I do like this suggestion.

Thanks,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] NET: Multiple queue hardware support
@ 2007-06-28 16:20 PJ Waskiewicz
  2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  0 siblings, 1 reply; 31+ messages in thread
From: PJ Waskiewicz @ 2007-06-28 16:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Please consider these patches for 2.6.23 inclusion.

Updates since the last submission:

1. Fixed alloc_netdev_mq() queue_count bug.

2. Fixed the TCA_PRIO_MQ options layout.

3. Protected sch_prio and sch_rr multiqueue code with NET_SCH_MULTIQUEUE.

4. Added RTA_{GET|PUT}_FLAG in place of RTA_DATA for passing multiqueue
   options to and from the qdisc.

5. Allow sch_prio and sch_rr to take 0 bands when in multiqueue mode.  This
   will set q->bands to dev->egress_subqueue_count; added this also to the
   kernel doc.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 16:20 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
@ 2007-06-28 16:21 ` PJ Waskiewicz
  2007-06-28 16:31   ` Patrick McHardy
                     ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: PJ Waskiewicz @ 2007-06-28 16:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

Updated: Fixed allocation of subqueues in alloc_netdev_mq() to
allocate all subqueues, not num - 1.

Added checks for netif_subqueue_stopped() to netpoll,
pktgen, and software device dev_queue_xmit().  This will ensure
external events to these subsystems will be handled correctly if
a subqueue is shut down.

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/etherdevice.h |    3 +-
 include/linux/netdevice.h   |   62 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |    4 ++-
 net/core/dev.c              |   27 +++++++++++++------
 net/core/netpoll.c          |    8 +++---
 net/core/pktgen.c           |   10 +++++--
 net/core/skbuff.c           |    3 ++
 net/ethernet/eth.c          |    9 +++---
 8 files changed, 104 insertions(+), 22 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void		eth_header_cache_update(struct hh_cache *hh, struct net_device *dev
 extern int		eth_header_cache(struct neighbour *neigh,
 					 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c0cc19..7078745 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+	/* Give a control state for each queue.  This struct may contain
+	 * per-queue locks in the future.
+	 */
+	unsigned long   state;
+};
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -331,6 +339,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -557,6 +566,10 @@ struct net_device
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
+
+	/* The TX queue control structures */
+	int				egress_subqueue_count;
+	struct net_device_subqueue	egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -719,6 +732,48 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index)
+{
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+					 u16 queue_index)
+{
+	return test_bit(__LINK_STATE_XOFF,
+			&dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+			       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -1009,8 +1064,11 @@ static inline void netif_tx_disable(struct net_device *dev)
 extern void		ether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-				       void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+				       void (*setup)(struct net_device *),
+				       int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+	alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int		register_netdev(struct net_device *dev);
 extern void		unregister_netdev(struct net_device *dev);
 /* Functions used for secondary unicast and multicast support */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b7b2628..6979f4b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -197,6 +197,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@tstamp: Time we arrived
  *	@dev: Device we arrived on/are leaving by
  *	@iif: ifindex of device we arrived on
+ *	@queue_mapping: Queue mapping for multiqueue devices
  *	@transport_header: Transport layer header
  *	@network_header: Network layer header
  *	@mac_header: Link layer header
@@ -247,7 +248,8 @@ struct sk_buff {
 	ktime_t			tstamp;
 	struct net_device	*dev;
 	int			iif;
-	/* 4 byte hole on 64 bit*/
+	__u16			queue_mapping;
+	/* 2 byte hole on 64 bit*/
 
 	struct  dst_entry	*dst;
 	struct	sec_path	*sp;
diff --git a/net/core/dev.c b/net/core/dev.c
index 778e102..d244771 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1429,7 +1429,9 @@ gso:
 			skb->next = nskb;
 			return rc;
 		}
-		if (unlikely(netif_queue_stopped(dev) && skb->next))
+		if (unlikely((netif_queue_stopped(dev) ||
+			     netif_subqueue_stopped(dev, skb->queue_mapping)) &&
+			     skb->next))
 			return NETDEV_TX_BUSY;
 	} while (skb->next);
 
@@ -1547,6 +1549,8 @@ gso:
 		spin_lock(&dev->queue_lock);
 		q = dev->qdisc;
 		if (q->enqueue) {
+			/* reset queue_mapping to zero */
+			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
 			qdisc_run(dev);
 			spin_unlock(&dev->queue_lock);
@@ -1576,7 +1580,8 @@ gso:
 
 			HARD_TX_LOCK(dev, cpu);
 
-			if (!netif_queue_stopped(dev)) {
+			if (!netif_queue_stopped(dev) &&
+			    !netif_subqueue_stopped(dev, skb->queue_mapping)) {
 				rc = 0;
 				if (!dev_hard_start_xmit(skb, dev)) {
 					HARD_TX_UNLOCK(dev);
@@ -3539,16 +3544,18 @@ static struct net_device_stats *internal_stats(struct net_device *dev)
 }
 
 /**
- *	alloc_netdev - allocate network device
+ *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
  *	@name:		device name format string
  *	@setup:		callback to initialize device
+ *	@queue_count:	the number of subqueues to allocate
  *
  *	Allocates a struct net_device with private data area for driver use
- *	and performs basic initialization.
+ *	and performs basic initialization.  Also allocates subquue structs
+ *	for each queue on the device at the end of the netdevice.
  */
-struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-		void (*setup)(struct net_device *))
+struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+		void (*setup)(struct net_device *), int queue_count)
 {
 	void *p;
 	struct net_device *dev;
@@ -3557,7 +3564,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
 	/* ensure 32-byte alignment of both the device and private area */
-	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
+	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
+		     (sizeof(struct net_device_subqueue) * queue_count)) &
+		     ~NETDEV_ALIGN_CONST;
 	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
 
 	p = kzalloc(alloc_size, GFP_KERNEL);
@@ -3573,12 +3582,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
 
+	dev->egress_subqueue_count = queue_count;
+
 	dev->get_stats = internal_stats;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
 }
-EXPORT_SYMBOL(alloc_netdev);
+EXPORT_SYMBOL(alloc_netdev_mq);
 
 /**
  *	free_netdev - free network device
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..2ace33d 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -66,8 +66,9 @@ static void queue_process(struct work_struct *work)
 
 		local_irq_save(flags);
 		netif_tx_lock(dev);
-		if (netif_queue_stopped(dev) ||
-		    dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
+		if ((netif_queue_stopped(dev) ||
+		     netif_subqueue_stopped(dev, skb->queue_mapping)) ||
+		     dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
 			skb_queue_head(&npinfo->txq, skb);
 			netif_tx_unlock(dev);
 			local_irq_restore(flags);
@@ -254,7 +255,8 @@ static void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
 			/* try until next clock tick */
 			for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
 					tries > 0; --tries) {
-				if (!netif_queue_stopped(dev))
+				if (!netif_queue_stopped(dev) &&
+				    !netif_subqueue_stopped(dev, skb->queue_mapping))
 					status = dev->hard_start_xmit(skb, dev);
 
 				if (status == NETDEV_TX_OK)
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 9cd3a1c..dffe067 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3139,7 +3139,9 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		}
 	}
 
-	if (netif_queue_stopped(odev) || need_resched()) {
+	if ((netif_queue_stopped(odev) ||
+	     netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) ||
+	     need_resched()) {
 		idle_start = getCurUs();
 
 		if (!netif_running(odev)) {
@@ -3154,7 +3156,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 
 		pkt_dev->idle_acc += getCurUs() - idle_start;
 
-		if (netif_queue_stopped(odev)) {
+		if (netif_queue_stopped(odev) ||
+		    netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 			pkt_dev->next_tx_us = getCurUs();	/* TODO */
 			pkt_dev->next_tx_ns = 0;
 			goto out;	/* Try the next interface */
@@ -3181,7 +3184,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 	}
 
 	netif_tx_lock_bh(odev);
-	if (!netif_queue_stopped(odev)) {
+	if (!netif_queue_stopped(odev) &&
+	    !netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 
 		atomic_inc(&(pkt_dev->skb->users));
 	      retry_now:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8d8e8fc..e9eea50 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -419,6 +419,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
+	C(queue_mapping);
 	C(priority);
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
@@ -460,6 +461,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 	new->sk		= NULL;
 	new->dev	= old->dev;
+	new->queue_mapping = old->queue_mapping;
 	new->priority	= old->priority;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
@@ -1927,6 +1929,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		tail = nskb;
 
 		nskb->dev = skb->dev;
+		nskb->queue_mapping = skb->queue_mapping;
 		nskb->priority = skb->priority;
 		nskb->protocol = skb->protocol;
 		nskb->dst = dst_clone(skb->dst);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0ac2524..87a509c 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -316,9 +316,10 @@ void ether_setup(struct net_device *dev)
 EXPORT_SYMBOL(ether_setup);
 
 /**
- * alloc_etherdev - Allocates and sets up an Ethernet device
+ * alloc_etherdev_mq - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
  *	for this Ethernet device
+ * @queue_count: The number of queues this device has.
  *
  * Fill in the fields of the device structure with Ethernet-generic
  * values. Basically does everything except registering the device.
@@ -328,8 +329,8 @@ EXPORT_SYMBOL(ether_setup);
  * this private data area.
  */
 
-struct net_device *alloc_etherdev(int sizeof_priv)
+struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count)
 {
-	return alloc_netdev(sizeof_priv, "eth%d", ether_setup);
+	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);
 }
-EXPORT_SYMBOL(alloc_etherdev);
+EXPORT_SYMBOL(alloc_etherdev_mq);

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
@ 2007-06-28 16:31   ` Patrick McHardy
  2007-06-28 17:00   ` Patrick McHardy
  2007-06-29  3:39   ` David Miller
  2 siblings, 0 replies; 31+ messages in thread
From: Patrick McHardy @ 2007-06-28 16:31 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
> Updated: Fixed allocation of subqueues in alloc_netdev_mq() to
> allocate all subqueues, not num - 1.
>
> Added checks for netif_subqueue_stopped() to netpoll,
> pktgen, and software device dev_queue_xmit().  This will ensure
> external events to these subsystems will be handled correctly if
> a subqueue is shut down.
>
> Add the multiqueue hardware device support API to the core network
> stack.  Allow drivers to allocate multiple queues and manage them
> at the netdev level if they choose to do so.
>
> Added a new field to sk_buff, namely queue_mapping, for drivers to
> know which tx_ring to select based on OS classification of the flow.
>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
>   

Acked-by: Patrick McHardy <kaber@trash.net>

skb->iif and queue_mapping should probably go somewhere near
the other shaping stuff and unsigned int seems to be a better
choice for egress_subqueue_count, but I can take care of that
when this patch is in Dave's tree.





^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  2007-06-28 16:31   ` Patrick McHardy
@ 2007-06-28 17:00   ` Patrick McHardy
  2007-06-28 19:00     ` Waskiewicz Jr, Peter P
  2007-06-29  3:39   ` David Miller
  2 siblings, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2007-06-28 17:00 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: davem, netdev, jeff, auke-jan.h.kok, hadi

PJ Waskiewicz wrote:
>  include/linux/etherdevice.h |    3 +-
>  include/linux/netdevice.h   |   62 ++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/skbuff.h      |    4 ++-
>  net/core/dev.c              |   27 +++++++++++++------
>  net/core/netpoll.c          |    8 +++---
>  net/core/pktgen.c           |   10 +++++--
>  net/core/skbuff.c           |    3 ++
>  net/ethernet/eth.c          |    9 +++---
>  8 files changed, 104 insertions(+), 22 deletions(-)

> include/linux/pkt_sched.h |    9 +++
> net/sched/Kconfig         |   23 +++++++
> net/sched/sch_prio.c      |  147
+++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 166 insertions(+), 13 deletions(-)


Quick question: where are the sch_generic changes? :)

If you hold for ten minutes I'll post a set of slightly changed
patches with the NETDEVICES_MULTIQUEUE option and a fix for this.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 17:00   ` Patrick McHardy
@ 2007-06-28 19:00     ` Waskiewicz Jr, Peter P
  2007-06-28 19:03       ` Patrick McHardy
  0 siblings, 1 reply; 31+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 19:00 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> PJ Waskiewicz wrote:
> >  include/linux/etherdevice.h |    3 +-
> >  include/linux/netdevice.h   |   62 
> ++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/skbuff.h      |    4 ++-
> >  net/core/dev.c              |   27 +++++++++++++------
> >  net/core/netpoll.c          |    8 +++---
> >  net/core/pktgen.c           |   10 +++++--
> >  net/core/skbuff.c           |    3 ++
> >  net/ethernet/eth.c          |    9 +++---
> >  8 files changed, 104 insertions(+), 22 deletions(-)
> 
> > include/linux/pkt_sched.h |    9 +++
> > net/sched/Kconfig         |   23 +++++++
> > net/sched/sch_prio.c      |  147
> +++++++++++++++++++++++++++++++++++++++++----
> >  3 files changed, 166 insertions(+), 13 deletions(-)
> 
> 
> Quick question: where are the sch_generic changes? :)
> 
> If you hold for ten minutes I'll post a set of slightly 
> changed patches with the NETDEVICES_MULTIQUEUE option and a 
> fix for this.

Jamal's and KK's qdisc_restart() rewrite took the netif_queue_stopped()
call out of sch_generic.c.  So the underlying qdisc is only responsible
for checking the queue status now before dequeueing.

-PJ

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:00     ` Waskiewicz Jr, Peter P
@ 2007-06-28 19:03       ` Patrick McHardy
  2007-06-28 19:06         ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2007-06-28 19:03 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>>Quick question: where are the sch_generic changes? :)
>>
>>If you hold for ten minutes I'll post a set of slightly 
>>changed patches with the NETDEVICES_MULTIQUEUE option and a 
>>fix for this.
> 
> 
> Jamal's and KK's qdisc_restart() rewrite took the netif_queue_stopped()
> call out of sch_generic.c.  So the underlying qdisc is only responsible
> for checking the queue status now before dequeueing.


Yes, I noticed that now. Doesn't seem right though as long as
queueing while queue is stopped is treated as a bug by the
drivers.

But I vaguely recall seeing a discussion about this, I'll check
the archives.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:03       ` Patrick McHardy
@ 2007-06-28 19:06         ` Waskiewicz Jr, Peter P
  2007-06-28 19:20           ` Patrick McHardy
  0 siblings, 1 reply; 31+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 19:06 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

> Waskiewicz Jr, Peter P wrote:
> >>Quick question: where are the sch_generic changes? :)
> >>
> >>If you hold for ten minutes I'll post a set of slightly changed 
> >>patches with the NETDEVICES_MULTIQUEUE option and a fix for this.
> > 
> > 
> > Jamal's and KK's qdisc_restart() rewrite took the 
> netif_queue_stopped()
> > call out of sch_generic.c.  So the underlying qdisc is only 
> responsible
> > for checking the queue status now before dequeueing.
> 
> 
> Yes, I noticed that now. Doesn't seem right though as long as
> queueing while queue is stopped is treated as a bug by the
> drivers.
> 
> But I vaguely recall seeing a discussion about this, I'll check
> the archives.

The basic gist is before the dequeue is done, the qdisc is locked by the
qdisc is running bit, so another CPU cannot get in there.  So if the
queue isn't stopped when a dequeue is done, that same queue should not
be stopped when hard_start_xmit() is called.  The only thing I could
think of that could happen is some out-of-band cleanup routine in the
driver where the tx_ring lock is held, and the skb is bounced back,
where the driver returns NETIF_TX_BUSY, and you requeue.  This is an
extreme corner case, so the check could be removed.

-PJ

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:06         ` Waskiewicz Jr, Peter P
@ 2007-06-28 19:20           ` Patrick McHardy
  2007-06-28 19:32             ` Jeff Garzik
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2007-06-28 19:20 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: davem, netdev, jeff, Kok, Auke-jan H, hadi

Waskiewicz Jr, Peter P wrote:
>>Waskiewicz Jr, Peter P wrote:
>>
>>Yes, I noticed that now. Doesn't seem right though as long as
>>queueing while queue is stopped is treated as a bug by the
>>drivers.
>>
>>But I vaguely recall seeing a discussion about this, I'll check
>>the archives.
> 
> 
> The basic gist is before the dequeue is done, the qdisc is locked by the
> qdisc is running bit, so another CPU cannot get in there.  So if the
> queue isn't stopped when a dequeue is done, that same queue should not
> be stopped when hard_start_xmit() is called.  The only thing I could
> think of that could happen is some out-of-band cleanup routine in the
> driver where the tx_ring lock is held, and the skb is bounced back,
> where the driver returns NETIF_TX_BUSY, and you requeue.  This is an
> extreme corner case, so the check could be removed.


Yes, but there are users that don't go through qdiscs, like netpoll,
Having them check the QDISC_RUNNING bit seems ugly.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:20           ` Patrick McHardy
@ 2007-06-28 19:32             ` Jeff Garzik
  2007-06-28 19:37               ` Patrick McHardy
  2007-06-28 20:39               ` David Miller
  0 siblings, 2 replies; 31+ messages in thread
From: Jeff Garzik @ 2007-06-28 19:32 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Waskiewicz Jr, Peter P, davem, netdev, Kok, Auke-jan H, hadi

Patrick McHardy wrote:
> Yes, but there are users that don't go through qdiscs, like netpoll,
> Having them check the QDISC_RUNNING bit seems ugly.

Is netpoll the only such user?

netpoll tends to be a special case in every sense of the word, and I 
wish it was less so :/

	Jeff



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:32             ` Jeff Garzik
@ 2007-06-28 19:37               ` Patrick McHardy
  2007-06-28 21:11                 ` Waskiewicz Jr, Peter P
  2007-06-28 20:39               ` David Miller
  1 sibling, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2007-06-28 19:37 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Waskiewicz Jr, Peter P, davem, netdev, Kok, Auke-jan H, hadi

Jeff Garzik wrote:
> Patrick McHardy wrote:
> 
>> Yes, but there are users that don't go through qdiscs, like netpoll,
>> Having them check the QDISC_RUNNING bit seems ugly.
> 
> 
> Is netpoll the only such user?

I'm not sure, I just remembered that one :)

Looking at Peter's multiqueue patch, which should include all
hard_start_xmit users (I'm not seeing sch_teql though, Peter?)
the only other one is pktgen.

> netpoll tends to be a special case in every sense of the word, and I
> wish it was less so :/

Indeed.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:37               ` Patrick McHardy
@ 2007-06-28 21:11                 ` Waskiewicz Jr, Peter P
  2007-06-28 21:18                   ` Patrick McHardy
  0 siblings, 1 reply; 31+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 21:11 UTC (permalink / raw)
  To: Patrick McHardy, Jeff Garzik; +Cc: davem, netdev, Kok, Auke-jan H, hadi

> -----Original Message-----
> From: Patrick McHardy [mailto:kaber@trash.net] 
> Sent: Thursday, June 28, 2007 12:37 PM
> To: Jeff Garzik
> Cc: Waskiewicz Jr, Peter P; davem@davemloft.net; 
> netdev@vger.kernel.org; Kok, Auke-jan H; hadi@cyberus.ca
> Subject: Re: [PATCH 2/3] NET: [CORE] Stack changes to add 
> multiqueue hardware support API
> 
> Jeff Garzik wrote:
> > Patrick McHardy wrote:
> > 
> >> Yes, but there are users that don't go through qdiscs, 
> like netpoll, 
> >> Having them check the QDISC_RUNNING bit seems ugly.
> > 
> > 
> > Is netpoll the only such user?
> 
> I'm not sure, I just remembered that one :)
> 
> Looking at Peter's multiqueue patch, which should include all 
> hard_start_xmit users (I'm not seeing sch_teql though, 
> Peter?) the only other one is pktgen.

Ugh.  That is another netif_queue_stopped() that needs
netif_subqueue_stopped().  I can send an updated patch for the core to
fix this based from your patches Patrick.

> 
> > netpoll tends to be a special case in every sense of the 
> word, and I 
> > wish it was less so :/
> 
> Indeed.

So what do we do about netpoll then wrt netif_(sub)queue_stopped() being
removed from qdisc_restart()?  The fallout of having netpoll() cause a
queue to stop (queue 0 only) is the skb sent will be requeued, since the
driver will return NETIF_TX_BUSY if this actually happens.  But this is
a corner case, and we won't lose packets; we'll just have increased
latency on that queue.  Should I worry about this or just move forward
with the sch_teql.c change and repost the core patch?

Thanks,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 21:11                 ` Waskiewicz Jr, Peter P
@ 2007-06-28 21:18                   ` Patrick McHardy
  2007-06-28 23:08                     ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2007-06-28 21:18 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: Jeff Garzik, davem, netdev, Kok, Auke-jan H, hadi

[-- Attachment #1: Type: text/plain, Size: 1048 bytes --]

Waskiewicz Jr, Peter P wrote:
>>
>> Looking at Peter's multiqueue patch, which should include all 
>> hard_start_xmit users (I'm not seeing sch_teql though, 
>> Peter?) the only other one is pktgen.
>>     
>
> Ugh.  That is another netif_queue_stopped() that needs
> netif_subqueue_stopped().  I can send an updated patch for the core to
> fix this based from your patches Patrick.
>   

I still have the tree around, here's an updated version.

>
> So what do we do about netpoll then wrt netif_(sub)queue_stopped() being
> removed from qdisc_restart()?  The fallout of having netpoll() cause a
> queue to stop (queue 0 only) is the skb sent will be requeued, since the
> driver will return NETIF_TX_BUSY if this actually happens.  But this is
> a corner case, and we won't lose packets; we'll just have increased
> latency on that queue.  Should I worry about this or just move forward
> with the sch_teql.c change and repost the core patch?
>   


I don't think you need to worry about that, the subqueue
patch just follows the existing code.


[-- Attachment #2: 01.diff --]
[-- Type: text/x-diff, Size: 16275 bytes --]

[CORE] Stack changes to add multiqueue hardware	support API

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>

---
commit 8658c76aeb209c91598d275b99d602bf5aeccb63
tree 915d315cdcff46d1f9aadb785a0173d318452122
parent 0552c565358330c59913f2b512355f196e31bd74
author Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Thu, 28 Jun 2007 18:41:00 +0200
committer Patrick McHardy <kaber@trash.net> Thu, 28 Jun 2007 23:17:13 +0200

 drivers/net/Kconfig         |    8 +++++
 include/linux/etherdevice.h |    3 +-
 include/linux/netdevice.h   |   76 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |   25 ++++++++++++--
 net/core/dev.c              |   27 +++++++++++----
 net/core/netpoll.c          |    8 +++--
 net/core/pktgen.c           |   10 ++++--
 net/core/skbuff.c           |    3 ++
 net/ethernet/eth.c          |    9 +++--
 net/sched/sch_teql.c        |    6 +++
 10 files changed, 150 insertions(+), 25 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index e5549dc..8bce4fb 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -28,6 +28,14 @@ config NETDEVICES
 # that for each of the symbols.
 if NETDEVICES
 
+config NETDEVICES_MULTIQUEUE
+	bool "Netdevice multiple hardware queue support"
+	---help---
+	  Say Y here if you want to allow the network stack to use multiple
+	  hardware TX queues on an ethernet device.
+
+	  Most people will say N here.
+
 config IFB
 	tristate "Intermediate Functional Block support"
 	depends on NET_CLS_ACT
diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..6cdb973 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void		eth_header_cache_update(struct hh_cache *hh, struct net_device *dev
 extern int		eth_header_cache(struct neighbour *neigh,
 					 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, unsigned int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c0cc19..1b43e15 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+	/* Give a control state for each queue.  This struct may contain
+	 * per-queue locks in the future.
+	 */
+	unsigned long   state;
+};
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -331,6 +339,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -557,6 +566,10 @@ struct net_device
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
+
+	/* The TX queue control structures */
+	unsigned int			egress_subqueue_count;
+	struct net_device_subqueue	egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -719,6 +732,62 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+#endif
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+#endif
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+					 u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	return test_bit(__LINK_STATE_XOFF,
+			&dev->egress_subqueue[queue_index].state);
+#else
+	return 0;
+#endif
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+			       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+#endif
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+#else
+	return 0;
+#endif
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -1009,8 +1078,11 @@ static inline void netif_tx_disable(struct net_device *dev)
 extern void		ether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-				       void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+				       void (*setup)(struct net_device *),
+				       unsigned int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+	alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int		register_netdev(struct net_device *dev);
 extern void		unregister_netdev(struct net_device *dev);
 /* Functions used for secondary unicast and multicast support */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b7b2628..e3851bb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -196,7 +196,6 @@ typedef unsigned char *sk_buff_data_t;
  *	@sk: Socket we are owned by
  *	@tstamp: Time we arrived
  *	@dev: Device we arrived on/are leaving by
- *	@iif: ifindex of device we arrived on
  *	@transport_header: Transport layer header
  *	@network_header: Network layer header
  *	@mac_header: Link layer header
@@ -231,6 +230,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@nfctinfo: Relationship of this skb to the connection
  *	@nfct_reasm: netfilter conntrack re-assembly pointer
  *	@nf_bridge: Saved data about a bridged frame - see br_netfilter.c
+ *	@iif: ifindex of device we arrived on
+ *	@queue_mapping: Queue mapping for multiqueue devices
  *	@tc_index: Traffic control index
  *	@tc_verd: traffic control verdict
  *	@dma_cookie: a cookie to one of several possible DMA operations
@@ -246,8 +247,6 @@ struct sk_buff {
 	struct sock		*sk;
 	ktime_t			tstamp;
 	struct net_device	*dev;
-	int			iif;
-	/* 4 byte hole on 64 bit*/
 
 	struct  dst_entry	*dst;
 	struct	sec_path	*sp;
@@ -290,12 +289,18 @@ struct sk_buff {
 #ifdef CONFIG_BRIDGE_NETFILTER
 	struct nf_bridge_info	*nf_bridge;
 #endif
+
+	int			iif;
+	__u16			queue_mapping;
+
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
 #ifdef CONFIG_NET_CLS_ACT
 	__u16			tc_verd;	/* traffic control verdict */
 #endif
 #endif
+	/* 2 byte hole */
+
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
 #endif
@@ -1721,6 +1726,20 @@ static inline void skb_init_secmark(struct sk_buff *skb)
 { }
 #endif
 
+static inline void skb_set_queue_mapping(struct sk_buff *skb, u16 queue_mapping)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	skb->queue_mapping = queue_mapping;
+#endif
+}
+
+static inline void skb_copy_queue_mapping(struct sk_buff *to, const struct sk_buff *from)
+{
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	to->queue_mapping = from->queue_mapping;
+#endif
+}
+
 static inline int skb_is_gso(const struct sk_buff *skb)
 {
 	return skb_shinfo(skb)->gso_size;
diff --git a/net/core/dev.c b/net/core/dev.c
index 778e102..e94991a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1429,7 +1429,9 @@ gso:
 			skb->next = nskb;
 			return rc;
 		}
-		if (unlikely(netif_queue_stopped(dev) && skb->next))
+		if (unlikely((netif_queue_stopped(dev) ||
+			     netif_subqueue_stopped(dev, skb->queue_mapping)) &&
+			     skb->next))
 			return NETDEV_TX_BUSY;
 	} while (skb->next);
 
@@ -1547,6 +1549,8 @@ gso:
 		spin_lock(&dev->queue_lock);
 		q = dev->qdisc;
 		if (q->enqueue) {
+			/* reset queue_mapping to zero */
+			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
 			qdisc_run(dev);
 			spin_unlock(&dev->queue_lock);
@@ -1576,7 +1580,8 @@ gso:
 
 			HARD_TX_LOCK(dev, cpu);
 
-			if (!netif_queue_stopped(dev)) {
+			if (!netif_queue_stopped(dev) &&
+			    !netif_subqueue_stopped(dev, skb->queue_mapping)) {
 				rc = 0;
 				if (!dev_hard_start_xmit(skb, dev)) {
 					HARD_TX_UNLOCK(dev);
@@ -3539,16 +3544,18 @@ static struct net_device_stats *internal_stats(struct net_device *dev)
 }
 
 /**
- *	alloc_netdev - allocate network device
+ *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
  *	@name:		device name format string
  *	@setup:		callback to initialize device
+ *	@queue_count:	the number of subqueues to allocate
  *
  *	Allocates a struct net_device with private data area for driver use
- *	and performs basic initialization.
+ *	and performs basic initialization.  Also allocates subquue structs
+ *	for each queue on the device at the end of the netdevice.
  */
-struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-		void (*setup)(struct net_device *))
+struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+		void (*setup)(struct net_device *), unsigned int queue_count)
 {
 	void *p;
 	struct net_device *dev;
@@ -3557,7 +3564,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
 	/* ensure 32-byte alignment of both the device and private area */
-	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
+	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
+		     (sizeof(struct net_device_subqueue) * queue_count)) &
+		     ~NETDEV_ALIGN_CONST;
 	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
 
 	p = kzalloc(alloc_size, GFP_KERNEL);
@@ -3573,12 +3582,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
 
+	dev->egress_subqueue_count = queue_count;
+
 	dev->get_stats = internal_stats;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
 }
-EXPORT_SYMBOL(alloc_netdev);
+EXPORT_SYMBOL(alloc_netdev_mq);
 
 /**
  *	free_netdev - free network device
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 758dafe..2ace33d 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -66,8 +66,9 @@ static void queue_process(struct work_struct *work)
 
 		local_irq_save(flags);
 		netif_tx_lock(dev);
-		if (netif_queue_stopped(dev) ||
-		    dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
+		if ((netif_queue_stopped(dev) ||
+		     netif_subqueue_stopped(dev, skb->queue_mapping)) ||
+		     dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) {
 			skb_queue_head(&npinfo->txq, skb);
 			netif_tx_unlock(dev);
 			local_irq_restore(flags);
@@ -254,7 +255,8 @@ static void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
 			/* try until next clock tick */
 			for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
 					tries > 0; --tries) {
-				if (!netif_queue_stopped(dev))
+				if (!netif_queue_stopped(dev) &&
+				    !netif_subqueue_stopped(dev, skb->queue_mapping))
 					status = dev->hard_start_xmit(skb, dev);
 
 				if (status == NETDEV_TX_OK)
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 9cd3a1c..dffe067 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3139,7 +3139,9 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		}
 	}
 
-	if (netif_queue_stopped(odev) || need_resched()) {
+	if ((netif_queue_stopped(odev) ||
+	     netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) ||
+	     need_resched()) {
 		idle_start = getCurUs();
 
 		if (!netif_running(odev)) {
@@ -3154,7 +3156,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 
 		pkt_dev->idle_acc += getCurUs() - idle_start;
 
-		if (netif_queue_stopped(odev)) {
+		if (netif_queue_stopped(odev) ||
+		    netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 			pkt_dev->next_tx_us = getCurUs();	/* TODO */
 			pkt_dev->next_tx_ns = 0;
 			goto out;	/* Try the next interface */
@@ -3181,7 +3184,8 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 	}
 
 	netif_tx_lock_bh(odev);
-	if (!netif_queue_stopped(odev)) {
+	if (!netif_queue_stopped(odev) &&
+	    !netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) {
 
 		atomic_inc(&(pkt_dev->skb->users));
 	      retry_now:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8d8e8fc..af03556 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -419,6 +419,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
+	skb_copy_queue_mapping(n, skb);
 	C(priority);
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
@@ -460,6 +461,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 	new->sk		= NULL;
 	new->dev	= old->dev;
+	skb_copy_queue_mapping(new, old);
 	new->priority	= old->priority;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
@@ -1927,6 +1929,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		tail = nskb;
 
 		nskb->dev = skb->dev;
+		skb_copy_queue_mapping(nskb, skb);
 		nskb->priority = skb->priority;
 		nskb->protocol = skb->protocol;
 		nskb->dst = dst_clone(skb->dst);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0ac2524..1387e54 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -316,9 +316,10 @@ void ether_setup(struct net_device *dev)
 EXPORT_SYMBOL(ether_setup);
 
 /**
- * alloc_etherdev - Allocates and sets up an Ethernet device
+ * alloc_etherdev_mq - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
  *	for this Ethernet device
+ * @queue_count: The number of queues this device has.
  *
  * Fill in the fields of the device structure with Ethernet-generic
  * values. Basically does everything except registering the device.
@@ -328,8 +329,8 @@ EXPORT_SYMBOL(ether_setup);
  * this private data area.
  */
 
-struct net_device *alloc_etherdev(int sizeof_priv)
+struct net_device *alloc_etherdev_mq(int sizeof_priv, unsigned int queue_count)
 {
-	return alloc_netdev(sizeof_priv, "eth%d", ether_setup);
+	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);
 }
-EXPORT_SYMBOL(alloc_etherdev);
+EXPORT_SYMBOL(alloc_etherdev_mq);
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index f05ad9a..dfe7e45 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -277,6 +277,7 @@ static int teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 	int busy;
 	int nores;
 	int len = skb->len;
+	int subq = skb->queue_mapping;
 	struct sk_buff *skb_res = NULL;
 
 	start = master->slaves;
@@ -293,7 +294,9 @@ restart:
 
 		if (slave->qdisc_sleeping != q)
 			continue;
-		if (netif_queue_stopped(slave) || ! netif_running(slave)) {
+		if (netif_queue_stopped(slave) ||
+		    netif_subqueue_stopped(slave, subq) ||
+		    !netif_running(slave)) {
 			busy = 1;
 			continue;
 		}
@@ -302,6 +305,7 @@ restart:
 		case 0:
 			if (netif_tx_trylock(slave)) {
 				if (!netif_queue_stopped(slave) &&
+				    !netif_subqueue_stopped(slave, subq) &&
 				    slave->hard_start_xmit(skb, slave) == 0) {
 					netif_tx_unlock(slave);
 					master->slaves = NEXT_SLAVE(q);

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 21:18                   ` Patrick McHardy
@ 2007-06-28 23:08                     ` Waskiewicz Jr, Peter P
  2007-06-28 23:31                       ` David Miller
  0 siblings, 1 reply; 31+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-06-28 23:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Jeff Garzik, davem, netdev, Kok, Auke-jan H, hadi

> Waskiewicz Jr, Peter P wrote:
> >>
> >> Looking at Peter's multiqueue patch, which should include all 
> >> hard_start_xmit users (I'm not seeing sch_teql though,
> >> Peter?) the only other one is pktgen.
> >>     
> >
> > Ugh.  That is another netif_queue_stopped() that needs 
> > netif_subqueue_stopped().  I can send an updated patch for 
> the core to 
> > fix this based from your patches Patrick.
> >   
> 
> I still have the tree around, here's an updated version.
> 
> >
> > So what do we do about netpoll then wrt netif_(sub)queue_stopped() 
> > being removed from qdisc_restart()?  The fallout of having 
> netpoll() 
> > cause a queue to stop (queue 0 only) is the skb sent will 
> be requeued, 
> > since the driver will return NETIF_TX_BUSY if this actually 
> happens.  
> > But this is a corner case, and we won't lose packets; we'll 
> just have 
> > increased latency on that queue.  Should I worry about this or just 
> > move forward with the sch_teql.c change and repost the core patch?
> >   
> 
> 
> I don't think you need to worry about that, the subqueue 
> patch just follows the existing code.

Thanks Patrick for taking care of this.  I am totally fine with this
patch; if anyone else has feedback, please send it.  If not, I'm excited
to see if these can be considered for 2.6.23 now.  :)  Thanks everyone
for the help.

Cheers,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 23:08                     ` Waskiewicz Jr, Peter P
@ 2007-06-28 23:31                       ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2007-06-28 23:31 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr; +Cc: kaber, jeff, netdev, auke-jan.h.kok, hadi

From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com>
Date: Thu, 28 Jun 2007 16:08:43 -0700

> Thanks Patrick for taking care of this.  I am totally fine with this
> patch; if anyone else has feedback, please send it.  If not, I'm excited
> to see if these can be considered for 2.6.23 now.  :)  Thanks everyone
> for the help.

I'll look over the current patches later this evening, I was
initially waiting for the GSO BUG() akpm reported to get fixed
and Herbert took care of that an hour or so ago.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 19:32             ` Jeff Garzik
  2007-06-28 19:37               ` Patrick McHardy
@ 2007-06-28 20:39               ` David Miller
  1 sibling, 0 replies; 31+ messages in thread
From: David Miller @ 2007-06-28 20:39 UTC (permalink / raw)
  To: jeff; +Cc: kaber, peter.p.waskiewicz.jr, netdev, auke-jan.h.kok, hadi

From: Jeff Garzik <jeff@garzik.org>
Date: Thu, 28 Jun 2007 15:32:40 -0400

> Patrick McHardy wrote:
> > Yes, but there are users that don't go through qdiscs, like netpoll,
> > Having them check the QDISC_RUNNING bit seems ugly.
> 
> Is netpoll the only such user?
> 
> netpoll tends to be a special case in every sense of the word, and I 
> wish it was less so :/

Seconded.

I'm perfectly happy to consider rearchitecting of netpoll to something
that works better.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  2007-06-28 16:31   ` Patrick McHardy
  2007-06-28 17:00   ` Patrick McHardy
@ 2007-06-29  3:39   ` David Miller
  2007-06-29 10:54     ` Jeff Garzik
  2 siblings, 1 reply; 31+ messages in thread
From: David Miller @ 2007-06-29  3:39 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr; +Cc: netdev, jeff, auke-jan.h.kok, hadi, kaber

From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
Date: Thu, 28 Jun 2007 09:21:13 -0700

> -struct net_device *alloc_netdev(int sizeof_priv, const char *name,
> -		void (*setup)(struct net_device *))
> +struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
> +		void (*setup)(struct net_device *), int queue_count)
>  {
>  	void *p;
>  	struct net_device *dev;
> @@ -3557,7 +3564,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>  	BUG_ON(strlen(name) >= sizeof(dev->name));
>  
>  	/* ensure 32-byte alignment of both the device and private area */
> -	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
> +	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
> +		     (sizeof(struct net_device_subqueue) * queue_count)) &
> +		     ~NETDEV_ALIGN_CONST;
>  	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
>  
>  	p = kzalloc(alloc_size, GFP_KERNEL);
> @@ -3573,12 +3582,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>  	if (sizeof_priv)
>  		dev->priv = netdev_priv(dev);
>  
> +	dev->egress_subqueue_count = queue_count;
> +
>  	dev->get_stats = internal_stats;
>  	setup(dev);
>  	strcpy(dev->name, name);
>  	return dev;
>  }

This isn't going to work.

The pointer returned from netdev_priv() doesn't take into account the
variable sized queues at the end of struct netdev, so we can stomp
over the queues with the private area.

This probably works by luck because of NETDEV_ALIGN.

The simplest fix is to just make netdev_priv() use dev->priv,
except when it's being initialized during allocation, and
that's what I'm going to do when I apply your patch.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-29  3:39   ` David Miller
@ 2007-06-29 10:54     ` Jeff Garzik
  0 siblings, 0 replies; 31+ messages in thread
From: Jeff Garzik @ 2007-06-29 10:54 UTC (permalink / raw)
  To: David Miller; +Cc: peter.p.waskiewicz.jr, netdev, auke-jan.h.kok, hadi, kaber

David Miller wrote:
> From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
> Date: Thu, 28 Jun 2007 09:21:13 -0700
> 
>> -struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>> -		void (*setup)(struct net_device *))
>> +struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
>> +		void (*setup)(struct net_device *), int queue_count)
>>  {
>>  	void *p;
>>  	struct net_device *dev;
>> @@ -3557,7 +3564,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>>  	BUG_ON(strlen(name) >= sizeof(dev->name));
>>  
>>  	/* ensure 32-byte alignment of both the device and private area */
>> -	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
>> +	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
>> +		     (sizeof(struct net_device_subqueue) * queue_count)) &
>> +		     ~NETDEV_ALIGN_CONST;
>>  	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
>>  
>>  	p = kzalloc(alloc_size, GFP_KERNEL);
>> @@ -3573,12 +3582,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>>  	if (sizeof_priv)
>>  		dev->priv = netdev_priv(dev);
>>  
>> +	dev->egress_subqueue_count = queue_count;
>> +
>>  	dev->get_stats = internal_stats;
>>  	setup(dev);
>>  	strcpy(dev->name, name);
>>  	return dev;
>>  }
> 
> This isn't going to work.
> 
> The pointer returned from netdev_priv() doesn't take into account the
> variable sized queues at the end of struct netdev, so we can stomp
> over the queues with the private area.
> 
> This probably works by luck because of NETDEV_ALIGN.
> 
> The simplest fix is to just make netdev_priv() use dev->priv,
> except when it's being initialized during allocation, and
> that's what I'm going to do when I apply your patch.

Ugh.  That will reverse the gains we had with the current setup, won't it?

Also, what happens when we want to add ingress_queue[0] ?

	Jeff






^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] NET: Multiple queue hardware support
@ 2007-06-21 21:26 PJ Waskiewicz
  2007-06-21 21:26 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
  0 siblings, 1 reply; 31+ messages in thread
From: PJ Waskiewicz @ 2007-06-21 21:26 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, kaber, hadi

Please consider these patches for 2.6.23 inclusion.

Updates since the last submission:

1. skb->queue_mapping moved into the iff cacheline.  I looked at moving
   iff and queue_mapping, but there wasn't enough room anywhere else to
   logically group these in a different cacheline that I could see.  Thanks
   Patrick McHardy.

2. netdev->egress_subqueue is now indexed thanks to Dave Miller.

3. sch_rr is now a MODULE_ALIAS of sch_prio.  Thanks Patrick McHardy.

4. Both sch_rr and multiqueue sch_prio expect the number of bands to
   equal the number of queues on the netdev.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.  The requested e1000 changes are still being resolved,
and will be sent at a later date.

I did not modify other users of netif_queue_stopped() in net/core/netpoll.c,
net/core/dev.c, or net/core/pktgen.c, since no classification occurs for
the skb being sent to the device.  Therefore, packets should always be
ending up in queue 0, so there's no need to check the subqueue status either.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
  2007-06-21 21:26 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
@ 2007-06-21 21:26 ` PJ Waskiewicz
  0 siblings, 0 replies; 31+ messages in thread
From: PJ Waskiewicz @ 2007-06-21 21:26 UTC (permalink / raw)
  To: davem; +Cc: netdev, jeff, auke-jan.h.kok, kaber, hadi

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/etherdevice.h |    3 +-
 include/linux/netdevice.h   |   62 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |    4 ++-
 net/core/dev.c              |   20 ++++++++++----
 net/core/skbuff.c           |    3 ++
 net/ethernet/eth.c          |    9 +++---
 6 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void		eth_header_cache_update(struct hh_cache *hh, struct net_device *dev
 extern int		eth_header_cache(struct neighbour *neigh,
 					 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7913ee..6509eb4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+	/* Give a control state for each queue.  This struct may contain
+	 * per-queue locks in the future.
+	 */
+	unsigned long	state;
+};
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -325,6 +333,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -543,6 +552,10 @@ struct net_device
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
+
+ 	/* The TX queue control structures */
+ 	int				egress_subqueue_count;
+ 	struct net_device_subqueue	egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -705,6 +718,48 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index)
+{
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+                                         u16 queue_index)
+{
+	return test_bit(__LINK_STATE_XOFF,
+	                &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+	                       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -995,8 +1050,11 @@ static inline void netif_tx_disable(struct net_device *dev)
 extern void		ether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-				       void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+					  void (*setup)(struct net_device *),
+					  int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+	alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int		register_netdev(struct net_device *dev);
 extern void		unregister_netdev(struct net_device *dev);
 /* Functions used for multicast support */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e7367c7..01b5e25 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -197,6 +197,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@tstamp: Time we arrived
  *	@dev: Device we arrived on/are leaving by
  *	@iif: ifindex of device we arrived on
+ *	@queue_mapping: Queue mapping for multiqueue devices
  *	@transport_header: Transport layer header
  *	@network_header: Network layer header
  *	@mac_header: Link layer header
@@ -246,7 +247,8 @@ struct sk_buff {
 	ktime_t			tstamp;
 	struct net_device	*dev;
 	int			iif;
-	/* 4 byte hole on 64 bit*/
+	__u16			queue_mapping;
+	/* 2 byte hole on 64 bit*/
 
 	struct  dst_entry	*dst;
 	struct	sec_path	*sp;
diff --git a/net/core/dev.c b/net/core/dev.c
index 2609062..66909aa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1545,6 +1545,8 @@ gso:
 		spin_lock(&dev->queue_lock);
 		q = dev->qdisc;
 		if (q->enqueue) {
+			/* reset queue_mapping to zero */
+			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
 			qdisc_run(dev);
 			spin_unlock(&dev->queue_lock);
@@ -3343,16 +3345,18 @@ static struct net_device_stats *internal_stats(struct net_device *dev)
 }
 
 /**
- *	alloc_netdev - allocate network device
+ *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
  *	@name:		device name format string
  *	@setup:		callback to initialize device
+ *	@queue_count:	the number of subqueues to allocate
  *
  *	Allocates a struct net_device with private data area for driver use
- *	and performs basic initialization.
+ *	and performs basic initialization.  Also allocates subqueue structs
+ *	for each queue on the device at the end of the netdevice.
  */
-struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-		void (*setup)(struct net_device *))
+struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+		void (*setup)(struct net_device *), int queue_count)
 {
 	void *p;
 	struct net_device *dev;
@@ -3361,7 +3365,9 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
 	/* ensure 32-byte alignment of both the device and private area */
-	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST;
+	alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
+		     (sizeof(struct net_device_subqueue) * (queue_count - 1))) &
+		     ~NETDEV_ALIGN_CONST;
 	alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
 
 	p = kzalloc(alloc_size, GFP_KERNEL);
@@ -3377,12 +3383,14 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
 
+  	dev->egress_subqueue_count = queue_count;
+
 	dev->get_stats = internal_stats;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
 }
-EXPORT_SYMBOL(alloc_netdev);
+EXPORT_SYMBOL(alloc_netdev_mq);
 
 /**
  *	free_netdev - free network device
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7c6a34e..7bbed45 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -418,6 +418,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
+	C(queue_mapping);
 	C(priority);
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
@@ -459,6 +460,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 	new->sk		= NULL;
 	new->dev	= old->dev;
+	new->queue_mapping = old->queue_mapping;
 	new->priority	= old->priority;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
@@ -1925,6 +1927,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		tail = nskb;
 
 		nskb->dev = skb->dev;
+		nskb->queue_mapping = skb->queue_mapping;
 		nskb->priority = skb->priority;
 		nskb->protocol = skb->protocol;
 		nskb->dst = dst_clone(skb->dst);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0ac2524..87a509c 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -316,9 +316,10 @@ void ether_setup(struct net_device *dev)
 EXPORT_SYMBOL(ether_setup);
 
 /**
- * alloc_etherdev - Allocates and sets up an Ethernet device
+ * alloc_etherdev_mq - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
  *	for this Ethernet device
+ * @queue_count: The number of queues this device has.
  *
  * Fill in the fields of the device structure with Ethernet-generic
  * values. Basically does everything except registering the device.
@@ -328,8 +329,8 @@ EXPORT_SYMBOL(ether_setup);
  * this private data area.
  */
 
-struct net_device *alloc_etherdev(int sizeof_priv)
+struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count)
 {
-	return alloc_netdev(sizeof_priv, "eth%d", ether_setup);
+	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);
 }
-EXPORT_SYMBOL(alloc_etherdev);
+EXPORT_SYMBOL(alloc_etherdev_mq);

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2007-06-29 10:54 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-23 21:36 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
2007-06-23 21:36 ` [PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation PJ Waskiewicz
2007-06-23 21:36 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
2007-06-24 12:00   ` Patrick McHardy
2007-06-25 16:25     ` Waskiewicz Jr, Peter P
2007-06-23 21:36 ` [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue PJ Waskiewicz
2007-06-24 12:16   ` Patrick McHardy
2007-06-25 17:27     ` Waskiewicz Jr, Peter P
2007-06-25 17:29       ` Patrick McHardy
2007-06-25 21:53     ` Waskiewicz Jr, Peter P
2007-06-25 21:58       ` Patrick McHardy
2007-06-25 22:07         ` Waskiewicz Jr, Peter P
2007-06-24 22:22   ` Patrick McHardy
2007-06-25 17:29     ` Waskiewicz Jr, Peter P
  -- strict thread matches above, loose matches on Subject: below --
2007-06-28 16:20 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
2007-06-28 16:21 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz
2007-06-28 16:31   ` Patrick McHardy
2007-06-28 17:00   ` Patrick McHardy
2007-06-28 19:00     ` Waskiewicz Jr, Peter P
2007-06-28 19:03       ` Patrick McHardy
2007-06-28 19:06         ` Waskiewicz Jr, Peter P
2007-06-28 19:20           ` Patrick McHardy
2007-06-28 19:32             ` Jeff Garzik
2007-06-28 19:37               ` Patrick McHardy
2007-06-28 21:11                 ` Waskiewicz Jr, Peter P
2007-06-28 21:18                   ` Patrick McHardy
2007-06-28 23:08                     ` Waskiewicz Jr, Peter P
2007-06-28 23:31                       ` David Miller
2007-06-28 20:39               ` David Miller
2007-06-29  3:39   ` David Miller
2007-06-29 10:54     ` Jeff Garzik
2007-06-21 21:26 [PATCH] NET: Multiple queue hardware support PJ Waskiewicz
2007-06-21 21:26 ` [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API PJ Waskiewicz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).