Re: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-10  0:32 [PATCH] NET: [UPDATED] Multiqueue network device support implementation Peter P Waskiewicz Jr
@ 2007-04-10  0:28 ` Patrick McHardy
  2007-04-10  1:40   ` Waskiewicz Jr, Peter P
  2007-04-10  9:48 ` Patrick McHardy
  1 sibling, 1 reply; 11+ messages in thread
From: Patrick McHardy @ 2007-04-10  0:28 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: davem, netdev, linux-kernel, jgarzik, cramerj, auke-jan.h.kok,
	christopher.leech

Peter P Waskiewicz Jr wrote:
> From: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> 
> Update: Fixed a typecast in free_netdev() for the egress_subqueue list.
> 
> Added an API and associated supporting routines for multiqueue network devices.
> This allows network devices supporting multiple TX queues to configure each
> queue within the netdevice and manage each queue independantly.  Changes to the
> PRIO Qdisc also allow a user to map multiple flows to individual TX queues,
> taking advantage of each queue on the device.


This indeed looks a lot better than the first patch. I'm too tired to
fully review this now, but could you please post the corresponding e1000
patch? From a quick look I'm guessing that this patch changes the
behaviour of the prio qdisc from strict priority to whatever scheduling
mechanism e1000 uses for its queues when the multiqueue config option
is enabled, which might surprise people.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
@ 2007-04-10  0:32 Peter P Waskiewicz Jr
  2007-04-10  0:28 ` Patrick McHardy
  2007-04-10  9:48 ` Patrick McHardy
  0 siblings, 2 replies; 11+ messages in thread
From: Peter P Waskiewicz Jr @ 2007-04-10  0:32 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, jgarzik, cramerj, auke-jan.h.kok,
	christopher.leech

From: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>

Update: Fixed a typecast in free_netdev() for the egress_subqueue list.

Added an API and associated supporting routines for multiqueue network devices.
This allows network devices supporting multiple TX queues to configure each
queue within the netdevice and manage each queue independantly.  Changes to the
PRIO Qdisc also allow a user to map multiple flows to individual TX queues,
taking advantage of each queue on the device.

Signed-off-by: Peter P. Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---

 include/linux/etherdevice.h |    3 +-
 include/linux/netdevice.h   |   66 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |    2 +
 net/core/dev.c              |   25 +++++++++++++---
 net/core/skbuff.c           |    3 ++
 net/ethernet/eth.c          |    9 +++---
 net/sched/sch_generic.c     |    4 +--
 net/sched/sch_prio.c        |   55 +++++++++++++++++++++++++++++++-----
 8 files changed, 146 insertions(+), 21 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 745c988..446de39 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void		eth_header_cache_update(struct hh_cache *hh, struct net_device *dev
 extern int		eth_header_cache(struct neighbour *neigh,
 					 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 static inline void eth_copy_and_sum (struct sk_buff *dest, 
 				     const unsigned char *src, 
 				     int len, int base)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 71fc8ff..db06169 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -106,6 +106,14 @@ struct netpoll_info;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+	/* Give a control state for each queue.  This struct may contain
+	 * per-queue locks in the future.
+	 */
+	unsigned long	state;
+};
+
 /*
  *	Network device statistics. Akin to the 2.0 ether stats but
  *	with byte counters.
@@ -324,6 +332,7 @@ struct net_device
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
 #define NETIF_F_INTERNAL_STATS	8192	/* Use stats structure in net_device */
+#define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -534,6 +543,14 @@ struct net_device
 	struct device		dev;
 	/* space for optional statistics and wireless sysfs groups */
 	struct attribute_group  *sysfs_groups[3];
+
+	/* To retrieve statistics per subqueue - FOR FUTURE USE */
+	struct net_device_stats* (*get_subqueue_stats)(struct net_device *dev,
+							int queue_index);
+
+	/* The TX queue control structures */
+	struct net_device_subqueue	*egress_subqueue;
+	int				egress_subqueue_count;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -675,6 +692,48 @@ static inline int netif_running(const struct net_device *dev)
 	return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index)
+{
+	clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+                                         u16 queue_index)
+{
+	return test_bit(__LINK_STATE_XOFF,
+	                &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	if (test_and_clear_bit(__LINK_STATE_XOFF,
+	                       &dev->egress_subqueue[queue_index].state))
+		__netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+	return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -968,8 +1027,11 @@ static inline void netif_tx_disable(struct net_device *dev)
 extern void		ether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-				       void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+					  void (*setup)(struct net_device *),
+					  int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+	alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int		register_netdev(struct net_device *dev);
 extern void		unregister_netdev(struct net_device *dev);
 /* Functions used for multicast support */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 08fa5c8..96fd263 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -212,6 +212,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@pkt_type: Packet class
  *	@fclone: skbuff clone status
  *	@ip_summed: Driver fed us an IP checksum
+ *	@queue_mapping: Queue mapping for multiqueue devices
  *	@priority: Packet queueing priority
  *	@users: User count - see {datagram,tcp}.c
  *	@protocol: Packet protocol from driver
@@ -264,6 +265,7 @@ struct sk_buff {
 		__wsum		csum;
 		__u32		csum_offset;
 	};
+	__u16			queue_mapping;
 	__u32			priority;
 	__u8			local_df:1,
 				cloned:1,
diff --git a/net/core/dev.c b/net/core/dev.c
index 219a57f..c11c8fa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3292,16 +3292,18 @@ static struct net_device_stats *maybe_internal_stats(struct net_device *dev)
 }
 
 /**
- *	alloc_netdev - allocate network device
+ *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
  *	@name:		device name format string
  *	@setup:		callback to initialize device
+ *	@queue_count:	the number of subqueues to allocate
  *
  *	Allocates a struct net_device with private data area for driver use
- *	and performs basic initialization.
+ *	and performs basic initialization.  Also allocates subqueue structs
+ *	for each queue on the device.
  */
-struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-		void (*setup)(struct net_device *))
+struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+		void (*setup)(struct net_device *), int queue_count)
 {
 	void *p;
 	struct net_device *dev;
@@ -3326,12 +3328,23 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
 	if (sizeof_priv)
 		dev->priv = netdev_priv(dev);
 
+ 	alloc_size = (sizeof(struct net_device_subqueue) * queue_count);
+ 
+ 	p = kzalloc(alloc_size, GFP_KERNEL);
+ 	if (!p) {
+ 		printk(KERN_ERR "alloc_netdev: Unable to allocate queues.\n");
+ 		return NULL;
+ 	}
+ 
+ 	dev->egress_subqueue = p;
+ 	dev->egress_subqueue_count = queue_count;
+
 	dev->get_stats = maybe_internal_stats;
 	setup(dev);
 	strcpy(dev->name, name);
 	return dev;
 }
-EXPORT_SYMBOL(alloc_netdev);
+EXPORT_SYMBOL(alloc_netdev_mq);
 
 /**
  *	free_netdev - free network device
@@ -3345,6 +3358,7 @@ void free_netdev(struct net_device *dev)
 {
 #ifdef CONFIG_SYSFS
 	/*  Compatibility with error handling in drivers */
+	kfree(dev->egress_subqueue);
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
 		kfree((char *)dev - dev->padded);
 		return;
@@ -3356,6 +3370,7 @@ void free_netdev(struct net_device *dev)
 	/* will free via device release */
 	put_device(&dev->dev);
 #else
+	kfree(dev->egress_subqueue);
 	kfree((char *)dev - dev->padded);
 #endif
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e60864e..642aab9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -474,6 +474,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 	n->nohdr = 0;
 	C(pkt_type);
 	C(ip_summed);
+	C(queue_mapping);
 	C(priority);
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
 	C(ipvs_property);
@@ -516,6 +517,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #endif
 	new->sk		= NULL;
 	new->dev	= old->dev;
+	new->queue_mapping = old->queue_mapping;
 	new->priority	= old->priority;
 	new->protocol	= old->protocol;
 	new->dst	= dst_clone(old->dst);
@@ -1974,6 +1976,7 @@ struct sk_buff *skb_segment(struct sk_buff *skb, int features)
 		tail = nskb;
 
 		nskb->dev = skb->dev;
+		nskb->queue_mapping = skb->queue_mapping;
 		nskb->priority = skb->priority;
 		nskb->protocol = skb->protocol;
 		nskb->dst = dst_clone(skb->dst);
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 0ac2524..87a509c 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -316,9 +316,10 @@ void ether_setup(struct net_device *dev)
 EXPORT_SYMBOL(ether_setup);
 
 /**
- * alloc_etherdev - Allocates and sets up an Ethernet device
+ * alloc_etherdev_mq - Allocates and sets up an Ethernet device
  * @sizeof_priv: Size of additional driver-private structure to be allocated
  *	for this Ethernet device
+ * @queue_count: The number of queues this device has.
  *
  * Fill in the fields of the device structure with Ethernet-generic
  * values. Basically does everything except registering the device.
@@ -328,8 +329,8 @@ EXPORT_SYMBOL(ether_setup);
  * this private data area.
  */
 
-struct net_device *alloc_etherdev(int sizeof_priv)
+struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count)
 {
-	return alloc_netdev(sizeof_priv, "eth%d", ether_setup);
+	return alloc_netdev_mq(sizeof_priv, "eth%d", ether_setup, queue_count);
 }
-EXPORT_SYMBOL(alloc_etherdev);
+EXPORT_SYMBOL(alloc_etherdev_mq);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 52eb343..8f0063a 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -133,7 +133,8 @@ static inline int qdisc_restart(struct net_device *dev)
 			/* And release queue */
 			spin_unlock(&dev->queue_lock);
 
-			if (!netif_queue_stopped(dev)) {
+			if (!netif_queue_stopped(dev) &&
+			    !netif_subqueue_stopped(dev, skb->queue_mapping)) {
 				int ret;
 
 				ret = dev_hard_start_xmit(skb, dev);
@@ -149,7 +150,6 @@ static inline int qdisc_restart(struct net_device *dev)
 					goto collision;
 				}
 			}
-
 			/* NETDEV_TX_BUSY - we need to requeue */
 			/* Release the driver */
 			if (!nolock) {
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 5cfe60b..7365621 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -43,6 +43,7 @@ struct prio_sched_data
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+	u16 band2queue[TC_PRIO_MAX + 1];
 };
 
 
@@ -63,20 +64,26 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 		case TC_ACT_SHOT:
 			return NULL;
 		};
-
 		if (!q->filter_list ) {
 #else
 		if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) {
 #endif
 			if (TC_H_MAJ(band))
 				band = 0;
+			skb->queue_mapping =
+				  q->prio2band[q->band2queue[band&TC_PRIO_MAX]];
+
 			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
-	if (band > q->bands)
+	if (band > q->bands) {
+		skb->queue_mapping = q->prio2band[q->band2queue[0]];
 		return q->queues[q->prio2band[0]];
+	}
+
+	skb->queue_mapping = q->band2queue[band];
 
 	return q->queues[band];
 }
@@ -144,11 +151,17 @@ prio_dequeue(struct Qdisc* sch)
 	struct Qdisc *qdisc;
 
 	for (prio = 0; prio < q->bands; prio++) {
-		qdisc = q->queues[prio];
-		skb = qdisc->dequeue(qdisc);
-		if (skb) {
-			sch->q.qlen--;
-			return skb;
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, q->band2queue[prio])) {
+			qdisc = q->queues[prio];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+				sch->q.qlen--;
+				return skb;
+			}
 		}
 	}
 	return NULL;
@@ -200,6 +213,10 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 	struct prio_sched_data *q = qdisc_priv(sch);
 	struct tc_prio_qopt *qopt = RTA_DATA(opt);
 	int i;
+	int queue;
+	int qmapoffset;
+	int offset;
+	int mod;
 
 	if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
 		return -EINVAL;
@@ -242,6 +259,30 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
 			}
 		}
 	}
+	/* setup queue to band mapping */
+	if (q->bands < sch->dev->egress_subqueue_count) {
+		qmapoffset = 1;
+		mod = 0;
+	} else {
+		mod = q->bands % sch->dev->egress_subqueue_count;
+		qmapoffset = q->bands / sch->dev->egress_subqueue_count +
+				((mod) ? 1 : 0);
+	}
+
+	queue = 0;
+	offset = 0;
+	for (i = 0; i < q->bands; i++) {
+		q->band2queue[i] = queue;
+		if ( ((i + 1) - offset) == qmapoffset) {
+			queue++;
+			offset += qmapoffset;
+			if (mod)
+				mod--;
+			qmapoffset = q->bands /
+				sch->dev->egress_subqueue_count +
+				((mod) ? 1 : 0);
+		}
+	}
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-10  0:28 ` Patrick McHardy
@ 2007-04-10  1:40   ` Waskiewicz Jr, Peter P
  2007-04-10  9:04     ` Patrick McHardy
  0 siblings, 1 reply; 11+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-04-10  1:40 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: davem, netdev, linux-kernel, jgarzik, cramerj, Kok, Auke-jan H,
	Leech, Christopher

> This indeed looks a lot better than the first patch. I'm too 
> tired to fully review this now, but could you please post the 
> corresponding e1000 patch? From a quick look I'm guessing 
> that this patch changes the behaviour of the prio qdisc from 
> strict priority to whatever scheduling mechanism e1000 uses 
> for its queues when the multiqueue config option is enabled, 
> which might surprise people.
> 

Thanks Pat for the initial feedback.  I can post a set of patches to
e1000 using the new API; I'll try to get them out asap (need to apply to
this kernel tree).  However, the PRIO qdisc still uses the priority in
the bands for dequeueing priority, and will feed the queues on the NIC.
The e1000, and any other multiqueue NIC, will schedule Tx based on how
the PRIO qdisc feeds the queues.  So the only priority here is the
dequeuing priority from the kernel.  The e1000 will use the new API for
starting/stopping the individual queues based on the descriptors
available, much like it does today for the global queue.

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-10  1:40   ` Waskiewicz Jr, Peter P
@ 2007-04-10  9:04     ` Patrick McHardy
  2007-04-11 16:52       ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick McHardy @ 2007-04-10  9:04 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: davem, netdev, linux-kernel, jgarzik, cramerj, Kok, Auke-jan H,
	Leech, Christopher

Waskiewicz Jr, Peter P wrote:
> Thanks Pat for the initial feedback.  I can post a set of patches to
> e1000 using the new API; I'll try to get them out asap (need to apply to
> this kernel tree).


Thanks.

> However, the PRIO qdisc still uses the priority in
> the bands for dequeueing priority, and will feed the queues on the NIC.
> The e1000, and any other multiqueue NIC, will schedule Tx based on how
> the PRIO qdisc feeds the queues.  So the only priority here is the
> dequeuing priority from the kernel.  The e1000 will use the new API for
> starting/stopping the individual queues based on the descriptors
> available, much like it does today for the global queue.


Packets will only be dequeued from a band if the associated subqueue
is active, which moves the decision from prio to the driver, no?
What policy does e1000 use for scheduling its internal queues?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-10  0:32 [PATCH] NET: [UPDATED] Multiqueue network device support implementation Peter P Waskiewicz Jr
  2007-04-10  0:28 ` Patrick McHardy
@ 2007-04-10  9:48 ` Patrick McHardy
  2007-04-10 16:27   ` Waskiewicz Jr, Peter P
  1 sibling, 1 reply; 11+ messages in thread
From: Patrick McHardy @ 2007-04-10  9:48 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: davem, netdev, linux-kernel, jgarzik, cramerj, auke-jan.h.kok,
	christopher.leech

Peter P Waskiewicz Jr wrote:
> +	/* To retrieve statistics per subqueue - FOR FUTURE USE */
> +	struct net_device_stats* (*get_subqueue_stats)(struct net_device *dev,
> +							int queue_index);


Please no future use stuff, just add it when you need it.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 219a57f..c11c8fa 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3326,12 +3328,23 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name,
>  	if (sizeof_priv)
>  		dev->priv = netdev_priv(dev);
>  
> + 	alloc_size = (sizeof(struct net_device_subqueue) * queue_count);
> + 
> + 	p = kzalloc(alloc_size, GFP_KERNEL);
> + 	if (!p) {
> + 		printk(KERN_ERR "alloc_netdev: Unable to allocate queues.\n");
> + 		return NULL;


This leaks the device. You treat every single-queue device as having
a single subqueue. If it doesn't get too ugly it would be nice to avoid
this and only allocate the subqueue states for real multiqueue devices.

> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -133,7 +133,8 @@ static inline int qdisc_restart(struct net_device *dev)
>  			/* And release queue */
>  			spin_unlock(&dev->queue_lock);
>  
> -			if (!netif_queue_stopped(dev)) {
> +			if (!netif_queue_stopped(dev) &&
> +			    !netif_subqueue_stopped(dev, skb->queue_mapping)) {
>  				int ret;
>  
>  				ret = dev_hard_start_xmit(skb, dev);
> @@ -149,7 +150,6 @@ static inline int qdisc_restart(struct net_device *dev)
>  					goto collision;
>  				}
>  			}
> -


Unrelated whitespace change.

>  			/* NETDEV_TX_BUSY - we need to requeue */
>  			/* Release the driver */
>  			if (!nolock) {
> diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
> index 5cfe60b..7365621 100644
> --- a/net/sched/sch_prio.c
> +++ b/net/sched/sch_prio.c
> @@ -43,6 +43,7 @@ struct prio_sched_data
>  	struct tcf_proto *filter_list;
>  	u8  prio2band[TC_PRIO_MAX+1];
>  	struct Qdisc *queues[TCQ_PRIO_BANDS];
> +	u16 band2queue[TC_PRIO_MAX + 1];
>  };
>  
>  
> @@ -63,20 +64,26 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
>  		case TC_ACT_SHOT:
>  			return NULL;
>  		};
> -

Same here

>  		if (!q->filter_list ) {
>  #else
>  		if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) {
>  #endif
>  			if (TC_H_MAJ(band))
>  				band = 0;
> +			skb->queue_mapping =
> +				  q->prio2band[q->band2queue[band&TC_PRIO_MAX]];
> +


Does this needs to be cleared at some point again? TC actions might
redirect or mirror packets to other (multiqueue) devices.

> @@ -242,6 +259,30 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
>  			}
>  		}
>  	}
> +	/* setup queue to band mapping */
> +	if (q->bands < sch->dev->egress_subqueue_count) {
> +		qmapoffset = 1;
> +		mod = 0;
> +	} else {
> +		mod = q->bands % sch->dev->egress_subqueue_count;
> +		qmapoffset = q->bands / sch->dev->egress_subqueue_count +
> +				((mod) ? 1 : 0);
> +	}
> +
> +	queue = 0;
> +	offset = 0;
> +	for (i = 0; i < q->bands; i++) {
> +		q->band2queue[i] = queue;
> +		if ( ((i + 1) - offset) == qmapoffset) {
> +			queue++;
> +			offset += qmapoffset;
> +			if (mod)
> +				mod--;
> +			qmapoffset = q->bands /
> +				sch->dev->egress_subqueue_count +
> +				((mod) ? 1 : 0);
> +		}
> +	}


Besides being quite ugly, I don't think this does what you want.
For bands < queues we get band2queue[0] = 0, all others map to 1.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-10  9:48 ` Patrick McHardy
@ 2007-04-10 16:27   ` Waskiewicz Jr, Peter P
  2007-04-11  5:47     ` Patrick McHardy
  0 siblings, 1 reply; 11+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-04-10 16:27 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: davem, netdev, linux-kernel, jgarzik, cramerj, Kok, Auke-jan H,
	Leech, Christopher

> Peter P Waskiewicz Jr wrote:
> > +	/* To retrieve statistics per subqueue - FOR FUTURE USE */
> > +	struct net_device_stats* (*get_subqueue_stats)(struct 
> net_device *dev,
> > +							int 
> queue_index);
> 
> 
> Please no future use stuff, just add it when you need it.

Gotcha.  I'll remove this.

> 
> > diff --git a/net/core/dev.c b/net/core/dev.c index 219a57f..c11c8fa 
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3326,12 +3328,23 @@ struct net_device *alloc_netdev(int 
> sizeof_priv, const char *name,
> >  	if (sizeof_priv)
> >  		dev->priv = netdev_priv(dev);
> >  
> > + 	alloc_size = (sizeof(struct net_device_subqueue) * queue_count);
> > + 
> > + 	p = kzalloc(alloc_size, GFP_KERNEL);
> > + 	if (!p) {
> > + 		printk(KERN_ERR "alloc_netdev: Unable to 
> allocate queues.\n");
> > + 		return NULL;
> 
> 
> This leaks the device. You treat every single-queue device as 
> having a single subqueue. If it doesn't get too ugly it would 
> be nice to avoid this and only allocate the subqueue states 
> for real multiqueue devices.

We went back and forth on this.  The reason we allocate a queue in every
case, even on single-queue devices, was to make the stack not have
branching for multiqueue and non-multiqueue devices.  If we don't have
at least one queue on a device, then we can't have
netif_subqueue_stopped() in the hotpath unless we check if a device is
multiqueue before.  The original patches I released had this branching,
and I was asked to not do that.  I'd also like to see all queue-related
stuff be pulled from net_device and put into net_device_subqueue at some
point, even for single-queue devices.  Thoughts?

> 
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -133,7 +133,8 @@ static inline int qdisc_restart(struct 
> net_device *dev)
> >  			/* And release queue */
> >  			spin_unlock(&dev->queue_lock);
> >  
> > -			if (!netif_queue_stopped(dev)) {
> > +			if (!netif_queue_stopped(dev) &&
> > +			    !netif_subqueue_stopped(dev, 
> skb->queue_mapping)) {
> >  				int ret;
> >  
> >  				ret = dev_hard_start_xmit(skb, 
> dev); @@ -149,7 +150,6 @@ static 
> > inline int qdisc_restart(struct net_device *dev)
> >  					goto collision;
> >  				}
> >  			}
> > -
> 
> 
> Unrelated whitespace change.

I'll fix that.

> 
> >  			/* NETDEV_TX_BUSY - we need to requeue */
> >  			/* Release the driver */
> >  			if (!nolock) {
> > diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 
> > 5cfe60b..7365621 100644
> > --- a/net/sched/sch_prio.c
> > +++ b/net/sched/sch_prio.c
> > @@ -43,6 +43,7 @@ struct prio_sched_data
> >  	struct tcf_proto *filter_list;
> >  	u8  prio2band[TC_PRIO_MAX+1];
> >  	struct Qdisc *queues[TCQ_PRIO_BANDS];
> > +	u16 band2queue[TC_PRIO_MAX + 1];
> >  };
> >  
> >  
> > @@ -63,20 +64,26 @@ prio_classify(struct sk_buff *skb, 
> struct Qdisc *sch, int *qerr)
> >  		case TC_ACT_SHOT:
> >  			return NULL;
> >  		};
> > -
> 
> Same here

I'll fix that too.

> 
> >  		if (!q->filter_list ) {
> >  #else
> >  		if (!q->filter_list || tc_classify(skb, 
> q->filter_list, &res)) {  
> > #endif
> >  			if (TC_H_MAJ(band))
> >  				band = 0;
> > +			skb->queue_mapping =
> > +				  
> q->prio2band[q->band2queue[band&TC_PRIO_MAX]];
> > +
> 
> 
> Does this needs to be cleared at some point again? TC actions 
> might redirect or mirror packets to other (multiqueue) devices.

If an skb is redirected to another device, the skb should be filtered
through that device's qdisc, yes?

> 
> > @@ -242,6 +259,30 @@ static int prio_tune(struct Qdisc 
> *sch, struct rtattr *opt)
> >  			}
> >  		}
> >  	}
> > +	/* setup queue to band mapping */
> > +	if (q->bands < sch->dev->egress_subqueue_count) {
> > +		qmapoffset = 1;
> > +		mod = 0;
> > +	} else {
> > +		mod = q->bands % sch->dev->egress_subqueue_count;
> > +		qmapoffset = q->bands / 
> sch->dev->egress_subqueue_count +
> > +				((mod) ? 1 : 0);
> > +	}
> > +
> > +	queue = 0;
> > +	offset = 0;
> > +	for (i = 0; i < q->bands; i++) {
> > +		q->band2queue[i] = queue;
> > +		if ( ((i + 1) - offset) == qmapoffset) {
> > +			queue++;
> > +			offset += qmapoffset;
> > +			if (mod)
> > +				mod--;
> > +			qmapoffset = q->bands /
> > +				sch->dev->egress_subqueue_count +
> > +				((mod) ? 1 : 0);
> > +		}
> > +	}
> 
> 
> Besides being quite ugly, I don't think this does what you want.
> For bands < queues we get band2queue[0] = 0, all others map to 1.
> 

I see what's wrong with this.  If I removed "mod = 0;", the algorithm
works as intended.  We also did go through a number of iterations of the
best way to map queues to bands, using a simple hash, a complex
assigment, etc., and this algorithm gives the best distribution given
any combination of queues / bands (minus that small bug).  I'll fix this
and resubmit.

Thanks for the feedback,

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-10 16:27   ` Waskiewicz Jr, Peter P
@ 2007-04-11  5:47     ` Patrick McHardy
  2007-04-11 15:40       ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick McHardy @ 2007-04-11  5:47 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: davem, netdev, linux-kernel, jgarzik, cramerj, Kok, Auke-jan H,
	Leech, Christopher

Waskiewicz Jr, Peter P wrote:
>>This leaks the device. You treat every single-queue device as 
>>having a single subqueue. If it doesn't get too ugly it would 
>>be nice to avoid this and only allocate the subqueue states 
>>for real multiqueue devices.
> 
> 
> We went back and forth on this.  The reason we allocate a queue in every
> case, even on single-queue devices, was to make the stack not have
> branching for multiqueue and non-multiqueue devices.  If we don't have
> at least one queue on a device, then we can't have
> netif_subqueue_stopped() in the hotpath unless we check if a device is
> multiqueue before.  The original patches I released had this branching,
> and I was asked to not do that.


OK, thanks for the explanation.

>>>+			skb->queue_mapping =
>>>+				 q->prio2band[q->band2queue[band&TC_PRIO_MAX]];
>>
>>
>>Does this needs to be cleared at some point again? TC actions 
>>might redirect or mirror packets to other (multiqueue) devices.
> 
> 
> If an skb is redirected to another device, the skb should be filtered
> through that device's qdisc, yes?


Yes, but the device might not have a queue or use something different
than prio, so the value would stay the same. I think you need to clear
it before enqueueing a packet or alternatively when redirecting in the
mirred action.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-11  5:47     ` Patrick McHardy
@ 2007-04-11 15:40       ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 11+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-04-11 15:40 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: davem, netdev, linux-kernel, jgarzik, cramerj, Kok, Auke-jan H,
	Leech, Christopher

> >>>+			skb->queue_mapping =
> >>>+				 
> q->prio2band[q->band2queue[band&TC_PRIO_MAX]];
> >>
> >>
> >>Does this needs to be cleared at some point again? TC actions might 
> >>redirect or mirror packets to other (multiqueue) devices.
> > 
> > 
> > If an skb is redirected to another device, the skb should 
> be filtered 
> > through that device's qdisc, yes?
> 
> 
> Yes, but the device might not have a queue or use something 
> different than prio, so the value would stay the same. I 
> think you need to clear it before enqueueing a packet or 
> alternatively when redirecting in the mirred action.
> 

I see what you're saying now - I'll set the queue_mapping to zero before
the enqueue occurs.  Thanks for the feedback.

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-10  9:04     ` Patrick McHardy
@ 2007-04-11 16:52       ` Waskiewicz Jr, Peter P
  2007-04-11 17:03         ` Patrick McHardy
  0 siblings, 1 reply; 11+ messages in thread
From: Waskiewicz Jr, Peter P @ 2007-04-11 16:52 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: davem, netdev, linux-kernel, jgarzik, cramerj, Kok, Auke-jan H,
	Leech, Christopher

> Thanks.
> 
> > However, the PRIO qdisc still uses the priority in the bands for 
> > dequeueing priority, and will feed the queues on the NIC.
> > The e1000, and any other multiqueue NIC, will schedule Tx 
> based on how 
> > the PRIO qdisc feeds the queues.  So the only priority here is the 
> > dequeuing priority from the kernel.  The e1000 will use the new API 
> > for starting/stopping the individual queues based on the 
> descriptors 
> > available, much like it does today for the global queue.
> 
> 
> Packets will only be dequeued from a band if the associated 
> subqueue is active, which moves the decision from prio to the 
> driver, no?
> What policy does e1000 use for scheduling its internal queues?
> 

E1000 is handed the skb's from PRIO to whichever queue the PRIO flows
dictate (based on band2queue mapping, tc filters, or TOS to
skb->priority filtering).  Once the skb hits the e1000, the internal
policy is round-robin on the hardware queues to dequeue and transmit on
the wire.  I agree the NIC does influence the decision of which band an
skb will be dequeued from based on available descriptors, etc., but
that's one of the goals of multiqueue: don't allow another traffic flow
in one queue stop or impact another separate flow.

Once NICs start using hardware-based priority scheduling (wireless?), we
can use a round-robin type qdisc in the kernel to dequeue, and let the
hardware directly decide how important one flow is over another.

Hope this answers your question.

Cheers,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-11 16:52       ` Waskiewicz Jr, Peter P
@ 2007-04-11 17:03         ` Patrick McHardy
  2007-04-12  5:24           ` Zhu Yi
  0 siblings, 1 reply; 11+ messages in thread
From: Patrick McHardy @ 2007-04-11 17:03 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: davem, netdev, linux-kernel, jgarzik, cramerj, Kok, Auke-jan H,
	Leech, Christopher

Waskiewicz Jr, Peter P wrote:
>>Packets will only be dequeued from a band if the associated 
>>subqueue is active, which moves the decision from prio to the 
>>driver, no?
>>What policy does e1000 use for scheduling its internal queues?
>>
> 
> 
> E1000 is handed the skb's from PRIO to whichever queue the PRIO flows
> dictate (based on band2queue mapping, tc filters, or TOS to
> skb->priority filtering).  Once the skb hits the e1000, the internal
> policy is round-robin on the hardware queues to dequeue and transmit on
> the wire.  I agree the NIC does influence the decision of which band an
> skb will be dequeued from based on available descriptors, etc., but
> that's one of the goals of multiqueue: don't allow another traffic flow
> in one queue stop or impact another separate flow.


Yes, I'm not saying its wrong, but it will surprise users that put a
prio qdisc on a multiqueue device without knowing anything about the
multiqueue capabilities and suddenly don't have strict priority
anymore. If it changes the behaviour, it should be explicitly enabled
by the user in the prio qdisc configuration. This would also allow
to move the band2queue mapping policy to userspace.

> Once NICs start using hardware-based priority scheduling (wireless?), we
> can use a round-robin type qdisc in the kernel to dequeue, and let the
> hardware directly decide how important one flow is over another.


You bring up a good point, it would be good to hear the opinion from
one of the wireless people on this since they have their own multiqueue
scheduler in the wireless-dev tree.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] NET: [UPDATED] Multiqueue network device support implementation.
  2007-04-11 17:03         ` Patrick McHardy
@ 2007-04-12  5:24           ` Zhu Yi
  0 siblings, 0 replies; 11+ messages in thread
From: Zhu Yi @ 2007-04-12  5:24 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Waskiewicz Jr, Peter P, davem, netdev, linux-kernel, jgarzik,
	cramerj, Kok, Auke-jan H, Leech, Christopher

On Wed, 2007-04-11 at 19:03 +0200, Patrick McHardy wrote:
> 
> You bring up a good point, it would be good to hear the opinion from
> one of the wireless people on this since they have their own
> multiqueue scheduler in the wireless-dev tree. 

The one in the wireless-dev is pretty much like this one. It existed
only because there was not such a multiqueue aware qdisc available at
that time.

The requirement for wireless is the same as the strict PRIO with an
addition that the dequeued SKB's corresponding NIC hardware queue must
be active (this is also true for other devices I think, otherwise it has
to be requeued which leads a busy or dead loop in the end). In other
words, the dequeue method should select the SKB with the highest
priority from all the ACTIVE hardware queues (not all queues). The
wireless hardware then schedules all the packets from its 4 hardware TX
queues based on the priority and network environment.

Thanks,
-yi

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-04-12  5:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-10  0:32 [PATCH] NET: [UPDATED] Multiqueue network device support implementation Peter P Waskiewicz Jr
2007-04-10  0:28 ` Patrick McHardy
2007-04-10  1:40   ` Waskiewicz Jr, Peter P
2007-04-10  9:04     ` Patrick McHardy
2007-04-11 16:52       ` Waskiewicz Jr, Peter P
2007-04-11 17:03         ` Patrick McHardy
2007-04-12  5:24           ` Zhu Yi
2007-04-10  9:48 ` Patrick McHardy
2007-04-10 16:27   ` Waskiewicz Jr, Peter P
2007-04-11  5:47     ` Patrick McHardy
2007-04-11 15:40       ` Waskiewicz Jr, Peter P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).