* [RFC PATCH 1/4] net: implement mechanism for HW based QOS
@ 2010-12-09 19:59 John Fastabend
2010-12-09 20:00 ` [RFC PATCH 2/4] net/sched: Allow multiple mq qdisc to be used as non-root John Fastabend
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: John Fastabend @ 2010-12-09 19:59 UTC (permalink / raw)
To: davem; +Cc: netdev, hadi, shemminger, tgraf, eric.dumazet, john.r.fastabend
This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.
Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.
By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.
With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
a single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.
To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.
This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
include/linux/netdevice.h | 65 +++++++++++++++++++++++++++++++++++++++++++++
net/core/dev.c | 39 ++++++++++++++++++++++++++-
2 files changed, 103 insertions(+), 1 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a9ac5dc..c0d4fb1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -646,6 +646,12 @@ struct xps_dev_maps {
(nr_cpu_ids * sizeof(struct xps_map *)))
#endif /* CONFIG_XPS */
+/* HW offloaded queuing disciplines txq count and offset maps */
+struct netdev_tc_txq {
+ u16 count;
+ u16 offset;
+};
+
/*
* This structure defines the management hooks for network devices.
* The following hooks can be defined; unless noted otherwise, they are
@@ -1146,6 +1152,10 @@ struct net_device {
/* Data Center Bridging netlink ops */
const struct dcbnl_rtnl_ops *dcbnl_ops;
#endif
+ u8 max_tc;
+ u8 num_tc;
+ struct netdev_tc_txq *_tc_to_txq;
+ u8 prio_tc_map[16];
#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
/* max exchange id for FCoE LRO by ddp */
@@ -1162,6 +1172,58 @@ struct net_device {
#define NETDEV_ALIGN 32
static inline
+int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
+{
+ return dev->prio_tc_map[prio & 15];
+}
+
+static inline
+int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
+{
+ if (tc >= dev->num_tc)
+ return -EINVAL;
+
+ dev->prio_tc_map[prio & 15] = tc & 15;
+ return 0;
+}
+
+static inline
+int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
+{
+ struct netdev_tc_txq *tcp;
+
+ if (tc >= dev->num_tc)
+ return -EINVAL;
+
+ tcp = &dev->_tc_to_txq[tc];
+ tcp->count = count;
+ tcp->offset = offset;
+ return 0;
+}
+
+static inline
+struct netdev_tc_txq *netdev_get_tc_queue(const struct net_device *dev, u8 tc)
+{
+ return &dev->_tc_to_txq[tc];
+}
+
+static inline
+int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
+{
+ if (num_tc > dev->max_tc)
+ return -EINVAL;
+
+ dev->num_tc = num_tc;
+ return 0;
+}
+
+static inline
+u8 netdev_get_num_tc(const struct net_device *dev)
+{
+ return dev->num_tc;
+}
+
+static inline
struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
unsigned int index)
{
@@ -1386,6 +1448,9 @@ static inline void unregister_netdevice(struct net_device *dev)
unregister_netdevice_queue(dev, NULL);
}
+extern int netdev_alloc_max_tc(struct net_device *dev, u8 tc);
+extern void netdev_free_tc(struct net_device *dev);
+
extern int netdev_refcnt_read(const struct net_device *dev);
extern void free_netdev(struct net_device *dev);
extern void synchronize_net(void);
diff --git a/net/core/dev.c b/net/core/dev.c
index 55ff66f..cc00e66 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2118,6 +2118,8 @@ static u32 hashrnd __read_mostly;
u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
{
u32 hash;
+ u16 qoffset = 0;
+ u16 qcount = dev->real_num_tx_queues;
if (skb_rx_queue_recorded(skb)) {
hash = skb_get_rx_queue(skb);
@@ -2126,13 +2128,20 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
return hash;
}
+ if (dev->num_tc) {
+ u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+ struct netdev_tc_txq *tcp = netdev_get_tc_queue(dev, tc);
+ qoffset = tcp->offset;
+ qcount = tcp->count;
+ }
+
if (skb->sk && skb->sk->sk_hash)
hash = skb->sk->sk_hash;
else
hash = (__force u16) skb->protocol ^ skb->rxhash;
hash = jhash_1word(hash, hashrnd);
- return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
+ return (u16) ((((u64) hash * qcount)) >> 32) + qoffset;
}
EXPORT_SYMBOL(skb_tx_hash);
@@ -5091,6 +5100,33 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev,
}
EXPORT_SYMBOL(netif_stacked_transfer_operstate);
+int netdev_alloc_max_tc(struct net_device *dev, u8 tcs)
+{
+ struct netdev_tc_txq *tcp;
+
+ if (tcs > 16)
+ return -EINVAL;
+
+ tcp = kcalloc(tcs, sizeof(*tcp), GFP_KERNEL);
+ if (!tcp)
+ return -ENOMEM;
+
+ dev->_tc_to_txq = tcp;
+ dev->max_tc = tcs;
+ return 0;
+}
+EXPORT_SYMBOL(netdev_alloc_max_tc);
+
+void netdev_free_tc(struct net_device *dev)
+{
+ dev->max_tc = 0;
+ dev->num_tc = 0;
+ memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
+ kfree(dev->_tc_to_txq);
+ dev->_tc_to_txq = NULL;
+}
+EXPORT_SYMBOL(netdev_free_tc);
+
#ifdef CONFIG_RPS
static int netif_alloc_rx_queues(struct net_device *dev)
{
@@ -5699,6 +5735,7 @@ void free_netdev(struct net_device *dev)
#ifdef CONFIG_RPS
kfree(dev->_rx);
#endif
+ netdev_free_tc(dev);
kfree(rcu_dereference_raw(dev->ingress_queue));
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH 2/4] net/sched: Allow multiple mq qdisc to be used as non-root
2010-12-09 19:59 [RFC PATCH 1/4] net: implement mechanism for HW based QOS John Fastabend
@ 2010-12-09 20:00 ` John Fastabend
2010-12-09 20:00 ` [RFC PATCH 3/4] net/sched: implement a root container qdisc sch_mclass John Fastabend
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: John Fastabend @ 2010-12-09 20:00 UTC (permalink / raw)
To: davem; +Cc: netdev, hadi, shemminger, tgraf, eric.dumazet, john.r.fastabend
This patch modifies the mq qdisc to allow multiple mq qdiscs
to be used. Allowing TX queues to be grouped for management.
This allows a root container qdisc to create multiple traffic
classes and use the mq qdisc as a default queueing discipline. It
is expected other queueing disciplines can then be grafted to the
container as needed.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
net/sched/sch_mq.c | 73 +++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 57 insertions(+), 16 deletions(-)
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index ecc302f..deac04c 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -19,17 +19,42 @@
struct mq_sched {
struct Qdisc **qdiscs;
+ u8 num_tc;
};
+static void mq_queues(struct net_device *dev, struct Qdisc *sch,
+ unsigned int *count, unsigned int *offset)
+{
+ struct mq_sched *priv = qdisc_priv(sch);
+ if (priv->num_tc) {
+ struct netdev_tc_txq *tc;
+ int queue = TC_H_MIN(sch->parent) - 1;
+
+ tc = netdev_get_tc_queue(dev, queue);
+ if (count)
+ *count = tc->count;
+ if (offset)
+ *offset = tc->offset;
+ } else {
+ if (count)
+ *count = dev->num_tx_queues;
+ if (offset)
+ *offset = 0;
+ }
+}
+
static void mq_destroy(struct Qdisc *sch)
{
struct net_device *dev = qdisc_dev(sch);
struct mq_sched *priv = qdisc_priv(sch);
- unsigned int ntx;
+ unsigned int ntx, count;
if (!priv->qdiscs)
return;
- for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
+
+ mq_queues(dev, sch, &count, NULL);
+
+ for (ntx = 0; ntx < count && priv->qdiscs[ntx]; ntx++)
qdisc_destroy(priv->qdiscs[ntx]);
kfree(priv->qdiscs);
}
@@ -41,21 +66,26 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
struct netdev_queue *dev_queue;
struct Qdisc *qdisc;
unsigned int ntx;
+ unsigned int count, offset;
- if (sch->parent != TC_H_ROOT)
+ if (sch->parent != TC_H_ROOT && !dev->num_tc)
return -EOPNOTSUPP;
if (!netif_is_multiqueue(dev))
return -EOPNOTSUPP;
+ /* Record num tc's in priv so we can tear down cleanly */
+ priv->num_tc = dev->num_tc;
+ mq_queues(dev, sch, &count, &offset);
+
/* pre-allocate qdiscs, attachment can't fail */
- priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
+ priv->qdiscs = kcalloc(count, sizeof(priv->qdiscs[0]),
GFP_KERNEL);
if (priv->qdiscs == NULL)
return -ENOMEM;
- for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
- dev_queue = netdev_get_tx_queue(dev, ntx);
+ for (ntx = 0; ntx < count; ntx++) {
+ dev_queue = netdev_get_tx_queue(dev, ntx + offset);
qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
TC_H_MAKE(TC_H_MAJ(sch->handle),
TC_H_MIN(ntx + 1)));
@@ -65,7 +95,8 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
priv->qdiscs[ntx] = qdisc;
}
- sch->flags |= TCQ_F_MQROOT;
+ if (!priv->num_tc)
+ sch->flags |= TCQ_F_MQROOT;
return 0;
err:
@@ -78,9 +109,11 @@ static void mq_attach(struct Qdisc *sch)
struct net_device *dev = qdisc_dev(sch);
struct mq_sched *priv = qdisc_priv(sch);
struct Qdisc *qdisc;
- unsigned int ntx;
+ unsigned int ntx, count;
- for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+ mq_queues(dev, sch, &count, NULL);
+
+ for (ntx = 0; ntx < count; ntx++) {
qdisc = priv->qdiscs[ntx];
qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
if (qdisc)
@@ -94,14 +127,17 @@ static int mq_dump(struct Qdisc *sch, struct sk_buff *skb)
{
struct net_device *dev = qdisc_dev(sch);
struct Qdisc *qdisc;
- unsigned int ntx;
+ unsigned int ntx, count, offset;
+
+ mq_queues(dev, sch, &count, &offset);
sch->q.qlen = 0;
memset(&sch->bstats, 0, sizeof(sch->bstats));
memset(&sch->qstats, 0, sizeof(sch->qstats));
- for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
- qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
+ for (ntx = 0; ntx < count; ntx++) {
+ int txq = ntx + offset;
+ qdisc = netdev_get_tx_queue(dev, txq)->qdisc_sleeping;
spin_lock_bh(qdisc_lock(qdisc));
sch->q.qlen += qdisc->q.qlen;
sch->bstats.bytes += qdisc->bstats.bytes;
@@ -120,10 +156,13 @@ static struct netdev_queue *mq_queue_get(struct Qdisc *sch, unsigned long cl)
{
struct net_device *dev = qdisc_dev(sch);
unsigned long ntx = cl - 1;
+ unsigned int count, offset;
+
+ mq_queues(dev, sch, &count, &offset);
- if (ntx >= dev->num_tx_queues)
+ if (ntx >= count)
return NULL;
- return netdev_get_tx_queue(dev, ntx);
+ return netdev_get_tx_queue(dev, offset + ntx);
}
static struct netdev_queue *mq_select_queue(struct Qdisc *sch,
@@ -203,13 +242,15 @@ static int mq_dump_class_stats(struct Qdisc *sch, unsigned long cl,
static void mq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
{
struct net_device *dev = qdisc_dev(sch);
- unsigned int ntx;
+ unsigned int ntx, count;
+
+ mq_queues(dev, sch, &count, NULL);
if (arg->stop)
return;
arg->count = arg->skip;
- for (ntx = arg->skip; ntx < dev->num_tx_queues; ntx++) {
+ for (ntx = arg->skip; ntx < count; ntx++) {
if (arg->fn(sch, ntx + 1, arg) < 0) {
arg->stop = 1;
break;
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH 3/4] net/sched: implement a root container qdisc sch_mclass
2010-12-09 19:59 [RFC PATCH 1/4] net: implement mechanism for HW based QOS John Fastabend
2010-12-09 20:00 ` [RFC PATCH 2/4] net/sched: Allow multiple mq qdisc to be used as non-root John Fastabend
@ 2010-12-09 20:00 ` John Fastabend
2010-12-09 20:00 ` [RFC PATCH 4/4] ixgbe: add multiple txqs per tc John Fastabend
2010-12-09 20:46 ` [RFC PATCH 1/4] net: implement mechanism for HW based QOS Eric Dumazet
3 siblings, 0 replies; 6+ messages in thread
From: John Fastabend @ 2010-12-09 20:00 UTC (permalink / raw)
To: davem; +Cc: netdev, hadi, shemminger, tgraf, eric.dumazet, john.r.fastabend
This implements a mclass 'multi-class' queueing discipline that by
default creates multiple mq qdisc's one for each traffic class. Each
mq qdisc then owns a range of queues per the netdev_tc_txq mappings.
Using the mclass qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.
To support HW QOS schemes on inflexible HW that require fixed
mappings between queues and classes a net device ops ndo_setup_tc
is used. The HW setup may be overridden.
Finally, qdiscs graft'd onto the mclass qdisc must be mq like in that
they must map queueing disciplines onto the netdev_queue.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
include/linux/netdevice.h | 3
include/linux/pkt_sched.h | 9 +
include/net/sch_generic.h | 1
net/sched/Makefile | 2
net/sched/sch_api.c | 1
net/sched/sch_generic.c | 8 +
net/sched/sch_mclass.c | 332 +++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 354 insertions(+), 2 deletions(-)
create mode 100644 net/sched/sch_mclass.c
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c0d4fb1..ac265bb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -762,6 +762,8 @@ struct netdev_tc_txq {
* int (*ndo_set_vf_port)(struct net_device *dev, int vf,
* struct nlattr *port[]);
* int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
+ *
+ * int (*ndo_setup_tc)(struct net_device *dev, int tc);
*/
#define HAVE_NET_DEVICE_OPS
struct net_device_ops {
@@ -820,6 +822,7 @@ struct net_device_ops {
struct nlattr *port[]);
int (*ndo_get_vf_port)(struct net_device *dev,
int vf, struct sk_buff *skb);
+ int (*ndo_setup_tc)(struct net_device *dev, u8 tc);
#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
int (*ndo_fcoe_enable)(struct net_device *dev);
int (*ndo_fcoe_disable)(struct net_device *dev);
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 2cfa4bc..0134ed4 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -481,4 +481,13 @@ struct tc_drr_stats {
__u32 deficit;
};
+/* MCLASS */
+struct tc_mclass_qopt {
+ __u8 num_tc;
+ __u8 prio_tc_map[16];
+ __u8 hw;
+ __u16 count[16];
+ __u16 offset[16];
+};
+
#endif
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index ea1f8a8..2bbcd09 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -276,6 +276,7 @@ extern struct Qdisc noop_qdisc;
extern struct Qdisc_ops noop_qdisc_ops;
extern struct Qdisc_ops pfifo_fast_ops;
extern struct Qdisc_ops mq_qdisc_ops;
+extern struct Qdisc_ops mclass_qdisc_ops;
struct Qdisc_class_common {
u32 classid;
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 960f5db..76dcf5b 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -2,7 +2,7 @@
# Makefile for the Linux Traffic Control Unit.
#
-obj-y := sch_generic.o sch_mq.o
+obj-y := sch_generic.o sch_mq.o sch_mclass.o
obj-$(CONFIG_NET_SCHED) += sch_api.o sch_blackhole.o
obj-$(CONFIG_NET_CLS) += cls_api.o
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index b22ca2d..24f40e0 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1770,6 +1770,7 @@ static int __init pktsched_init(void)
register_qdisc(&bfifo_qdisc_ops);
register_qdisc(&pfifo_head_drop_qdisc_ops);
register_qdisc(&mq_qdisc_ops);
+ register_qdisc(&mclass_qdisc_ops);
rtnl_register(PF_UNSPEC, RTM_NEWQDISC, tc_modify_qdisc, NULL);
rtnl_register(PF_UNSPEC, RTM_DELQDISC, tc_get_qdisc, NULL);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 0918834..73ed9b7 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -709,7 +709,13 @@ static void attach_default_qdiscs(struct net_device *dev)
dev->qdisc = txq->qdisc_sleeping;
atomic_inc(&dev->qdisc->refcnt);
} else {
- qdisc = qdisc_create_dflt(txq, &mq_qdisc_ops, TC_H_ROOT);
+ if (dev->num_tc)
+ qdisc = qdisc_create_dflt(txq, &mclass_qdisc_ops,
+ TC_H_ROOT);
+ else
+ qdisc = qdisc_create_dflt(txq, &mq_qdisc_ops,
+ TC_H_ROOT);
+
if (qdisc) {
qdisc->ops->attach(qdisc);
dev->qdisc = qdisc;
diff --git a/net/sched/sch_mclass.c b/net/sched/sch_mclass.c
new file mode 100644
index 0000000..1ff156c
--- /dev/null
+++ b/net/sched/sch_mclass.c
@@ -0,0 +1,332 @@
+/*
+ * net/sched/sch_mclass.c
+ *
+ * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+
+struct mclass_sched {
+ struct Qdisc **qdiscs;
+ int hw_owned;
+};
+
+static void mclass_destroy(struct Qdisc *sch)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mclass_sched *priv = qdisc_priv(sch);
+ unsigned int ntc;
+
+ if (!priv->qdiscs)
+ return;
+
+ for (ntc = 0; ntc < dev->num_tc && priv->qdiscs[ntc]; ntc++)
+ qdisc_destroy(priv->qdiscs[ntc]);
+
+ if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
+ dev->netdev_ops->ndo_setup_tc(dev, 0);
+ else
+ netdev_set_num_tc(dev, 0);
+
+ kfree(priv->qdiscs);
+}
+
+static int mclass_parse_opt(struct net_device *dev, struct tc_mclass_qopt *qopt)
+{
+ int i, j;
+
+ /* Verify TC offset and count are sane */
+ for (i = 0; i < qopt->num_tc; i++) {
+ int last = qopt->offset[i] + qopt->count[i];
+ if (last > dev->num_tx_queues)
+ return -EINVAL;
+ for (j = i + 1; j < qopt->num_tc; j++) {
+ if (last > qopt->offset[j])
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+static int mclass_init(struct Qdisc *sch, struct nlattr *opt)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mclass_sched *priv = qdisc_priv(sch);
+ struct netdev_queue *dev_queue;
+ struct Qdisc *qdisc;
+ int i, err = -EOPNOTSUPP;
+ struct tc_mclass_qopt *qopt = NULL;
+
+ if (sch->parent != TC_H_ROOT)
+ return -EOPNOTSUPP;
+
+ if (!netif_is_multiqueue(dev))
+ return -EOPNOTSUPP;
+
+ if (nla_len(opt) < sizeof(*qopt))
+ return -EINVAL;
+ qopt = nla_data(opt);
+
+ /* Workaround inflexible hardware where drivers may want to align
+ * TX queues and traffic class support to provide HW offloaded
+ * QOS.
+ */
+ if (qopt->hw && dev->netdev_ops->ndo_setup_tc) {
+ priv->hw_owned = 1;
+ if (dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc))
+ return -EINVAL;
+ } else {
+ if (mclass_parse_opt(dev, qopt))
+ return -EINVAL;
+
+ if (!dev->max_tc && netdev_alloc_max_tc(dev, 16))
+ return -ENOMEM;
+
+ if (netdev_set_num_tc(dev, qopt->num_tc))
+ return -ENOMEM;
+
+ for (i = 0; i < qopt->num_tc; i++)
+ netdev_set_tc_queue(dev, i,
+ qopt->count[i], qopt->offset[i]);
+ }
+
+ for (i = 0; i < 16; i++) {
+ if (netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])) {
+ err = -EINVAL;
+ goto tc_err;
+ }
+ }
+
+ /* pre-allocate qdiscs, attachment can't fail */
+ priv->qdiscs = kcalloc(qopt->num_tc,
+ sizeof(priv->qdiscs[0]), GFP_KERNEL);
+ if (priv->qdiscs == NULL) {
+ err = -ENOMEM;
+ goto tc_err;
+ }
+
+ for (i = 0; i < dev->num_tc; i++) {
+ dev_queue = netdev_get_tx_queue(dev, 0);
+ qdisc = qdisc_create_dflt(dev_queue, &mq_qdisc_ops,
+ TC_H_MAKE(TC_H_MAJ(sch->handle),
+ TC_H_MIN(i + 1)));
+ if (qdisc == NULL) {
+ err = -ENOMEM;
+ goto err;
+ }
+ qdisc->flags |= TCQ_F_CAN_BYPASS;
+ priv->qdiscs[i] = qdisc;
+ }
+
+ sch->flags |= TCQ_F_MQROOT;
+ return 0;
+
+err:
+ mclass_destroy(sch);
+tc_err:
+ if (priv->hw_owned)
+ dev->netdev_ops->ndo_setup_tc(dev, 0);
+ else
+ netdev_set_num_tc(dev, 0);
+ return err;
+}
+
+static void mclass_attach(struct Qdisc *sch)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mclass_sched *priv = qdisc_priv(sch);
+ struct Qdisc *qdisc;
+ unsigned int ntc;
+
+ /* Attach underlying qdisc */
+ for (ntc = 0; ntc < dev->num_tc; ntc++) {
+ qdisc = priv->qdiscs[ntc];
+ if (qdisc->ops && qdisc->ops->attach)
+ qdisc->ops->attach(qdisc);
+ }
+}
+
+static int mclass_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
+ struct Qdisc **old)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mclass_sched *priv = qdisc_priv(sch);
+ unsigned long ntc = cl - 1;
+
+ if (ntc >= dev->num_tc)
+ return -EINVAL;
+
+ if (dev->flags & IFF_UP)
+ dev_deactivate(dev);
+
+ *old = priv->qdiscs[ntc];
+ if (new == NULL)
+ new = &noop_qdisc;
+ priv->qdiscs[ntc] = new;
+ qdisc_reset(*old);
+
+ if (dev->flags & IFF_UP)
+ dev_activate(dev);
+
+ return 0;
+}
+
+static int mclass_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mclass_sched *priv = qdisc_priv(sch);
+ unsigned char *b = skb_tail_pointer(skb);
+ struct tc_mclass_qopt opt;
+ struct Qdisc *qdisc;
+ unsigned int ntc;
+
+ sch->q.qlen = 0;
+ memset(&sch->bstats, 0, sizeof(sch->bstats));
+ memset(&sch->qstats, 0, sizeof(sch->qstats));
+
+ for (ntc = 0; ntc < dev->num_tc; ntc++) {
+ qdisc = priv->qdiscs[ntc];
+ spin_lock_bh(qdisc_lock(qdisc));
+ sch->q.qlen += qdisc->q.qlen;
+ sch->bstats.bytes += qdisc->bstats.bytes;
+ sch->bstats.packets += qdisc->bstats.packets;
+ sch->qstats.qlen += qdisc->qstats.qlen;
+ sch->qstats.backlog += qdisc->qstats.backlog;
+ sch->qstats.drops += qdisc->qstats.drops;
+ sch->qstats.requeues += qdisc->qstats.requeues;
+ sch->qstats.overlimits += qdisc->qstats.overlimits;
+ spin_unlock_bh(qdisc_lock(qdisc));
+ }
+
+ opt.num_tc = dev->num_tc;
+ memcpy(opt.prio_tc_map, dev->prio_tc_map, 16);
+ opt.hw = priv->hw_owned;
+
+ for (ntc = 0; ntc < dev->num_tc; ntc++) {
+ struct netdev_tc_txq *tcp = &dev->_tc_to_txq[ntc];
+ opt.count[ntc] = tcp->count;
+ opt.offset[ntc] = tcp->offset;
+ }
+
+ NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+ return skb->len;
+nla_put_failure:
+ nlmsg_trim(skb, b);
+ return -1;
+}
+
+static struct Qdisc *mclass_leaf(struct Qdisc *sch, unsigned long cl)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mclass_sched *priv = qdisc_priv(sch);
+ unsigned long ntc = cl - 1;
+
+ if (ntc >= dev->num_tc)
+ return NULL;
+ return priv->qdiscs[ntc];
+}
+
+static unsigned long mclass_get(struct Qdisc *sch, u32 classid)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ unsigned int ntc = TC_H_MIN(classid);
+
+ if (ntc >= dev->num_tc)
+ return 0;
+ return ntc;
+}
+
+static void mclass_put(struct Qdisc *sch, unsigned long cl)
+{
+}
+
+static int mclass_dump_class(struct Qdisc *sch, unsigned long cl,
+ struct sk_buff *skb, struct tcmsg *tcm)
+{
+ struct Qdisc *class;
+ struct net_device *dev = qdisc_dev(sch);
+ struct mclass_sched *priv = qdisc_priv(sch);
+ unsigned long ntc = cl - 1;
+
+ if (ntc >= dev->num_tc)
+ return -EINVAL;
+
+ class = priv->qdiscs[ntc];
+
+ tcm->tcm_parent = TC_H_ROOT;
+ tcm->tcm_handle |= TC_H_MIN(cl);
+ tcm->tcm_info = class->handle;
+ return 0;
+}
+
+static int mclass_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+ struct gnet_dump *d)
+{
+ struct Qdisc *class;
+ struct net_device *dev = qdisc_dev(sch);
+ struct mclass_sched *priv = qdisc_priv(sch);
+ unsigned long ntc = cl - 1;
+
+ if (ntc >= dev->num_tc)
+ return -EINVAL;
+
+ class = priv->qdiscs[ntc];
+ class->qstats.qlen = class->q.qlen;
+ if (gnet_stats_copy_basic(d, &class->bstats) < 0 ||
+ gnet_stats_copy_queue(d, &class->qstats) < 0)
+ return -1;
+ return 0;
+}
+
+static void mclass_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ unsigned long ntc;
+
+ if (arg->stop)
+ return;
+
+ arg->count = arg->skip;
+ for (ntc = arg->skip; ntc < dev->num_tc; ntc++) {
+ if (arg->fn(sch, ntc + 1, arg) < 0) {
+ arg->stop = 1;
+ break;
+ }
+ arg->count++;
+ }
+}
+
+static const struct Qdisc_class_ops mclass_class_ops = {
+ .graft = mclass_graft,
+ .leaf = mclass_leaf,
+ .get = mclass_get,
+ .put = mclass_put,
+ .walk = mclass_walk,
+ .dump = mclass_dump_class,
+ .dump_stats = mclass_dump_class_stats,
+};
+
+struct Qdisc_ops mclass_qdisc_ops __read_mostly = {
+ .cl_ops = &mclass_class_ops,
+ .id = "mclass",
+ .priv_size = sizeof(struct mclass_sched),
+ .init = mclass_init,
+ .destroy = mclass_destroy,
+ .attach = mclass_attach,
+ .dump = mclass_dump,
+ .owner = THIS_MODULE,
+};
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH 4/4] ixgbe: add multiple txqs per tc
2010-12-09 19:59 [RFC PATCH 1/4] net: implement mechanism for HW based QOS John Fastabend
2010-12-09 20:00 ` [RFC PATCH 2/4] net/sched: Allow multiple mq qdisc to be used as non-root John Fastabend
2010-12-09 20:00 ` [RFC PATCH 3/4] net/sched: implement a root container qdisc sch_mclass John Fastabend
@ 2010-12-09 20:00 ` John Fastabend
2010-12-09 20:46 ` [RFC PATCH 1/4] net: implement mechanism for HW based QOS Eric Dumazet
3 siblings, 0 replies; 6+ messages in thread
From: John Fastabend @ 2010-12-09 20:00 UTC (permalink / raw)
To: davem; +Cc: netdev, hadi, shemminger, tgraf, eric.dumazet, john.r.fastabend
This illustrates the usage model for hardware QOS offloading.
Currently, DCB only enables a single queue per tc. Due to
complications with how to map tc filter rules to traffic classes
when multiple queues are enabled. And previously there was no
mechanism to map flows to multiple queues by priority.
Using the QOS offloading API we allocate multiple queues per
tc and configure the stack to hash across these queues. The
hardware then offloads the DCB extended transmission selection
algorithm. Sockets can set the priority using the SO_PRIORITY
socket option and expect ETS to work.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
drivers/net/ixgbe/ixgbe.h | 2
drivers/net/ixgbe/ixgbe_dcb_nl.c | 4 -
drivers/net/ixgbe/ixgbe_main.c | 256 ++++++++++++++++++++++----------------
3 files changed, 149 insertions(+), 113 deletions(-)
diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 3ae30b8..860b1fa 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -243,7 +243,7 @@ enum ixgbe_ring_f_enum {
RING_F_ARRAY_SIZE /* must be last in enum set */
};
-#define IXGBE_MAX_DCB_INDICES 8
+#define IXGBE_MAX_DCB_INDICES 64
#define IXGBE_MAX_RSS_INDICES 16
#define IXGBE_MAX_VMDQ_INDICES 64
#define IXGBE_MAX_FDIR_INDICES 64
diff --git a/drivers/net/ixgbe/ixgbe_dcb_nl.c b/drivers/net/ixgbe/ixgbe_dcb_nl.c
index bf566e8..d49b8ce 100644
--- a/drivers/net/ixgbe/ixgbe_dcb_nl.c
+++ b/drivers/net/ixgbe/ixgbe_dcb_nl.c
@@ -354,12 +354,12 @@ static u8 ixgbe_dcbnl_set_all(struct net_device *netdev)
{
struct ixgbe_adapter *adapter = netdev_priv(netdev);
int ret;
+ int tc = max(netdev->num_tc, MAX_TRAFFIC_CLASS);
if (!adapter->dcb_set_bitmap)
return DCB_NO_HW_CHG;
- ret = ixgbe_copy_dcb_cfg(&adapter->temp_dcb_cfg, &adapter->dcb_cfg,
- adapter->ring_feature[RING_F_DCB].indices);
+ ret = ixgbe_copy_dcb_cfg(&adapter->temp_dcb_cfg, &adapter->dcb_cfg, tc);
if (ret)
return DCB_NO_HW_CHG;
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index a12e86f..46b700d 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -638,6 +638,43 @@ void ixgbe_unmap_and_free_tx_resource(struct ixgbe_ring *tx_ring,
/* tx_buffer_info must be completely set up in the transmit path */
}
+#define IXGBE_MAX_Q_PER_TC (IXGBE_MAX_DCB_INDICES / MAX_TRAFFIC_CLASS)
+
+/* ixgbe setup routine for many traffic classes hardware only supports
+ * 4 or 8 traffic classes.
+ *
+ * JF: Todo, software should be able to map arbitrary TCs to 4 or 8 HW
+ * tcs. For illustration purposes require 4 or 8 tcs for now.
+ */
+int ixgbe_setup_tc(struct net_device *dev, u8 tcs)
+{
+ struct ixgbe_adapter *adapter = netdev_priv(dev);
+ int i, err = 0;
+ unsigned int q, offset = 0;
+
+ if (!tcs) {
+ err = netdev_set_num_tc(dev, tcs);
+ } else if (tcs != 4 || tcs != 8) {
+ if (!dev->max_tc && netdev_alloc_max_tc(dev, tcs))
+ return -ENOMEM;
+
+ if (netdev_set_num_tc(dev, tcs))
+ return -EINVAL;
+
+ /* Partition TX queues evenly amongst traffic classes */
+ for (i = 0; i < tcs; i++) {
+ q = min((int)num_online_cpus(), IXGBE_MAX_Q_PER_TC);
+ netdev_set_prio_tc_map(adapter->netdev, i, i);
+ netdev_set_tc_queue(adapter->netdev, i, q, offset);
+ offset += q;
+ }
+ } else {
+ err = -EINVAL;
+ }
+
+ return err;
+}
+
/**
* ixgbe_dcb_txq_to_tc - convert a reg index to a traffic class
* @adapter: driver private struct
@@ -651,7 +688,7 @@ void ixgbe_unmap_and_free_tx_resource(struct ixgbe_ring *tx_ring,
u8 ixgbe_dcb_txq_to_tc(struct ixgbe_adapter *adapter, u8 reg_idx)
{
int tc = -1;
- int dcb_i = adapter->ring_feature[RING_F_DCB].indices;
+ u8 num_tcs = netdev_get_num_tc(adapter->netdev);
/* if DCB is not enabled the queues have no TC */
if (!(adapter->flags & IXGBE_FLAG_DCB_ENABLED))
@@ -666,13 +703,13 @@ u8 ixgbe_dcb_txq_to_tc(struct ixgbe_adapter *adapter, u8 reg_idx)
tc = reg_idx >> 2;
break;
default:
- if (dcb_i != 4 && dcb_i != 8)
+ if (num_tcs != 4 && num_tcs != 8)
break;
/* if VMDq is enabled the lowest order bits determine TC */
if (adapter->flags & (IXGBE_FLAG_SRIOV_ENABLED |
IXGBE_FLAG_VMDQ_ENABLED)) {
- tc = reg_idx & (dcb_i - 1);
+ tc = reg_idx & (num_tcs - 1);
break;
}
@@ -685,9 +722,9 @@ u8 ixgbe_dcb_txq_to_tc(struct ixgbe_adapter *adapter, u8 reg_idx)
* will only ever be 8 or 4 and that reg_idx will never
* be greater then 128. The code without the power of 2
* optimizations would be:
- * (((reg_idx % 32) + 32) * dcb_i) >> (9 - reg_idx / 32)
+ * (((reg_idx % 32) + 32) * num_tcs) >> (9 - reg_idx / 32)
*/
- tc = ((reg_idx & 0X1F) + 0x20) * dcb_i;
+ tc = ((reg_idx & 0X1F) + 0x20) * num_tcs;
tc >>= 9 - (reg_idx >> 5);
}
@@ -4205,10 +4242,17 @@ static inline bool ixgbe_set_dcb_queues(struct ixgbe_adapter *adapter)
{
bool ret = false;
struct ixgbe_ring_feature *f = &adapter->ring_feature[RING_F_DCB];
+ int i, q;
if (!(adapter->flags & IXGBE_FLAG_DCB_ENABLED))
return ret;
+ f->indices = 0;
+ for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
+ q = min((int)num_online_cpus(), MAX_TRAFFIC_CLASS);
+ f->indices += q;
+ }
+
f->mask = 0x7 << 3;
adapter->num_rx_queues = f->indices;
adapter->num_tx_queues = f->indices;
@@ -4295,12 +4339,7 @@ static inline bool ixgbe_set_fcoe_queues(struct ixgbe_adapter *adapter)
if (adapter->flags & IXGBE_FLAG_FCOE_ENABLED) {
adapter->num_rx_queues = 1;
adapter->num_tx_queues = 1;
-#ifdef CONFIG_IXGBE_DCB
- if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
- e_info(probe, "FCoE enabled with DCB\n");
- ixgbe_set_dcb_queues(adapter);
- }
-#endif
+
if (adapter->flags & IXGBE_FLAG_RSS_ENABLED) {
e_info(probe, "FCoE enabled with RSS\n");
if ((adapter->flags & IXGBE_FLAG_FDIR_HASH_CAPABLE) ||
@@ -4356,16 +4395,15 @@ static int ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
if (ixgbe_set_sriov_queues(adapter))
goto done;
-#ifdef IXGBE_FCOE
- if (ixgbe_set_fcoe_queues(adapter))
- goto done;
-
-#endif /* IXGBE_FCOE */
#ifdef CONFIG_IXGBE_DCB
if (ixgbe_set_dcb_queues(adapter))
goto done;
-
#endif
+
+#ifdef IXGBE_FCOE
+ if (ixgbe_set_fcoe_queues(adapter))
+ goto done;
+#endif /* IXGBE_FCOE */
if (ixgbe_set_fdir_queues(adapter))
goto done;
@@ -4457,6 +4495,63 @@ static inline bool ixgbe_cache_ring_rss(struct ixgbe_adapter *adapter)
}
#ifdef CONFIG_IXGBE_DCB
+
+ /* ixgbe_get_first_reg_idx - Return first register index associated
+ * with this traffic class
+ */
+void ixgbe_get_first_reg_idx(struct ixgbe_adapter *adapter, u8 tc,
+ unsigned int *tx, unsigned int *rx)
+{
+ struct net_device *dev = adapter->netdev;
+ struct ixgbe_hw *hw = &adapter->hw;
+ u8 num_tcs = netdev_get_num_tc(dev);
+
+ *tx = 0;
+ *rx = 0;
+
+ switch (hw->mac.type) {
+ case ixgbe_mac_82598EB:
+ *tx = tc << 3;
+ *rx = tc << 2;
+ break;
+ case ixgbe_mac_82599EB:
+ case ixgbe_mac_X540:
+ if (num_tcs == 8) {
+ if (tc < 3) {
+ *tx = tc << 5;
+ *rx = tc << 4;
+ } else if (tc < 5) {
+ *tx = ((tc + 2) << 4);
+ *rx = tc << 4;
+ } else if (tc < num_tcs) {
+ *tx = ((tc + 8) << 3);
+ *rx = tc << 4;
+ }
+ } else if (num_tcs == 4) {
+ *rx = tc << 5;
+ switch (tc) {
+ case 0:
+ *tx = 0;
+ break;
+ case 1:
+ *tx = 64;
+ break;
+ case 2:
+ *tx = 96;
+ break;
+ case 3:
+ *tx = 112;
+ break;
+ default:
+ break;
+ }
+ }
+ break;
+ default:
+ break;
+ }
+}
+
/**
* ixgbe_cache_ring_dcb - Descriptor ring to register mapping for DCB
* @adapter: board private structure to initialize
@@ -4466,72 +4561,26 @@ static inline bool ixgbe_cache_ring_rss(struct ixgbe_adapter *adapter)
**/
static inline bool ixgbe_cache_ring_dcb(struct ixgbe_adapter *adapter)
{
- int i;
- bool ret = false;
- int dcb_i = adapter->ring_feature[RING_F_DCB].indices;
+ struct net_device *dev = adapter->netdev;
+ int i, j, k;
+ u8 num_tcs = netdev_get_num_tc(dev);
+ unsigned int tx_s, rx_s;
if (!(adapter->flags & IXGBE_FLAG_DCB_ENABLED))
return false;
/* the number of queues is assumed to be symmetric */
- switch (adapter->hw.mac.type) {
- case ixgbe_mac_82598EB:
- for (i = 0; i < dcb_i; i++) {
- adapter->rx_ring[i]->reg_idx = i << 3;
- adapter->tx_ring[i]->reg_idx = i << 2;
+ for (i = 0, k = 0; i < num_tcs; i++) {
+ struct netdev_tc_txq *tcp = netdev_get_tc_queue(dev, i);
+ u16 qcount = tcp->count;
+ ixgbe_get_first_reg_idx(adapter, i, &tx_s, &rx_s);
+ for (j = 0; j < qcount; j++, k++) {
+ adapter->tx_ring[k]->reg_idx = tx_s + j;
+ adapter->rx_ring[k]->reg_idx = rx_s + j;
}
- ret = true;
- break;
- case ixgbe_mac_82599EB:
- case ixgbe_mac_X540:
- if (dcb_i == 8) {
- /*
- * Tx TC0 starts at: descriptor queue 0
- * Tx TC1 starts at: descriptor queue 32
- * Tx TC2 starts at: descriptor queue 64
- * Tx TC3 starts at: descriptor queue 80
- * Tx TC4 starts at: descriptor queue 96
- * Tx TC5 starts at: descriptor queue 104
- * Tx TC6 starts at: descriptor queue 112
- * Tx TC7 starts at: descriptor queue 120
- *
- * Rx TC0-TC7 are offset by 16 queues each
- */
- for (i = 0; i < 3; i++) {
- adapter->tx_ring[i]->reg_idx = i << 5;
- adapter->rx_ring[i]->reg_idx = i << 4;
- }
- for ( ; i < 5; i++) {
- adapter->tx_ring[i]->reg_idx = ((i + 2) << 4);
- adapter->rx_ring[i]->reg_idx = i << 4;
- }
- for ( ; i < dcb_i; i++) {
- adapter->tx_ring[i]->reg_idx = ((i + 8) << 3);
- adapter->rx_ring[i]->reg_idx = i << 4;
- }
- ret = true;
- } else if (dcb_i == 4) {
- /*
- * Tx TC0 starts at: descriptor queue 0
- * Tx TC1 starts at: descriptor queue 64
- * Tx TC2 starts at: descriptor queue 96
- * Tx TC3 starts at: descriptor queue 112
- *
- * Rx TC0-TC3 are offset by 32 queues each
- */
- adapter->tx_ring[0]->reg_idx = 0;
- adapter->tx_ring[1]->reg_idx = 64;
- adapter->tx_ring[2]->reg_idx = 96;
- adapter->tx_ring[3]->reg_idx = 112;
- for (i = 0 ; i < dcb_i; i++)
- adapter->rx_ring[i]->reg_idx = i << 5;
- ret = true;
- }
- break;
- default:
- break;
}
- return ret;
+
+ return true;
}
#endif
@@ -4659,17 +4708,15 @@ static void ixgbe_cache_ring_register(struct ixgbe_adapter *adapter)
if (ixgbe_cache_ring_sriov(adapter))
return;
-
+#ifdef CONFIG_IXGBE_DCB
+ if (ixgbe_cache_ring_dcb(adapter))
+ return;
+#endif /* IXGBE_DCB */
#ifdef IXGBE_FCOE
if (ixgbe_cache_ring_fcoe(adapter))
return;
#endif /* IXGBE_FCOE */
-#ifdef CONFIG_IXGBE_DCB
- if (ixgbe_cache_ring_dcb(adapter))
- return;
-
-#endif
if (ixgbe_cache_ring_fdir(adapter))
return;
@@ -5133,7 +5180,7 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
adapter->dcb_cfg.round_robin_enable = false;
adapter->dcb_set_bitmap = 0x00;
ixgbe_copy_dcb_cfg(&adapter->dcb_cfg, &adapter->temp_dcb_cfg,
- adapter->ring_feature[RING_F_DCB].indices);
+ MAX_TRAFFIC_CLASS);
#endif
@@ -5986,7 +6033,7 @@ static void ixgbe_watchdog_task(struct work_struct *work)
if (link_up) {
#ifdef CONFIG_DCB
if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
- for (i = 0; i < MAX_TRAFFIC_CLASS; i++)
+ for (i = 0; i < netdev->max_tc; i++)
hw->mac.ops.fc_enable(hw, i);
} else {
hw->mac.ops.fc_enable(hw, 0);
@@ -6511,25 +6558,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb)
{
struct ixgbe_adapter *adapter = netdev_priv(dev);
int txq = smp_processor_id();
-#ifdef IXGBE_FCOE
- __be16 protocol;
-
- protocol = vlan_get_protocol(skb);
-
- if ((protocol == htons(ETH_P_FCOE)) ||
- (protocol == htons(ETH_P_FIP))) {
- if (adapter->flags & IXGBE_FLAG_FCOE_ENABLED) {
- txq &= (adapter->ring_feature[RING_F_FCOE].indices - 1);
- txq += adapter->ring_feature[RING_F_FCOE].mask;
- return txq;
-#ifdef CONFIG_IXGBE_DCB
- } else if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
- txq = adapter->fcoe.up;
- return txq;
-#endif
- }
- }
-#endif
if (adapter->flags & IXGBE_FLAG_FDIR_HASH_CAPABLE) {
while (unlikely(txq >= dev->real_num_tx_queues))
@@ -6537,14 +6565,20 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb)
return txq;
}
- if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
- if (skb->priority == TC_PRIO_CONTROL)
- txq = adapter->ring_feature[RING_F_DCB].indices-1;
- else
- txq = (skb->vlan_tci & IXGBE_TX_FLAGS_VLAN_PRIO_MASK)
- >> 13;
+#ifdef IXGBE_FCOE
+ /*
+ * If DCB is not enabled to assign FCoE a priority mapping
+ * we need to steer the skb to FCoE enabled tx rings.
+ */
+ if ((adapter->flags & IXGBE_FLAG_FCOE_ENABLED) &&
+ !(adapter->flags & IXGBE_FLAG_DCB_ENABLED) &&
+ ((skb->protocol == htons(ETH_P_FCOE)) ||
+ (skb->protocol == htons(ETH_P_FIP)))) {
+ txq &= (adapter->ring_feature[RING_F_FCOE].indices - 1);
+ txq += adapter->ring_feature[RING_F_FCOE].mask;
return txq;
}
+#endif
return skb_tx_hash(dev, skb);
}
@@ -6867,6 +6901,7 @@ static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_set_vf_tx_rate = ixgbe_ndo_set_vf_bw,
.ndo_get_vf_config = ixgbe_ndo_get_vf_config,
.ndo_get_stats64 = ixgbe_get_stats64,
+ .ndo_setup_tc = ixgbe_setup_tc,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = ixgbe_netpoll,
#endif
@@ -7007,8 +7042,9 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
else
indices = min_t(unsigned int, indices, IXGBE_MAX_FDIR_INDICES);
+#if defined(CONFIG_IXGBE_DCB)
indices = max_t(unsigned int, indices, IXGBE_MAX_DCB_INDICES);
-#ifdef IXGBE_FCOE
+#elif defined(IXGBE_FCOE)
indices += min_t(unsigned int, num_possible_cpus(),
IXGBE_MAX_FCOE_INDICES);
#endif
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/4] net: implement mechanism for HW based QOS
2010-12-09 19:59 [RFC PATCH 1/4] net: implement mechanism for HW based QOS John Fastabend
` (2 preceding siblings ...)
2010-12-09 20:00 ` [RFC PATCH 4/4] ixgbe: add multiple txqs per tc John Fastabend
@ 2010-12-09 20:46 ` Eric Dumazet
2010-12-10 0:24 ` John Fastabend
3 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2010-12-09 20:46 UTC (permalink / raw)
To: John Fastabend; +Cc: davem, netdev, hadi, shemminger, tgraf
Le jeudi 09 décembre 2010 à 11:59 -0800, John Fastabend a écrit :
> This patch provides a mechanism for lower layer devices to
> steer traffic using skb->priority to tx queues. This allows
> for hardware based QOS schemes to use the default qdisc without
> incurring the penalties related to global state and the qdisc
> lock. While reliably receiving skbs on the correct tx ring
> to avoid head of line blocking resulting from shuffling in
> the LLD. Finally, all the goodness from txq caching and xps/rps
> can still be leveraged.
>
> Many drivers and hardware exist with the ability to implement
> QOS schemes in the hardware but currently these drivers tend
> to rely on firmware to reroute specific traffic, a driver
> specific select_queue or the queue_mapping action in the
> qdisc.
>
> By using select_queue for this drivers need to be updated for
> each and every traffic type and we lose the goodness of much
> of the upstream work. Firmware solutions are inherently
> inflexible. And finally if admins are expected to build a
> qdisc and filter rules to steer traffic this requires knowledge
> of how the hardware is currently configured. The number of tx
> queues and the queue offsets may change depending on resources.
> Also this approach incurs all the overhead of a qdisc with filters.
>
> With the mechanism in this patch users can set skb priority using
> expected methods ie setsockopt() or the stack can set the priority
> directly. Then the skb will be steered to the correct tx queues
> aligned with hardware QOS traffic classes. In the normal case with
> a single traffic class and all queues in this class everything
> works as is until the LLD enables multiple tcs.
>
> To steer the skb we mask out the lower 4 bits of the priority
> and allow the hardware to configure upto 15 distinct classes
> of traffic. This is expected to be sufficient for most applications
> at any rate it is more then the 8021Q spec designates and is
> equal to the number of prio bands currently implemented in
> the default qdisc.
>
> This in conjunction with a userspace application such as
> lldpad can be used to implement 8021Q transmission selection
> algorithms one of these algorithms being the extended transmission
> selection algorithm currently being used for DCB.
>
Very nice Changelog !
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>
> include/linux/netdevice.h | 65 +++++++++++++++++++++++++++++++++++++++++++++
> net/core/dev.c | 39 ++++++++++++++++++++++++++-
> 2 files changed, 103 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index a9ac5dc..c0d4fb1 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -646,6 +646,12 @@ struct xps_dev_maps {
> (nr_cpu_ids * sizeof(struct xps_map *)))
> #endif /* CONFIG_XPS */
>
> +/* HW offloaded queuing disciplines txq count and offset maps */
> +struct netdev_tc_txq {
> + u16 count;
> + u16 offset;
> +};
> +
> /*
> * This structure defines the management hooks for network devices.
> * The following hooks can be defined; unless noted otherwise, they are
> @@ -1146,6 +1152,10 @@ struct net_device {
> /* Data Center Bridging netlink ops */
> const struct dcbnl_rtnl_ops *dcbnl_ops;
> #endif
> + u8 max_tc;
> + u8 num_tc;
> + struct netdev_tc_txq *_tc_to_txq;
Given that this is up to 16*4 bytes (64), shouldnt we embed this in
net_device struct to avoid one dereference ?
> + u8 prio_tc_map[16];
>
> #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
> /* max exchange id for FCoE LRO by ddp */
> @@ -1162,6 +1172,58 @@ struct net_device {
> #define NETDEV_ALIGN 32
>
> static inline
> +int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
> +{
> + return dev->prio_tc_map[prio & 15];
> +}
> +
> +static inline
> +int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
> +{
> + if (tc >= dev->num_tc)
> + return -EINVAL;
> +
> + dev->prio_tc_map[prio & 15] = tc & 15;
> + return 0;
> +}
> +
> +static inline
> +int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
> +{
> + struct netdev_tc_txq *tcp;
> +
> + if (tc >= dev->num_tc)
> + return -EINVAL;
> +
> + tcp = &dev->_tc_to_txq[tc];
> + tcp->count = count;
> + tcp->offset = offset;
> + return 0;
> +}
> +
> +static inline
> +struct netdev_tc_txq *netdev_get_tc_queue(const struct net_device *dev, u8 tc)
> +{
> + return &dev->_tc_to_txq[tc];
> +}
> +
> +static inline
> +int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
> +{
> + if (num_tc > dev->max_tc)
> + return -EINVAL;
> +
> + dev->num_tc = num_tc;
> + return 0;
> +}
> +
> +static inline
> +u8 netdev_get_num_tc(const struct net_device *dev)
> +{
> + return dev->num_tc;
> +}
> +
> +static inline
> struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
> unsigned int index)
> {
> @@ -1386,6 +1448,9 @@ static inline void unregister_netdevice(struct net_device *dev)
> unregister_netdevice_queue(dev, NULL);
> }
>
> +extern int netdev_alloc_max_tc(struct net_device *dev, u8 tc);
> +extern void netdev_free_tc(struct net_device *dev);
> +
> extern int netdev_refcnt_read(const struct net_device *dev);
> extern void free_netdev(struct net_device *dev);
> extern void synchronize_net(void);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 55ff66f..cc00e66 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2118,6 +2118,8 @@ static u32 hashrnd __read_mostly;
> u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
> {
> u32 hash;
> + u16 qoffset = 0;
> + u16 qcount = dev->real_num_tx_queues;
>
> if (skb_rx_queue_recorded(skb)) {
> hash = skb_get_rx_queue(skb);
> @@ -2126,13 +2128,20 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
> return hash;
> }
>
> + if (dev->num_tc) {
> + u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
> + struct netdev_tc_txq *tcp = netdev_get_tc_queue(dev, tc);
> + qoffset = tcp->offset;
> + qcount = tcp->count;
> + }
> +
> if (skb->sk && skb->sk->sk_hash)
> hash = skb->sk->sk_hash;
> else
> hash = (__force u16) skb->protocol ^ skb->rxhash;
> hash = jhash_1word(hash, hashrnd);
>
> - return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
> + return (u16) ((((u64) hash * qcount)) >> 32) + qoffset;
> }
> EXPORT_SYMBOL(skb_tx_hash);
>
> @@ -5091,6 +5100,33 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev,
> }
> EXPORT_SYMBOL(netif_stacked_transfer_operstate);
>
> +int netdev_alloc_max_tc(struct net_device *dev, u8 tcs)
> +{
> + struct netdev_tc_txq *tcp;
> +
> + if (tcs > 16)
> + return -EINVAL;
> +
> + tcp = kcalloc(tcs, sizeof(*tcp), GFP_KERNEL);
common risk : allocating less than one cache line, and this possibly can
have false sharing.
I would just embed the thing.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/4] net: implement mechanism for HW based QOS
2010-12-09 20:46 ` [RFC PATCH 1/4] net: implement mechanism for HW based QOS Eric Dumazet
@ 2010-12-10 0:24 ` John Fastabend
0 siblings, 0 replies; 6+ messages in thread
From: John Fastabend @ 2010-12-10 0:24 UTC (permalink / raw)
To: Eric Dumazet
Cc: davem@davemloft.net, netdev@vger.kernel.org, hadi@cyberus.ca,
shemminger@vyatta.com, tgraf@infradead.org
On 12/9/2010 12:46 PM, Eric Dumazet wrote:
> Le jeudi 09 décembre 2010 à 11:59 -0800, John Fastabend a écrit :
>> This patch provides a mechanism for lower layer devices to
>> steer traffic using skb->priority to tx queues. This allows
>> for hardware based QOS schemes to use the default qdisc without
>> incurring the penalties related to global state and the qdisc
>> lock. While reliably receiving skbs on the correct tx ring
>> to avoid head of line blocking resulting from shuffling in
>> the LLD. Finally, all the goodness from txq caching and xps/rps
>> can still be leveraged.
>>
>> Many drivers and hardware exist with the ability to implement
>> QOS schemes in the hardware but currently these drivers tend
>> to rely on firmware to reroute specific traffic, a driver
>> specific select_queue or the queue_mapping action in the
>> qdisc.
>>
>> By using select_queue for this drivers need to be updated for
>> each and every traffic type and we lose the goodness of much
>> of the upstream work. Firmware solutions are inherently
>> inflexible. And finally if admins are expected to build a
>> qdisc and filter rules to steer traffic this requires knowledge
>> of how the hardware is currently configured. The number of tx
>> queues and the queue offsets may change depending on resources.
>> Also this approach incurs all the overhead of a qdisc with filters.
>>
>> With the mechanism in this patch users can set skb priority using
>> expected methods ie setsockopt() or the stack can set the priority
>> directly. Then the skb will be steered to the correct tx queues
>> aligned with hardware QOS traffic classes. In the normal case with
>> a single traffic class and all queues in this class everything
>> works as is until the LLD enables multiple tcs.
>>
>> To steer the skb we mask out the lower 4 bits of the priority
>> and allow the hardware to configure upto 15 distinct classes
>> of traffic. This is expected to be sufficient for most applications
>> at any rate it is more then the 8021Q spec designates and is
>> equal to the number of prio bands currently implemented in
>> the default qdisc.
>>
>> This in conjunction with a userspace application such as
>> lldpad can be used to implement 8021Q transmission selection
>> algorithms one of these algorithms being the extended transmission
>> selection algorithm currently being used for DCB.
>>
>
> Very nice Changelog !
>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>
>> include/linux/netdevice.h | 65 +++++++++++++++++++++++++++++++++++++++++++++
>> net/core/dev.c | 39 ++++++++++++++++++++++++++-
>> 2 files changed, 103 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index a9ac5dc..c0d4fb1 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -646,6 +646,12 @@ struct xps_dev_maps {
>> (nr_cpu_ids * sizeof(struct xps_map *)))
>> #endif /* CONFIG_XPS */
>>
>> +/* HW offloaded queuing disciplines txq count and offset maps */
>> +struct netdev_tc_txq {
>> + u16 count;
>> + u16 offset;
>> +};
>> +
>> /*
>> * This structure defines the management hooks for network devices.
>> * The following hooks can be defined; unless noted otherwise, they are
>> @@ -1146,6 +1152,10 @@ struct net_device {
>> /* Data Center Bridging netlink ops */
>> const struct dcbnl_rtnl_ops *dcbnl_ops;
>> #endif
>> + u8 max_tc;
>> + u8 num_tc;
>> + struct netdev_tc_txq *_tc_to_txq;
>
> Given that this is up to 16*4 bytes (64), shouldnt we embed this in
> net_device struct to avoid one dereference ?
>
>
>> + u8 prio_tc_map[16];
>>
>> #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
>> /* max exchange id for FCoE LRO by ddp */
>> @@ -1162,6 +1172,58 @@ struct net_device {
>> #define NETDEV_ALIGN 32
>>
>> static inline
>> +int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
>> +{
>> + return dev->prio_tc_map[prio & 15];
>> +}
>> +
>> +static inline
>> +int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
>> +{
>> + if (tc >= dev->num_tc)
>> + return -EINVAL;
>> +
>> + dev->prio_tc_map[prio & 15] = tc & 15;
>> + return 0;
>> +}
>> +
>> +static inline
>> +int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
>> +{
>> + struct netdev_tc_txq *tcp;
>> +
>> + if (tc >= dev->num_tc)
>> + return -EINVAL;
>> +
>> + tcp = &dev->_tc_to_txq[tc];
>> + tcp->count = count;
>> + tcp->offset = offset;
>> + return 0;
>> +}
>> +
>> +static inline
>> +struct netdev_tc_txq *netdev_get_tc_queue(const struct net_device *dev, u8 tc)
>> +{
>> + return &dev->_tc_to_txq[tc];
>> +}
>> +
>> +static inline
>> +int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
>> +{
>> + if (num_tc > dev->max_tc)
>> + return -EINVAL;
>> +
>> + dev->num_tc = num_tc;
>> + return 0;
>> +}
>> +
>> +static inline
>> +u8 netdev_get_num_tc(const struct net_device *dev)
>> +{
>> + return dev->num_tc;
>> +}
>> +
>> +static inline
>> struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
>> unsigned int index)
>> {
>> @@ -1386,6 +1448,9 @@ static inline void unregister_netdevice(struct net_device *dev)
>> unregister_netdevice_queue(dev, NULL);
>> }
>>
>> +extern int netdev_alloc_max_tc(struct net_device *dev, u8 tc);
>> +extern void netdev_free_tc(struct net_device *dev);
>> +
>> extern int netdev_refcnt_read(const struct net_device *dev);
>> extern void free_netdev(struct net_device *dev);
>> extern void synchronize_net(void);
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 55ff66f..cc00e66 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2118,6 +2118,8 @@ static u32 hashrnd __read_mostly;
>> u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>> {
>> u32 hash;
>> + u16 qoffset = 0;
>> + u16 qcount = dev->real_num_tx_queues;
>>
>> if (skb_rx_queue_recorded(skb)) {
>> hash = skb_get_rx_queue(skb);
>> @@ -2126,13 +2128,20 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>> return hash;
>> }
>>
>> + if (dev->num_tc) {
>> + u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
>> + struct netdev_tc_txq *tcp = netdev_get_tc_queue(dev, tc);
>> + qoffset = tcp->offset;
>> + qcount = tcp->count;
>> + }
>> +
>> if (skb->sk && skb->sk->sk_hash)
>> hash = skb->sk->sk_hash;
>> else
>> hash = (__force u16) skb->protocol ^ skb->rxhash;
>> hash = jhash_1word(hash, hashrnd);
>>
>> - return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>> + return (u16) ((((u64) hash * qcount)) >> 32) + qoffset;
>> }
>> EXPORT_SYMBOL(skb_tx_hash);
>>
>> @@ -5091,6 +5100,33 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev,
>> }
>> EXPORT_SYMBOL(netif_stacked_transfer_operstate);
>>
>> +int netdev_alloc_max_tc(struct net_device *dev, u8 tcs)
>> +{
>> + struct netdev_tc_txq *tcp;
>> +
>> + if (tcs > 16)
>> + return -EINVAL;
>> +
>> + tcp = kcalloc(tcs, sizeof(*tcp), GFP_KERNEL);
>
> common risk : allocating less than one cache line, and this possibly can
> have false sharing.
>
> I would just embed the thing.
>
Yes, I think you are right plus this simplifies the code a bit. I'll go ahead and do this. Thanks!
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-12-10 0:24 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-09 19:59 [RFC PATCH 1/4] net: implement mechanism for HW based QOS John Fastabend
2010-12-09 20:00 ` [RFC PATCH 2/4] net/sched: Allow multiple mq qdisc to be used as non-root John Fastabend
2010-12-09 20:00 ` [RFC PATCH 3/4] net/sched: implement a root container qdisc sch_mclass John Fastabend
2010-12-09 20:00 ` [RFC PATCH 4/4] ixgbe: add multiple txqs per tc John Fastabend
2010-12-09 20:46 ` [RFC PATCH 1/4] net: implement mechanism for HW based QOS Eric Dumazet
2010-12-10 0:24 ` John Fastabend
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).