[PATCH] NET: DCB generic netlink interface

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] NET: DCB generic netlink interface
@ 2008-05-27 14:13 PJ Waskiewicz
  2008-05-27 14:13 ` [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition PJ Waskiewicz
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: PJ Waskiewicz @ 2008-05-27 14:13 UTC (permalink / raw)
  To: jeff, davem; +Cc: netdev

This patchset adds the initial DCB generic netlink interface to the kernel.
It adds the layer as a generic interface for any DCB-capable device through
the netdevice.

This patchset also includes an implementation using this interface in the
ixgbe driver.  It adds the hardware-specific code to turn the interface on,
and includes the netlink callbacks in the driver to perform the requested
operations.

These patches are targeted at the net-next-2.6 tree, for 2.6.27.  The patch
series is as follows:

patch 1: DCB netlink interface in-kernel
patch 2: ixgbe DCB hardware-specific patches
patch 3: enable DCB in ixgbe

Thanks,
-- 
PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition
  2008-05-27 14:13 [PATCH] NET: DCB generic netlink interface PJ Waskiewicz
@ 2008-05-27 14:13 ` PJ Waskiewicz
  2008-05-28  9:41   ` Thomas Graf
  2008-06-05 13:17   ` Patrick McHardy
  2008-05-27 14:13 ` [PATCH 2/3] ixgbe: Add Data Center Bridging hardware initialization code PJ Waskiewicz
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 24+ messages in thread
From: PJ Waskiewicz @ 2008-05-27 14:13 UTC (permalink / raw)
  To: jeff, davem; +Cc: netdev

This patch adds the netlink interface definition for Data Center Bridging.
This technology uses 802.1Qaz and 801.1Qbb for extending ethernet to
converge different traffic types on a single link.  E.g. Fibre Channel
over Ethernet and regular LAN traffic.  The goal is to use priority flow
control to pause individual flows at the MAC/network level, without
impacting other network flows.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/dcbnl.h     |  241 +++++++++++++++
 include/linux/netdevice.h |    8 
 net/Kconfig               |    1 
 net/Makefile              |    3 
 net/dcb/Kconfig           |   12 +
 net/dcb/Makefile          |    1 
 net/dcb/dcbnl.c           |  722 +++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 988 insertions(+), 0 deletions(-)

diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h
new file mode 100644
index 0000000..db50f6c
--- /dev/null
+++ b/include/linux/dcbnl.h
@@ -0,0 +1,241 @@
+#ifndef __LINUX_DCBNL_H__
+#define __LINUX_DCBNL_H__
+/*
+ * Data Center Bridging (DCB) netlink header
+ *
+ * Copyright 2008, Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
+ */
+
+#define DCB_PROTO_VERSION 1
+
+/**
+ * enum dcbnl_commands - supported DCB commands
+ *
+ * @DCB_CMD_UNDEFINED: unspecified command to catch errors
+ * @DCB_CMD_GSTATE: request the state of DCB in the device
+ * @DCB_CMD_SSTATE: set the state of DCB in the device
+ * @DCB_CMD_PGTX_GCFG: request the priority group configuration for Tx
+ * @DCB_CMD_PGTX_SCFG: set the priority group configuration for Tx
+ * @DCB_CMD_PGRX_GCFG: request the priority group configuration for Rx
+ * @DCB_CMD_PGRX_SCFG: set the priority group configuration for Rx
+ * @DCB_CMD_PFC_GCFG: request the priority flow control configuration
+ * @DCB_CMD_PFC_SCFG: set the priority flow control configuration
+ * @DCB_CMD_SET_ALL: apply all changes to the underlying device
+ * @DCB_CMD_GPERM_HWADDR: get the permanent MAC address of the underlying
+ *                        device.  Only useful when using bonding.
+ */
+enum dcbnl_commands {
+	DCB_CMD_UNDEFINED,
+
+	DCB_CMD_GSTATE,
+	DCB_CMD_SSTATE,
+
+	DCB_CMD_PGTX_GCFG,
+	DCB_CMD_PGTX_SCFG,
+	DCB_CMD_PGRX_GCFG,
+	DCB_CMD_PGRX_SCFG,
+
+	DCB_CMD_PFC_GCFG,
+	DCB_CMD_PFC_SCFG,
+
+	DCB_CMD_SET_ALL,
+	DCB_CMD_GPERM_HWADDR,
+
+	__DCB_CMD_ENUM_MAX,
+	DCB_CMD_MAX = __DCB_CMD_ENUM_MAX - 1,
+};
+
+
+/**
+ * enum dcbnl_attrs - DCB top-level netlink attributes
+ *
+ * @DCB_ATTR_UNDEFINED: unspecified attribute to catch errors
+ * @DCB_ATTR_IFNAME: interface name of the underlying device (NLA_STRING)
+ * @DCB_ATTR_STATE: state of the DCB state machine in the device (NLA_U8)
+ * @DCB_ATTR_PFC_CFG: priority flow control configuration (NLA_NESTED)
+ * @DCB_ATTR_PG_CFG: priority group configuration (NLA_NESTED)
+ * @DCB_ATTR_SET_ALL: bool to commit changes to hardware or not (NLA_U8)
+ * @DCB_ATTR_PERM_HWADDR: MAC address of the physical device (NLA_NESTED)
+ */
+enum dcbnl_attrs {
+	DCB_ATTR_UNDEFINED,
+
+	DCB_ATTR_IFNAME,
+	DCB_ATTR_STATE,
+	DCB_ATTR_PFC_CFG,
+	DCB_ATTR_PG_CFG,
+	DCB_ATTR_SET_ALL,
+	DCB_ATTR_PERM_HWADDR,
+
+	__DCB_ATTR_ENUM_MAX,
+	DCB_ATTR_MAX = __DCB_ATTR_ENUM_MAX - 1,
+};
+
+
+/**
+ * enum dcbnl_perm_hwaddr_attrs - DCB Permanent HW Address nested attributes
+ *
+ * @DCB_PERM_HW_ATTR_UNDEFINED: unspecified attribute to catch errors
+ * @DCB_PERM_HW_ATTR_0: MAC address from receive address 0 (NLA_U8)
+ * @DCB_PERM_HW_ATTR_1: MAC address from receive address 1 (NLA_U8)
+ * @DCB_PERM_HW_ATTR_2: MAC address from receive address 2 (NLA_U8)
+ * @DCB_PERM_HW_ATTR_3: MAC address from receive address 3 (NLA_U8)
+ * @DCB_PERM_HW_ATTR_4: MAC address from receive address 4 (NLA_U8)
+ * @DCB_PERM_HW_ATTR_5: MAC address from receive address 5 (NLA_U8)
+ * @DCB_PERM_HW_ATTR_ALL: apply to all MAC addresses (NLA_FLAG)
+ *
+ * These attributes are used when bonding DCB interfaces together.
+ *
+ */
+enum dcbnl_perm_hwaddr_attrs {
+	DCB_PERM_HW_ATTR_UNDEFINED,
+
+	DCB_PERM_HW_ATTR_0,
+	DCB_PERM_HW_ATTR_1,
+	DCB_PERM_HW_ATTR_2,
+	DCB_PERM_HW_ATTR_3,
+	DCB_PERM_HW_ATTR_4,
+	DCB_PERM_HW_ATTR_5,
+	DCB_PERM_HW_ATTR_ALL,
+
+	__DCB_PERM_HW_ATTR_ENUM_MAX,
+	DCB_PERM_HW_ATTR_MAX = __DCB_PERM_HW_ATTR_ENUM_MAX - 1,
+};
+
+/**
+ * enum dcbnl_pfc_attrs - DCB Priority Flow Control user-priority nested attrs
+ *
+ * @DCB_PFC_UP_ATTR_UNDEFINED: unspecified attribute to catch errors
+ * @DCB_PFC_UP_ATTR_0: Priority Flow Control value for User Priority 0 (NLA_U8)
+ * @DCB_PFC_UP_ATTR_1: Priority Flow Control value for User Priority 1 (NLA_U8)
+ * @DCB_PFC_UP_ATTR_2: Priority Flow Control value for User Priority 2 (NLA_U8)
+ * @DCB_PFC_UP_ATTR_3: Priority Flow Control value for User Priority 3 (NLA_U8)
+ * @DCB_PFC_UP_ATTR_4: Priority Flow Control value for User Priority 4 (NLA_U8)
+ * @DCB_PFC_UP_ATTR_5: Priority Flow Control value for User Priority 5 (NLA_U8)
+ * @DCB_PFC_UP_ATTR_6: Priority Flow Control value for User Priority 6 (NLA_U8)
+ * @DCB_PFC_UP_ATTR_7: Priority Flow Control value for User Priority 7 (NLA_U8)
+ * @DCB_PFC_UP_ATTR_MAX: highest attribute number currently defined
+ * @DCB_PFC_UP_ATTR_ALL: apply to all priority flow control attrs (NLA_FLAG) 
+ *
+ */
+enum dcbnl_pfc_up_attrs {
+	DCB_PFC_UP_ATTR_UNDEFINED,
+
+	DCB_PFC_UP_ATTR_0,
+	DCB_PFC_UP_ATTR_1,
+	DCB_PFC_UP_ATTR_2,
+	DCB_PFC_UP_ATTR_3,
+	DCB_PFC_UP_ATTR_4,
+	DCB_PFC_UP_ATTR_5,
+	DCB_PFC_UP_ATTR_6,
+	DCB_PFC_UP_ATTR_7,
+	DCB_PFC_UP_ATTR_ALL,
+
+	__DCB_PFC_UP_ATTR_ENUM_MAX,
+	DCB_PFC_UP_ATTR_MAX = __DCB_PFC_UP_ATTR_ENUM_MAX - 1,
+};
+
+/**
+ * enum dcbnl_pg_attrs - DCB Priority Group attributes
+ *
+ * @DCB_PG_ATTR_UNDEFINED: unspecified attribute to catch errors
+ * @DCB_PG_ATTR_TC_0: Priority Group Traffic Class 0 configuration (NLA_NESTED)
+ * @DCB_PG_ATTR_TC_1: Priority Group Traffic Class 1 configuration (NLA_NESTED)
+ * @DCB_PG_ATTR_TC_2: Priority Group Traffic Class 2 configuration (NLA_NESTED)
+ * @DCB_PG_ATTR_TC_3: Priority Group Traffic Class 3 configuration (NLA_NESTED)
+ * @DCB_PG_ATTR_TC_4: Priority Group Traffic Class 4 configuration (NLA_NESTED)
+ * @DCB_PG_ATTR_TC_5: Priority Group Traffic Class 5 configuration (NLA_NESTED)
+ * @DCB_PG_ATTR_TC_6: Priority Group Traffic Class 6 configuration (NLA_NESTED)
+ * @DCB_PG_ATTR_TC_7: Priority Group Traffic Class 7 configuration (NLA_NESTED)
+ * @DCB_PG_ATTR_TC_MAX: highest attribute number currently defined
+ * @DCB_PG_ATTR_TC_ALL: apply to all traffic classes (NLA_NESTED)
+ * @DCB_PG_ATTR_BWG_0: Bandwidth group 0 configuration (NLA_U8)
+ * @DCB_PG_ATTR_BWG_1: Bandwidth group 1 configuration (NLA_U8)
+ * @DCB_PG_ATTR_BWG_2: Bandwidth group 2 configuration (NLA_U8)
+ * @DCB_PG_ATTR_BWG_3: Bandwidth group 3 configuration (NLA_U8)
+ * @DCB_PG_ATTR_BWG_4: Bandwidth group 4 configuration (NLA_U8)
+ * @DCB_PG_ATTR_BWG_5: Bandwidth group 5 configuration (NLA_U8)
+ * @DCB_PG_ATTR_BWG_6: Bandwidth group 6 configuration (NLA_U8)
+ * @DCB_PG_ATTR_BWG_7: Bandwidth group 7 configuration (NLA_U8)
+ * @DCB_PG_ATTR_BWG_MAX: highest attribute number currently defined
+ * @DCB_PG_ATTR_BWG_ALL: apply to all bandwidth groups (NLA_FLAG)
+ *
+ */
+enum dcbnl_pg_attrs {
+	DCB_PG_ATTR_UNDEFINED,
+
+	DCB_PG_ATTR_TC_0,
+	DCB_PG_ATTR_TC_1,
+	DCB_PG_ATTR_TC_2,
+	DCB_PG_ATTR_TC_3,
+	DCB_PG_ATTR_TC_4,
+	DCB_PG_ATTR_TC_5,
+	DCB_PG_ATTR_TC_6,
+	DCB_PG_ATTR_TC_7,
+	DCB_PG_ATTR_TC_MAX,
+	DCB_PG_ATTR_TC_ALL,
+
+	DCB_PG_ATTR_BWG_0,
+	DCB_PG_ATTR_BWG_1,
+	DCB_PG_ATTR_BWG_2,
+	DCB_PG_ATTR_BWG_3,
+	DCB_PG_ATTR_BWG_4,
+	DCB_PG_ATTR_BWG_5,
+	DCB_PG_ATTR_BWG_6,
+	DCB_PG_ATTR_BWG_7,
+	DCB_PG_ATTR_BWG_MAX,
+	DCB_PG_ATTR_BWG_ALL,
+
+	__DCB_PG_ATTR_ENUM_MAX,
+	DCB_PG_ATTR_MAX = __DCB_PG_ATTR_ENUM_MAX - 1,
+};
+
+/**
+ * enum dcbnl_tc_attrs - DCB Traffic Class attributes
+ *
+ * @DCB_TC_ATTR_PARAM_UNDEFINED: unspecified attribute to catch errors
+ * @DCB_TC_ATTR_PARAM_STRICT_PRIO: Type of strict bandwidth aggregration (link
+ *                                 strict or group strict) (NLA_U8)
+ * @DCB_TC_ATTR_PARAM_BW_GROUP_ID: Bandwidth group this traffic class belongs to
+ *                                 (NLA_U8)
+ * @DCB_TC_ATTR_PARAM_BW_PCT: Percentage of bandwidth in the bandwidth group
+ *                            this traffic class has (NLA_U8)
+ * @DCB_TC_ATTR_PARAM_UP_MAPPING: Traffic class to user priority map (NLA_U8)
+ * @DCB_TC_ATTR_PARAM_ALL: apply to all traffic class parameters (NLA_FLAG)
+ *
+ */
+enum dcbnl_tc_attrs {
+	DCB_TC_ATTR_PARAM_UNDEFINED,
+
+	DCB_TC_ATTR_PARAM_STRICT_PRIO,
+	DCB_TC_ATTR_PARAM_BW_GROUP_ID,
+	DCB_TC_ATTR_PARAM_BW_PCT,
+	DCB_TC_ATTR_PARAM_UP_MAPPING,
+	DCB_TC_ATTR_PARAM_ALL,
+
+	__DCB_TC_ATTR_PARAM_ENUM_MAX,
+	DCB_TC_ATTR_PARAM_MAX = __DCB_TC_ATTR_PARAM_ENUM_MAX - 1,
+};
+
+/* 
+ * Ops struct for the netlink callbacks.  Used by DCB-enabled drivers through
+ * the netdevice struct.
+ */
+struct dcbnl_genl_ops {
+	u8   (*getstate)(struct net_device *);
+	void (*setstate)(struct net_device *, u8);
+	void (*getpermhwaddr)(struct net_device *, u8 *);
+	void (*setpgtccfgtx)(struct net_device *, int, u8, u8, u8, u8);
+	void (*setpgbwgcfgtx)(struct net_device *, int, u8);
+	void (*setpgtccfgrx)(struct net_device *, int, u8, u8, u8, u8);
+	void (*setpgbwgcfgrx)(struct net_device *, int, u8);
+	void (*getpgtccfgtx)(struct net_device *, int, u8 *, u8 *, u8 *, u8 *);
+	void (*getpgbwgcfgtx)(struct net_device *, int, u8 *);
+	void (*getpgtccfgrx)(struct net_device *, int, u8 *, u8 *, u8 *, u8 *);
+	void (*getpgbwgcfgrx)(struct net_device *, int, u8 *);
+	void (*setpfccfg)(struct net_device *, int, u8);
+	void (*getpfccfg)(struct net_device *, int, u8 *);
+	u8   (*setall)(struct net_device *);
+};
+
+#endif /* __LINUX_DCBNL_H__ */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f27fd20..f28a1fa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -42,6 +42,9 @@
 #include <linux/workqueue.h>
 
 #include <net/net_namespace.h>
+#ifdef CONFIG_DCBNL
+#include <linux/dcbnl.h>
+#endif
 
 struct vlan_group;
 struct ethtool_ops;
@@ -752,6 +755,11 @@ struct net_device
 #define GSO_MAX_SIZE		65536
 	unsigned int		gso_max_size;
 
+#ifdef CONFIG_DCBNL
+	/* Data Center Bridging netlink ops */
+	struct dcbnl_genl_ops *dcbnl_ops;
+#endif
+
 	/* The TX queue control structures */
 	unsigned int			egress_subqueue_count;
 	struct net_device_subqueue	egress_subqueue[1];
diff --git a/net/Kconfig b/net/Kconfig
index acbf7c6..fc6b832 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -192,6 +192,7 @@ source "net/lapb/Kconfig"
 source "net/econet/Kconfig"
 source "net/wanrouter/Kconfig"
 source "net/sched/Kconfig"
+source "net/dcb/Kconfig"
 
 menu "Network testing"
 
diff --git a/net/Makefile b/net/Makefile
index b7a1364..bc43e77 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -53,6 +53,9 @@ obj-$(CONFIG_NETLABEL)		+= netlabel/
 obj-$(CONFIG_IUCV)		+= iucv/
 obj-$(CONFIG_RFKILL)		+= rfkill/
 obj-$(CONFIG_NET_9P)		+= 9p/
+ifeq ($(CONFIG_DCBNL),y)
+obj-$(CONFIG_DCB)		+= dcb/
+endif
 
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_SYSCTL)		+= sysctl_net.o
diff --git a/net/dcb/Kconfig b/net/dcb/Kconfig
new file mode 100644
index 0000000..bdf3880
--- /dev/null
+++ b/net/dcb/Kconfig
@@ -0,0 +1,12 @@
+config DCB
+        tristate "Data Center Bridging support"
+
+config DCBNL
+	bool "Data Center Bridging netlink interface support"
+	depends on DCB
+	default n
+	---help---
+	  This option turns on the netlink interface
+	  (dcbnl) for Data Center Bridging capable devices.
+
+	  If unsure, say N.
diff --git a/net/dcb/Makefile b/net/dcb/Makefile
new file mode 100644
index 0000000..9930f4c
--- /dev/null
+++ b/net/dcb/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_DCB) += dcbnl.o
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
new file mode 100644
index 0000000..f5f4c31
--- /dev/null
+++ b/net/dcb/dcbnl.c
@@ -0,0 +1,722 @@
+/*
+ * This is the Data Center Bridging configuration interface.
+ *
+ * Copyright 2008, Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
+ *
+ */
+
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <linux/genetlink.h>
+#include <net/genetlink.h>
+#include <linux/dcbnl.h>
+
+MODULE_AUTHOR("Peter P Waskiewicz Jr, <peter.p.waskiewicz.jr@intel.com>");
+MODULE_DESCRIPTION("Data Center Bridging generic netlink interface");
+MODULE_LICENSE("GPL");
+
+/* The family */
+static struct genl_family dcbnl_family = {
+	.id = GENL_ID_GENERATE,
+	.hdrsize = 0,
+	.name = "dcbnl",
+	.version = DCB_PROTO_VERSION,
+	.maxattr = DCB_ATTR_MAX,
+};
+
+/* DCB netlink attributes policy */
+static struct nla_policy dcbnl_genl_policy[DCB_ATTR_MAX + 1] = {
+	[DCB_ATTR_IFNAME]    = {.type = NLA_STRING, .len = IFNAMSIZ - 1},
+	[DCB_ATTR_STATE]     = {.type = NLA_U8},
+	[DCB_ATTR_PFC_CFG]   = {.type = NLA_NESTED},
+	[DCB_ATTR_PG_CFG]    = {.type = NLA_NESTED},
+	[DCB_ATTR_SET_ALL]   = {.type = NLA_U8},
+	[DCB_ATTR_PERM_HWADDR] = {.type = NLA_NESTED},
+};
+
+/* DCB permanent hardware address nested attributes */
+static struct nla_policy dcbnl_perm_hwaddr_nest[DCB_PERM_HW_ATTR_MAX + 1] = {
+	[DCB_PERM_HW_ATTR_0] = {.type = NLA_U8},
+	[DCB_PERM_HW_ATTR_1] = {.type = NLA_U8},
+	[DCB_PERM_HW_ATTR_2] = {.type = NLA_U8},
+	[DCB_PERM_HW_ATTR_3] = {.type = NLA_U8},
+	[DCB_PERM_HW_ATTR_4] = {.type = NLA_U8},
+	[DCB_PERM_HW_ATTR_5] = {.type = NLA_U8},
+	[DCB_PERM_HW_ATTR_ALL] = {.type = NLA_FLAG},
+};
+
+/* DCB priority flow control to User Priority nested attributes */
+static struct nla_policy dcbnl_pfc_up_nest[DCB_PFC_UP_ATTR_MAX + 1] = {
+	[DCB_PFC_UP_ATTR_0]   = {.type = NLA_U8},
+	[DCB_PFC_UP_ATTR_1]   = {.type = NLA_U8},
+	[DCB_PFC_UP_ATTR_2]   = {.type = NLA_U8},
+	[DCB_PFC_UP_ATTR_3]   = {.type = NLA_U8},
+	[DCB_PFC_UP_ATTR_4]   = {.type = NLA_U8},
+	[DCB_PFC_UP_ATTR_5]   = {.type = NLA_U8},
+	[DCB_PFC_UP_ATTR_6]   = {.type = NLA_U8},
+	[DCB_PFC_UP_ATTR_7]   = {.type = NLA_U8},
+	[DCB_PFC_UP_ATTR_ALL] = {.type = NLA_FLAG},
+};
+
+/* DCB priority grouping nested attributes */
+static struct nla_policy dcbnl_pg_nest[DCB_PG_ATTR_MAX + 1] = {
+	[DCB_PG_ATTR_TC_0]   = {.type = NLA_NESTED},
+	[DCB_PG_ATTR_TC_1]   = {.type = NLA_NESTED},
+	[DCB_PG_ATTR_TC_2]   = {.type = NLA_NESTED},
+	[DCB_PG_ATTR_TC_3]   = {.type = NLA_NESTED},
+	[DCB_PG_ATTR_TC_4]   = {.type = NLA_NESTED},
+	[DCB_PG_ATTR_TC_5]   = {.type = NLA_NESTED},
+	[DCB_PG_ATTR_TC_6]   = {.type = NLA_NESTED},
+	[DCB_PG_ATTR_TC_7]   = {.type = NLA_NESTED},
+	[DCB_PG_ATTR_TC_ALL] = {.type = NLA_NESTED},
+	[DCB_PG_ATTR_BWG_0]  = {.type = NLA_U8},
+	[DCB_PG_ATTR_BWG_1]  = {.type = NLA_U8},
+	[DCB_PG_ATTR_BWG_2]  = {.type = NLA_U8},
+	[DCB_PG_ATTR_BWG_3]  = {.type = NLA_U8},
+	[DCB_PG_ATTR_BWG_4]  = {.type = NLA_U8},
+	[DCB_PG_ATTR_BWG_5]  = {.type = NLA_U8},
+	[DCB_PG_ATTR_BWG_6]  = {.type = NLA_U8},
+	[DCB_PG_ATTR_BWG_7]  = {.type = NLA_U8},
+	[DCB_PG_ATTR_BWG_ALL]= {.type = NLA_FLAG},
+};
+
+/* DCB traffic class nested attributes. */
+static struct nla_policy dcbnl_tc_param_nest[DCB_TC_ATTR_PARAM_MAX + 1] = {
+	[DCB_TC_ATTR_PARAM_STRICT_PRIO]     = {.type = NLA_U8},
+	[DCB_TC_ATTR_PARAM_BW_GROUP_ID]     = {.type = NLA_U8},
+	[DCB_TC_ATTR_PARAM_BW_PCT]          = {.type = NLA_U8},
+	[DCB_TC_ATTR_PARAM_UP_MAPPING]      = {.type = NLA_U8},
+	[DCB_TC_ATTR_PARAM_ALL]             = {.type = NLA_FLAG},
+};
+
+/* standard netlink reply call */
+static int dcbnl_reply(u8 value, u8 cmd, u8 attr, struct genl_info *info)
+{
+	struct sk_buff *dcbnl_skb;
+	void *data;
+	int ret = -EINVAL;
+
+	dcbnl_skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!dcbnl_skb)
+		return ret;
+
+	data = genlmsg_put_reply(dcbnl_skb, info, &dcbnl_family, 0, cmd);
+	if (!data)
+		goto err;
+
+	ret = nla_put_u8(dcbnl_skb, attr, value);
+	if (ret)
+        	goto err;
+
+	/* end the message, assign the nlmsg_len. */
+	genlmsg_end(dcbnl_skb, data);
+	ret = genlmsg_reply(dcbnl_skb, info);
+	if (ret)
+        	goto err;
+
+	return 0;
+err:
+	kfree(dcbnl_skb);
+	return ret;
+}
+
+static int dcbnl_getstate(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net_device *netdev;
+	int ret = -EINVAL;
+
+	if (!info->attrs[DCB_ATTR_IFNAME])
+		return ret;
+
+	netdev = dev_get_by_name(&init_net,
+				 nla_data(info->attrs[DCB_ATTR_IFNAME]));
+	if (!netdev)
+		return ret;
+
+	if (!netdev->dcbnl_ops || !netdev->dcbnl_ops->getstate)
+		goto err;
+
+	ret = dcbnl_reply(netdev->dcbnl_ops->getstate(netdev),
+                          DCB_CMD_GSTATE, DCB_ATTR_STATE, info);
+err:
+	dev_put(netdev);
+	return ret;
+}
+
+static int dcbnl_setstate(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net_device *netdev;
+	int ret = -EINVAL;
+	u8 value;
+
+	if (!info->attrs[DCB_ATTR_IFNAME] || !info->attrs[DCB_ATTR_STATE])
+		return ret;
+
+	netdev = dev_get_by_name(&init_net,
+				 nla_data(info->attrs[DCB_ATTR_IFNAME]));
+	if (!netdev)
+		return ret;
+
+	if (!netdev->dcbnl_ops || !netdev->dcbnl_ops->setstate)
+		goto err;
+
+	value = nla_get_u8(info->attrs[DCB_ATTR_STATE]);
+
+	netdev->dcbnl_ops->setstate(netdev, value);
+
+	ret = dcbnl_reply(0, DCB_CMD_SSTATE, DCB_ATTR_STATE, info);
+err:
+	dev_put(netdev);
+	return ret;
+}
+
+static int dcbnl_getperm_hwaddr(struct sk_buff *skb, struct genl_info *info)
+{
+	void *data;
+	struct sk_buff *dcbnl_skb;
+	struct nlattr *tb[DCB_PERM_HW_ATTR_MAX + 1], *nest;
+	struct net_device *netdev;
+	u8 perm_addr[MAX_ADDR_LEN];
+	int ret = -EINVAL;
+	int i;
+
+	if (!info->attrs[DCB_ATTR_IFNAME] || !info->attrs[DCB_ATTR_PERM_HWADDR])
+		return ret;
+
+	netdev = dev_get_by_name(&init_net,
+				 nla_data(info->attrs[DCB_ATTR_IFNAME]));
+	if (!netdev)
+		return ret;
+
+	if (!netdev->dcbnl_ops || !netdev->dcbnl_ops->getpermhwaddr)
+		goto err_out;
+
+	ret = nla_parse_nested(tb, DCB_PERM_HW_ATTR_MAX,
+			       info->attrs[DCB_ATTR_PERM_HWADDR],
+			       dcbnl_perm_hwaddr_nest);
+	if (ret)
+		goto err_out;
+
+	dcbnl_skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!dcbnl_skb)
+		goto err_out;
+
+	data = genlmsg_put_reply(dcbnl_skb, info, &dcbnl_family, 0,
+				 DCB_CMD_GPERM_HWADDR);
+	if (!data)
+		goto err;
+
+	nest = nla_nest_start(dcbnl_skb, DCB_ATTR_PERM_HWADDR);
+	if (!nest)
+		goto err;
+
+	netdev->dcbnl_ops->getpermhwaddr(netdev, perm_addr);
+	for (i = 0; i < netdev->addr_len; i++) {
+		ret = nla_put_u8(dcbnl_skb, DCB_ATTR_PERM_HWADDR,
+				 perm_addr[i]);
+
+		if (ret) {
+			nla_nest_cancel(dcbnl_skb, nest);
+			goto err;
+		}
+	}
+
+	nla_nest_end(dcbnl_skb, nest);
+
+	genlmsg_end(dcbnl_skb, data);
+
+	ret = genlmsg_reply(dcbnl_skb, info);
+	if (ret)
+		goto err_out;
+
+	dev_put(netdev);
+	return 0;
+err:
+	kfree(dcbnl_skb);
+err_out:
+	dev_put(netdev);
+	return ret;
+}
+
+static int __dcbnl_pg_setcfg(struct genl_info *info, int dir)
+{
+	struct net_device *netdev = NULL;
+	struct nlattr *pg_tb[DCB_PG_ATTR_MAX + 1];
+	struct nlattr *param_tb[DCB_TC_ATTR_PARAM_MAX + 1];
+	int ret = -EINVAL;
+	int i;
+	u8 prio = 0, bwg_id = 0, bw_pct = 0, up_map = 0;
+
+	if (!info->attrs[DCB_ATTR_IFNAME] || !info->attrs[DCB_ATTR_PG_CFG])
+		return ret;
+
+	netdev = dev_get_by_name(&init_net,
+				 nla_data(info->attrs[DCB_ATTR_IFNAME]));
+	if (!netdev)
+		return ret;
+
+	if (!netdev->dcbnl_ops ||
+	    !netdev->dcbnl_ops->setpgtccfgtx ||
+	    !netdev->dcbnl_ops->setpgtccfgrx ||
+	    !netdev->dcbnl_ops->setpgbwgcfgtx ||
+	    !netdev->dcbnl_ops->setpgbwgcfgrx)
+		goto err;
+
+	ret = nla_parse_nested(pg_tb, DCB_PG_ATTR_MAX,
+			       info->attrs[DCB_ATTR_PG_CFG], dcbnl_pg_nest);
+	if (ret)
+		goto err;
+
+	for (i = DCB_PG_ATTR_TC_0; i < DCB_PG_ATTR_TC_MAX; i++) {
+		if (!pg_tb[i])
+			continue;
+
+		ret = nla_parse_nested(param_tb, DCB_TC_ATTR_PARAM_MAX,
+				       pg_tb[i], dcbnl_tc_param_nest);
+		if (ret)
+			goto err;
+
+		if (param_tb[DCB_TC_ATTR_PARAM_STRICT_PRIO])
+			prio =
+			    nla_get_u8(param_tb[DCB_TC_ATTR_PARAM_STRICT_PRIO]);
+
+		if (param_tb[DCB_TC_ATTR_PARAM_BW_GROUP_ID])
+			bwg_id =
+			    nla_get_u8(param_tb[DCB_TC_ATTR_PARAM_BW_GROUP_ID]);
+
+		if (param_tb[DCB_TC_ATTR_PARAM_BW_PCT])
+			bw_pct = nla_get_u8(param_tb[DCB_TC_ATTR_PARAM_BW_PCT]);
+
+		if (param_tb[DCB_TC_ATTR_PARAM_UP_MAPPING])
+			up_map =
+			     nla_get_u8(param_tb[DCB_TC_ATTR_PARAM_UP_MAPPING]);
+
+		/* dir: Tx = 0, Rx = 1 */
+		if (dir) {
+			/* Rx */
+			netdev->dcbnl_ops->setpgtccfgrx(netdev,
+						  i - DCB_PG_ATTR_TC_0,
+						  prio, bwg_id, bw_pct, up_map);
+		} else {
+			/* Tx */
+			netdev->dcbnl_ops->setpgtccfgtx(netdev, 
+						  i - DCB_PG_ATTR_TC_0,
+						  prio, bwg_id, bw_pct, up_map);
+		}
+	}
+
+	for (i = DCB_PG_ATTR_BWG_0; i < DCB_PG_ATTR_BWG_MAX; i++) {
+		if (!pg_tb[i])
+			continue;
+
+		bw_pct = nla_get_u8(pg_tb[i]);
+
+		/* dir: Tx = 0, Rx = 1 */
+		if (dir) {
+			/* Rx */
+			netdev->dcbnl_ops->setpgbwgcfgrx(netdev,
+						 i - DCB_PG_ATTR_BWG_0, bw_pct);
+		} else {
+			/* Tx */
+			netdev->dcbnl_ops->setpgbwgcfgtx(netdev,
+						 i - DCB_PG_ATTR_BWG_0, bw_pct);
+		}
+	}
+
+	ret = dcbnl_reply(0, (dir ? DCB_CMD_PGRX_SCFG : DCB_CMD_PGTX_SCFG),
+			  DCB_ATTR_PG_CFG, info);
+
+err:
+	dev_put(netdev);
+	return ret;
+}
+
+static int dcbnl_pgtx_setcfg(struct sk_buff *skb, struct genl_info *info)
+{
+	return __dcbnl_pg_setcfg(info, 0);
+}
+
+static int dcbnl_pgrx_setcfg(struct sk_buff *skb, struct genl_info *info)
+{
+	return __dcbnl_pg_setcfg(info, 1);
+}
+
+static int __dcbnl_pg_getcfg(struct genl_info *info, int dir)
+{
+	void *data;
+	struct sk_buff *dcbnl_skb;
+	struct nlattr *pg_nest, *param_nest, *tb;
+	struct nlattr *pg_tb[DCB_PG_ATTR_MAX + 1];
+	struct nlattr *param_tb[DCB_TC_ATTR_PARAM_MAX + 1];
+	struct net_device *netdev;
+	u8 prio, bwg_id, bw_pct, up_map;
+	int ret  = -EINVAL;
+	int i;
+
+	if (!info->attrs[DCB_ATTR_IFNAME] || !info->attrs[DCB_ATTR_PG_CFG])
+		return ret;
+
+	netdev = dev_get_by_name(&init_net,
+				 nla_data(info->attrs[DCB_ATTR_IFNAME]));
+	if (!netdev)
+		return ret;
+
+	if (!netdev->dcbnl_ops ||
+	    !netdev->dcbnl_ops->getpgtccfgtx ||
+	    !netdev->dcbnl_ops->getpgtccfgrx ||
+	    !netdev->dcbnl_ops->getpgbwgcfgtx ||
+	    !netdev->dcbnl_ops->getpgbwgcfgrx)
+		goto err_out;
+
+	ret = nla_parse_nested(pg_tb, DCB_PG_ATTR_MAX,
+			       info->attrs[DCB_ATTR_PG_CFG], dcbnl_pg_nest);
+	if (ret)
+		goto err_out;
+
+	dcbnl_skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!dcbnl_skb)
+		goto err_out;
+
+	data =  genlmsg_put_reply(dcbnl_skb, info, &dcbnl_family, 0,
+				 (dir) ? DCB_CMD_PGRX_GCFG : DCB_CMD_PGTX_GCFG);
+
+	if (!data)
+		goto err;
+
+	pg_nest = nla_nest_start(dcbnl_skb, DCB_ATTR_PG_CFG);
+	if (!pg_nest)
+		goto err;
+
+	for (i = DCB_PG_ATTR_TC_0; i < DCB_PG_ATTR_TC_MAX; i++) {
+		if (pg_tb[DCB_PG_ATTR_TC_ALL])
+			tb = pg_tb[DCB_PG_ATTR_TC_ALL];
+		else
+			tb = pg_tb[i];
+		ret = nla_parse_nested(param_tb, DCB_TC_ATTR_PARAM_MAX,
+				       tb, dcbnl_tc_param_nest);
+		if (ret)
+			goto err_pg;
+
+		param_nest = nla_nest_start(dcbnl_skb, i);
+		if (!param_nest)
+			goto err_pg;
+
+		if (dir) {
+			/* Rx */
+			netdev->dcbnl_ops->getpgtccfgrx(netdev,
+						i - DCB_PG_ATTR_TC_0, &prio,
+						&bwg_id, &bw_pct, &up_map);
+		} else {
+			/* Tx */
+			netdev->dcbnl_ops->getpgtccfgtx(netdev,
+						i - DCB_PG_ATTR_TC_0, &prio,
+						&bwg_id, &bw_pct, &up_map);
+		}
+
+		if (param_tb[DCB_TC_ATTR_PARAM_STRICT_PRIO] ||
+		    param_tb[DCB_TC_ATTR_PARAM_ALL]) {
+			ret = nla_put_u8(dcbnl_skb,
+					 DCB_TC_ATTR_PARAM_STRICT_PRIO, prio);
+			if (ret)
+				goto err_param;
+		}
+		if (param_tb[DCB_TC_ATTR_PARAM_BW_GROUP_ID] ||
+		    param_tb[DCB_TC_ATTR_PARAM_ALL]) {
+			ret = nla_put_u8(dcbnl_skb,
+					 DCB_TC_ATTR_PARAM_BW_GROUP_ID, bwg_id);
+			if (ret)
+				goto err_param;
+		}
+		if (param_tb[DCB_TC_ATTR_PARAM_BW_PCT] ||
+		    param_tb[DCB_TC_ATTR_PARAM_ALL]) {
+			ret = nla_put_u8(dcbnl_skb, DCB_TC_ATTR_PARAM_BW_PCT,
+					 bw_pct);
+			if (ret)
+				goto err_param;
+		}
+		if (param_tb[DCB_TC_ATTR_PARAM_UP_MAPPING] ||
+		    param_tb[DCB_TC_ATTR_PARAM_ALL]) {
+			ret = nla_put_u8(dcbnl_skb,
+					 DCB_TC_ATTR_PARAM_UP_MAPPING, up_map);
+			if (ret)
+				goto err_param;
+		}
+		nla_nest_end(dcbnl_skb, param_nest);
+	}
+
+	for (i = DCB_PG_ATTR_BWG_0; i < DCB_PG_ATTR_BWG_MAX; i++) {
+		if (dir) {
+			/* Rx */
+			netdev->dcbnl_ops->getpgbwgcfgrx(netdev,
+						i - DCB_PG_ATTR_BWG_0, &bw_pct);
+		} else {
+			/* Tx */
+			netdev->dcbnl_ops->getpgbwgcfgtx(netdev,
+						i - DCB_PG_ATTR_BWG_0, &bw_pct);
+		}
+		ret = nla_put_u8(dcbnl_skb, i, bw_pct);
+
+		if (ret)
+			goto err_pg;
+	}
+
+	nla_nest_end(dcbnl_skb, pg_nest);
+
+	genlmsg_end(dcbnl_skb, data);
+	ret = genlmsg_reply(dcbnl_skb, info);
+	if (ret)
+		goto err;
+
+	dev_put(netdev);
+	return 0;
+
+err_param:
+	nla_nest_cancel(dcbnl_skb, param_nest);
+err_pg:
+	nla_nest_cancel(dcbnl_skb, pg_nest);
+err:
+	kfree(dcbnl_skb);
+err_out:
+	dev_put(netdev);
+	return ret;
+}
+
+static int dcbnl_pgtx_getcfg(struct sk_buff *skb, struct genl_info *info)
+{
+	return __dcbnl_pg_getcfg(info, 0);
+}
+
+static int dcbnl_pgrx_getcfg(struct sk_buff *skb, struct genl_info *info)
+{
+	return __dcbnl_pg_getcfg(info, 1);
+}
+
+static int dcbnl_setpfccfg(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr *tb[DCB_PFC_UP_ATTR_MAX + 1];
+	struct net_device *netdev;
+	int i;
+	int ret = -EINVAL;
+	u8 value;
+
+	if (!info->attrs[DCB_ATTR_IFNAME] || !info->attrs[DCB_ATTR_PFC_CFG])
+		return ret;
+
+	netdev = dev_get_by_name(&init_net,
+				 nla_data(info->attrs[DCB_ATTR_IFNAME]));
+	if (!netdev)
+		return ret;
+
+	if (!netdev->dcbnl_ops || !netdev->dcbnl_ops->setpfccfg)
+		goto err;
+
+	ret = nla_parse_nested(tb, DCB_PFC_UP_ATTR_MAX,
+		               info->attrs[DCB_ATTR_PFC_CFG],
+		               dcbnl_pfc_up_nest);
+	if (ret)
+		goto err;
+
+	for (i = DCB_PFC_UP_ATTR_0; i < DCB_PFC_UP_ATTR_MAX; i++) {
+		value = nla_get_u8(tb[i]);
+		netdev->dcbnl_ops->setpfccfg(netdev, i - DCB_PFC_UP_ATTR_0,
+					     value);
+	}
+
+	ret = dcbnl_reply(0, DCB_CMD_PFC_SCFG, DCB_ATTR_PFC_CFG, info);
+err:
+	dev_put(netdev);
+	return ret;
+}
+
+static int dcbnl_getpfccfg(struct sk_buff *skb, struct genl_info *info)
+{
+	void *data;
+	struct sk_buff *dcbnl_skb;
+	struct nlattr *tb[DCB_PFC_UP_ATTR_MAX + 1], *nest;
+	struct net_device *netdev;
+	u8 value;
+	int ret = -EINVAL;
+	int i;
+
+	if (!info->attrs[DCB_ATTR_IFNAME] || !info->attrs[DCB_ATTR_PFC_CFG])
+		return ret;
+
+	netdev = dev_get_by_name(&init_net,
+				 nla_data(info->attrs[DCB_ATTR_IFNAME]));
+	if (!netdev)
+		return ret;
+
+	if (!netdev->dcbnl_ops || !netdev->dcbnl_ops->getpfccfg)
+		goto err_out;
+
+	ret = nla_parse_nested(tb, DCB_PFC_UP_ATTR_MAX,
+			       info->attrs[DCB_ATTR_PFC_CFG],
+			       dcbnl_pfc_up_nest);
+	if (ret)
+		goto err_out;
+
+	dcbnl_skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!dcbnl_skb)
+		goto err_out;
+
+	data = genlmsg_put_reply(dcbnl_skb, info, &dcbnl_family, 0,
+				 DCB_CMD_PFC_GCFG);
+	if (!data)
+		goto err;
+
+	nest = nla_nest_start(dcbnl_skb, DCB_ATTR_PFC_CFG);
+	if (!nest)
+		goto err;
+
+	for (i = DCB_PFC_UP_ATTR_0; i < DCB_PFC_UP_ATTR_MAX; i++) {
+		netdev->dcbnl_ops->getpfccfg(netdev, i - DCB_PFC_UP_ATTR_0,
+					     &value);
+		ret = nla_put_u8(dcbnl_skb, i, value);
+
+		if (ret) {
+			nla_nest_cancel(dcbnl_skb, nest);
+			goto err;
+		}
+	}
+	nla_nest_end(dcbnl_skb, nest);
+
+	genlmsg_end(dcbnl_skb, data);
+
+	ret = genlmsg_reply(dcbnl_skb, info);
+	if (ret)
+		goto err;
+
+	dev_put(netdev);
+	return 0;
+
+err:
+	kfree(dcbnl_skb);
+err_out:
+	dev_put(netdev);
+	return ret;
+}
+
+static int dcbnl_setall(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net_device *netdev;
+	int ret = -EINVAL;
+
+	if (!info->attrs[DCB_ATTR_IFNAME] || !info->attrs[DCB_ATTR_SET_ALL])
+		return ret;
+
+	netdev = dev_get_by_name(&init_net,
+				 nla_data(info->attrs[DCB_ATTR_IFNAME]));
+	if (!netdev)
+		return ret;
+
+	if (!netdev->dcbnl_ops || !netdev->dcbnl_ops->setall)
+		return ret;
+
+	ret = dcbnl_reply(netdev->dcbnl_ops->setall(netdev), DCB_CMD_SET_ALL,
+			  DCB_ATTR_SET_ALL, info);
+
+	dev_put(netdev);
+	return ret;
+}
+
+static struct genl_ops dcbnl_ops[] = {
+	{
+		.cmd = DCB_CMD_GSTATE,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_getstate,
+		.dumpit =  NULL,
+	},
+	{
+		.cmd = DCB_CMD_SSTATE,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_setstate,
+		.dumpit =  NULL,
+	},
+	{
+		.cmd = DCB_CMD_PGTX_SCFG,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_pgtx_setcfg,
+		.dumpit =  NULL,
+	},
+	{
+		.cmd = DCB_CMD_PGRX_SCFG,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_pgrx_setcfg,
+		.dumpit =  NULL,
+	},
+	{
+		.cmd = DCB_CMD_PFC_SCFG,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_setpfccfg,
+		.dumpit =  NULL,
+	},
+	{
+		.cmd = DCB_CMD_PGTX_GCFG,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_pgtx_getcfg,
+		.dumpit =  NULL,
+	},
+	{
+		.cmd = DCB_CMD_PGRX_GCFG,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_pgrx_getcfg,
+		.dumpit =  NULL,
+	},
+	{
+		.cmd = DCB_CMD_PFC_GCFG,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_getpfccfg,
+		.dumpit =  NULL,
+	},
+	{
+		.cmd = DCB_CMD_SET_ALL,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_setall,
+		.dumpit =  NULL,
+	},
+	{
+		.cmd = DCB_CMD_GPERM_HWADDR,
+		.flags = GENL_ADMIN_PERM,
+		.policy = dcbnl_genl_policy,
+		.doit = dcbnl_getperm_hwaddr,
+		.dumpit =  NULL,
+	},
+};
+
+/* init and exit */
+static int __init dcbnl_init(void)
+{
+	int err, i;
+
+	err = genl_register_family(&dcbnl_family);
+	if (err)
+		return err;
+
+	for (i = 0; i < ARRAY_SIZE(dcbnl_ops); i++) {
+		err = genl_register_ops(&dcbnl_family, &dcbnl_ops[i]);
+		if (err)
+			goto err_out;
+	}
+
+	return 0;
+
+err_out:
+	genl_unregister_family(&dcbnl_family);
+	return err;
+}
+module_init(dcbnl_init);
+
+static void __exit dcbnl_exit(void)
+{
+	genl_unregister_family(&dcbnl_family);
+}
+module_exit(dcbnl_exit);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition
  2008-05-27 14:13 ` [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition PJ Waskiewicz
@ 2008-05-28  9:41   ` Thomas Graf
  2008-05-28 16:03     ` Waskiewicz Jr, Peter P
  2008-06-05 13:17   ` Patrick McHardy
  1 sibling, 1 reply; 24+ messages in thread
From: Thomas Graf @ 2008-05-28  9:41 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: jeff, davem, netdev

* PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com> 2008-05-27 07:13
> This patch adds the netlink interface definition for Data Center Bridging.
> This technology uses 802.1Qaz and 801.1Qbb for extending ethernet to
> converge different traffic types on a single link.  E.g. Fibre Channel
> over Ethernet and regular LAN traffic.  The goal is to use priority flow
> control to pause individual flows at the MAC/network level, without
> impacting other network flows.

Is there a specific reason why you used a separate generic netlink
interface instead of embedding this into the regular link message
via either IFLA_DCB or by using the info API?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition
  2008-05-28  9:41   ` Thomas Graf
@ 2008-05-28 16:03     ` Waskiewicz Jr, Peter P
  2008-05-28 22:37       ` Thomas Graf
  0 siblings, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2008-05-28 16:03 UTC (permalink / raw)
  To: Thomas Graf; +Cc: jeff, davem, netdev

> Is there a specific reason why you used a separate generic 
> netlink interface instead of embedding this into the regular 
> link message via either IFLA_DCB or by using the info API?

There are four reasons we decided to use a separate interface:

1. The netlink messages are generated via userspace when the connection
is setup, plus they're generated from LLDP frames coming in off the
wire.  Those LLDP frames implement the DCBX protocol (Data Center
Bridging Exchange), which is the negotiation protocol between a DCB
device and its link partner.  In most cases, it's a DCB-compliant
switch, like a Cisco Nexus 5000.  So the messages can come out of band
depending on how the network gets configured, and if any events occur
causing the bandwidth credits or priority mappings to change (think
automated backups at night, wanting more bandwidth than during the day).

2. The DCBX protocol is being extended to contain more information, and
second generation DCB devices have more configuration data for the
network.  So we wanted an interface that could be extended on its own to
support the new DCB protocols as they're ratified and implemented in new
equipment, without impacting existing infrastructure.

3. We wanted to use generic netlink, since that seems to be a more
preferred method of netlink communication vs. rtnetlink.  And I don't
know anything about the info API, so I can't comment on why we didn't
look at that for implementation.  Can you suggest something for me to
look at for the info API so I can see what that's all about?

4. We also developed the userspace utilities for the Linux OSVs, which
should be having a pre-release "release" on Sourceforge in the next week
or so, to support the DCBX protocol.  They're implemented using the
generic netlink interface, so obviously if we can keep it that way, it'd
be preferred.  :-)

Thanks Thomas.  Other than that, is there anything in the netlink
interface that you would suggest to change?

Cheers,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition
  2008-05-28 16:03     ` Waskiewicz Jr, Peter P
@ 2008-05-28 22:37       ` Thomas Graf
  2008-06-01 12:16         ` Waskiewicz Jr, Peter P
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Graf @ 2008-05-28 22:37 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: jeff, davem, netdev

* Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> 2008-05-28 09:03
> 1. The netlink messages are generated via userspace when the connection
> is setup, plus they're generated from LLDP frames coming in off the
> wire.  Those LLDP frames implement the DCBX protocol (Data Center
> Bridging Exchange), which is the negotiation protocol between a DCB
> device and its link partner.  In most cases, it's a DCB-compliant
> switch, like a Cisco Nexus 5000.  So the messages can come out of band
> depending on how the network gets configured, and if any events occur
> causing the bandwidth credits or priority mappings to change (think
> automated backups at night, wanting more bandwidth than during the day).

There isn't much difference really, instead of using the separate
interface you could simply add a new link attribute IFLA_DCB and issue
a RTM_SETLINK/RTM_GETLINK and send the same information in the same
format. However, I agree with you that a separate interface is better
in this case as dcb requests are not directly connected to other link
changes at all and the dcb message structure is pretty complex.

> 3. We wanted to use generic netlink, since that seems to be a more
> preferred method of netlink communication vs. rtnetlink.  And I don't
> know anything about the info API, so I can't comment on why we didn't
> look at that for implementation.  Can you suggest something for me to
> look at for the info API so I can see what that's all about?

A prominent user is the VLAN code in net/8021q/vlan_netlink.c

> Thanks Thomas.  Other than that, is there anything in the netlink
> interface that you would suggest to change?

Looks good from here, I didn't read it all line by line though.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition
  2008-05-28 22:37       ` Thomas Graf
@ 2008-06-01 12:16         ` Waskiewicz Jr, Peter P
  0 siblings, 0 replies; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2008-06-01 12:16 UTC (permalink / raw)
  To: Thomas Graf; +Cc: jeff, davem, netdev

> There isn't much difference really, instead of using the 
> separate interface you could simply add a new link attribute 
> IFLA_DCB and issue a RTM_SETLINK/RTM_GETLINK and send the 
> same information in the same format. However, I agree with 
> you that a separate interface is better in this case as dcb 
> requests are not directly connected to other link changes at 
> all and the dcb message structure is pretty complex.
> 
> > 3. We wanted to use generic netlink, since that seems to be a more 
> > preferred method of netlink communication vs. rtnetlink.  
> And I don't 
> > know anything about the info API, so I can't comment on why 
> we didn't 
> > look at that for implementation.  Can you suggest something 
> for me to 
> > look at for the info API so I can see what that's all about?
> 
> A prominent user is the VLAN code in net/8021q/vlan_netlink.c
> 
> > Thanks Thomas.  Other than that, is there anything in the netlink 
> > interface that you would suggest to change?
> 
> Looks good from here, I didn't read it all line by line though.

Thanks Thomas for the review and comments.

Dave and Jeff, have you two taken a peek at this by chance?

Thanks,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition
  2008-05-27 14:13 ` [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition PJ Waskiewicz
  2008-05-28  9:41   ` Thomas Graf
@ 2008-06-05 13:17   ` Patrick McHardy
  2008-06-09 22:11     ` Waskiewicz Jr, Peter P
  1 sibling, 1 reply; 24+ messages in thread
From: Patrick McHardy @ 2008-06-05 13:17 UTC (permalink / raw)
  To: PJ Waskiewicz; +Cc: jeff, davem, netdev, Thomas Graf

PJ Waskiewicz wrote:
> +/**
> + * enum dcbnl_perm_hwaddr_attrs - DCB Permanent HW Address nested attributes
> + *
> + * @DCB_PERM_HW_ATTR_UNDEFINED: unspecified attribute to catch errors
> + * @DCB_PERM_HW_ATTR_0: MAC address from receive address 0 (NLA_U8)
> + * @DCB_PERM_HW_ATTR_1: MAC address from receive address 1 (NLA_U8)
> + * @DCB_PERM_HW_ATTR_2: MAC address from receive address 2 (NLA_U8)
> + * @DCB_PERM_HW_ATTR_3: MAC address from receive address 3 (NLA_U8)
> + * @DCB_PERM_HW_ATTR_4: MAC address from receive address 4 (NLA_U8)
> + * @DCB_PERM_HW_ATTR_5: MAC address from receive address 5 (NLA_U8)
> + * @DCB_PERM_HW_ATTR_ALL: apply to all MAC addresses (NLA_FLAG)
> + *
> + * These attributes are used when bonding DCB interfaces together.
> + *
> + */

For these and the other numbered attributes: is the maximum number
fixed and/or defined somewhere? If not, I'd suggest to use lists
of attributes.

> +/**
> + * enum dcbnl_pg_attrs - DCB Priority Group attributes
> + *
> + * @DCB_PG_ATTR_UNDEFINED: unspecified attribute to catch errors
> + * @DCB_PG_ATTR_TC_0: Priority Group Traffic Class 0 configuration (NLA_NESTED)
> + * @DCB_PG_ATTR_TC_1: Priority Group Traffic Class 1 configuration (NLA_NESTED)
> + * @DCB_PG_ATTR_TC_2: Priority Group Traffic Class 2 configuration (NLA_NESTED)
> + * @DCB_PG_ATTR_TC_3: Priority Group Traffic Class 3 configuration (NLA_NESTED)
> + * @DCB_PG_ATTR_TC_4: Priority Group Traffic Class 4 configuration (NLA_NESTED)
> + * @DCB_PG_ATTR_TC_5: Priority Group Traffic Class 5 configuration (NLA_NESTED)
> + * @DCB_PG_ATTR_TC_6: Priority Group Traffic Class 6 configuration (NLA_NESTED)
> + * @DCB_PG_ATTR_TC_7: Priority Group Traffic Class 7 configuration (NLA_NESTED)
> + * @DCB_PG_ATTR_TC_MAX: highest attribute number currently defined
> + * @DCB_PG_ATTR_TC_ALL: apply to all traffic classes (NLA_NESTED)
> + * @DCB_PG_ATTR_BWG_0: Bandwidth group 0 configuration (NLA_U8)
> + * @DCB_PG_ATTR_BWG_1: Bandwidth group 1 configuration (NLA_U8)
> + * @DCB_PG_ATTR_BWG_2: Bandwidth group 2 configuration (NLA_U8)
> + * @DCB_PG_ATTR_BWG_3: Bandwidth group 3 configuration (NLA_U8)
> + * @DCB_PG_ATTR_BWG_4: Bandwidth group 4 configuration (NLA_U8)
> + * @DCB_PG_ATTR_BWG_5: Bandwidth group 5 configuration (NLA_U8)
> + * @DCB_PG_ATTR_BWG_6: Bandwidth group 6 configuration (NLA_U8)
> + * @DCB_PG_ATTR_BWG_7: Bandwidth group 7 configuration (NLA_U8)
> + * @DCB_PG_ATTR_BWG_MAX: highest attribute number currently defined
> + * @DCB_PG_ATTR_BWG_ALL: apply to all bandwidth groups (NLA_FLAG)

And in this case lists of nested attributes consisting of
Priority and Bandwidth, since they seem to belong together.

> +struct dcbnl_genl_ops {
> +	u8   (*getstate)(struct net_device *);
> +	void (*setstate)(struct net_device *, u8);
> +	void (*getpermhwaddr)(struct net_device *, u8 *);

"getpermhwaddr" doesn't seem to belong in this interface but
in rtnetlink and/or ethtool instead.

> +static int dcbnl_getperm_hwaddr(struct sk_buff *skb, struct genl_info *info)
> +{
> ...
> +	dcbnl_skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!dcbnl_skb)
> +		goto err_out;
> ...
> +err:
> +	kfree(dcbnl_skb);

^^^ kfree_skb

The same error is present multiple times

> +static int __dcbnl_pg_setcfg(struct genl_info *info, int dir)
> +{
> +	struct net_device *netdev = NULL;
> +	struct nlattr *pg_tb[DCB_PG_ATTR_MAX + 1];
> +	struct nlattr *param_tb[DCB_TC_ATTR_PARAM_MAX + 1];
> +	int ret = -EINVAL;
> +	int i;
> +	u8 prio = 0, bwg_id = 0, bw_pct = 0, up_map = 0;
> +
> +	if (!info->attrs[DCB_ATTR_IFNAME] || !info->attrs[DCB_ATTR_PG_CFG])
> +		return ret;
> +
> +	netdev = dev_get_by_name(&init_net,
> +				 nla_data(info->attrs[DCB_ATTR_IFNAME]));


The fact that you do this in every handler makes me wonder whether
rtnetlink wouldn't be the better choice, if only because it uses
the rtnl_mutex and configuration changes are thus serialized with
other networking configuration changes.

For example I don't see anything preventing concurrent changes
to the DCB configuration while it is copied between the temporary
configuration and the real one. In one cases its done in a
path holding the rtnl_mutex, in another case its done with
holding the genl_mutex in a genetlink callback.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition
  2008-06-05 13:17   ` Patrick McHardy
@ 2008-06-09 22:11     ` Waskiewicz Jr, Peter P
  2008-06-10  7:14       ` Patrick McHardy
  0 siblings, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2008-06-09 22:11 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: jeff, davem, netdev, Thomas Graf

> For these and the other numbered attributes: is the maximum 
> number fixed and/or defined somewhere? If not, I'd suggest to 
> use lists of attributes.

I think we want to define a maximum number.  I can fix that.

> 
> And in this case lists of nested attributes consisting of 
> Priority and Bandwidth, since they seem to belong together.

I'll see what I can come up with when I move this implementation to
rtnetlink.

> > +struct dcbnl_genl_ops {
> > +	u8   (*getstate)(struct net_device *);
> > +	void (*setstate)(struct net_device *, u8);
> > +	void (*getpermhwaddr)(struct net_device *, u8 *);
> 
> "getpermhwaddr" doesn't seem to belong in this interface but 
> in rtnetlink and/or ethtool instead.

This was a feature in ethtool, but it was removed at some point.  I can
add it to rtnetlink though if people think it'll be useful to have.

> > +static int dcbnl_getperm_hwaddr(struct sk_buff *skb, 
> struct genl_info 
> > +*info) {
> > ...
> > +	dcbnl_skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
> > +	if (!dcbnl_skb)
> > +		goto err_out;
> > ...
> > +err:
> > +	kfree(dcbnl_skb);
> 
> ^^^ kfree_skb
> 
> The same error is present multiple times

Good catch.  I'll fix that.

> > +static int __dcbnl_pg_setcfg(struct genl_info *info, int dir) {
> > +	struct net_device *netdev = NULL;
> > +	struct nlattr *pg_tb[DCB_PG_ATTR_MAX + 1];
> > +	struct nlattr *param_tb[DCB_TC_ATTR_PARAM_MAX + 1];
> > +	int ret = -EINVAL;
> > +	int i;
> > +	u8 prio = 0, bwg_id = 0, bw_pct = 0, up_map = 0;
> > +
> > +	if (!info->attrs[DCB_ATTR_IFNAME] || 
> !info->attrs[DCB_ATTR_PG_CFG])
> > +		return ret;
> > +
> > +	netdev = dev_get_by_name(&init_net,
> > +				 
> nla_data(info->attrs[DCB_ATTR_IFNAME]));
> 
> 
> The fact that you do this in every handler makes me wonder 
> whether rtnetlink wouldn't be the better choice, if only 
> because it uses the rtnl_mutex and configuration changes are 
> thus serialized with other networking configuration changes.
> 
> For example I don't see anything preventing concurrent 
> changes to the DCB configuration while it is copied between 
> the temporary configuration and the real one. In one cases 
> its done in a path holding the rtnl_mutex, in another case 
> its done with holding the genl_mutex in a genetlink callback.

When we wrote this, we didn't know enough about rtnetlink to know what
to choose.  The general trend seemed to be moving towards genetlink for
subsystem changes in the kernel, so we chose that.  But this
serialization issue is a nice catch, and I think we'll move this
implementation to use rtnetlink.

Cheers,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition
  2008-06-09 22:11     ` Waskiewicz Jr, Peter P
@ 2008-06-10  7:14       ` Patrick McHardy
  0 siblings, 0 replies; 24+ messages in thread
From: Patrick McHardy @ 2008-06-10  7:14 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: jeff, davem, netdev, Thomas Graf

Waskiewicz Jr, Peter P wrote:
>>> +struct dcbnl_genl_ops {
>>> +	u8   (*getstate)(struct net_device *);
>>> +	void (*setstate)(struct net_device *, u8);
>>> +	void (*getpermhwaddr)(struct net_device *, u8 *);
>> "getpermhwaddr" doesn't seem to belong in this interface but 
>> in rtnetlink and/or ethtool instead.
> 
> This was a feature in ethtool, but it was removed at some point.  I can
> add it to rtnetlink though if people think it'll be useful to have.

I think thats a better idea than to put it in a private
driver interface.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/3] ixgbe: Add Data Center Bridging hardware initialization code
  2008-05-27 14:13 [PATCH] NET: DCB generic netlink interface PJ Waskiewicz
  2008-05-27 14:13 ` [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition PJ Waskiewicz
@ 2008-05-27 14:13 ` PJ Waskiewicz
  2008-05-27 14:13 ` [PATCH 3/3] ixgbe: Enable Data Center Bridging (DCB) support PJ Waskiewicz
  2008-06-04 18:44 ` [PATCH] NET: DCB generic netlink interface David Miller
  3 siblings, 0 replies; 24+ messages in thread
From: PJ Waskiewicz @ 2008-05-27 14:13 UTC (permalink / raw)
  To: jeff, davem; +Cc: netdev

This patch adds the necessary hardware initialization code for 82598 to
support Data Center Bridging.  The code takes care of bandwidth credit
calculations for the hardware arbiters, priority grouping methods, and
all the hardware accesses to enable the features in 82598.

This is based from the net-next-2.6 tree.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/ixgbe_dcb.c       |  330 +++++++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_dcb.h       |  168 +++++++++++++++
 drivers/net/ixgbe/ixgbe_dcb_82598.c |  400 +++++++++++++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_dcb_82598.h |   98 +++++++++
 4 files changed, 996 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_dcb.c b/drivers/net/ixgbe/ixgbe_dcb.c
new file mode 100644
index 0000000..11be2b8
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_dcb.c
@@ -0,0 +1,330 @@
+/*******************************************************************************
+
+  Intel 10 Gigabit PCI Express Linux driver
+  Copyright(c) 1999 - 2008 Intel Corporation.
+
+  This program is free software; you can redistribute it and/or modify it
+  under the terms and conditions of the GNU General Public License,
+  version 2, as published by the Free Software Foundation.
+
+  This program is distributed in the hope it will be useful, but WITHOUT
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+  more details.
+
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc.,
+  51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+  The full GNU General Public License is included in this distribution in
+  the file called "COPYING".
+
+  Contact Information:
+  Linux NICS <linux.nics@intel.com>
+  e1000-devel Mailing List <e1000-devel@lists.sourceforge.net>
+  Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
+*******************************************************************************/
+
+
+#include "ixgbe.h"
+#include "ixgbe_type.h"
+#include "ixgbe_dcb.h"
+#include "ixgbe_dcb_82598.h"
+
+/**
+ * ixgbe_dcb_config - Struct containing DCB settings.
+ * @dcb_config: Pointer to DCB config structure
+ *
+ * This function checks DCB rules for DCB settings.
+ * The following rules are checked:
+ * 1. The sum of bandwidth percentages of all Bandwidth Groups must total 100%.
+ * 2. The sum of bandwidth percentages of all Traffic Classes within a Bandwidth
+ *    Group must total 100.
+ * 3. A Traffic Class should not be set to both Link Strict Priority
+ *    and Group Strict Priority.
+ * 4. Link strict Bandwidth Groups can only have link strict traffic classes
+ *    with zero bandwidth.
+ */
+s32 ixgbe_dcb_check_config(struct ixgbe_dcb_config *dcb_config)
+{
+	struct tc_bw_alloc *p;
+	s32 ret_val = 0;
+	u8 i, j, bw = 0, bw_id;
+	u8 bw_sum[2][MAX_BW_GROUP];
+	bool link_strict[2][MAX_BW_GROUP];
+
+	memset(bw_sum, 0, sizeof(bw_sum));
+	memset(link_strict, 0, sizeof(link_strict));
+
+	/* First Tx, then Rx */
+	for (i = 0; i < 2; i++) {
+		/* Check each traffic class for rule violation */
+		for (j = 0; j < MAX_TRAFFIC_CLASS; j++) {
+			p = &dcb_config->tc_config[j].path[i];
+
+			bw = p->bwg_percent;
+			bw_id = p->bwg_id;
+
+			if (bw_id >= MAX_BW_GROUP) {
+				ret_val = DCB_ERR_CONFIG;
+				goto err_config;
+			}
+			if (p->prio_type == prio_link) {
+				link_strict[i][bw_id] = true;
+				/* Link strict should have zero bandwidth */
+				if (bw) {
+					ret_val = DCB_ERR_LS_BW_NONZERO;
+					goto err_config;
+				}
+			} else if (!bw) {
+				/*
+				 * Traffic classes without link strict
+				 * should have non-zero bandwidth.
+				 */
+				ret_val = DCB_ERR_TC_BW_ZERO;
+				goto err_config;
+			}
+			bw_sum[i][bw_id] += bw;
+		}
+
+		bw = 0;
+
+		/* Check each bandwidth group for rule violation */
+		for (j = 0; j < MAX_BW_GROUP; j++) {
+			bw += dcb_config->bw_percentage[i][j];
+			/*
+			 * Sum of bandwidth percentages of all traffic classes
+			 * within a Bandwidth Group must total 100 except for
+			 * link strict group (zero bandwidth).
+			 */
+			if (link_strict[i][j]) {
+				if (bw_sum[i][j]) {
+					/*
+					 * Link strict group should have zero
+					 * bandwidth.
+					 */
+					ret_val = DCB_ERR_LS_BWG_NONZERO;
+					goto err_config;
+				}
+			} else if (bw_sum[i][j] != BW_PERCENT &&
+			           bw_sum[i][j] != 0) {
+				ret_val = DCB_ERR_TC_BW;
+				goto err_config;
+			}
+		}
+
+		if (bw != BW_PERCENT) {
+			ret_val = DCB_ERR_BW_GROUP;
+			goto err_config;
+		}
+	}
+
+err_config:
+	return ret_val;
+}
+
+/**
+ * ixgbe_dcb_calculate_tc_credits - Calculates traffic class credits
+ * @ixgbe_dcb_config: Struct containing DCB settings.
+ * @direction: Configuring either Tx or Rx.
+ *
+ * This function calculates the credits allocated to each traffic class.
+ * It should be called only after the rules are checked by
+ * ixgbe_dcb_check_config().
+ */
+s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_dcb_config *dcb_config,
+                                   u8 direction)
+{
+	struct tc_bw_alloc *p;
+	s32 ret_val = 0;
+	/* Initialization values default for Tx settings */
+	u32 credit_refill       = 0;
+	u32 credit_max          = 0;
+	u16 link_percentage     = 0;
+	u8  bw_percent          = 0;
+	u8  i;
+
+	if (dcb_config == NULL) {
+		ret_val = DCB_ERR_CONFIG;
+		goto out;
+	}
+
+	/* Find out the link percentage for each TC first */
+	for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
+		p = &dcb_config->tc_config[i].path[direction];
+		bw_percent = dcb_config->bw_percentage[direction][p->bwg_id];
+
+		link_percentage = p->bwg_percent;
+		/* Must be careful of integer division for very small nums */
+		link_percentage = (link_percentage * bw_percent) / 100;
+		if (p->bwg_percent > 0 && link_percentage == 0)
+			link_percentage = 1;
+
+		/* Save link_percentage for reference */
+		p->link_percent = (u8)link_percentage;
+
+		/* Calculate credit refill and save it */
+		credit_refill = link_percentage * MINIMUM_CREDIT_REFILL;
+		p->data_credits_refill = (u16)credit_refill;
+
+		/* Calculate maximum credit for the TC */
+		credit_max = (link_percentage * MAX_CREDIT) / 100;
+
+		/*
+		 * Adjustment based on rule checking, if the percentage
+		 * of a TC is too small, the maximum credit may not be
+		 * enough to send out a jumbo frame in data plane arbitration.
+		 */
+		if (credit_max && credit_max < MINIMUM_CREDIT_FOR_JUMBO)
+			credit_max = MINIMUM_CREDIT_FOR_JUMBO;
+
+		if (direction == DCB_TX_CONFIG) {
+			/*
+			 * Adjustment based on rule checking, if the
+			 * percentage of a TC is too small, the maximum
+			 * credit may not be enough to send out a TSO
+			 * packet in descriptor plane arbitration.
+			 */
+			if (credit_max &&
+			    (credit_max < MINIMUM_CREDIT_FOR_TSO))
+				credit_max = MINIMUM_CREDIT_FOR_TSO;
+
+			dcb_config->tc_config[i].desc_credits_max = (u16)credit_max;
+		}
+
+		p->data_credits_max = (u16)credit_max;
+	}
+
+out:
+	return ret_val;
+}
+
+/**
+ * ixgbe_dcb_get_tc_stats - Returns status of each traffic class
+ * @hw: pointer to hardware structure
+ * @stats: pointer to statistics structure
+ * @tc_count:  Number of elements in bwg_array.
+ *
+ * This function returns the status data for each of the Traffic Classes in use.
+ */
+s32 ixgbe_dcb_get_tc_stats(struct ixgbe_hw *hw, struct ixgbe_hw_stats *stats,
+                           u8 tc_count)
+{
+	s32 ret = 0;
+	if (hw->mac.type == ixgbe_mac_82598EB)
+		ret = ixgbe_dcb_get_tc_stats_82598(hw, stats, tc_count);
+	return ret;
+}
+
+/**
+ * ixgbe_dcb_get_pfc_stats - Returns CBFC status of each traffic class
+ * hw - pointer to hardware structure
+ * stats - pointer to statistics structure
+ * tc_count -  Number of elements in bwg_array.
+ *
+ * This function returns the CBFC status data for each of the Traffic Classes.
+ */
+s32 ixgbe_dcb_get_pfc_stats(struct ixgbe_hw *hw, struct ixgbe_hw_stats *stats,
+                            u8 tc_count)
+{
+	s32 ret = 0;
+	if (hw->mac.type == ixgbe_mac_82598EB)
+		ret = ixgbe_dcb_get_pfc_stats_82598(hw, stats, tc_count);
+	return ret;
+}
+
+/**
+ * ixgbe_dcb_config_rx_arbiter - Config Rx arbiter
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure Rx Data Arbiter and credits for each traffic class.
+ */
+s32 ixgbe_dcb_config_rx_arbiter(struct ixgbe_hw *hw,
+                                struct ixgbe_dcb_config *dcb_config)
+{
+	s32 ret = 0;
+	if (hw->mac.type == ixgbe_mac_82598EB)
+		ret = ixgbe_dcb_config_rx_arbiter_82598(hw, dcb_config);
+	return ret;
+}
+
+/**
+ * ixgbe_dcb_config_tx_desc_arbiter - Config Tx Desc arbiter
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure Tx Descriptor Arbiter and credits for each traffic class.
+ */
+s32 ixgbe_dcb_config_tx_desc_arbiter(struct ixgbe_hw *hw,
+                                     struct ixgbe_dcb_config *dcb_config)
+{
+	s32 ret = 0;
+	if (hw->mac.type == ixgbe_mac_82598EB)
+		ret = ixgbe_dcb_config_tx_desc_arbiter_82598(hw, dcb_config);
+	return ret;
+}
+
+/**
+ * ixgbe_dcb_config_tx_data_arbiter - Config Tx data arbiter
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure Tx Data Arbiter and credits for each traffic class.
+ */
+s32 ixgbe_dcb_config_tx_data_arbiter(struct ixgbe_hw *hw,
+                                     struct ixgbe_dcb_config *dcb_config)
+{
+	s32 ret = 0;
+	if (hw->mac.type == ixgbe_mac_82598EB)
+		ret = ixgbe_dcb_config_tx_data_arbiter_82598(hw, dcb_config);
+	return ret;
+}
+
+/**
+ * ixgbe_dcb_config_pfc - Config priority flow control
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure Priority Flow Control for each traffic class.
+ */
+s32 ixgbe_dcb_config_pfc(struct ixgbe_hw *hw,
+                         struct ixgbe_dcb_config *dcb_config)
+{
+	s32 ret = 0;
+	if (hw->mac.type == ixgbe_mac_82598EB)
+		ret = ixgbe_dcb_config_pfc_82598(hw, dcb_config);
+	return ret;
+}
+
+/**
+ * ixgbe_dcb_config_tc_stats - Config traffic class statistics
+ * @hw: pointer to hardware structure
+ *
+ * Configure queue statistics registers, all queues belonging to same traffic
+ * class uses a single set of queue statistics counters.
+ */
+s32 ixgbe_dcb_config_tc_stats(struct ixgbe_hw *hw)
+{
+	s32 ret = 0;
+	if (hw->mac.type == ixgbe_mac_82598EB)
+		ret = ixgbe_dcb_config_tc_stats_82598(hw);
+	return ret;
+}
+
+/**
+ * ixgbe_dcb_hw_config - Config and enable DCB
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure dcb settings and enable dcb mode.
+ */
+s32 ixgbe_dcb_hw_config(struct ixgbe_hw *hw,
+                        struct ixgbe_dcb_config *dcb_config)
+{
+	s32 ret = 0;
+	if (hw->mac.type == ixgbe_mac_82598EB)
+		ret = ixgbe_dcb_hw_config_82598(hw, dcb_config);
+	return ret;
+}
diff --git a/drivers/net/ixgbe/ixgbe_dcb.h b/drivers/net/ixgbe/ixgbe_dcb.h
new file mode 100644
index 0000000..0f16539
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_dcb.h
@@ -0,0 +1,168 @@
+/*******************************************************************************
+
+  Intel 10 Gigabit PCI Express Linux driver
+  Copyright(c) 1999 - 2008 Intel Corporation.
+
+  This program is free software; you can redistribute it and/or modify it
+  under the terms and conditions of the GNU General Public License,
+  version 2, as published by the Free Software Foundation.
+
+  This program is distributed in the hope it will be useful, but WITHOUT
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+  more details.
+
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc.,
+  51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+  The full GNU General Public License is included in this distribution in
+  the file called "COPYING".
+
+  Contact Information:
+  Linux NICS <linux.nics@intel.com>
+  e1000-devel Mailing List <e1000-devel@lists.sourceforge.net>
+  Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
+*******************************************************************************/
+
+#ifndef _DCB_CONFIG_H_
+#define _DCB_CONFIG_H_
+
+#include "ixgbe_type.h"
+
+/* DCB data structures */
+
+#define IXGBE_MAX_PACKET_BUFFERS 8
+#define MAX_USER_PRIORITY        8
+#define MAX_TRAFFIC_CLASS        8
+#define MAX_BW_GROUP             8
+#define BW_PERCENT               100
+
+#define DCB_TX_CONFIG            0
+#define DCB_RX_CONFIG            1
+
+/* DCB error Codes */
+#define DCB_SUCCESS              0
+#define DCB_ERR_CONFIG           -1
+#define DCB_ERR_PARAM            -2
+
+/* Transmit and receive Errors */
+/* Error in bandwidth group allocation */
+#define DCB_ERR_BW_GROUP        -3
+/* Error in traffic class bandwidth allocation */
+#define DCB_ERR_TC_BW           -4
+/* Traffic class has both link strict and group strict enabled */
+#define DCB_ERR_LS_GS           -5
+/* Link strict traffic class has non zero bandwidth */
+#define DCB_ERR_LS_BW_NONZERO   -6
+/* Link strict bandwidth group has non zero bandwidth */
+#define DCB_ERR_LS_BWG_NONZERO  -7
+/*  Traffic class has zero bandwidth */
+#define DCB_ERR_TC_BW_ZERO      -8
+
+#define DCB_NOT_IMPLEMENTED      0x7FFFFFFF
+
+struct dcb_pfc_tc_debug
+{
+	u8  tc;
+	u8  pause_status;
+	u64 pause_quanta;
+};
+
+enum strict_prio_type {
+	prio_none = 0,
+	prio_group,
+	prio_link
+};
+
+/* Traffic class bandwidth allocation per direction */
+struct tc_bw_alloc
+{
+	u8 bwg_id;                /* Bandwidth Group (BWG) ID */
+	u8 bwg_percent;           /* % of BWG's bandwidth */
+	u8 link_percent;          /* % of link bandwidth */
+	u8 up_to_tc_bitmap;       /* User Priority to Traffic Class mapping */
+	u16 data_credits_refill;  /* Credit refill amount in 64B granularity */
+	u16 data_credits_max;     /* Max credits for a configured packet buffer
+	                           * in 64B granularity.*/
+	enum strict_prio_type prio_type; /* Link or Group Strict Priority */
+};
+
+enum dcb_pfc_type {
+	pfc_disabled = 0,
+	pfc_enabled_full,
+	pfc_enabled_tx,
+	pfc_enabled_rx
+};
+
+/* Traffic class configuration */
+struct tc_configuration
+{
+	struct tc_bw_alloc path[2]; /* One each for Tx/Rx */
+	enum dcb_pfc_type  dcb_pfc; /* Class based flow control setting */
+
+	u16 desc_credits_max; /* For Tx Descriptor arbitration */
+	u8 tc; /* Traffic class (TC) */
+};
+
+enum dcb_rx_pba_cfg {
+	pba_equal,     /* PBA[0-7] each use 64KB FIFO */
+	pba_80_48      /* PBA[0-3] each use 80KB, PBA[4-7] each use 48KB */
+};
+
+struct ixgbe_dcb_config
+{
+	struct tc_configuration tc_config[MAX_TRAFFIC_CLASS];
+	u8     bw_percentage[2][MAX_BW_GROUP]; /* One each for Tx/Rx */
+
+	bool  round_robin_enable;
+
+	enum dcb_rx_pba_cfg rx_pba_cfg;
+
+	u32  dcb_cfg_version; /* Not used...OS-specific? */
+	u32  link_speed; /* For bandwidth allocation validation purpose */
+};
+
+
+/* DCB driver APIs */
+
+/* DCB rule checking function.*/
+s32 ixgbe_dcb_check_config(struct ixgbe_dcb_config *config);
+
+/* DCB credits calculation */
+s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_dcb_config *config,
+                                   u8 direction);
+
+/* DCB PFC functions */
+s32 ixgbe_dcb_config_pfc(struct ixgbe_hw *hw,
+                         struct ixgbe_dcb_config *dcb_config);
+s32 ixgbe_dcb_get_pfc_stats(struct ixgbe_hw *hw, struct ixgbe_hw_stats *stats,
+                            u8 tc_count);
+
+/* DCB traffic class stats */
+s32 ixgbe_dcb_config_tc_stats(struct ixgbe_hw *);
+s32 ixgbe_dcb_get_tc_stats(struct ixgbe_hw *hw, struct ixgbe_hw_stats *stats,
+                           u8 tc_count);
+
+/* DCB config arbiters */
+s32 ixgbe_dcb_config_tx_desc_arbiter(struct ixgbe_hw *hw,
+                                     struct ixgbe_dcb_config *dcb_config);
+s32 ixgbe_dcb_config_tx_data_arbiter(struct ixgbe_hw *hw,
+                                     struct ixgbe_dcb_config *dcb_config);
+s32 ixgbe_dcb_config_rx_arbiter(struct ixgbe_hw *hw,
+                                struct ixgbe_dcb_config *dcb_config);
+
+/* DCB hw initialization */
+s32 ixgbe_dcb_hw_config(struct ixgbe_hw *hw, struct ixgbe_dcb_config *config);
+
+
+/* DCB definitions for credit calculation */
+#define MAX_CREDIT_REFILL       511  /* 0x1FF * 64B = 32704B */
+#define MINIMUM_CREDIT_REFILL   5    /* 5*64B = 320B */
+#define MINIMUM_CREDIT_FOR_JUMBO 145  /* 145= UpperBound((9*1024+54)/64B) for 9KB jumbo frame */
+#define DCB_MAX_TSO_SIZE        32*1024 /* MAX TSO packet size supported in DCB mode */
+#define MINIMUM_CREDIT_FOR_TSO  (DCB_MAX_TSO_SIZE/64 + 1) /* 513 for 32KB TSO packet */
+#define MAX_CREDIT              4095 /* Maximum credit supported: 256KB * 1204 / 64B */
+
+#endif /* _DCB_CONFIG_H */
diff --git a/drivers/net/ixgbe/ixgbe_dcb_82598.c b/drivers/net/ixgbe/ixgbe_dcb_82598.c
new file mode 100644
index 0000000..647aea5
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_dcb_82598.c
@@ -0,0 +1,400 @@
+/*******************************************************************************
+
+  Intel 10 Gigabit PCI Express Linux driver
+  Copyright(c) 1999 - 2008 Intel Corporation.
+
+  This program is free software; you can redistribute it and/or modify it
+  under the terms and conditions of the GNU General Public License,
+  version 2, as published by the Free Software Foundation.
+
+  This program is distributed in the hope it will be useful, but WITHOUT
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+  more details.
+
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc.,
+  51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+  The full GNU General Public License is included in this distribution in
+  the file called "COPYING".
+
+  Contact Information:
+  Linux NICS <linux.nics@intel.com>
+  e1000-devel Mailing List <e1000-devel@lists.sourceforge.net>
+  Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
+*******************************************************************************/
+
+#include "ixgbe.h"
+#include "ixgbe_type.h"
+#include "ixgbe_dcb.h"
+#include "ixgbe_dcb_82598.h"
+
+/**
+ * ixgbe_dcb_get_tc_stats_82598 - Return status data for each traffic class
+ * @hw: pointer to hardware structure
+ * @stats: pointer to statistics structure
+ * @tc_count:  Number of elements in bwg_array.
+ *
+ * This function returns the status data for each of the Traffic Classes in use.
+ */
+s32 ixgbe_dcb_get_tc_stats_82598(struct ixgbe_hw *hw,
+                                 struct ixgbe_hw_stats *stats,
+                                 u8 tc_count)
+{
+	int tc;
+
+	if (tc_count > MAX_TRAFFIC_CLASS)
+		return DCB_ERR_PARAM;
+
+	/* Statistics pertaining to each traffic class */
+	for (tc = 0; tc < tc_count; tc++) {
+		/* Transmitted Packets */
+		stats->qptc[tc] += IXGBE_READ_REG(hw, IXGBE_QPTC(tc));
+		/* Transmitted Bytes */
+		stats->qbtc[tc] += IXGBE_READ_REG(hw, IXGBE_QBTC(tc));
+		/* Received Packets */
+		stats->qprc[tc] += IXGBE_READ_REG(hw, IXGBE_QPRC(tc));
+		/* Received Bytes */
+		stats->qbrc[tc] += IXGBE_READ_REG(hw, IXGBE_QBRC(tc));
+	}
+
+	return 0;
+}
+
+/**
+ * ixgbe_dcb_get_pfc_stats_82598 - Returns CBFC status data
+ * @hw: pointer to hardware structure
+ * @stats: pointer to statistics structure
+ * @tc_count:  Number of elements in bwg_array.
+ *
+ * This function returns the CBFC status data for each of the Traffic Classes.
+ */
+s32 ixgbe_dcb_get_pfc_stats_82598(struct ixgbe_hw *hw,
+                                  struct ixgbe_hw_stats *stats,
+                                  u8 tc_count)
+{
+	int tc;
+
+	if (tc_count > MAX_TRAFFIC_CLASS)
+		return DCB_ERR_PARAM;
+
+	for (tc = 0; tc < tc_count; tc++) {
+		/* Priority XOFF Transmitted */
+		stats->pxofftxc[tc] += IXGBE_READ_REG(hw, IXGBE_PXOFFTXC(tc));
+		/* Priority XOFF Received */
+		stats->pxoffrxc[tc] += IXGBE_READ_REG(hw, IXGBE_PXOFFRXC(tc));
+	}
+
+	return 0;
+}
+
+/**
+ * ixgbe_dcb_config_packet_buffers_82598 - Configure packet buffers
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure packet buffers for DCB mode.
+ */ 
+s32 ixgbe_dcb_config_packet_buffers_82598(struct ixgbe_hw *hw,
+                                          struct ixgbe_dcb_config *dcb_config)
+{
+	s32 ret_val = 0;
+	u32 value = IXGBE_RXPBSIZE_64KB;
+	u8  i = 0;
+
+	/* Setup Rx packet buffer sizes */
+	switch (dcb_config->rx_pba_cfg) {
+	case pba_80_48:
+		/* Setup the first four at 80KB */
+		value = IXGBE_RXPBSIZE_80KB;
+		for (; i < 4; i++) {
+			IXGBE_WRITE_REG(hw, IXGBE_RXPBSIZE(i), value);
+		}
+		/* Setup the last four at 48KB...don't re-init i */
+		value = IXGBE_RXPBSIZE_48KB;
+		/* Fall Through */
+	case pba_equal:
+	default:
+		for (; i < IXGBE_MAX_PACKET_BUFFERS; i++) {
+			IXGBE_WRITE_REG(hw, IXGBE_RXPBSIZE(i), value);
+		}
+
+		/* Setup Tx packet buffer sizes */
+		for (i = 0; i < IXGBE_MAX_PACKET_BUFFERS; i++) {
+			IXGBE_WRITE_REG(hw, IXGBE_TXPBSIZE(i),
+			                IXGBE_TXPBSIZE_40KB);
+		}
+		break;
+	}
+
+	return ret_val;
+}
+
+/**
+ * ixgbe_dcb_config_rx_arbiter_82598 - Config Rx data arbiter
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure Rx Data Arbiter and credits for each traffic class.
+ */
+s32 ixgbe_dcb_config_rx_arbiter_82598(struct ixgbe_hw *hw,
+                                      struct ixgbe_dcb_config *dcb_config)
+{
+	struct tc_bw_alloc    *p;
+	u32    reg           = 0;
+	u32    credit_refill = 0;
+	u32    credit_max    = 0;
+	u8     i             = 0;
+
+	reg = IXGBE_READ_REG(hw, IXGBE_RUPPBMR) | IXGBE_RUPPBMR_MQA;
+	IXGBE_WRITE_REG(hw, IXGBE_RUPPBMR, reg);
+
+	reg = IXGBE_READ_REG(hw, IXGBE_RMCS);
+	/* Enable Arbiter */
+	reg &= ~IXGBE_RMCS_ARBDIS;
+	/* Enable Receive Recycle within the BWG */
+	reg |= IXGBE_RMCS_RRM;
+	/* Enable Deficit Fixed Priority arbitration*/
+	reg |= IXGBE_RMCS_DFP;
+
+	IXGBE_WRITE_REG(hw, IXGBE_RMCS, reg);
+
+	/* Configure traffic class credits and priority */
+	for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
+		p = &dcb_config->tc_config[i].path[DCB_RX_CONFIG];
+		credit_refill = p->data_credits_refill;
+		credit_max    = p->data_credits_max;
+
+		reg = credit_refill | (credit_max << IXGBE_RT2CR_MCL_SHIFT);
+
+		if (p->prio_type == prio_link)
+			reg |= IXGBE_RT2CR_LSP;
+
+		IXGBE_WRITE_REG(hw, IXGBE_RT2CR(i), reg);
+	}
+
+	reg = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
+	reg |= IXGBE_RDRXCTL_RDMTS_1_2;
+	reg |= IXGBE_RDRXCTL_MPBEN;
+	reg |= IXGBE_RDRXCTL_MCEN;
+	IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, reg);
+
+	reg = IXGBE_READ_REG(hw, IXGBE_RXCTRL);
+	/* Make sure there is enough descriptors before arbitration */
+	reg &= ~IXGBE_RXCTRL_DMBYPS;
+	IXGBE_WRITE_REG(hw, IXGBE_RXCTRL, reg);
+
+	return 0;
+}
+
+/**
+ * ixgbe_dcb_config_tx_desc_arbiter_82598 - Config Tx Desc. arbiter
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure Tx Descriptor Arbiter and credits for each traffic class.
+ */
+s32 ixgbe_dcb_config_tx_desc_arbiter_82598(struct ixgbe_hw *hw,
+                                           struct ixgbe_dcb_config *dcb_config)
+{
+	struct tc_bw_alloc *p;
+	u32    reg, max_credits;
+	u8     i;
+
+	reg = IXGBE_READ_REG(hw, IXGBE_DPMCS);
+
+	/* Enable arbiter */
+	reg &= ~IXGBE_DPMCS_ARBDIS;
+	if (!(dcb_config->round_robin_enable)) {
+		/* Enable DFP and Recycle mode */
+		reg |= (IXGBE_DPMCS_TDPAC | IXGBE_DPMCS_TRM);
+	}
+	reg |= IXGBE_DPMCS_TSOEF;
+	/* Configure Max TSO packet size 34KB including payload and headers */
+	reg |= (0x4 << IXGBE_DPMCS_MTSOS_SHIFT);
+
+	IXGBE_WRITE_REG(hw, IXGBE_DPMCS, reg);
+
+	/* Configure traffic class credits and priority */
+	for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
+		p = &dcb_config->tc_config[i].path[DCB_TX_CONFIG];
+		max_credits = dcb_config->tc_config[i].desc_credits_max;
+		reg = max_credits << IXGBE_TDTQ2TCCR_MCL_SHIFT;
+		reg |= p->data_credits_refill;
+		reg |= (u32)(p->bwg_id) << IXGBE_TDTQ2TCCR_BWG_SHIFT;
+
+		if (p->prio_type == prio_group)
+			reg |= IXGBE_TDTQ2TCCR_GSP;
+
+		if (p->prio_type == prio_link)
+			reg |= IXGBE_TDTQ2TCCR_LSP;
+
+		IXGBE_WRITE_REG(hw, IXGBE_TDTQ2TCCR(i), reg);
+	}
+
+	return 0;
+}
+
+/**
+ * ixgbe_dcb_config_tx_data_arbiter_82598 - Config Tx data arbiter
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure Tx Data Arbiter and credits for each traffic class.
+ */
+s32 ixgbe_dcb_config_tx_data_arbiter_82598(struct ixgbe_hw *hw,
+                                           struct ixgbe_dcb_config *dcb_config)
+{
+	struct tc_bw_alloc *p;
+	u32 reg;
+	u8 i;
+
+	reg = IXGBE_READ_REG(hw, IXGBE_PDPMCS);
+	/* Enable Data Plane Arbiter */
+	reg &= ~IXGBE_PDPMCS_ARBDIS;
+	/* Enable DFP and Transmit Recycle Mode */
+	reg |= (IXGBE_PDPMCS_TPPAC | IXGBE_PDPMCS_TRM);
+
+	IXGBE_WRITE_REG(hw, IXGBE_PDPMCS, reg);
+
+	/* Configure traffic class credits and priority */
+	for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
+		p = &dcb_config->tc_config[i].path[DCB_TX_CONFIG];
+		reg = p->data_credits_refill;
+		reg |= (u32)(p->data_credits_max) << IXGBE_TDPT2TCCR_MCL_SHIFT;
+		reg |= (u32)(p->bwg_id) << IXGBE_TDPT2TCCR_BWG_SHIFT;
+
+		if (p->prio_type == prio_group)
+			reg |= IXGBE_TDPT2TCCR_GSP;
+
+		if (p->prio_type == prio_link)
+			reg |= IXGBE_TDPT2TCCR_LSP;
+
+		IXGBE_WRITE_REG(hw, IXGBE_TDPT2TCCR(i), reg);
+	}
+
+	/* Enable Tx packet buffer division */
+	reg = IXGBE_READ_REG(hw, IXGBE_DTXCTL);
+	reg |= IXGBE_DTXCTL_ENDBUBD;
+	IXGBE_WRITE_REG(hw, IXGBE_DTXCTL, reg);
+
+	return 0;
+}
+
+/**
+ * ixgbe_dcb_config_pfc_82598 - Config priority flow control
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure Priority Flow Control for each traffic class.
+ */
+s32 ixgbe_dcb_config_pfc_82598(struct ixgbe_hw *hw,
+                               struct ixgbe_dcb_config *dcb_config)
+{
+	u32 reg, rx_pba_size;
+	u8  i;
+
+	/* Enable Transmit Priority Flow Control */
+	reg = IXGBE_READ_REG(hw, IXGBE_RMCS);
+	reg &= ~IXGBE_RMCS_TFCE_802_3X;
+	/* correct the reporting of our flow control status */
+	hw->fc.type = ixgbe_fc_none;
+	reg |= IXGBE_RMCS_TFCE_PRIORITY;
+	IXGBE_WRITE_REG(hw, IXGBE_RMCS, reg);
+
+	/* Enable Receive Priority Flow Control */
+	reg = IXGBE_READ_REG(hw, IXGBE_FCTRL);
+	reg &= ~IXGBE_FCTRL_RFCE;
+	reg |= IXGBE_FCTRL_RPFCE;
+	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, reg);
+
+	/*
+	 * Configure flow control thresholds and enable priority flow control
+	 * for each traffic class.
+	 */
+	for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
+		if (dcb_config->rx_pba_cfg == pba_equal) {
+			rx_pba_size = IXGBE_RXPBSIZE_64KB;
+		} else {
+			rx_pba_size = (i < 4) ? IXGBE_RXPBSIZE_80KB
+			                      : IXGBE_RXPBSIZE_48KB;
+		}
+
+		reg = ((rx_pba_size >> 5) &  0xFFF0);
+		if (dcb_config->tc_config[i].dcb_pfc == pfc_enabled_tx ||
+		    dcb_config->tc_config[i].dcb_pfc == pfc_enabled_full)
+			reg |= IXGBE_FCRTL_XONE;
+
+		IXGBE_WRITE_REG(hw, IXGBE_FCRTL(i), reg);
+
+		reg = ((rx_pba_size >> 2) & 0xFFF0);
+		if (dcb_config->tc_config[i].dcb_pfc == pfc_enabled_tx ||
+		    dcb_config->tc_config[i].dcb_pfc == pfc_enabled_full)
+			reg |= IXGBE_FCRTH_FCEN;
+
+		IXGBE_WRITE_REG(hw, IXGBE_FCRTH(i), reg);
+	}
+
+	/* Configure pause time */
+	for (i = 0; i < (MAX_TRAFFIC_CLASS >> 1); i++)
+		IXGBE_WRITE_REG(hw, IXGBE_FCTTV(i), 0x68006800);
+
+	/* Configure flow control refresh threshold value */
+	IXGBE_WRITE_REG(hw, IXGBE_FCRTV, 0x3400);
+
+	return 0;
+}
+
+/**
+ * ixgbe_dcb_config_tc_stats_82598 - Configure traffic class statistics
+ * @hw: pointer to hardware structure
+ *
+ * Configure queue statistics registers, all queues belonging to same traffic
+ * class uses a single set of queue statistics counters.
+ */
+s32 ixgbe_dcb_config_tc_stats_82598(struct ixgbe_hw *hw)
+{
+	u32 reg = 0;
+	u8  i   = 0;
+	u8  j   = 0;
+
+	/* Receive Queues stats setting -  8 queues per statistics reg */
+	for (i = 0, j = 0; i < 15 && j < 8; i = i + 2, j++) {
+		reg = IXGBE_READ_REG(hw, IXGBE_RQSMR(i));
+		reg |= ((0x1010101) * j);
+		IXGBE_WRITE_REG(hw, IXGBE_RQSMR(i), reg);
+		reg = IXGBE_READ_REG(hw, IXGBE_RQSMR(i + 1));
+		reg |= ((0x1010101) * j);
+		IXGBE_WRITE_REG(hw, IXGBE_RQSMR(i + 1), reg);
+	}
+	/* Transmit Queues stats setting -  4 queues per statistics reg */
+	for (i = 0; i < 8; i++) {
+		reg = IXGBE_READ_REG(hw, IXGBE_TQSMR(i));
+		reg |= ((0x1010101) * i);
+		IXGBE_WRITE_REG(hw, IXGBE_TQSMR(i), reg);
+	}
+
+	return 0;
+}
+
+/**
+ * ixgbe_dcb_hw_config_82598 - Config and enable DCB
+ * @hw: pointer to hardware structure
+ * @dcb_config: pointer to ixgbe_dcb_config structure
+ *
+ * Configure dcb settings and enable dcb mode.
+ */
+s32 ixgbe_dcb_hw_config_82598(struct ixgbe_hw *hw,
+                              struct ixgbe_dcb_config *dcb_config)
+{
+	ixgbe_dcb_config_packet_buffers_82598(hw, dcb_config);
+	ixgbe_dcb_config_rx_arbiter_82598(hw, dcb_config);
+	ixgbe_dcb_config_tx_desc_arbiter_82598(hw, dcb_config);
+	ixgbe_dcb_config_tx_data_arbiter_82598(hw, dcb_config);
+	ixgbe_dcb_config_pfc_82598(hw, dcb_config);
+	ixgbe_dcb_config_tc_stats_82598(hw);
+
+	return 0;
+}
diff --git a/drivers/net/ixgbe/ixgbe_dcb_82598.h b/drivers/net/ixgbe/ixgbe_dcb_82598.h
new file mode 100644
index 0000000..4d3b95a
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_dcb_82598.h
@@ -0,0 +1,98 @@
+/*******************************************************************************
+
+  Intel 10 Gigabit PCI Express Linux driver
+  Copyright(c) 1999 - 2008 Intel Corporation.
+
+  This program is free software; you can redistribute it and/or modify it
+  under the terms and conditions of the GNU General Public License,
+  version 2, as published by the Free Software Foundation.
+
+  This program is distributed in the hope it will be useful, but WITHOUT
+  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+  more details.
+
+  You should have received a copy of the GNU General Public License along with
+  this program; if not, write to the Free Software Foundation, Inc.,
+  51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+  The full GNU General Public License is included in this distribution in
+  the file called "COPYING".
+
+  Contact Information:
+  Linux NICS <linux.nics@intel.com>
+  e1000-devel Mailing List <e1000-devel@lists.sourceforge.net>
+  Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
+*******************************************************************************/
+
+#ifndef _DCB_82598_CONFIG_H_
+#define _DCB_82598_CONFIG_H_
+
+/* DCB register definitions */
+
+#define IXGBE_DPMCS_MTSOS_SHIFT 16
+#define IXGBE_DPMCS_TDPAC       0x00000001 /* 0 Round Robin, 1 DFP - Deficit Fixed Priority */
+#define IXGBE_DPMCS_TRM         0x00000010 /* Transmit Recycle Mode */
+#define IXGBE_DPMCS_ARBDIS      0x00000040 /* DCB arbiter disable */
+#define IXGBE_DPMCS_TSOEF       0x00080000 /* TSO Expand Factor: 0=x4, 1=x2 */
+
+#define IXGBE_RUPPBMR_MQA       0x80000000 /* Enable UP to queue mapping */
+
+#define IXGBE_RT2CR_MCL_SHIFT   12 /* Offset to Max Credit Limit setting */
+#define IXGBE_RT2CR_LSP         0x80000000 /* LSP enable bit */
+
+#define IXGBE_RDRXCTL_MPBEN     0x00000010 /* DMA config for multiple packet buffers enable */
+#define IXGBE_RDRXCTL_MCEN      0x00000040 /* DMA config for multiple cores (RSS) enable */
+
+#define IXGBE_TDTQ2TCCR_MCL_SHIFT   12
+#define IXGBE_TDTQ2TCCR_BWG_SHIFT   9
+#define IXGBE_TDTQ2TCCR_GSP     0x40000000
+#define IXGBE_TDTQ2TCCR_LSP     0x80000000
+
+#define IXGBE_TDPT2TCCR_MCL_SHIFT   12
+#define IXGBE_TDPT2TCCR_BWG_SHIFT   9
+#define IXGBE_TDPT2TCCR_GSP     0x40000000
+#define IXGBE_TDPT2TCCR_LSP     0x80000000
+
+#define IXGBE_PDPMCS_TPPAC      0x00000020 /* 0 Round Robin, 1 for DFP - Deficit Fixed Priority */
+#define IXGBE_PDPMCS_ARBDIS     0x00000040 /* Arbiter disable */
+#define IXGBE_PDPMCS_TRM        0x00000100 /* Transmit Recycle Mode enable */
+
+#define IXGBE_DTXCTL_ENDBUBD    0x00000004 /* Enable DBU buffer division */
+
+#define IXGBE_TXPBSIZE_40KB     0x0000A000 /* 40KB Packet Buffer */
+#define IXGBE_RXPBSIZE_48KB     0x0000C000 /* 48KB Packet Buffer */
+#define IXGBE_RXPBSIZE_64KB     0x00010000 /* 64KB Packet Buffer */
+#define IXGBE_RXPBSIZE_80KB     0x00014000 /* 80KB Packet Buffer */
+
+#define IXGBE_RDRXCTL_RDMTS_1_2 0x00000000
+
+/* DCB hardware-specific driver APIs */
+
+/* DCB PFC functions */
+s32 ixgbe_dcb_config_pfc_82598(struct ixgbe_hw *hw,
+                               struct ixgbe_dcb_config *dcb_config);
+s32 ixgbe_dcb_get_pfc_stats_82598(struct ixgbe_hw *hw,
+                                  struct ixgbe_hw_stats *stats,
+                                  u8 tc_count);
+
+/* DCB traffic class stats */
+s32 ixgbe_dcb_config_tc_stats_82598(struct ixgbe_hw *hw);
+s32 ixgbe_dcb_get_tc_stats_82598(struct ixgbe_hw *hw,
+                                 struct ixgbe_hw_stats *stats,
+                                 u8 tc_count);
+
+/* DCB config arbiters */
+s32 ixgbe_dcb_config_tx_desc_arbiter_82598(struct ixgbe_hw *hw,
+                                           struct ixgbe_dcb_config *dcb_config);
+s32 ixgbe_dcb_config_tx_data_arbiter_82598(struct ixgbe_hw *hw,
+                                           struct ixgbe_dcb_config *dcb_config);
+s32 ixgbe_dcb_config_rx_arbiter_82598(struct ixgbe_hw *hw,
+                                      struct ixgbe_dcb_config *dcb_config);
+
+/* DCB hw initialization */
+s32 ixgbe_dcb_hw_config_82598(struct ixgbe_hw *hw,
+                              struct ixgbe_dcb_config *config);
+
+#endif /* _DCB_82598_CONFIG_H */


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] ixgbe: Enable Data Center Bridging (DCB) support
  2008-05-27 14:13 [PATCH] NET: DCB generic netlink interface PJ Waskiewicz
  2008-05-27 14:13 ` [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition PJ Waskiewicz
  2008-05-27 14:13 ` [PATCH 2/3] ixgbe: Add Data Center Bridging hardware initialization code PJ Waskiewicz
@ 2008-05-27 14:13 ` PJ Waskiewicz
  2008-06-04 18:44 ` [PATCH] NET: DCB generic netlink interface David Miller
  3 siblings, 0 replies; 24+ messages in thread
From: PJ Waskiewicz @ 2008-05-27 14:13 UTC (permalink / raw)
  To: jeff, davem; +Cc: netdev

This patch enables DCB support for 82598.  DCB is a technology using the
802.1Qaz and 802.1Qbb IEEE standards for priority grouping and priority
flow control.  This is useful when trying to allow different types of
traffic on separate flows to be paused and handled with different
priorities across the network without impacting other flows on the same
links.  A target traffic type for this technology is to provide flow
control for Fibre Channel over Ethernet, while not impacting other LAN
traffic flows on the link.

This is a respin of previous patches posted for this.  These new patches
now use the DCBNL netlink interface in the kernel to communicate with
userspace.  The userspace utilities to control this are in the process of
being posted to SourceForge, and should be available in the very near
future.

This is based from the net-next-2.6 tree.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/Makefile        |    2 
 drivers/net/ixgbe/ixgbe.h         |   22 +
 drivers/net/ixgbe/ixgbe_ethtool.c |   36 ++
 drivers/net/ixgbe/ixgbe_main.c    |  577 +++++++++++++++++++++++++++++++++----
 4 files changed, 576 insertions(+), 61 deletions(-)

diff --git a/drivers/net/ixgbe/Makefile b/drivers/net/ixgbe/Makefile
index ccd83d9..20b37cc 100644
--- a/drivers/net/ixgbe/Makefile
+++ b/drivers/net/ixgbe/Makefile
@@ -33,4 +33,4 @@
 obj-$(CONFIG_IXGBE) += ixgbe.o
 
 ixgbe-objs := ixgbe_main.o ixgbe_common.o ixgbe_ethtool.o \
-              ixgbe_82598.o ixgbe_phy.o
+              ixgbe_82598.o ixgbe_phy.o ixgbe_dcb.o ixgbe_dcb_82598.o
diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index d981134..145421f 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -35,6 +35,7 @@
 
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
+#include "ixgbe_dcb.h"
 
 #ifdef CONFIG_DCA
 #include <linux/dca.h>
@@ -98,6 +99,7 @@
 #define IXGBE_TX_FLAGS_TSO		(u32)(1 << 2)
 #define IXGBE_TX_FLAGS_IPV4		(u32)(1 << 3)
 #define IXGBE_TX_FLAGS_VLAN_MASK	0xffff0000
+#define IXGBE_TX_FLAGS_VLAN_PRIO_MASK	0x0000e000
 #define IXGBE_TX_FLAGS_VLAN_SHIFT	16
 
 /* wrapper around a pointer to a socket buffer,
@@ -144,7 +146,7 @@ struct ixgbe_ring {
 
 	u16 reg_idx; /* holds the special value that gets the hardware register
 		      * offset associated with this ring, which is different
-		      * for DCE and RSS modes */
+		      * for DCB and RSS modes */
 
 #ifdef CONFIG_DCA
 	/* cpu for tx queue */
@@ -162,8 +164,10 @@ struct ixgbe_ring {
 	u16 work_limit;                /* max work per interrupt */
 };
 
+#define RING_F_DCB  0
 #define RING_F_VMDQ 1
 #define RING_F_RSS  2
+#define IXGBE_MAX_DCB_INDICES   8
 #define IXGBE_MAX_RSS_INDICES  16
 #define IXGBE_MAX_VMDQ_INDICES 16
 struct ixgbe_ring_feature {
@@ -174,6 +178,10 @@ struct ixgbe_ring_feature {
 #define MAX_RX_QUEUES 64
 #define MAX_TX_QUEUES 32
 
+#define MAX_RX_PACKET_BUFFERS ((adapter->flags & IXGBE_FLAG_DCB_ENABLED) \
+                               ? 8 : 1)
+#define MAX_TX_PACKET_BUFFERS MAX_RX_PACKET_BUFFERS
+
 /* MAX_MSIX_Q_VECTORS of these are allocated,
  * but we only use one per queue-specific vector.
  */
@@ -226,6 +234,9 @@ struct ixgbe_adapter {
 	struct work_struct reset_task;
 	struct ixgbe_q_vector q_vector[MAX_MSIX_Q_VECTORS];
 	char name[MAX_MSIX_COUNT][IFNAMSIZ + 5];
+	struct ixgbe_dcb_config dcb_cfg;
+	struct ixgbe_dcb_config temp_dcb_cfg;
+	u8 dcb_set_bitmap;
 
 	/* Interrupt Throttle Rate */
 	u32 itr_setting;
@@ -234,6 +245,7 @@ struct ixgbe_adapter {
 
 	/* TX */
 	struct ixgbe_ring *tx_ring;	/* One per active queue */
+	int num_tx_queues;
 	u64 restart_queue;
 	u64 lsc_int;
 	u64 hw_tso_ctxt;
@@ -243,12 +255,11 @@ struct ixgbe_adapter {
 
 	/* RX */
 	struct ixgbe_ring *rx_ring;	/* One per active queue */
+	int num_rx_queues;
 	u64 hw_csum_tx_good;
 	u64 hw_csum_rx_error;
 	u64 hw_csum_rx_good;
 	u64 non_eop_descs;
-	int num_tx_queues;
-	int num_rx_queues;
 	int num_msix_vectors;
 	struct ixgbe_ring_feature ring_feature[3];
 	struct msix_entry *msix_entries;
@@ -270,6 +281,7 @@ struct ixgbe_adapter {
 #define IXGBE_FLAG_RSS_ENABLED                  (u32)(1 << 6)
 #define IXGBE_FLAG_VMDQ_ENABLED                 (u32)(1 << 7)
 #define IXGBE_FLAG_DCA_ENABLED                  (u32)(1 << 8)
+#define IXGBE_FLAG_DCB_ENABLED                  (u32)(1 << 9)
 
 	/* OS defined structs */
 	struct net_device *netdev;
@@ -314,5 +326,9 @@ extern int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 				    struct ixgbe_ring *rxdr);
 extern int ixgbe_setup_tx_resources(struct ixgbe_adapter *adapter,
 				    struct ixgbe_ring *txdr);
+extern int ixgbe_close(struct net_device *netdev);
+extern void ixgbe_reset_interrupt_capability(struct ixgbe_adapter *adapter);
+extern int ixgbe_open(struct net_device *netdev);
+extern int ixgbe_init_interrupt_scheme(struct ixgbe_adapter *adapter);
 
 #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c b/drivers/net/ixgbe/ixgbe_ethtool.c
index 4e46377..944f669 100644
--- a/drivers/net/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ixgbe/ixgbe_ethtool.c
@@ -97,7 +97,17 @@ static struct ixgbe_stats ixgbe_gstrings_stats[] = {
 		 ((struct ixgbe_adapter *)netdev->priv)->num_rx_queues) * \
 		 (sizeof(struct ixgbe_queue_stats) / sizeof(u64)))
 #define IXGBE_GLOBAL_STATS_LEN	ARRAY_SIZE(ixgbe_gstrings_stats)
-#define IXGBE_STATS_LEN (IXGBE_GLOBAL_STATS_LEN + IXGBE_QUEUE_STATS_LEN)
+#define IXGBE_PB_STATS_LEN ( \
+		(((struct ixgbe_adapter *)netdev->priv)->flags & \
+		 IXGBE_FLAG_DCB_ENABLED) ? \
+		 (sizeof(((struct ixgbe_adapter *)0)->stats.pxonrxc) + \
+		  sizeof(((struct ixgbe_adapter *)0)->stats.pxontxc) + \
+		  sizeof(((struct ixgbe_adapter *)0)->stats.pxoffrxc) + \
+		  sizeof(((struct ixgbe_adapter *)0)->stats.pxofftxc)) \
+		 / sizeof(u64) : 0)
+#define IXGBE_STATS_LEN (IXGBE_GLOBAL_STATS_LEN + \
+		IXGBE_PB_STATS_LEN + \
+		IXGBE_QUEUE_STATS_LEN)
 
 static int ixgbe_get_settings(struct net_device *netdev,
 			      struct ethtool_cmd *ecmd)
@@ -806,6 +816,16 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
 			data[i + k] = queue_stat[k];
 		i += k;
 	}
+	if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+		for (j = 0; j < MAX_TX_PACKET_BUFFERS; j++) {
+			data[i++] = adapter->stats.pxontxc[j];
+			data[i++] = adapter->stats.pxofftxc[j];
+		}
+		for (j = 0; j < MAX_RX_PACKET_BUFFERS; j++) {
+			data[i++] = adapter->stats.pxonrxc[j];
+			data[i++] = adapter->stats.pxoffrxc[j];
+		}
+	}
 }
 
 static void ixgbe_get_strings(struct net_device *netdev, u32 stringset,
@@ -834,6 +854,20 @@ static void ixgbe_get_strings(struct net_device *netdev, u32 stringset,
 			sprintf(p, "rx_queue_%u_bytes", i);
 			p += ETH_GSTRING_LEN;
 		}
+		if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+			for (i = 0; i < MAX_TX_PACKET_BUFFERS; i++) {
+				sprintf(p, "tx_pb_%u_pxon", i);
+				p += ETH_GSTRING_LEN;
+				sprintf(p, "tx_pb_%u_pxoff", i);
+				p += ETH_GSTRING_LEN;
+			}
+			for (i = 0; i < MAX_RX_PACKET_BUFFERS; i++) {
+				sprintf(p, "rx_pb_%u_pxon", i);
+				p += ETH_GSTRING_LEN;
+				sprintf(p, "rx_pb_%u_pxoff", i);
+				p += ETH_GSTRING_LEN;
+			}
+		}
 /*		BUG_ON(p - data != IXGBE_STATS_LEN * ETH_GSTRING_LEN); */
 		break;
 	}
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 7b85922..4177ea5 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1,7 +1,7 @@
 /*******************************************************************************
 
   Intel 10 Gigabit PCI Express Linux driver
-  Copyright(c) 1999 - 2007 Intel Corporation.
+  Copyright(c) 1999 - 2008 Intel Corporation.
 
   This program is free software; you can redistribute it and/or modify it
   under the terms and conditions of the GNU General Public License,
@@ -48,10 +48,10 @@ char ixgbe_driver_name[] = "ixgbe";
 static const char ixgbe_driver_string[] =
 	"Intel(R) 10 Gigabit PCI Express Network Driver";
 
-#define DRV_VERSION "1.3.18-k2"
+#define DRV_VERSION "1.3.29-k2"
 const char ixgbe_driver_version[] = DRV_VERSION;
 static const char ixgbe_copyright[] =
-	 "Copyright (c) 1999-2007 Intel Corporation.";
+	 "Copyright (c) 1999-2008 Intel Corporation.";
 
 static const struct ixgbe_info *ixgbe_info_tbl[] = {
 	[board_82598]			= &ixgbe_82598_info,
@@ -90,6 +90,41 @@ static struct notifier_block dca_notifier = {
 };
 #endif
 
+#ifdef CONFIG_DCBNL
+static u8   ixgbe_dcbnl_getstate(struct net_device *);
+static void ixgbe_dcbnl_setstate(struct net_device *, u8);
+static void ixgbe_dcbnl_getpermhwaddr(struct net_device *, u8 *);
+static void ixgbe_dcbnl_setpgtccfgtx(struct net_device *, int, u8, u8, u8, u8);
+static void ixgbe_dcbnl_setpgbwgcfgtx(struct net_device *, int, u8);
+static void ixgbe_dcbnl_setpgtccfgrx(struct net_device *, int, u8, u8, u8, u8);
+static void ixgbe_dcbnl_setpgbwgcfgrx(struct net_device *, int, u8);
+static void ixgbe_dcbnl_getpgtccfgtx(struct net_device *, int, u8 *, u8 *,
+				     u8 *, u8 *);
+static void ixgbe_dcbnl_getpgbwgcfgtx(struct net_device *, int, u8 *);
+static void ixgbe_dcbnl_getpgtccfgrx(struct net_device *, int, u8 *, u8 *,
+				     u8 *, u8 *);
+static void ixgbe_dcbnl_getpgbwgcfgrx(struct net_device *, int, u8 *);
+static void ixgbe_dcbnl_setpfccfg(struct net_device *, int, u8);
+static void ixgbe_dcbnl_getpfccfg(struct net_device *, int, u8 *);
+static u8   ixgbe_dcbnl_setall(struct net_device *);
+static struct dcbnl_genl_ops dcbnl_ops = {
+	.getstate	= ixgbe_dcbnl_getstate,
+	.setstate	= ixgbe_dcbnl_setstate,
+	.getpermhwaddr	= ixgbe_dcbnl_getpermhwaddr,
+	.setpgtccfgtx	= ixgbe_dcbnl_setpgtccfgtx,
+	.setpgbwgcfgtx	= ixgbe_dcbnl_setpgbwgcfgtx,
+	.setpgtccfgrx	= ixgbe_dcbnl_setpgtccfgrx,
+	.setpgbwgcfgrx	= ixgbe_dcbnl_setpgbwgcfgrx,
+	.getpgtccfgtx	= ixgbe_dcbnl_getpgtccfgtx,
+	.getpgbwgcfgtx	= ixgbe_dcbnl_getpgbwgcfgtx,
+	.getpgtccfgrx	= ixgbe_dcbnl_getpgtccfgrx,
+	.getpgbwgcfgrx	= ixgbe_dcbnl_getpgbwgcfgrx,
+	.setpfccfg	= ixgbe_dcbnl_setpfccfg,
+	.getpfccfg	= ixgbe_dcbnl_getpfccfg,
+	.setall		= ixgbe_dcbnl_setall
+};
+#endif
+
 MODULE_AUTHOR("Intel Corporation, <linux.nics@intel.com>");
 MODULE_DESCRIPTION("Intel(R) 10 Gigabit PCI Express Network Driver");
 MODULE_LICENSE("GPL");
@@ -397,13 +432,13 @@ static void ixgbe_receive_skb(struct ixgbe_adapter *adapter,
 			      u16 tag)
 {
 	if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL)) {
-		if (adapter->vlgrp && is_vlan)
+		if (adapter->vlgrp && is_vlan && (tag != 0))
 			vlan_hwaccel_receive_skb(skb, adapter->vlgrp, tag);
 		else
 			netif_receive_skb(skb);
 	} else {
 
-		if (adapter->vlgrp && is_vlan)
+		if (adapter->vlgrp && is_vlan && (tag != 0))
 			vlan_hwaccel_rx(skb, adapter->vlgrp, tag);
 		else
 			netif_rx(skb);
@@ -545,14 +580,13 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_adapter *adapter,
 	struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer;
 	struct sk_buff *skb;
 	unsigned int i;
-	u32 upper_len, len, staterr;
+	u32 len, staterr;
 	u16 hdr_info, vlan_tag;
 	bool is_vlan, cleaned = false;
 	int cleaned_count = 0;
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 
 	i = rx_ring->next_to_clean;
-	upper_len = 0;
 	rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i);
 	staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
 	rx_buffer_info = &rx_ring->rx_buffer_info[i];
@@ -560,6 +594,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_adapter *adapter,
 	vlan_tag = le16_to_cpu(rx_desc->wb.upper.vlan);
 
 	while (staterr & IXGBE_RXD_STAT_DD) {
+		u32 upper_len = 0;
 		if (*work_done >= work_to_do)
 			break;
 		(*work_done)++;
@@ -1343,7 +1378,7 @@ static void ixgbe_configure_msi_and_legacy(struct ixgbe_adapter *adapter)
 }
 
 /**
- * ixgbe_configure_tx - Configure 8254x Transmit Unit after Reset
+ * ixgbe_configure_tx - Configure 8259x Transmit Unit after Reset
  * @adapter: board private structure
  *
  * Configure the Tx unit of the MAC after a reset.
@@ -1371,9 +1406,9 @@ static void ixgbe_configure_tx(struct ixgbe_adapter *adapter)
 		/* Disable Tx Head Writeback RO bit, since this hoses
 		 * bookkeeping if things aren't delivered in order.
 		 */
-		txctrl = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL(i));
+		txctrl = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL(j));
 		txctrl &= ~IXGBE_DCA_TXCTRL_TX_WB_RO_EN;
-		IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL(i), txctrl);
+		IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL(j), txctrl);
 	}
 }
 
@@ -1382,7 +1417,7 @@ static void ixgbe_configure_tx(struct ixgbe_adapter *adapter)
 
 #define IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT			2
 /**
- * ixgbe_configure_rx - Configure 8254x Receive Unit after Reset
+ * ixgbe_configure_rx - Configure 8259x Receive Unit after Reset
  * @adapter: board private structure
  *
  * Configure the Rx unit of the MAC after a reset.
@@ -1529,6 +1564,16 @@ static void ixgbe_vlan_rx_register(struct net_device *netdev,
 		ixgbe_irq_disable(adapter);
 	adapter->vlgrp = grp;
 
+	/*
+	 * For a DCB driver, always enable VLAN tag stripping so we can
+	 * still receive traffic from a DCB-enabled host even if we're
+	 * not in DCB mode.
+	 */
+	ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_VLNCTRL);
+	ctrl |= IXGBE_VLNCTRL_VME;
+	ctrl &= ~IXGBE_VLNCTRL_CFIEN;
+	IXGBE_WRITE_REG(&adapter->hw, IXGBE_VLNCTRL, ctrl);
+
 	if (grp) {
 		/* enable VLAN tag insert/strip */
 		ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_VLNCTRL);
@@ -1672,6 +1717,316 @@ static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
 	}
 }
 
+/*
+ * ixgbe_configure_dcb - Configure DCB hardware
+ * @adapter: ixgbe adapter struct
+ *
+ * This is called by the driver on open to configure the DCB hardware.
+ * This is also called by the gennetlink interface when reconfiguring
+ * the DCB state.
+ */
+void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
+{
+	struct ixgbe_hw *hw = &adapter->hw;
+	u32 txdctl, vlnctrl;
+	int i, j;
+
+	ixgbe_dcb_check_config(&adapter->dcb_cfg);
+	ixgbe_dcb_calculate_tc_credits(&adapter->dcb_cfg, DCB_TX_CONFIG);
+	ixgbe_dcb_calculate_tc_credits(&adapter->dcb_cfg, DCB_RX_CONFIG);
+
+	/* reconfigure the hardware */
+	ixgbe_dcb_hw_config(&adapter->hw, &adapter->dcb_cfg);
+
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		j = adapter->tx_ring[i].reg_idx;
+		txdctl = IXGBE_READ_REG(hw, IXGBE_TXDCTL(j));
+		/* PThresh workaround for Tx hang with DFP enabled. */
+		txdctl |= 32;
+		IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(j), txdctl);
+	}
+	/* Enable VLAN tag insert/strip */
+	vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
+	vlnctrl |= IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE;
+	vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+	IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+	ixgbe_set_vfta(hw, 0, 0, true);
+}
+
+#ifdef CONFIG_DCBNL
+static int ixgbe_copy_dcb_cfg(struct ixgbe_dcb_config *src_dcb_cfg,
+			      struct ixgbe_dcb_config *dst_dcb_cfg, int tc_max)
+{
+	struct tc_configuration *src_tc_cfg = NULL;
+	struct tc_configuration *dst_tc_cfg = NULL;
+	int i;
+
+	if (!src_dcb_cfg || !dst_dcb_cfg)
+		return -EINVAL;
+
+	for (i = DCB_PG_ATTR_TC_0; i < tc_max + DCB_PG_ATTR_TC_0; i++) {
+		src_tc_cfg = &src_dcb_cfg->tc_config[i - DCB_PG_ATTR_TC_0];
+		dst_tc_cfg = &dst_dcb_cfg->tc_config[i - DCB_PG_ATTR_TC_0];
+
+		dst_tc_cfg->path[DCB_TX_CONFIG].prio_type =
+				src_tc_cfg->path[DCB_TX_CONFIG].prio_type;
+
+		dst_tc_cfg->path[DCB_TX_CONFIG].bwg_id =
+				src_tc_cfg->path[DCB_TX_CONFIG].bwg_id;
+
+		dst_tc_cfg->path[DCB_TX_CONFIG].bwg_percent =
+				src_tc_cfg->path[DCB_TX_CONFIG].bwg_percent;
+
+		dst_tc_cfg->path[DCB_TX_CONFIG].up_to_tc_bitmap =
+				src_tc_cfg->path[DCB_TX_CONFIG].up_to_tc_bitmap;
+
+		dst_tc_cfg->path[DCB_RX_CONFIG].prio_type =
+				src_tc_cfg->path[DCB_RX_CONFIG].prio_type;
+
+		dst_tc_cfg->path[DCB_RX_CONFIG].bwg_id =
+				src_tc_cfg->path[DCB_RX_CONFIG].bwg_id;
+
+		dst_tc_cfg->path[DCB_RX_CONFIG].bwg_percent =
+				src_tc_cfg->path[DCB_RX_CONFIG].bwg_percent;
+
+		dst_tc_cfg->path[DCB_RX_CONFIG].up_to_tc_bitmap =
+				src_tc_cfg->path[DCB_RX_CONFIG].up_to_tc_bitmap;
+	}
+
+	for (i = DCB_PG_ATTR_BWG_0; i < DCB_PG_ATTR_BWG_MAX; i++) {
+		dst_dcb_cfg->bw_percentage[DCB_TX_CONFIG][i-DCB_PG_ATTR_BWG_0] =
+		 src_dcb_cfg->bw_percentage[DCB_TX_CONFIG][i-DCB_PG_ATTR_BWG_0];
+		dst_dcb_cfg->bw_percentage[DCB_RX_CONFIG][i-DCB_PG_ATTR_BWG_0] =
+	         src_dcb_cfg->bw_percentage[DCB_RX_CONFIG][i-DCB_PG_ATTR_BWG_0];
+	}
+
+	for (i = DCB_PFC_UP_ATTR_0; i < DCB_PFC_UP_ATTR_MAX; i++) {
+		dst_dcb_cfg->tc_config[i - DCB_PFC_UP_ATTR_0].dcb_pfc =
+			src_dcb_cfg->tc_config[i - DCB_PFC_UP_ATTR_0].dcb_pfc;
+	}
+
+	return 0;
+}
+
+/* Callbacks for DCB netlink in the kernel */
+#define BIT_DCB_MODE	0x01
+#define BIT_PFC		0x02
+#define BIT_PG_RX	0x04
+#define BIT_PG_TX	0x08
+static u8 ixgbe_dcbnl_getstate(struct net_device *netdev)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	DPRINTK(DRV, INFO, "Get DCB Admin Mode.\n");
+
+	return (!!(adapter->flags & IXGBE_FLAG_DCB_ENABLED));
+}
+
+static void ixgbe_dcbnl_setstate(struct net_device *netdev, u8 state)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	DPRINTK(DRV, INFO, "Set DCB Admin Mode.\n");
+
+	if (state > 0) {
+		/* Turn on DCB */
+		if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+			return;
+		} else {
+			set_bit(__IXGBE_DOWN, &adapter->state);
+			if (netdev->flags & IFF_UP)
+				ixgbe_close(netdev);
+			ixgbe_reset_interrupt_capability(adapter);
+			kfree(adapter->tx_ring);
+			kfree(adapter->rx_ring);
+
+			adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
+			adapter->flags |= IXGBE_FLAG_DCB_ENABLED;
+			ixgbe_init_interrupt_scheme(adapter);
+			if (netdev->flags & IFF_UP)
+				ixgbe_open(netdev);
+			clear_bit(__IXGBE_DOWN, &adapter->state);
+		}
+	} else {
+		/* Turn off DCB */
+		if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+			set_bit(__IXGBE_DOWN, &adapter->state);
+			if (netdev->flags & IFF_UP)
+				ixgbe_close(netdev);
+			ixgbe_reset_interrupt_capability(adapter);
+			kfree(adapter->tx_ring);
+			kfree(adapter->rx_ring);
+
+			adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
+			adapter->flags |= IXGBE_FLAG_RSS_ENABLED;
+			ixgbe_init_interrupt_scheme(adapter);
+			if (netdev->flags & IFF_UP)
+				ixgbe_open(netdev);
+			clear_bit(__IXGBE_DOWN, &adapter->state);
+		} else {
+			return;
+		}
+	}
+}
+
+static void ixgbe_dcbnl_getpermhwaddr(struct net_device *netdev, u8 *perm_addr)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+	int i;
+
+	for (i = 0; i < netdev->addr_len; i++)
+		perm_addr[i] = adapter->hw.mac.perm_addr[i];
+}
+
+static void ixgbe_dcbnl_setpgtccfgtx(struct net_device *netdev, int tc,
+				     u8 prio, u8 bwg_id, u8 bw_pct, u8 up_map)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	adapter->temp_dcb_cfg.tc_config[tc].path[0].prio_type = prio;
+	adapter->temp_dcb_cfg.tc_config[tc].path[0].bwg_id = bwg_id;
+	adapter->temp_dcb_cfg.tc_config[tc].path[0].bwg_percent = bw_pct;
+	adapter->temp_dcb_cfg.tc_config[tc].path[0].up_to_tc_bitmap = up_map;
+
+	if ( (adapter->temp_dcb_cfg.tc_config[tc].path[0].prio_type !=
+	      adapter->dcb_cfg.tc_config[tc].path[0].prio_type) ||
+	     (adapter->temp_dcb_cfg.tc_config[tc].path[0].bwg_id !=
+	      adapter->dcb_cfg.tc_config[tc].path[0].bwg_id) ||
+	     (adapter->temp_dcb_cfg.tc_config[tc].path[0].bwg_percent !=
+	      adapter->dcb_cfg.tc_config[tc].path[0].bwg_percent) ||
+	     (adapter->temp_dcb_cfg.tc_config[tc].path[0].up_to_tc_bitmap !=
+	      adapter->dcb_cfg.tc_config[tc].path[0].up_to_tc_bitmap))
+		adapter->dcb_set_bitmap |= BIT_PG_TX;
+}
+
+static void ixgbe_dcbnl_setpgbwgcfgtx(struct net_device *netdev, int bwg_id,
+				      u8 bw_pct)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	adapter->temp_dcb_cfg.bw_percentage[0][bwg_id] = bw_pct;
+
+	if (adapter->temp_dcb_cfg.bw_percentage[0][bwg_id] !=
+	    adapter->dcb_cfg.bw_percentage[0][bwg_id])
+		adapter->dcb_set_bitmap |= BIT_PG_RX;
+}
+
+static void ixgbe_dcbnl_setpgtccfgrx(struct net_device *netdev, int tc,
+				     u8 prio, u8 bwg_id, u8 bw_pct, u8 up_map)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	adapter->temp_dcb_cfg.tc_config[tc].path[1].prio_type = prio;
+	adapter->temp_dcb_cfg.tc_config[tc].path[1].bwg_id = bwg_id;
+	adapter->temp_dcb_cfg.tc_config[tc].path[1].bwg_percent = bw_pct;
+	adapter->temp_dcb_cfg.tc_config[tc].path[1].up_to_tc_bitmap = up_map;
+
+	if ( (adapter->temp_dcb_cfg.tc_config[tc].path[1].prio_type !=
+	      adapter->dcb_cfg.tc_config[tc].path[1].prio_type) ||
+	     (adapter->temp_dcb_cfg.tc_config[tc].path[1].bwg_id !=
+	      adapter->dcb_cfg.tc_config[tc].path[1].bwg_id) ||
+	     (adapter->temp_dcb_cfg.tc_config[tc].path[1].bwg_percent !=
+	      adapter->dcb_cfg.tc_config[tc].path[1].bwg_percent) ||
+	     (adapter->temp_dcb_cfg.tc_config[tc].path[1].up_to_tc_bitmap !=
+	      adapter->dcb_cfg.tc_config[tc].path[1].up_to_tc_bitmap))
+		adapter->dcb_set_bitmap |= BIT_PG_RX;
+}
+
+static void ixgbe_dcbnl_setpgbwgcfgrx(struct net_device *netdev, int bwg_id,
+				      u8 bw_pct)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	adapter->temp_dcb_cfg.bw_percentage[1][bwg_id] = bw_pct;
+
+	if (adapter->temp_dcb_cfg.bw_percentage[1][bwg_id] !=
+	    adapter->dcb_cfg.bw_percentage[1][bwg_id])
+		adapter->dcb_set_bitmap |= BIT_PG_RX;
+}
+
+static void ixgbe_dcbnl_getpgtccfgtx(struct net_device *netdev, int tc,
+				   u8 *prio, u8 *bwg_id, u8 *bw_pct, u8 *up_map)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	*prio = adapter->dcb_cfg.tc_config[tc].path[0].prio_type;
+	*bwg_id = adapter->dcb_cfg.tc_config[tc].path[0].bwg_id;
+	*bw_pct = adapter->dcb_cfg.tc_config[tc].path[0].bwg_percent;
+	*up_map = adapter->dcb_cfg.tc_config[tc].path[0].up_to_tc_bitmap;
+}
+
+static void ixgbe_dcbnl_getpgbwgcfgtx(struct net_device *netdev, int bwg_id,
+				      u8 *bw_pct)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	*bw_pct = adapter->dcb_cfg.bw_percentage[0][bwg_id];
+}
+
+static void ixgbe_dcbnl_getpgtccfgrx(struct net_device *netdev, int tc,
+				   u8 *prio, u8 *bwg_id, u8 *bw_pct, u8 *up_map)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	*prio = adapter->dcb_cfg.tc_config[tc].path[1].prio_type;
+	*bwg_id = adapter->dcb_cfg.tc_config[tc].path[1].bwg_id;
+	*bw_pct = adapter->dcb_cfg.tc_config[tc].path[1].bwg_percent;
+	*up_map = adapter->dcb_cfg.tc_config[tc].path[1].up_to_tc_bitmap;
+}
+
+static void ixgbe_dcbnl_getpgbwgcfgrx(struct net_device *netdev, int bwg_id,
+				      u8 *bw_pct)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	*bw_pct = adapter->dcb_cfg.bw_percentage[1][bwg_id];
+}
+
+static void ixgbe_dcbnl_setpfccfg(struct net_device *netdev, int priority,
+				  u8 setting)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	adapter->temp_dcb_cfg.tc_config[priority].dcb_pfc = setting;
+	if (adapter->temp_dcb_cfg.tc_config[priority].dcb_pfc !=
+	    adapter->dcb_cfg.tc_config[priority].dcb_pfc)
+		adapter->dcb_set_bitmap |= BIT_PFC;
+}
+
+static void ixgbe_dcbnl_getpfccfg(struct net_device *netdev, int priority,
+				  u8 *setting)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+	*setting = adapter->dcb_cfg.tc_config[priority].dcb_pfc;
+}
+
+static u8 ixgbe_dcbnl_setall(struct net_device *netdev)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(netdev);
+	int ret;
+
+	if (!adapter->dcb_set_bitmap)
+		return 1;
+
+	while (test_and_set_bit(__IXGBE_RESETTING, &adapter->state))
+		msleep(1);
+
+	ret = ixgbe_copy_dcb_cfg(&adapter->temp_dcb_cfg, &adapter->dcb_cfg,
+				 adapter->ring_feature[RING_F_DCB].indices);
+	if (ret) {
+		clear_bit(__IXGBE_RESETTING, &adapter->state);
+		return ret;
+	}
+
+	ixgbe_down(adapter);
+	ixgbe_up(adapter);
+	adapter->dcb_set_bitmap = 0x00;
+	clear_bit(__IXGBE_RESETTING, &adapter->state);
+	return ret;
+}
+#endif /* CONFIG_DCBNL */
+
 static void ixgbe_configure(struct ixgbe_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
@@ -1680,6 +2035,12 @@ static void ixgbe_configure(struct ixgbe_adapter *adapter)
 	ixgbe_set_multi(netdev);
 
 	ixgbe_restore_vlan(adapter);
+	if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+		netif_set_gso_max_size(netdev, 32768);
+		ixgbe_configure_dcb(adapter);
+	} else {
+		netif_set_gso_max_size(netdev, 65536);
+	}
 
 	ixgbe_configure_tx(adapter);
 	ixgbe_configure_rx(adapter);
@@ -1699,6 +2060,11 @@ static int ixgbe_up_complete(struct ixgbe_adapter *adapter)
 
 	ixgbe_get_hw_control(adapter);
 
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	if (adapter->num_tx_queues > 1)
+		netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+
 	if ((adapter->flags & IXGBE_FLAG_MSIX_ENABLED) ||
 	    (adapter->flags & IXGBE_FLAG_MSI_ENABLED)) {
 		if (adapter->flags & IXGBE_FLAG_MSIX_ENABLED) {
@@ -1943,36 +2309,44 @@ static void ixgbe_clean_all_tx_rings(struct ixgbe_adapter *adapter)
 void ixgbe_down(struct ixgbe_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
+	struct ixgbe_hw *hw = &adapter->hw;
 	u32 rxctrl;
+	u32 txdctl;
+	int i, j;
 
 	/* signal that we are down to the interrupt handler */
 	set_bit(__IXGBE_DOWN, &adapter->state);
 
 	/* disable receives */
-	rxctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_RXCTRL);
-	IXGBE_WRITE_REG(&adapter->hw, IXGBE_RXCTRL,
-			rxctrl & ~IXGBE_RXCTRL_RXEN);
-
-	netif_tx_disable(netdev);
-
-	/* disable transmits in the hardware */
+	rxctrl = IXGBE_READ_REG(hw, IXGBE_RXCTRL);
+	IXGBE_WRITE_REG(hw, IXGBE_RXCTRL, rxctrl & ~IXGBE_RXCTRL_RXEN);
 
-	/* flush both disables */
-	IXGBE_WRITE_FLUSH(&adapter->hw);
+	IXGBE_WRITE_FLUSH(hw);
 	msleep(10);
 
+	netif_stop_queue(netdev);
+	if (netif_is_multiqueue(netdev))
+		for (i = 0; i < adapter->num_tx_queues; i++)
+			netif_stop_subqueue(netdev, i);
+
 	ixgbe_irq_disable(adapter);
 
 	ixgbe_napi_disable_all(adapter);
 	del_timer_sync(&adapter->watchdog_timer);
 
+	/* disable transmits in the hardware now that interrupts are off */
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		j = adapter->tx_ring[i].reg_idx;
+		txdctl = IXGBE_READ_REG(hw, IXGBE_TXDCTL(j));
+		IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(j),
+		                (txdctl & ~IXGBE_TXDCTL_ENABLE));
+	}
+
 	netif_carrier_off(netdev);
-	netif_stop_queue(netdev);
 
 	ixgbe_reset(adapter);
 	ixgbe_clean_all_tx_rings(adapter);
 	ixgbe_clean_all_rx_rings(adapter);
-
 }
 
 static int ixgbe_suspend(struct pci_dev *pdev, pm_message_t state)
@@ -2069,6 +2443,11 @@ static void ixgbe_reset_task(struct work_struct *work)
 	struct ixgbe_adapter *adapter;
 	adapter = container_of(work, struct ixgbe_adapter, reset_task);
 
+	/* If we're already down or resetting, just bail */
+	if (test_bit(__IXGBE_DOWN, &adapter->state) ||
+	    test_bit(__IXGBE_RESETTING, &adapter->state))
+		return;
+
 	adapter->tx_timeout_count++;
 
 	ixgbe_reinit_locked(adapter);
@@ -2112,6 +2491,7 @@ static void ixgbe_acquire_msix_vectors(struct ixgbe_adapter *adapter,
 		adapter->flags &= ~IXGBE_FLAG_MSIX_ENABLED;
 		kfree(adapter->msix_entries);
 		adapter->msix_entries = NULL;
+		adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
 		adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
 		adapter->num_tx_queues = 1;
 		adapter->num_rx_queues = 1;
@@ -2121,19 +2501,39 @@ static void ixgbe_acquire_msix_vectors(struct ixgbe_adapter *adapter,
 	}
 }
 
-static void __devinit ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
+static void ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
 {
-	int nrq, ntq;
+	int nrq = 1, ntq = 1;
 	int feature_mask = 0, rss_i, rss_m;
+	int dcb_i, dcb_m;
 
 	/* Number of supported queues */
 	switch (adapter->hw.mac.type) {
 	case ixgbe_mac_82598EB:
+		dcb_i = adapter->ring_feature[RING_F_DCB].indices;
+		dcb_m = 0;
 		rss_i = adapter->ring_feature[RING_F_RSS].indices;
 		rss_m = 0;
+		feature_mask |= IXGBE_FLAG_DCB_ENABLED;
 		feature_mask |= IXGBE_FLAG_RSS_ENABLED;
 
 		switch (adapter->flags & feature_mask) {
+		case (IXGBE_FLAG_DCB_ENABLED):
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+			dcb_m = 0x7 << 3;
+			nrq = dcb_i;
+			ntq = dcb_i;
+#else
+			printk(KERN_INFO, "Kernel has no multiqueue "
+				"support, disabling DCB.\n");
+			/* Fall back onto RSS */
+			rss_m = 0xF;
+			nrq = rss_i;
+			ntq = 1;
+			dcb_m = 0;
+			dcb_i = 0;
+#endif
+			break;
 		case (IXGBE_FLAG_RSS_ENABLED):
 			rss_m = 0xF;
 			nrq = rss_i;
@@ -2145,6 +2545,8 @@ static void __devinit ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
 			break;
 		case 0:
 		default:
+			dcb_i = 0;
+			dcb_m = 0;
 			rss_i = 0;
 			rss_m = 0;
 			nrq = 1;
@@ -2152,6 +2554,8 @@ static void __devinit ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
 			break;
 		}
 
+		adapter->ring_feature[RING_F_DCB].indices = dcb_i;
+		adapter->ring_feature[RING_F_DCB].mask = dcb_m;
 		adapter->ring_feature[RING_F_RSS].indices = rss_i;
 		adapter->ring_feature[RING_F_RSS].mask = rss_m;
 		break;
@@ -2179,15 +2583,25 @@ static void __devinit ixgbe_cache_ring_register(struct ixgbe_adapter *adapter)
 	 */
 	int feature_mask = 0, rss_i;
 	int i, txr_idx, rxr_idx;
+	int dcb_i;
 
 	/* Number of supported queues */
 	switch (adapter->hw.mac.type) {
 	case ixgbe_mac_82598EB:
+		dcb_i = adapter->ring_feature[RING_F_DCB].indices;
 		rss_i = adapter->ring_feature[RING_F_RSS].indices;
 		txr_idx = 0;
 		rxr_idx = 0;
+		feature_mask |= IXGBE_FLAG_DCB_ENABLED;
 		feature_mask |= IXGBE_FLAG_RSS_ENABLED;
 		switch (adapter->flags & feature_mask) {
+		case (IXGBE_FLAG_DCB_ENABLED):
+			/* the number of queues is assumed to be symmetric */
+			for (i = 0; i < dcb_i; i++) {
+				adapter->rx_ring[i].reg_idx = i << 3;
+				adapter->tx_ring[i].reg_idx = i << 2;
+			}
+			break;
 		case (IXGBE_FLAG_RSS_ENABLED):
 			for (i = 0; i < adapter->num_rx_queues; i++)
 				adapter->rx_ring[i].reg_idx = i;
@@ -2212,7 +2626,7 @@ static void __devinit ixgbe_cache_ring_register(struct ixgbe_adapter *adapter)
  * number of queues at compile-time.  The polling_netdev array is
  * intended for Multiqueue, but should work fine with a single queue.
  **/
-static int __devinit ixgbe_alloc_queues(struct ixgbe_adapter *adapter)
+static int ixgbe_alloc_queues(struct ixgbe_adapter *adapter)
 {
 	int i;
 
@@ -2252,7 +2666,7 @@ err_tx_ring_allocation:
  * Attempt to configure the interrupts using the best available
  * capabilities of the hardware and the kernel.
  **/
-static int __devinit ixgbe_set_interrupt_capability(struct ixgbe_adapter
+static int ixgbe_set_interrupt_capability(struct ixgbe_adapter
 						    *adapter)
 {
 	int err = 0;
@@ -2281,6 +2695,7 @@ static int __devinit ixgbe_set_interrupt_capability(struct ixgbe_adapter
 	adapter->msix_entries = kcalloc(v_budget,
 					sizeof(struct msix_entry), GFP_KERNEL);
 	if (!adapter->msix_entries) {
+		adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
 		adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
 		ixgbe_set_num_queues(adapter);
 		kfree(adapter->tx_ring);
@@ -2323,7 +2738,7 @@ out:
 	return err;
 }
 
-static void ixgbe_reset_interrupt_capability(struct ixgbe_adapter *adapter)
+void ixgbe_reset_interrupt_capability(struct ixgbe_adapter *adapter)
 {
 	if (adapter->flags & IXGBE_FLAG_MSIX_ENABLED) {
 		adapter->flags &= ~IXGBE_FLAG_MSIX_ENABLED;
@@ -2347,7 +2762,7 @@ static void ixgbe_reset_interrupt_capability(struct ixgbe_adapter *adapter)
  * - Hardware queue count (num_*_queues)
  *   - defined by miscellaneous hardware support/features (RSS, etc.)
  **/
-static int __devinit ixgbe_init_interrupt_scheme(struct ixgbe_adapter *adapter)
+int ixgbe_init_interrupt_scheme(struct ixgbe_adapter *adapter)
 {
 	int err;
 
@@ -2395,11 +2810,29 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	struct pci_dev *pdev = adapter->pdev;
 	unsigned int rss;
+	int j;
+	struct tc_configuration *tc;
 
 	/* Set capability flags */
 	rss = min(IXGBE_MAX_RSS_INDICES, (int)num_online_cpus());
 	adapter->ring_feature[RING_F_RSS].indices = rss;
 	adapter->flags |= IXGBE_FLAG_RSS_ENABLED;
+	adapter->ring_feature[RING_F_DCB].indices = IXGBE_MAX_DCB_INDICES;
+	for (j = 0; j < MAX_TRAFFIC_CLASS; j++) {
+		tc = &adapter->dcb_cfg.tc_config[j];
+		tc->path[DCB_TX_CONFIG].bwg_id = 0;
+		tc->path[DCB_TX_CONFIG].bwg_percent = 12 + (j & 1);
+		tc->path[DCB_RX_CONFIG].bwg_id = 0;
+		tc->path[DCB_RX_CONFIG].bwg_percent = 12 + (j & 1);
+		tc->dcb_pfc = pfc_disabled;
+	}
+	adapter->dcb_cfg.bw_percentage[DCB_TX_CONFIG][0] = 100;
+	adapter->dcb_cfg.bw_percentage[DCB_RX_CONFIG][0] = 100;
+	adapter->dcb_cfg.rx_pba_cfg = pba_equal;
+	adapter->dcb_cfg.round_robin_enable = false;
+	adapter->dcb_set_bitmap = 0x00;
+	ixgbe_copy_dcb_cfg(&adapter->dcb_cfg, &adapter->temp_dcb_cfg,
+			   adapter->ring_feature[RING_F_DCB].indices);
 
 	/* Enable Dynamic interrupt throttling by default */
 	adapter->rx_eitr = 1;
@@ -2681,7 +3114,7 @@ static int ixgbe_change_mtu(struct net_device *netdev, int new_mtu)
  * handler is registered with the OS, the watchdog timer is started,
  * and the stack is notified that the interface is ready.
  **/
-static int ixgbe_open(struct net_device *netdev)
+int ixgbe_open(struct net_device *netdev)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 	int err;
@@ -2736,7 +3169,7 @@ err_setup_tx:
  * needs to be disabled.  A global MAC reset is issued to stop the
  * hardware, and all transmit and receive resources are freed.
  **/
-static int ixgbe_close(struct net_device *netdev)
+int ixgbe_close(struct net_device *netdev)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 
@@ -2769,6 +3202,18 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
 		adapter->stats.mpc[i] += mpc;
 		total_mpc += adapter->stats.mpc[i];
 		adapter->stats.rnbc[i] += IXGBE_READ_REG(hw, IXGBE_RNBC(i));
+		adapter->stats.qptc[i] += IXGBE_READ_REG(hw, IXGBE_QPTC(i));
+		adapter->stats.qbtc[i] += IXGBE_READ_REG(hw, IXGBE_QBTC(i));
+		adapter->stats.qprc[i] += IXGBE_READ_REG(hw, IXGBE_QPRC(i));
+		adapter->stats.qbrc[i] += IXGBE_READ_REG(hw, IXGBE_QBRC(i));
+		adapter->stats.pxonrxc[i] += IXGBE_READ_REG(hw,
+		                                            IXGBE_PXONRXC(i));
+		adapter->stats.pxontxc[i] += IXGBE_READ_REG(hw,
+		                                            IXGBE_PXONTXC(i));
+		adapter->stats.pxoffrxc[i] += IXGBE_READ_REG(hw,
+		                                             IXGBE_PXOFFRXC(i));
+		adapter->stats.pxofftxc[i] += IXGBE_READ_REG(hw,
+		                                             IXGBE_PXOFFTXC(i));
 	}
 	adapter->stats.gprc += IXGBE_READ_REG(hw, IXGBE_GPRC);
 	/* work around hardware counting issue */
@@ -2865,10 +3310,9 @@ static void ixgbe_watchdog(unsigned long data)
 
 			netif_carrier_on(netdev);
 			netif_wake_queue(netdev);
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-			for (i = 0; i < adapter->num_tx_queues; i++)
-				netif_wake_subqueue(netdev, i);
-#endif
+			if (netif_is_multiqueue(netdev))
+				for (i = 0; i < adapter->num_tx_queues; i++)
+					netif_wake_subqueue(netdev, i);
 		} else {
 			/* Force detection of hung controller */
 			adapter->detect_tx_hung = true;
@@ -2878,6 +3322,9 @@ static void ixgbe_watchdog(unsigned long data)
 			DPRINTK(LINK, INFO, "NIC Link is Down\n");
 			netif_carrier_off(netdev);
 			netif_stop_queue(netdev);
+			if (netif_is_multiqueue(netdev))
+				for (i = 0; i < adapter->num_tx_queues; i++)
+					netif_stop_subqueue(netdev, i);
 		}
 	}
 
@@ -2972,6 +3419,8 @@ static int ixgbe_tso(struct ixgbe_adapter *adapter,
 		mss_l4len_idx |=
 		    (skb_shinfo(skb)->gso_size << IXGBE_ADVTXD_MSS_SHIFT);
 		mss_l4len_idx |= (l4len << IXGBE_ADVTXD_L4LEN_SHIFT);
+		/* use index 1 for TSO */
+		mss_l4len_idx |= (1 << IXGBE_ADVTXD_IDX_SHIFT);
 		context_desc->mss_l4len_idx = cpu_to_le32(mss_l4len_idx);
 
 		tx_buffer_info->time_stamp = jiffies;
@@ -3044,6 +3493,7 @@ static bool ixgbe_tx_csum(struct ixgbe_adapter *adapter,
 		}
 
 		context_desc->type_tucmd_mlhl = cpu_to_le32(type_tucmd_mlhl);
+		/* use index zero for tx checksum offload */
 		context_desc->mss_l4len_idx = 0;
 
 		tx_buffer_info->time_stamp = jiffies;
@@ -3152,6 +3602,8 @@ static void ixgbe_tx_queue(struct ixgbe_adapter *adapter,
 		olinfo_status |= IXGBE_TXD_POPTS_TXSM <<
 						IXGBE_ADVTXD_POPTS_SHIFT;
 
+		/* use index 1 context for tso */
+		olinfo_status |= (1 << IXGBE_ADVTXD_IDX_SHIFT);
 		if (tx_flags & IXGBE_TX_FLAGS_IPV4)
 			olinfo_status |= IXGBE_TXD_POPTS_IXSM <<
 						IXGBE_ADVTXD_POPTS_SHIFT;
@@ -3195,11 +3647,11 @@ static int __ixgbe_maybe_stop_tx(struct net_device *netdev,
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-	netif_stop_subqueue(netdev, tx_ring->queue_index);
-#else
-	netif_stop_queue(netdev);
-#endif
+	if (netif_is_multiqueue(netdev))
+		netif_stop_subqueue(netdev, tx_ring->queue_index);
+	else
+		netif_stop_queue(netdev);
+
 	/* Herbert's original patch had:
 	 *  smp_mb__after_netif_stop_queue();
 	 * but since that doesn't exist yet, just open code it. */
@@ -3211,11 +3663,10 @@ static int __ixgbe_maybe_stop_tx(struct net_device *netdev,
 		return -EBUSY;
 
 	/* A reprieve! - use start_queue because it doesn't call schedule */
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-	netif_wake_subqueue(netdev, tx_ring->queue_index);
-#else
-	netif_wake_queue(netdev);
-#endif
+	if (netif_is_multiqueue(netdev))
+		netif_start_subqueue(netdev, tx_ring->queue_index);
+	else
+		netif_start_queue(netdev);
 	++adapter->restart_queue;
 	return 0;
 }
@@ -3253,6 +3704,7 @@ static int ixgbe_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		dev_kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
+
 	mss = skb_shinfo(skb)->gso_size;
 
 	if (mss)
@@ -3269,8 +3721,21 @@ static int ixgbe_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		return NETDEV_TX_BUSY;
 	}
 	if (adapter->vlgrp && vlan_tx_tag_present(skb)) {
+		tx_flags |= vlan_tx_tag_get(skb);
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+		if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+			tx_flags &= ~IXGBE_TX_FLAGS_VLAN_PRIO_MASK;
+			tx_flags |= (skb->queue_mapping << 13);
+		}
+#endif
+		tx_flags <<= IXGBE_TX_FLAGS_VLAN_SHIFT;
 		tx_flags |= IXGBE_TX_FLAGS_VLAN;
-		tx_flags |= (vlan_tx_tag_get(skb) << IXGBE_TX_FLAGS_VLAN_SHIFT);
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	} else if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+		tx_flags |= (skb->queue_mapping << 13);
+		tx_flags <<= IXGBE_TX_FLAGS_VLAN_SHIFT;
+		tx_flags |= IXGBE_TX_FLAGS_VLAN;
+#endif
 	}
 
 	if (skb->protocol == htons(ETH_P_IP))
@@ -3520,13 +3985,16 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 	netdev->features |= NETIF_F_TSO;
 
 	netdev->features |= NETIF_F_TSO6;
-	if (pci_using_dac)
-		netdev->features |= NETIF_F_HIGHDMA;
+	if (adapter->flags & IXGBE_FLAG_DCB_ENABLED)
+		adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
 
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-	netdev->features |= NETIF_F_MULTI_QUEUE;
+#ifdef CONFIG_DCBNL
+	netdev->dcbnl_ops = &dcbnl_ops;
 #endif
 
+	if (pci_using_dac)
+		netdev->features |= NETIF_F_HIGHDMA;
+
 	/* make sure the EEPROM is good */
 	if (ixgbe_validate_eeprom_checksum(hw, NULL) < 0) {
 		dev_err(&pdev->dev, "The EEPROM Checksum Is Not Valid\n");
@@ -3593,10 +4061,9 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 
 	netif_carrier_off(netdev);
 	netif_stop_queue(netdev);
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-	for (i = 0; i < adapter->num_tx_queues; i++)
-		netif_stop_subqueue(netdev, i);
-#endif
+	if (netif_is_multiqueue(netdev))
+		for (i = 0; i < adapter->num_tx_queues; i++)
+			netif_stop_subqueue(netdev, i);
 
 	ixgbe_napi_add_all(adapter);
 
@@ -3782,7 +4249,6 @@ static struct pci_driver ixgbe_driver = {
  **/
 static int __init ixgbe_init_module(void)
 {
-	int ret;
 	printk(KERN_INFO "%s: %s - version %s\n", ixgbe_driver_name,
 	       ixgbe_driver_string, ixgbe_driver_version);
 
@@ -3792,8 +4258,7 @@ static int __init ixgbe_init_module(void)
 	dca_register_notify(&dca_notifier);
 
 #endif
-	ret = pci_register_driver(&ixgbe_driver);
-	return ret;
+	return pci_register_driver(&ixgbe_driver);
 }
 module_init(ixgbe_init_module);
 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] NET: DCB generic netlink interface
  2008-05-27 14:13 [PATCH] NET: DCB generic netlink interface PJ Waskiewicz
                   ` (2 preceding siblings ...)
  2008-05-27 14:13 ` [PATCH 3/3] ixgbe: Enable Data Center Bridging (DCB) support PJ Waskiewicz
@ 2008-06-04 18:44 ` David Miller
  2008-06-05  6:23   ` Waskiewicz Jr, Peter P
  3 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2008-06-04 18:44 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr; +Cc: jeff, netdev

From: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
Date: Tue, 27 May 2008 07:13:39 -0700

> This patchset adds the initial DCB generic netlink interface to the kernel.
> It adds the layer as a generic interface for any DCB-capable device through
> the netdevice.
> 
> This patchset also includes an implementation using this interface in the
> ixgbe driver.  It adds the hardware-specific code to turn the interface on,
> and includes the netlink callbacks in the driver to perform the requested
> operations.
> 
> These patches are targeted at the net-next-2.6 tree, for 2.6.27.  The patch
> series is as follows:
> 
> patch 1: DCB netlink interface in-kernel
> patch 2: ixgbe DCB hardware-specific patches
> patch 3: enable DCB in ixgbe

Overall the changes look OK.  In particular the netlink implementation
looks clean.

However we need to think about how this stuff overlaps with existing
'tc' facilities.  For example, what we really need to do here is
define this generic DCB interface such that it normally just sits on
top of a software scheduler layer implementation and therefore there
are always non-NULL DCB ops to invoke.

If there is a device that can implement this in hardware, that's
fine and we define some interface for invoking that.

Because of that, the netdevice is likely not the correct place for the
ops (the only actual ugly part of the patches in my opinion).

I'm still very active travelling which is why I haven't responded to
this earlier.  I ask that you express some understanding about this as
there is really nothing I can do to review these kinds of important
changes properly when I am changing 10 timezones every other day.

Besides we're still in bug fix phase, so nothing I say will get this
upstream into Linus's tree any faster, and we really need to get
something like this right because it will be hard to undo this
afterwards if we get it wrong.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH] NET: DCB generic netlink interface
  2008-06-04 18:44 ` [PATCH] NET: DCB generic netlink interface David Miller
@ 2008-06-05  6:23   ` Waskiewicz Jr, Peter P
  2008-06-05 14:43     ` David Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2008-06-05  6:23 UTC (permalink / raw)
  To: David Miller; +Cc: jeff, netdev

> From: David Miller [mailto:davem@davemloft.net] 
> Sent: Wednesday, June 04, 2008 11:45 AM
> To: Waskiewicz Jr, Peter P
> Cc: jeff@garzik.org; netdev@vger.kernel.org
> Subject: Re: [PATCH] NET: DCB generic netlink interface

> Overall the changes look OK.  In particular the netlink 
> implementation looks clean.
> 
> However we need to think about how this stuff overlaps with 
> existing 'tc' facilities.  For example, what we really need 
> to do here is define this generic DCB interface such that it 
> normally just sits on top of a software scheduler layer 
> implementation and therefore there are always non-NULL DCB 
> ops to invoke.

I'm not sure I follow this.  DCB is a scheduling policy, but that
scheduling policy is in the hardware.  The configuration interface,
which is what this is, happens out of band of any scheduling policies in
the kernel.  It's very analogous to the wireless configuration layer for
mac80211 that uses generic netlink.

> If there is a device that can implement this in hardware, 
> that's fine and we define some interface for invoking that.
> 
> Because of that, the netdevice is likely not the correct 
> place for the ops (the only actual ugly part of the patches 
> in my opinion).

I agree that having this in the netdevice isn't great.  But given the
nature of how the hardware can be configured (via a user on the host, or
via messages from a switch to the userspace utilities), I think it needs
to be in there.  It's also the only common place the userspace can
enumerate an ethernet device from userspace; tc is another way, but the
ethernet device is used, along with the qdisc, which is part of the
netdevice.  But I am certainly all ears for any suggestions to not have
it in the netdevice if it can be done.

> I'm still very active travelling which is why I haven't 
> responded to this earlier.  I ask that you express some 
> understanding about this as there is really nothing I can do 
> to review these kinds of important changes properly when I am 
> changing 10 timezones every other day.

I can certainly sympathize; I've been travelling in Israel this past
week, and will be getting back to Portland on Friday.  Having a 10-hour
difference from what your normal timezone is is certainly challenging.

> Besides we're still in bug fix phase, so nothing I say will 
> get this upstream into Linus's tree any faster, and we really 
> need to get something like this right because it will be hard 
> to undo this afterwards if we get it wrong.

Totally understood.  My goal though is to make sure any
feedback/suggestions can be digested prior to the merge window, since
the chances of getting any good review during the merge window is next
to impossible, given the amount of patches flying in that have already
been queued.  I just want to make sure I'm prepared for a good
submission by the time the merge window opens.

Thanks Dave.  Any guidance from you, or anyone else, as to how I can get
this into better shape for acceptance, I'm all ears.

Cheers,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] NET: DCB generic netlink interface
  2008-06-05  6:23   ` Waskiewicz Jr, Peter P
@ 2008-06-05 14:43     ` David Miller
  2008-06-05 20:29       ` Thomas Graf
  2008-06-10 19:55       ` Waskiewicz Jr, Peter P
  0 siblings, 2 replies; 24+ messages in thread
From: David Miller @ 2008-06-05 14:43 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr; +Cc: jeff, netdev

From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com>
Date: Wed, 4 Jun 2008 23:23:00 -0700

> I'm not sure I follow this.  DCB is a scheduling policy, but that
> scheduling policy is in the hardware.  The configuration interface,
> which is what this is, happens out of band of any scheduling policies in
> the kernel.  It's very analogous to the wireless configuration layer for
> mac80211 that uses generic netlink.

And I'm saying we should have a equivalent software scheduler in
the kernel that can implement this if the hardware offloaded version
isn't present.

It overlaps existing functionality to a certain extent, and there is
no real reason for that overlap to exist.  The question is which
(the existing facilities or the new one) subsumes which.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] NET: DCB generic netlink interface
  2008-06-05 14:43     ` David Miller
@ 2008-06-05 20:29       ` Thomas Graf
  2008-06-10 19:55       ` Waskiewicz Jr, Peter P
  1 sibling, 0 replies; 24+ messages in thread
From: Thomas Graf @ 2008-06-05 20:29 UTC (permalink / raw)
  To: David Miller; +Cc: peter.p.waskiewicz.jr, jeff, netdev

* David Miller <davem@davemloft.net> 2008-06-05 07:43
> From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com>
> Date: Wed, 4 Jun 2008 23:23:00 -0700
> 
> > I'm not sure I follow this.  DCB is a scheduling policy, but that
> > scheduling policy is in the hardware.  The configuration interface,
> > which is what this is, happens out of band of any scheduling policies in
> > the kernel.  It's very analogous to the wireless configuration layer for
> > mac80211 that uses generic netlink.
> 
> And I'm saying we should have a equivalent software scheduler in
> the kernel that can implement this if the hardware offloaded version
> isn't present.

I agree, I think we should make it possible to use DCB without the strict
need for hardware support. Our current classful qdiscs htb and cbq already
fullfil the requirements of priority grouping as specified in 802.1.qaz
except that the configuration interface is different as it does not use
percentage values.

There is also demand for a networking cgroup subsystem which requires
exactly the same kind of priority grouping. It would be only logical
to be able to use hardware support where possible.

Therefore I believe the way to go is to build a new or extend the
current tc system. Like with the current default qdisc, it should be
possible to automatically create qdiscs for DCB or cgroup while still
allowing to overwrite the tc configuration with a custom tree for

As Patrick pointed out, this probably belongs in the rtnetlink family as
it will be strictly attached to the net_device.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH] NET: DCB generic netlink interface
  2008-06-05 14:43     ` David Miller
  2008-06-05 20:29       ` Thomas Graf
@ 2008-06-10 19:55       ` Waskiewicz Jr, Peter P
  2008-06-10 20:07         ` David Miller
  2008-06-11 17:51         ` Thomas Graf
  1 sibling, 2 replies; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2008-06-10 19:55 UTC (permalink / raw)
  To: David Miller; +Cc: jeff, netdev

> From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com>
> Date: Wed, 4 Jun 2008 23:23:00 -0700
> 
> > I'm not sure I follow this.  DCB is a scheduling policy, but that 
> > scheduling policy is in the hardware.  The configuration interface, 
> > which is what this is, happens out of band of any 
> scheduling policies 
> > in the kernel.  It's very analogous to the wireless configuration 
> > layer for
> > mac80211 that uses generic netlink.
> 
> And I'm saying we should have a equivalent software scheduler 
> in the kernel that can implement this if the hardware 
> offloaded version isn't present.

I really don't think this is something that would work in software.  I
agree that having a bandwidth grouping like 802.1Qaz would be somewhat
useful, but that's the only piece of DCB that would work in software.
And you can achieve the same behavior using sch_prio with cbq or htb on
the nodes, minus the full link aggregation.

The 802.1Qbb, per-priority pause (flow control), cannot work in a
software implementation.  This is a new flow control frame processed by
the MAC for each priority on the link.  Also, the Rx filtering can't be
emulated in software either.  The MAC filters on VLAN priority.  I know
that can be configured with vconfig and set_ingress_map, but the whole
point of the technology is to have the Rx processing done in the
hardware's packet buffers, much like RSS filtering.

This technology really is a hardware-based technology.  The piece that
we could effectively implement in software is the Tx priority grouping.
But that's a small piece of the whole technology.  All we're trying to
do is provide the method of configuring the hardware for this
technology, which is closely coupled with the FCoE work going on in the
SCSI world.  I don't think anyone would benefit trying to emulate it in
software, since everything we can implement in the software can already
be achieved using existing facilities.

> It overlaps existing functionality to a certain extent, and 
> there is no real reason for that overlap to exist.  The 
> question is which (the existing facilities or the new one) 
> subsumes which.

The existing facilities for traffic shaping and bandwidth aggregation do
overlap if you loaded those qdiscs with a DCB device.  But I don't think
the existing qdiscs should be removed or modified; the two technologies
are too different, in my opinion, to be combined in software.

Thanks Dave,
-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] NET: DCB generic netlink interface
  2008-06-10 19:55       ` Waskiewicz Jr, Peter P
@ 2008-06-10 20:07         ` David Miller
  2008-06-11 17:51         ` Thomas Graf
  1 sibling, 0 replies; 24+ messages in thread
From: David Miller @ 2008-06-10 20:07 UTC (permalink / raw)
  To: peter.p.waskiewicz.jr; +Cc: jeff, netdev

From: "Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@intel.com>
Date: Tue, 10 Jun 2008 12:55:16 -0700

> The 802.1Qbb, per-priority pause (flow control), cannot work in a
> software implementation.

Of course, I know that.

> Also, the Rx filtering can't be emulated in software either.  The
> MAC filters on VLAN priority.  I know that can be configured with
> vconfig and set_ingress_map, but the whole point of the technology
> is to have the Rx processing done in the hardware's packet buffers,
> much like RSS filtering.

This is a scare crow, please don't use arguments like that.

Saying that it can't be done at all in software, but then saying
"well, it sort of can be done, but the point is to do it in hardware"
side-steps the very reason I want you to implement a software variant
of the parts that can be done in software.

> This technology really is a hardware-based technology.

This sounds like another way of saying "having a software
implementation of even some of this facility would compromise
the value of our hardware implementation."

That's not the kind of decision making process we use when
deciding how to implement things in the kernel.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] NET: DCB generic netlink interface
  2008-06-10 19:55       ` Waskiewicz Jr, Peter P
  2008-06-10 20:07         ` David Miller
@ 2008-06-11 17:51         ` Thomas Graf
  2008-06-11 17:50           ` Patrick McHardy
  2008-06-11 18:28           ` Waskiewicz Jr, Peter P
  1 sibling, 2 replies; 24+ messages in thread
From: Thomas Graf @ 2008-06-11 17:51 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: David Miller, jeff, netdev

* Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> 2008-06-10 12:55
> I really don't think this is something that would work in software.  I
> agree that having a bandwidth grouping like 802.1Qaz would be somewhat
> useful, but that's the only piece of DCB that would work in software.
> And you can achieve the same behavior using sch_prio with cbq or htb on
> the nodes, minus the full link aggregation.
> 
> The 802.1Qbb, per-priority pause (flow control), cannot work in a
> software implementation.  This is a new flow control frame processed by
> the MAC for each priority on the link.  Also, the Rx filtering can't be
> emulated in software either.  The MAC filters on VLAN priority.  I know
> that can be configured with vconfig and set_ingress_map, but the whole
> point of the technology is to have the Rx processing done in the
> hardware's packet buffers, much like RSS filtering.

Everything is possible in software as long as the hardware doesn't hide
the congestion information. It would be very useful to pass congestion
information received by 802.1Qau frames to the kernel for use when
selecting the nexthop or for the routing daemon to make decisions on.
So far we could only react to link states, now we could actually react to
link congestion on the routing layer.

There is no doubt that doing the prioritization in hardware is much
preferred but we should try and integrate it with other tc techniques.
F.e. it would be great if we could control DCB via skb->tc_index if 
that is possible. It would allow to define DCB traffic classes with the
rich features of existing classifiers. I've seen there is a mapping
functionality although I haven't found any documentation on how to use
it exactly.

Another area of interest is sending congestion frames on our own. We
could finally implement real ingress software shaping and turn every
linux system into a DCB capable node.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] NET: DCB generic netlink interface
  2008-06-11 17:51         ` Thomas Graf
@ 2008-06-11 17:50           ` Patrick McHardy
  2008-06-11 21:28             ` Thomas Graf
  2008-06-11 18:28           ` Waskiewicz Jr, Peter P
  1 sibling, 1 reply; 24+ messages in thread
From: Patrick McHardy @ 2008-06-11 17:50 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Waskiewicz Jr, Peter P, David Miller, jeff, netdev

Thomas Graf wrote:
> Another area of interest is sending congestion frames on our own. We
> could finally implement real ingress software shaping and turn every
> linux system into a DCB capable node.

There was a qdisc submission for a scheduler called "pace" (IIRC)
that did this. It needed some cleanups before merging, but nothing
grave.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] NET: DCB generic netlink interface
  2008-06-11 17:50           ` Patrick McHardy
@ 2008-06-11 21:28             ` Thomas Graf
  2008-06-12 10:17               ` Patrick McHardy
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Graf @ 2008-06-11 21:28 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Waskiewicz Jr, Peter P, David Miller, jeff, netdev

* Patrick McHardy <kaber@trash.net> 2008-06-11 19:50
> Thomas Graf wrote:
> >Another area of interest is sending congestion frames on our own. We
> >could finally implement real ingress software shaping and turn every
> >linux system into a DCB capable node.
> 
> There was a qdisc submission for a scheduler called "pace" (IIRC)
> that did this. It needed some cleanups before merging, but nothing
> grave.

Do you know if it used congestion notifcation on link level? I can't
seem to find the posting.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] NET: DCB generic netlink interface
  2008-06-11 21:28             ` Thomas Graf
@ 2008-06-12 10:17               ` Patrick McHardy
  0 siblings, 0 replies; 24+ messages in thread
From: Patrick McHardy @ 2008-06-12 10:17 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Waskiewicz Jr, Peter P, David Miller, jeff, netdev

Thomas Graf wrote:
> * Patrick McHardy <kaber@trash.net> 2008-06-11 19:50
>> Thomas Graf wrote:
>>> Another area of interest is sending congestion frames on our own. We
>>> could finally implement real ingress software shaping and turn every
>>> linux system into a DCB capable node.
>> There was a qdisc submission for a scheduler called "pace" (IIRC)
>> that did this. It needed some cleanups before merging, but nothing
>> grave.
> 
> Do you know if it used congestion notifcation on link level? I can't
> seem to find the posting.

They're using PAUSE frames. The latest submission I could find is:

http://marc.info/?t=119625135300006&r=1&w=2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH] NET: DCB generic netlink interface
  2008-06-11 17:51         ` Thomas Graf
  2008-06-11 17:50           ` Patrick McHardy
@ 2008-06-11 18:28           ` Waskiewicz Jr, Peter P
  2008-06-11 21:26             ` Thomas Graf
  1 sibling, 1 reply; 24+ messages in thread
From: Waskiewicz Jr, Peter P @ 2008-06-11 18:28 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David Miller, jeff, netdev

> Everything is possible in software as long as the hardware doesn't
hide
> the congestion information. It would be very useful to pass congestion
> Information received by 802.1Qau frames to the kernel for use when
> selecting the nexthop or for the routing daemon to make decisions on.
> So far we could only react to link states, now we could actually react
to
> link congestion on the routing layer.

Congestion notification in 802.1Qau is certainly something we need to
support somewhere in the stack.  I was actually talking with one of our
hardware architects while I was in Israel last week about that exact
gap, since the BCN/QCN rate limiting will eventually drop packets if we
don't have a way of telling the upper layers to "slow down."  The
notification mechanism is also needed for 802.1Qbb, since the whole
point of the priority flow control is to provide a no-drop mechanism for
things like FCoE.  But if the upper layers (e.g. FCoE stack) don't know
to pause when the network is too congested, frames will be dropped,
which is bad.

802.1Qau is still being defined in IEEE unfortunately, and we and others
have no hardware that supports it to test the congestion notification
tag processing.  But it is something on our radar that needs to be
addressed.

> There is no doubt that doing the prioritization in hardware is much
> preferred but we should try and integrate it with other tc techniques.
> F.e. it would be great if we could control DCB via skb->tc_index if 
> that is possible. It would allow to define DCB traffic classes with
the
> rich features of existing classifiers. I've seen there is a mapping
> functionality although I haven't found any documentation on how to use
> it exactly.

The prioritization is only one piece.  The bandwidth aggregation,
different modes of defining group strict vs. link strict priorities
within a bandwidth group, etc., are all hardware modes.  These modes
need to be in sync with the link partner (switch, back to back NIC), and
are kept in sync with the DCBX protocol via LLDP.

> Another area of interest is sending congestion frames on our own. We
> could finally implement real ingress software shaping and turn every
> linux system into a DCB capable node.

Once 802.1Qau is defined, and IEEE decides to use BCN or QCN, I think
this is a great direction to go in.  Right now the congestion
notification stuff is too up in the air to latch onto unfortunately.

Thanks for the comments Thomas,

-PJ Waskiewicz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] NET: DCB generic netlink interface
  2008-06-11 18:28           ` Waskiewicz Jr, Peter P
@ 2008-06-11 21:26             ` Thomas Graf
  0 siblings, 0 replies; 24+ messages in thread
From: Thomas Graf @ 2008-06-11 21:26 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P; +Cc: David Miller, jeff, netdev

* Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> 2008-06-11 11:28
> Congestion notification in 802.1Qau is certainly something we need to
> support somewhere in the stack.  I was actually talking with one of our
> hardware architects while I was in Israel last week about that exact
> gap, since the BCN/QCN rate limiting will eventually drop packets if we
> don't have a way of telling the upper layers to "slow down."

That's already possible by calling netif_stop_queue() respectively
netif_stop_subqueue(). Much more important for the upper layers to
know is when a congestion is about to happen so it can be avoided
with routing decisions.

> The prioritization is only one piece.  The bandwidth aggregation,
> different modes of defining group strict vs. link strict priorities
> within a bandwidth group, etc., are all hardware modes.  These modes
> need to be in sync with the link partner (switch, back to back NIC), and
> are kept in sync with the DCBX protocol via LLDP.

Again, this piece of hardware basically implements a classful qdisc
like htb or cbq except that it's limited to a flat tree. Yet, the
capabilities of the hardware may not be sufficient, therefore it must
be possible to combine hardware shaping with software qdiscs. It is
therefore crucial to find common grounds to exchange traffic class
information. Is your piece of hardware strictly limited to map VLANs
to traffic classes or would it be possible to attach traffic class
information to the packet in some way? (skb->tc_index) Could you
elaborate on how classification works, especially the configurable
mapping?

The most difficult part for me, an probably others as well, is that there
are no public documents available yet which would describe the direction
of where this is going, how complete the current implementation actually
is. We're pretty much looking at a black box which doesn't work very well
with the existing architecture and requires a completely separate
configuration interface. If we merge a configuration interface for DCB
now it will be pretty much written in stone, yet we have no idea what
other vendors may need.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [ANNOUNCE] ixgbe: Data Center Bridging (DCB) support for ixgbe
@ 2008-05-02  0:42 PJ Waskiewicz
  2008-05-02  0:43 ` [PATCH 3/3] ixgbe: Enable Data Center Bridging (DCB) support PJ Waskiewicz
  0 siblings, 1 reply; 24+ messages in thread
From: PJ Waskiewicz @ 2008-05-02  0:42 UTC (permalink / raw)
  To: jgarzik; +Cc: netdev

Jeff,

This patchset introduces Data Center Bridging, a scheduling technology
supported by Intel's 82598 silicon.  The technology uses 802.1p VLAN priority
tags to schedule and control traffic rates across an entire network.  It
also uses IEEE 802.1Qaz (priority grouping) and IEEE 802.1Qbb (priority
flow control) technologies, in order to physically separate traffic flows
that coexist on the same physical link.

The technology is initially targeting storage traffic and regular LAN traffic
on the same physical connection.  Using priority flow control, one flow can
be paused at the MAC level (same as 802.3 flow control) while not affecting
other flows running at different priorities.

The first patchset introduces a netlink interface for ixgbe, which is used
to configure all the DCB parameters that come either from userspace, or from
DCB-capable switch negotiation.

The second patchset introduces the hardware initialization code to turn this
whole technology on in the device.

The third patchset implements the netlink interface and hardware init code,
and enables DCB support in the driver.

Thanks,
-- 
PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/3] ixgbe: Enable Data Center Bridging (DCB) support
  2008-05-02  0:42 [ANNOUNCE] ixgbe: Data Center Bridging (DCB) support for ixgbe PJ Waskiewicz
@ 2008-05-02  0:43 ` PJ Waskiewicz
  0 siblings, 0 replies; 24+ messages in thread
From: PJ Waskiewicz @ 2008-05-02  0:43 UTC (permalink / raw)
  To: jgarzik; +Cc: netdev

This patch enables DCB support for 82598.  DCB is a technology implementing
the 802.1Qaz and 802.1Qbb IEEE standards for priority grouping and priority
pause.  The technology uses the 802.1p VLAN priority tags to identify
traffic on the network, and establish prioritization of the traffic
throughout the environment.  The 802.1Qbb priority flow control allows
MAC-level flow control on each of these priorities, creating 8 virtual
links in the network, allowing certain types of traffic to be paused while
not affecting other traffic types.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---

 drivers/net/ixgbe/Makefile          |    3 
 drivers/net/ixgbe/ixgbe.h           |   28 +++-
 drivers/net/ixgbe/ixgbe_dcb_82598.c |    1 
 drivers/net/ixgbe/ixgbe_ethtool.c   |   36 +++++
 drivers/net/ixgbe/ixgbe_main.c      |  267 ++++++++++++++++++++++++++++-------
 5 files changed, 275 insertions(+), 60 deletions(-)

diff --git a/drivers/net/ixgbe/Makefile b/drivers/net/ixgbe/Makefile
index ccd83d9..2a45fa0 100644
--- a/drivers/net/ixgbe/Makefile
+++ b/drivers/net/ixgbe/Makefile
@@ -33,4 +33,5 @@
 obj-$(CONFIG_IXGBE) += ixgbe.o
 
 ixgbe-objs := ixgbe_main.o ixgbe_common.o ixgbe_ethtool.o \
-              ixgbe_82598.o ixgbe_phy.o
+              ixgbe_82598.o ixgbe_phy.o ixgbe_dcb.o ixgbe_dcb_82598.o \
+              ixgbe_dcb_nl.o
diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index d981134..5098b9d 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -35,6 +35,7 @@
 
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
+#include "ixgbe_dcb.h"
 
 #ifdef CONFIG_DCA
 #include <linux/dca.h>
@@ -98,6 +99,7 @@
 #define IXGBE_TX_FLAGS_TSO		(u32)(1 << 2)
 #define IXGBE_TX_FLAGS_IPV4		(u32)(1 << 3)
 #define IXGBE_TX_FLAGS_VLAN_MASK	0xffff0000
+#define IXGBE_TX_FLAGS_VLAN_PRIO_MASK	0x0000e000
 #define IXGBE_TX_FLAGS_VLAN_SHIFT	16
 
 /* wrapper around a pointer to a socket buffer,
@@ -144,7 +146,7 @@ struct ixgbe_ring {
 
 	u16 reg_idx; /* holds the special value that gets the hardware register
 		      * offset associated with this ring, which is different
-		      * for DCE and RSS modes */
+		      * for DCB and RSS modes */
 
 #ifdef CONFIG_DCA
 	/* cpu for tx queue */
@@ -162,8 +164,10 @@ struct ixgbe_ring {
 	u16 work_limit;                /* max work per interrupt */
 };
 
+#define RING_F_DCB  0
 #define RING_F_VMDQ 1
 #define RING_F_RSS  2
+#define IXGBE_MAX_DCB_INDICES   8
 #define IXGBE_MAX_RSS_INDICES  16
 #define IXGBE_MAX_VMDQ_INDICES 16
 struct ixgbe_ring_feature {
@@ -174,6 +178,10 @@ struct ixgbe_ring_feature {
 #define MAX_RX_QUEUES 64
 #define MAX_TX_QUEUES 32
 
+#define MAX_RX_PACKET_BUFFERS ((adapter->flags & IXGBE_FLAG_DCB_ENABLED) \
+                               ? 8 : 1)
+#define MAX_TX_PACKET_BUFFERS MAX_RX_PACKET_BUFFERS
+
 /* MAX_MSIX_Q_VECTORS of these are allocated,
  * but we only use one per queue-specific vector.
  */
@@ -226,6 +234,9 @@ struct ixgbe_adapter {
 	struct work_struct reset_task;
 	struct ixgbe_q_vector q_vector[MAX_MSIX_Q_VECTORS];
 	char name[MAX_MSIX_COUNT][IFNAMSIZ + 5];
+	struct ixgbe_dcb_config dcb_cfg;
+	struct ixgbe_dcb_config temp_dcb_cfg;
+	u8 dcb_set_bitmap;
 
 	/* Interrupt Throttle Rate */
 	u32 itr_setting;
@@ -234,6 +245,7 @@ struct ixgbe_adapter {
 
 	/* TX */
 	struct ixgbe_ring *tx_ring;	/* One per active queue */
+	int num_tx_queues;
 	u64 restart_queue;
 	u64 lsc_int;
 	u64 hw_tso_ctxt;
@@ -243,12 +255,11 @@ struct ixgbe_adapter {
 
 	/* RX */
 	struct ixgbe_ring *rx_ring;	/* One per active queue */
+	int num_rx_queues;
 	u64 hw_csum_tx_good;
 	u64 hw_csum_rx_error;
 	u64 hw_csum_rx_good;
 	u64 non_eop_descs;
-	int num_tx_queues;
-	int num_rx_queues;
 	int num_msix_vectors;
 	struct ixgbe_ring_feature ring_feature[3];
 	struct msix_entry *msix_entries;
@@ -270,6 +281,7 @@ struct ixgbe_adapter {
 #define IXGBE_FLAG_RSS_ENABLED                  (u32)(1 << 6)
 #define IXGBE_FLAG_VMDQ_ENABLED                 (u32)(1 << 7)
 #define IXGBE_FLAG_DCA_ENABLED                  (u32)(1 << 8)
+#define IXGBE_FLAG_DCB_ENABLED                  (u32)(1 << 9)
 
 	/* OS defined structs */
 	struct net_device *netdev;
@@ -314,5 +326,15 @@ extern int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 				    struct ixgbe_ring *rxdr);
 extern int ixgbe_setup_tx_resources(struct ixgbe_adapter *adapter,
 				    struct ixgbe_ring *txdr);
+/* needed by ixgbe_dcb_nl.c */
+extern void ixgbe_configure_dcb(struct ixgbe_adapter *adapter);
+extern int ixgbe_close(struct net_device *netdev);
+extern void ixgbe_reset_interrupt_capability(struct ixgbe_adapter *adapter);
+extern int ixgbe_open(struct net_device *netdev);
+extern int ixgbe_init_interrupt_scheme(struct ixgbe_adapter *adapter);
+extern bool ixgbe_is_ixgbe(struct pci_dev *pcidev);
+
+extern int ixgbe_dcb_netlink_register(void);
+extern int ixgbe_dcb_netlink_unregister(void);
 
 #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ixgbe/ixgbe_dcb_82598.c b/drivers/net/ixgbe/ixgbe_dcb_82598.c
index 39b63ee..3c7f187 100644
--- a/drivers/net/ixgbe/ixgbe_dcb_82598.c
+++ b/drivers/net/ixgbe/ixgbe_dcb_82598.c
@@ -27,6 +27,7 @@
 *******************************************************************************/
 
 
+#include "ixgbe.h"
 #include "ixgbe_type.h"
 #include "ixgbe_dcb.h"
 #include "ixgbe_dcb_82598.h"
diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c b/drivers/net/ixgbe/ixgbe_ethtool.c
index 4e46377..944f669 100644
--- a/drivers/net/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ixgbe/ixgbe_ethtool.c
@@ -97,7 +97,17 @@ static struct ixgbe_stats ixgbe_gstrings_stats[] = {
 		 ((struct ixgbe_adapter *)netdev->priv)->num_rx_queues) * \
 		 (sizeof(struct ixgbe_queue_stats) / sizeof(u64)))
 #define IXGBE_GLOBAL_STATS_LEN	ARRAY_SIZE(ixgbe_gstrings_stats)
-#define IXGBE_STATS_LEN (IXGBE_GLOBAL_STATS_LEN + IXGBE_QUEUE_STATS_LEN)
+#define IXGBE_PB_STATS_LEN ( \
+		(((struct ixgbe_adapter *)netdev->priv)->flags & \
+		 IXGBE_FLAG_DCB_ENABLED) ? \
+		 (sizeof(((struct ixgbe_adapter *)0)->stats.pxonrxc) + \
+		  sizeof(((struct ixgbe_adapter *)0)->stats.pxontxc) + \
+		  sizeof(((struct ixgbe_adapter *)0)->stats.pxoffrxc) + \
+		  sizeof(((struct ixgbe_adapter *)0)->stats.pxofftxc)) \
+		 / sizeof(u64) : 0)
+#define IXGBE_STATS_LEN (IXGBE_GLOBAL_STATS_LEN + \
+		IXGBE_PB_STATS_LEN + \
+		IXGBE_QUEUE_STATS_LEN)
 
 static int ixgbe_get_settings(struct net_device *netdev,
 			      struct ethtool_cmd *ecmd)
@@ -806,6 +816,16 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
 			data[i + k] = queue_stat[k];
 		i += k;
 	}
+	if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+		for (j = 0; j < MAX_TX_PACKET_BUFFERS; j++) {
+			data[i++] = adapter->stats.pxontxc[j];
+			data[i++] = adapter->stats.pxofftxc[j];
+		}
+		for (j = 0; j < MAX_RX_PACKET_BUFFERS; j++) {
+			data[i++] = adapter->stats.pxonrxc[j];
+			data[i++] = adapter->stats.pxoffrxc[j];
+		}
+	}
 }
 
 static void ixgbe_get_strings(struct net_device *netdev, u32 stringset,
@@ -834,6 +854,20 @@ static void ixgbe_get_strings(struct net_device *netdev, u32 stringset,
 			sprintf(p, "rx_queue_%u_bytes", i);
 			p += ETH_GSTRING_LEN;
 		}
+		if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+			for (i = 0; i < MAX_TX_PACKET_BUFFERS; i++) {
+				sprintf(p, "tx_pb_%u_pxon", i);
+				p += ETH_GSTRING_LEN;
+				sprintf(p, "tx_pb_%u_pxoff", i);
+				p += ETH_GSTRING_LEN;
+			}
+			for (i = 0; i < MAX_RX_PACKET_BUFFERS; i++) {
+				sprintf(p, "rx_pb_%u_pxon", i);
+				p += ETH_GSTRING_LEN;
+				sprintf(p, "rx_pb_%u_pxoff", i);
+				p += ETH_GSTRING_LEN;
+			}
+		}
 /*		BUG_ON(p - data != IXGBE_STATS_LEN * ETH_GSTRING_LEN); */
 		break;
 	}
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 7b85922..82312d7 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1,7 +1,7 @@
 /*******************************************************************************
 
   Intel 10 Gigabit PCI Express Linux driver
-  Copyright(c) 1999 - 2007 Intel Corporation.
+  Copyright(c) 1999 - 2008 Intel Corporation.
 
   This program is free software; you can redistribute it and/or modify it
   under the terms and conditions of the GNU General Public License,
@@ -48,7 +48,7 @@ char ixgbe_driver_name[] = "ixgbe";
 static const char ixgbe_driver_string[] =
 	"Intel(R) 10 Gigabit PCI Express Network Driver";
 
-#define DRV_VERSION "1.3.18-k2"
+#define DRV_VERSION "1.3.26-k2"
 const char ixgbe_driver_version[] = DRV_VERSION;
 static const char ixgbe_copyright[] =
 	 "Copyright (c) 1999-2007 Intel Corporation.";
@@ -397,13 +397,13 @@ static void ixgbe_receive_skb(struct ixgbe_adapter *adapter,
 			      u16 tag)
 {
 	if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL)) {
-		if (adapter->vlgrp && is_vlan)
+		if (adapter->vlgrp && is_vlan && (tag != 0))
 			vlan_hwaccel_receive_skb(skb, adapter->vlgrp, tag);
 		else
 			netif_receive_skb(skb);
 	} else {
 
-		if (adapter->vlgrp && is_vlan)
+		if (adapter->vlgrp && is_vlan && (tag != 0))
 			vlan_hwaccel_rx(skb, adapter->vlgrp, tag);
 		else
 			netif_rx(skb);
@@ -545,14 +545,13 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_adapter *adapter,
 	struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer;
 	struct sk_buff *skb;
 	unsigned int i;
-	u32 upper_len, len, staterr;
+	u32 len, staterr;
 	u16 hdr_info, vlan_tag;
 	bool is_vlan, cleaned = false;
 	int cleaned_count = 0;
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 
 	i = rx_ring->next_to_clean;
-	upper_len = 0;
 	rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i);
 	staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
 	rx_buffer_info = &rx_ring->rx_buffer_info[i];
@@ -560,6 +559,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_adapter *adapter,
 	vlan_tag = le16_to_cpu(rx_desc->wb.upper.vlan);
 
 	while (staterr & IXGBE_RXD_STAT_DD) {
+		u32 upper_len = 0;
 		if (*work_done >= work_to_do)
 			break;
 		(*work_done)++;
@@ -1343,7 +1343,7 @@ static void ixgbe_configure_msi_and_legacy(struct ixgbe_adapter *adapter)
 }
 
 /**
- * ixgbe_configure_tx - Configure 8254x Transmit Unit after Reset
+ * ixgbe_configure_tx - Configure 8259x Transmit Unit after Reset
  * @adapter: board private structure
  *
  * Configure the Tx unit of the MAC after a reset.
@@ -1371,9 +1371,9 @@ static void ixgbe_configure_tx(struct ixgbe_adapter *adapter)
 		/* Disable Tx Head Writeback RO bit, since this hoses
 		 * bookkeeping if things aren't delivered in order.
 		 */
-		txctrl = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL(i));
+		txctrl = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL(j));
 		txctrl &= ~IXGBE_DCA_TXCTRL_TX_WB_RO_EN;
-		IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL(i), txctrl);
+		IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL(j), txctrl);
 	}
 }
 
@@ -1382,7 +1382,7 @@ static void ixgbe_configure_tx(struct ixgbe_adapter *adapter)
 
 #define IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT			2
 /**
- * ixgbe_configure_rx - Configure 8254x Receive Unit after Reset
+ * ixgbe_configure_rx - Configure 8259x Receive Unit after Reset
  * @adapter: board private structure
  *
  * Configure the Rx unit of the MAC after a reset.
@@ -1529,6 +1529,16 @@ static void ixgbe_vlan_rx_register(struct net_device *netdev,
 		ixgbe_irq_disable(adapter);
 	adapter->vlgrp = grp;
 
+	/*
+	 * For a DCB driver, always enable VLAN tag stripping so we can
+	 * still receive traffic from a DCB-enabled host even if we're
+	 * not in DCB mode.
+	 */
+	ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_VLNCTRL);
+	ctrl |= IXGBE_VLNCTRL_VME;
+	ctrl &= ~IXGBE_VLNCTRL_CFIEN;
+	IXGBE_WRITE_REG(&adapter->hw, IXGBE_VLNCTRL, ctrl);
+
 	if (grp) {
 		/* enable VLAN tag insert/strip */
 		ctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_VLNCTRL);
@@ -1672,6 +1682,42 @@ static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
 	}
 }
 
+/*
+ * ixgbe_configure_dcb - Configure DCB hardware
+ * @adapter: ixgbe adapter struct
+ *
+ * This is called by the driver on open to configure the DCB hardware.
+ * This is also called by the gennetlink interface when reconfiguring
+ * the DCB state.
+ */
+void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
+{
+	struct ixgbe_hw *hw = &adapter->hw;
+	u32 txdctl, vlnctrl;
+	int i, j;
+
+	ixgbe_dcb_check_config(&adapter->dcb_cfg);
+	ixgbe_dcb_calculate_tc_credits(&adapter->dcb_cfg, DCB_TX_CONFIG);
+	ixgbe_dcb_calculate_tc_credits(&adapter->dcb_cfg, DCB_RX_CONFIG);
+
+	/* reconfigure the hardware */
+	ixgbe_dcb_hw_config(&adapter->hw, &adapter->dcb_cfg);
+
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		j = adapter->tx_ring[i].reg_idx;
+		txdctl = IXGBE_READ_REG(hw, IXGBE_TXDCTL(j));
+		/* PThresh workaround for Tx hang with DFP enabled. */
+		txdctl |= 32;
+		IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(j), txdctl);
+	}
+	/* Enable VLAN tag insert/strip */
+	vlnctrl = IXGBE_READ_REG(hw, IXGBE_VLNCTRL);
+	vlnctrl |= IXGBE_VLNCTRL_VME | IXGBE_VLNCTRL_VFE;
+	vlnctrl &= ~IXGBE_VLNCTRL_CFIEN;
+	IXGBE_WRITE_REG(hw, IXGBE_VLNCTRL, vlnctrl);
+	ixgbe_set_vfta(hw, 0, 0, true);
+}
+
 static void ixgbe_configure(struct ixgbe_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
@@ -1680,6 +1726,12 @@ static void ixgbe_configure(struct ixgbe_adapter *adapter)
 	ixgbe_set_multi(netdev);
 
 	ixgbe_restore_vlan(adapter);
+	if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+		netif_set_gso_max_size(netdev, 32768);
+		ixgbe_configure_dcb(adapter);
+	} else {
+		netif_set_gso_max_size(netdev, 65536);
+	}
 
 	ixgbe_configure_tx(adapter);
 	ixgbe_configure_rx(adapter);
@@ -1699,6 +1751,11 @@ static int ixgbe_up_complete(struct ixgbe_adapter *adapter)
 
 	ixgbe_get_hw_control(adapter);
 
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	if (adapter->num_tx_queues > 1)
+		netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+
 	if ((adapter->flags & IXGBE_FLAG_MSIX_ENABLED) ||
 	    (adapter->flags & IXGBE_FLAG_MSI_ENABLED)) {
 		if (adapter->flags & IXGBE_FLAG_MSIX_ENABLED) {
@@ -1943,36 +2000,44 @@ static void ixgbe_clean_all_tx_rings(struct ixgbe_adapter *adapter)
 void ixgbe_down(struct ixgbe_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
+	struct ixgbe_hw *hw = &adapter->hw;
 	u32 rxctrl;
+	u32 txdctl;
+	int i, j;
 
 	/* signal that we are down to the interrupt handler */
 	set_bit(__IXGBE_DOWN, &adapter->state);
 
 	/* disable receives */
-	rxctrl = IXGBE_READ_REG(&adapter->hw, IXGBE_RXCTRL);
-	IXGBE_WRITE_REG(&adapter->hw, IXGBE_RXCTRL,
-			rxctrl & ~IXGBE_RXCTRL_RXEN);
-
-	netif_tx_disable(netdev);
-
-	/* disable transmits in the hardware */
+	rxctrl = IXGBE_READ_REG(hw, IXGBE_RXCTRL);
+	IXGBE_WRITE_REG(hw, IXGBE_RXCTRL, rxctrl & ~IXGBE_RXCTRL_RXEN);
 
-	/* flush both disables */
-	IXGBE_WRITE_FLUSH(&adapter->hw);
+	IXGBE_WRITE_FLUSH(hw);
 	msleep(10);
 
+	netif_stop_queue(netdev);
+	if (netif_is_multiqueue(netdev))
+		for (i = 0; i < adapter->num_tx_queues; i++)
+			netif_stop_subqueue(netdev, i);
+
 	ixgbe_irq_disable(adapter);
 
 	ixgbe_napi_disable_all(adapter);
 	del_timer_sync(&adapter->watchdog_timer);
 
+	/* disable transmits in the hardware now that interrupts are off */
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		j = adapter->tx_ring[i].reg_idx;
+		txdctl = IXGBE_READ_REG(hw, IXGBE_TXDCTL(j));
+		IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(j),
+		                (txdctl & ~IXGBE_TXDCTL_ENABLE));
+	}
+
 	netif_carrier_off(netdev);
-	netif_stop_queue(netdev);
 
 	ixgbe_reset(adapter);
 	ixgbe_clean_all_tx_rings(adapter);
 	ixgbe_clean_all_rx_rings(adapter);
-
 }
 
 static int ixgbe_suspend(struct pci_dev *pdev, pm_message_t state)
@@ -2069,6 +2134,11 @@ static void ixgbe_reset_task(struct work_struct *work)
 	struct ixgbe_adapter *adapter;
 	adapter = container_of(work, struct ixgbe_adapter, reset_task);
 
+	/* If we're already down or resetting, just bail */
+	if (test_bit(__IXGBE_DOWN, &adapter->state) ||
+	    test_bit(__IXGBE_RESETTING, &adapter->state))
+		return;
+
 	adapter->tx_timeout_count++;
 
 	ixgbe_reinit_locked(adapter);
@@ -2112,6 +2182,7 @@ static void ixgbe_acquire_msix_vectors(struct ixgbe_adapter *adapter,
 		adapter->flags &= ~IXGBE_FLAG_MSIX_ENABLED;
 		kfree(adapter->msix_entries);
 		adapter->msix_entries = NULL;
+		adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
 		adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
 		adapter->num_tx_queues = 1;
 		adapter->num_rx_queues = 1;
@@ -2121,19 +2192,39 @@ static void ixgbe_acquire_msix_vectors(struct ixgbe_adapter *adapter,
 	}
 }
 
-static void __devinit ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
+static void ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
 {
-	int nrq, ntq;
+	int nrq = 1, ntq = 1;
 	int feature_mask = 0, rss_i, rss_m;
+	int dcb_i, dcb_m;
 
 	/* Number of supported queues */
 	switch (adapter->hw.mac.type) {
 	case ixgbe_mac_82598EB:
+		dcb_i = adapter->ring_feature[RING_F_DCB].indices;
+		dcb_m = 0;
 		rss_i = adapter->ring_feature[RING_F_RSS].indices;
 		rss_m = 0;
+		feature_mask |= IXGBE_FLAG_DCB_ENABLED;
 		feature_mask |= IXGBE_FLAG_RSS_ENABLED;
 
 		switch (adapter->flags & feature_mask) {
+		case (IXGBE_FLAG_DCB_ENABLED):
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+			dcb_m = 0x7 << 3;
+			nrq = dcb_i;
+			ntq = dcb_i;
+#else
+			printk(KERN_INFO, "Kernel has no multiqueue "
+				"support, disabling DCB.\n");
+			/* Fall back onto RSS */
+			rss_m = 0xF;
+			nrq = rss_i;
+			ntq = 1;
+			dcb_m = 0;
+			dcb_i = 0;
+#endif
+			break;
 		case (IXGBE_FLAG_RSS_ENABLED):
 			rss_m = 0xF;
 			nrq = rss_i;
@@ -2145,6 +2236,8 @@ static void __devinit ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
 			break;
 		case 0:
 		default:
+			dcb_i = 0;
+			dcb_m = 0;
 			rss_i = 0;
 			rss_m = 0;
 			nrq = 1;
@@ -2152,6 +2245,8 @@ static void __devinit ixgbe_set_num_queues(struct ixgbe_adapter *adapter)
 			break;
 		}
 
+		adapter->ring_feature[RING_F_DCB].indices = dcb_i;
+		adapter->ring_feature[RING_F_DCB].mask = dcb_m;
 		adapter->ring_feature[RING_F_RSS].indices = rss_i;
 		adapter->ring_feature[RING_F_RSS].mask = rss_m;
 		break;
@@ -2179,15 +2274,25 @@ static void __devinit ixgbe_cache_ring_register(struct ixgbe_adapter *adapter)
 	 */
 	int feature_mask = 0, rss_i;
 	int i, txr_idx, rxr_idx;
+	int dcb_i;
 
 	/* Number of supported queues */
 	switch (adapter->hw.mac.type) {
 	case ixgbe_mac_82598EB:
+		dcb_i = adapter->ring_feature[RING_F_DCB].indices;
 		rss_i = adapter->ring_feature[RING_F_RSS].indices;
 		txr_idx = 0;
 		rxr_idx = 0;
+		feature_mask |= IXGBE_FLAG_DCB_ENABLED;
 		feature_mask |= IXGBE_FLAG_RSS_ENABLED;
 		switch (adapter->flags & feature_mask) {
+		case (IXGBE_FLAG_DCB_ENABLED):
+			/* the number of queues is assumed to be symmetric */
+			for (i = 0; i < dcb_i; i++) {
+				adapter->rx_ring[i].reg_idx = i << 3;
+				adapter->tx_ring[i].reg_idx = i << 2;
+			}
+			break;
 		case (IXGBE_FLAG_RSS_ENABLED):
 			for (i = 0; i < adapter->num_rx_queues; i++)
 				adapter->rx_ring[i].reg_idx = i;
@@ -2212,7 +2317,7 @@ static void __devinit ixgbe_cache_ring_register(struct ixgbe_adapter *adapter)
  * number of queues at compile-time.  The polling_netdev array is
  * intended for Multiqueue, but should work fine with a single queue.
  **/
-static int __devinit ixgbe_alloc_queues(struct ixgbe_adapter *adapter)
+static int ixgbe_alloc_queues(struct ixgbe_adapter *adapter)
 {
 	int i;
 
@@ -2252,7 +2357,7 @@ err_tx_ring_allocation:
  * Attempt to configure the interrupts using the best available
  * capabilities of the hardware and the kernel.
  **/
-static int __devinit ixgbe_set_interrupt_capability(struct ixgbe_adapter
+static int ixgbe_set_interrupt_capability(struct ixgbe_adapter
 						    *adapter)
 {
 	int err = 0;
@@ -2281,6 +2386,7 @@ static int __devinit ixgbe_set_interrupt_capability(struct ixgbe_adapter
 	adapter->msix_entries = kcalloc(v_budget,
 					sizeof(struct msix_entry), GFP_KERNEL);
 	if (!adapter->msix_entries) {
+		adapter->flags &= ~IXGBE_FLAG_DCB_ENABLED;
 		adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
 		ixgbe_set_num_queues(adapter);
 		kfree(adapter->tx_ring);
@@ -2323,7 +2429,7 @@ out:
 	return err;
 }
 
-static void ixgbe_reset_interrupt_capability(struct ixgbe_adapter *adapter)
+void ixgbe_reset_interrupt_capability(struct ixgbe_adapter *adapter)
 {
 	if (adapter->flags & IXGBE_FLAG_MSIX_ENABLED) {
 		adapter->flags &= ~IXGBE_FLAG_MSIX_ENABLED;
@@ -2347,7 +2453,7 @@ static void ixgbe_reset_interrupt_capability(struct ixgbe_adapter *adapter)
  * - Hardware queue count (num_*_queues)
  *   - defined by miscellaneous hardware support/features (RSS, etc.)
  **/
-static int __devinit ixgbe_init_interrupt_scheme(struct ixgbe_adapter *adapter)
+int ixgbe_init_interrupt_scheme(struct ixgbe_adapter *adapter)
 {
 	int err;
 
@@ -2395,11 +2501,27 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	struct pci_dev *pdev = adapter->pdev;
 	unsigned int rss;
+	int j;
+	struct tc_configuration *tc;
 
 	/* Set capability flags */
 	rss = min(IXGBE_MAX_RSS_INDICES, (int)num_online_cpus());
 	adapter->ring_feature[RING_F_RSS].indices = rss;
 	adapter->flags |= IXGBE_FLAG_RSS_ENABLED;
+	adapter->ring_feature[RING_F_DCB].indices = IXGBE_MAX_DCB_INDICES;
+	for (j = 0; j < MAX_TRAFFIC_CLASS; j++) {
+		tc = &adapter->dcb_cfg.tc_config[j];
+		tc->path[DCB_TX_CONFIG].bwg_id = 0;
+		tc->path[DCB_TX_CONFIG].bwg_percent = 12 + (j & 1);
+		tc->path[DCB_RX_CONFIG].bwg_id = 0;
+		tc->path[DCB_RX_CONFIG].bwg_percent = 12 + (j & 1);
+		tc->dcb_pfc = pfc_disabled;
+	}
+	adapter->dcb_cfg.bw_percentage[DCB_TX_CONFIG][0] = 100;
+	adapter->dcb_cfg.bw_percentage[DCB_RX_CONFIG][0] = 100;
+	adapter->dcb_cfg.rx_pba_cfg = pba_equal;
+	adapter->dcb_cfg.round_robin_enable = false;
+	adapter->dcb_set_bitmap = 0x00;
 
 	/* Enable Dynamic interrupt throttling by default */
 	adapter->rx_eitr = 1;
@@ -2681,7 +2803,7 @@ static int ixgbe_change_mtu(struct net_device *netdev, int new_mtu)
  * handler is registered with the OS, the watchdog timer is started,
  * and the stack is notified that the interface is ready.
  **/
-static int ixgbe_open(struct net_device *netdev)
+int ixgbe_open(struct net_device *netdev)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 	int err;
@@ -2736,7 +2858,7 @@ err_setup_tx:
  * needs to be disabled.  A global MAC reset is issued to stop the
  * hardware, and all transmit and receive resources are freed.
  **/
-static int ixgbe_close(struct net_device *netdev)
+int ixgbe_close(struct net_device *netdev)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 
@@ -2769,6 +2891,18 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
 		adapter->stats.mpc[i] += mpc;
 		total_mpc += adapter->stats.mpc[i];
 		adapter->stats.rnbc[i] += IXGBE_READ_REG(hw, IXGBE_RNBC(i));
+		adapter->stats.qptc[i] += IXGBE_READ_REG(hw, IXGBE_QPTC(i));
+		adapter->stats.qbtc[i] += IXGBE_READ_REG(hw, IXGBE_QBTC(i));
+		adapter->stats.qprc[i] += IXGBE_READ_REG(hw, IXGBE_QPRC(i));
+		adapter->stats.qbrc[i] += IXGBE_READ_REG(hw, IXGBE_QBRC(i));
+		adapter->stats.pxonrxc[i] += IXGBE_READ_REG(hw,
+		                                            IXGBE_PXONRXC(i));
+		adapter->stats.pxontxc[i] += IXGBE_READ_REG(hw,
+		                                            IXGBE_PXONTXC(i));
+		adapter->stats.pxoffrxc[i] += IXGBE_READ_REG(hw,
+		                                             IXGBE_PXOFFRXC(i));
+		adapter->stats.pxofftxc[i] += IXGBE_READ_REG(hw,
+		                                             IXGBE_PXOFFTXC(i));
 	}
 	adapter->stats.gprc += IXGBE_READ_REG(hw, IXGBE_GPRC);
 	/* work around hardware counting issue */
@@ -2865,10 +2999,9 @@ static void ixgbe_watchdog(unsigned long data)
 
 			netif_carrier_on(netdev);
 			netif_wake_queue(netdev);
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-			for (i = 0; i < adapter->num_tx_queues; i++)
-				netif_wake_subqueue(netdev, i);
-#endif
+			if (netif_is_multiqueue(netdev))
+				for (i = 0; i < adapter->num_tx_queues; i++)
+					netif_wake_subqueue(netdev, i);
 		} else {
 			/* Force detection of hung controller */
 			adapter->detect_tx_hung = true;
@@ -2878,6 +3011,9 @@ static void ixgbe_watchdog(unsigned long data)
 			DPRINTK(LINK, INFO, "NIC Link is Down\n");
 			netif_carrier_off(netdev);
 			netif_stop_queue(netdev);
+			if (netif_is_multiqueue(netdev))
+				for (i = 0; i < adapter->num_tx_queues; i++)
+					netif_stop_subqueue(netdev, i);
 		}
 	}
 
@@ -2972,6 +3108,8 @@ static int ixgbe_tso(struct ixgbe_adapter *adapter,
 		mss_l4len_idx |=
 		    (skb_shinfo(skb)->gso_size << IXGBE_ADVTXD_MSS_SHIFT);
 		mss_l4len_idx |= (l4len << IXGBE_ADVTXD_L4LEN_SHIFT);
+		/* use index 1 for TSO */
+		mss_l4len_idx |= (1 << IXGBE_ADVTXD_IDX_SHIFT);
 		context_desc->mss_l4len_idx = cpu_to_le32(mss_l4len_idx);
 
 		tx_buffer_info->time_stamp = jiffies;
@@ -3044,6 +3182,7 @@ static bool ixgbe_tx_csum(struct ixgbe_adapter *adapter,
 		}
 
 		context_desc->type_tucmd_mlhl = cpu_to_le32(type_tucmd_mlhl);
+		/* use index zero for tx checksum offload */
 		context_desc->mss_l4len_idx = 0;
 
 		tx_buffer_info->time_stamp = jiffies;
@@ -3152,6 +3291,8 @@ static void ixgbe_tx_queue(struct ixgbe_adapter *adapter,
 		olinfo_status |= IXGBE_TXD_POPTS_TXSM <<
 						IXGBE_ADVTXD_POPTS_SHIFT;
 
+		/* use index 1 context for tso */
+		olinfo_status |= (1 << IXGBE_ADVTXD_IDX_SHIFT);
 		if (tx_flags & IXGBE_TX_FLAGS_IPV4)
 			olinfo_status |= IXGBE_TXD_POPTS_IXSM <<
 						IXGBE_ADVTXD_POPTS_SHIFT;
@@ -3195,11 +3336,11 @@ static int __ixgbe_maybe_stop_tx(struct net_device *netdev,
 {
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-	netif_stop_subqueue(netdev, tx_ring->queue_index);
-#else
-	netif_stop_queue(netdev);
-#endif
+	if (netif_is_multiqueue(netdev))
+		netif_stop_subqueue(netdev, tx_ring->queue_index);
+	else
+		netif_stop_queue(netdev);
+
 	/* Herbert's original patch had:
 	 *  smp_mb__after_netif_stop_queue();
 	 * but since that doesn't exist yet, just open code it. */
@@ -3211,11 +3352,10 @@ static int __ixgbe_maybe_stop_tx(struct net_device *netdev,
 		return -EBUSY;
 
 	/* A reprieve! - use start_queue because it doesn't call schedule */
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-	netif_wake_subqueue(netdev, tx_ring->queue_index);
-#else
-	netif_wake_queue(netdev);
-#endif
+	if (netif_is_multiqueue(netdev))
+		netif_start_subqueue(netdev, tx_ring->queue_index);
+	else
+		netif_start_queue(netdev);
 	++adapter->restart_queue;
 	return 0;
 }
@@ -3253,6 +3393,7 @@ static int ixgbe_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		dev_kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
+
 	mss = skb_shinfo(skb)->gso_size;
 
 	if (mss)
@@ -3269,8 +3410,21 @@ static int ixgbe_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		return NETDEV_TX_BUSY;
 	}
 	if (adapter->vlgrp && vlan_tx_tag_present(skb)) {
+		tx_flags |= vlan_tx_tag_get(skb);
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+		if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+			tx_flags &= ~IXGBE_TX_FLAGS_VLAN_PRIO_MASK;
+			tx_flags |= (skb->queue_mapping << 13);
+		}
+#endif
+		tx_flags <<= IXGBE_TX_FLAGS_VLAN_SHIFT;
+		tx_flags |= IXGBE_TX_FLAGS_VLAN;
+#ifdef CONFIG_NETDEVICES_MULTIQUEUE
+	} else if (adapter->flags & IXGBE_FLAG_DCB_ENABLED) {
+		tx_flags |= (skb->queue_mapping << 13);
+		tx_flags <<= IXGBE_TX_FLAGS_VLAN_SHIFT;
 		tx_flags |= IXGBE_TX_FLAGS_VLAN;
-		tx_flags |= (vlan_tx_tag_get(skb) << IXGBE_TX_FLAGS_VLAN_SHIFT);
+#endif
 	}
 
 	if (skb->protocol == htons(ETH_P_IP))
@@ -3520,13 +3674,12 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 	netdev->features |= NETIF_F_TSO;
 
 	netdev->features |= NETIF_F_TSO6;
+	if (adapter->flags & IXGBE_FLAG_DCB_ENABLED)
+		adapter->flags &= ~IXGBE_FLAG_RSS_ENABLED;
+
 	if (pci_using_dac)
 		netdev->features |= NETIF_F_HIGHDMA;
 
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-	netdev->features |= NETIF_F_MULTI_QUEUE;
-#endif
-
 	/* make sure the EEPROM is good */
 	if (ixgbe_validate_eeprom_checksum(hw, NULL) < 0) {
 		dev_err(&pdev->dev, "The EEPROM Checksum Is Not Valid\n");
@@ -3593,10 +3746,9 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 
 	netif_carrier_off(netdev);
 	netif_stop_queue(netdev);
-#ifdef CONFIG_NETDEVICES_MULTIQUEUE
-	for (i = 0; i < adapter->num_tx_queues; i++)
-		netif_stop_subqueue(netdev, i);
-#endif
+	if (netif_is_multiqueue(netdev))
+		for (i = 0; i < adapter->num_tx_queues; i++)
+			netif_stop_subqueue(netdev, i);
 
 	ixgbe_napi_add_all(adapter);
 
@@ -3774,6 +3926,11 @@ static struct pci_driver ixgbe_driver = {
 	.err_handler = &ixgbe_err_handler
 };
 
+bool ixgbe_is_ixgbe(struct pci_dev *pcidev)
+{
+	return (!(pci_dev_driver(pcidev) != &ixgbe_driver));
+}
+
 /**
  * ixgbe_init_module - Driver Registration Routine
  *
@@ -3782,18 +3939,17 @@ static struct pci_driver ixgbe_driver = {
  **/
 static int __init ixgbe_init_module(void)
 {
-	int ret;
 	printk(KERN_INFO "%s: %s - version %s\n", ixgbe_driver_name,
 	       ixgbe_driver_string, ixgbe_driver_version);
 
 	printk(KERN_INFO "%s: %s\n", ixgbe_driver_name, ixgbe_copyright);
 
+	ixgbe_dcb_netlink_register();
 #ifdef CONFIG_DCA
 	dca_register_notify(&dca_notifier);
 
 #endif
-	ret = pci_register_driver(&ixgbe_driver);
-	return ret;
+	return pci_register_driver(&ixgbe_driver);
 }
 module_init(ixgbe_init_module);
 
@@ -3808,6 +3964,7 @@ static void __exit ixgbe_exit_module(void)
 #ifdef CONFIG_DCA
 	dca_unregister_notify(&dca_notifier);
 #endif
+	ixgbe_dcb_netlink_unregister();
 	pci_unregister_driver(&ixgbe_driver);
 }
 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-06-12 10:17 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-27 14:13 [PATCH] NET: DCB generic netlink interface PJ Waskiewicz
2008-05-27 14:13 ` [PATCH 1/3] [NET-NEXT]: Add DCB netlink interface definition PJ Waskiewicz
2008-05-28  9:41   ` Thomas Graf
2008-05-28 16:03     ` Waskiewicz Jr, Peter P
2008-05-28 22:37       ` Thomas Graf
2008-06-01 12:16         ` Waskiewicz Jr, Peter P
2008-06-05 13:17   ` Patrick McHardy
2008-06-09 22:11     ` Waskiewicz Jr, Peter P
2008-06-10  7:14       ` Patrick McHardy
2008-05-27 14:13 ` [PATCH 2/3] ixgbe: Add Data Center Bridging hardware initialization code PJ Waskiewicz
2008-05-27 14:13 ` [PATCH 3/3] ixgbe: Enable Data Center Bridging (DCB) support PJ Waskiewicz
2008-06-04 18:44 ` [PATCH] NET: DCB generic netlink interface David Miller
2008-06-05  6:23   ` Waskiewicz Jr, Peter P
2008-06-05 14:43     ` David Miller
2008-06-05 20:29       ` Thomas Graf
2008-06-10 19:55       ` Waskiewicz Jr, Peter P
2008-06-10 20:07         ` David Miller
2008-06-11 17:51         ` Thomas Graf
2008-06-11 17:50           ` Patrick McHardy
2008-06-11 21:28             ` Thomas Graf
2008-06-12 10:17               ` Patrick McHardy
2008-06-11 18:28           ` Waskiewicz Jr, Peter P
2008-06-11 21:26             ` Thomas Graf
  -- strict thread matches above, loose matches on Subject: below --
2008-05-02  0:42 [ANNOUNCE] ixgbe: Data Center Bridging (DCB) support for ixgbe PJ Waskiewicz
2008-05-02  0:43 ` [PATCH 3/3] ixgbe: Enable Data Center Bridging (DCB) support PJ Waskiewicz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).