net: Add network priority cgroup

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* net: Add network priority cgroup
@ 2011-11-09 19:57 Neil Horman
  2011-11-09 19:57 ` [RFC PATCH 1/2] net: add network priority cgroup infrastructure Neil Horman
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Neil Horman @ 2011-11-09 19:57 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, John Fastabend, Robert Love, David S. Miller

Data Center Bridging environments are currently somewhat limited in their
ability to provide a general mechanism for controlling traffic priority.
Specifically they are unable to administratively control the priority at which
various types of network traffic are sent.

Currently, the only ways to set the priority of a network buffer are:

1) Through the use of the SO_PRIORITY socket option
2) By using low level hooks, like a tc action

(1) is difficult from an administrative perspective because it requires that the
application to be coded to not just assume the default priority is sufficient,
and must expose an administrative interface to allow priority adjustment.  Such
a solution is not scalable in a DCB environment

(2) is also difficult, as it requires constant administrative oversight of
applications so as to build appropriate rules to match traffic belonging to
various classes, so that priority can be appropriately set. It is further
limiting when DCB enabled hardware is in use, due to the fact that tc rules are
only run after a root qdisc has been selected (DCB enabled hardware may reserve
hw queues for various traffic classes and needs the priority to be set prior to
selecting the root qdisc)

I've discussed various solutions with John Fastabend, and we saw a cgroup as
being a good general solution to this problem.  The network priority cgroup
allows for a per-interface priority map to be built per cgroup.  Any traffic
originating from an application in a cgroup, that does not explicitly set its
priority with SO_PRIORITY will have its priority assigned to the value
designated for that group on that interface.  This allows a user space daemon,
when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
based on the APP_TLV value received and administratively assign applications to
that priority using the existing cgroup utility infrastructure.

Tested by John and myself, with good results

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/2] net: add network priority cgroup infrastructure
  2011-11-09 19:57 net: Add network priority cgroup Neil Horman
@ 2011-11-09 19:57 ` Neil Horman
  2011-11-09 19:57 ` [RFC PATCH 2/2] net: add documentation for net_prio cgroups Neil Horman
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Neil Horman @ 2011-11-09 19:57 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, John Fastabend, Robert Love, David S. Miller

This patch adds in the infrastructure code to create the network priority
cgroup.  The cgroup, in addition to the standard processes file creates two
control files:

1) prioidx - This is a read-only file that exports the index of this cgroup.
This is a value that is both arbitrary and unique to a cgroup in this subsystem,
and is used to index the per-device priority map

2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
<name:priority> where name is the name of a network interface and priority is
indicates the priority assigned to frames egresessing on the named interface and
originating from a pid in this cgroup

This cgroup allows for skb priority to be set prior to a root qdisc getting
selected. This is benenficial for DCB enabled systems, in that it allows for any
application to use dcb configured priorities so without application modification

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 include/linux/cgroup_subsys.h |    8 +
 include/linux/netdevice.h     |    4 +
 include/net/netprio_cgroup.h  |   66 ++++++++
 include/net/sock.h            |    3 +
 net/Kconfig                   |    7 +
 net/core/Makefile             |    1 +
 net/core/dev.c                |   13 ++
 net/core/netprio_cgroup.c     |  340 +++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c               |   22 +++-
 net/socket.c                  |    2 +
 10 files changed, 465 insertions(+), 1 deletions(-)
 create mode 100644 include/net/netprio_cgroup.h
 create mode 100644 net/core/netprio_cgroup.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..0bd390c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,8 +59,16 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+/* */
+
 #ifdef CONFIG_CGROUP_PERF
 SUBSYS(perf)
 #endif
 
 /* */
+
+#ifdef CONFIG_NETPRIO_CGROUP
+SUBSYS(net_prio)
+#endif
+
+/* */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0db1f5f..86e8c3f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -50,6 +50,7 @@
 #ifdef CONFIG_DCB
 #include <net/dcbnl.h>
 #endif
+#include <net/netprio_cgroup.h>
 
 struct vlan_group;
 struct netpoll_info;
@@ -1312,6 +1313,9 @@ struct net_device {
 	/* max exchange id for FCoE LRO by ddp */
 	unsigned int		fcoe_ddp_xid;
 #endif
+#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
+	struct netprio_map *priomap;
+#endif
 	/* phy device may attach itself for hardware timestamping */
 	struct phy_device *phydev;
 
diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
new file mode 100644
index 0000000..6b65936
--- /dev/null
+++ b/include/net/netprio_cgroup.h
@@ -0,0 +1,66 @@
+/*
+ * netprio_cgroup.h			Control Group Priority set 
+ *
+ *
+ * Authors:	Neil Horman <nhorman@tuxdriver.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#ifndef _NETPRIO_CGROUP_H
+#define _NETPRIO_CGROUP_H
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/hardirq.h>
+#include <linux/rcupdate.h>
+
+struct cgroup_netprio_state
+{
+	struct cgroup_subsys_state css;
+	u32 prioidx;
+};
+
+struct netprio_map {
+	struct rcu_head rcu;
+	u32 priomap_len;
+	u32 priomap[];
+};
+
+#ifdef CONFIG_CGROUPS
+
+#ifndef CONFIG_NETPRIO_CGROUP
+extern int net_prio_subsys_id;
+#endif
+
+extern void sock_update_netprioidx(struct sock *sk);
+extern void skb_update_prio(struct sk_buff *skb);
+
+static inline struct cgroup_netprio_state
+		*task_netprio_state(struct task_struct *p)
+{
+#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
+	return container_of(task_subsys_state(p, net_prio_subsys_id),
+			    struct cgroup_netprio_state, css);
+#else
+	return NULL;
+#endif
+}
+
+#else
+
+#define sock_update_netprioidx(sk)
+#define skb_update_prio(skb)
+
+static inline struct cgroup_netprio_state
+		*task_netprio_state(struct task_struct *p)
+{
+	return NULL;
+}
+
+#endif
+
+#endif  /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 5ac682f..87b24aa 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -321,6 +321,9 @@ struct sock {
 	unsigned short		sk_ack_backlog;
 	unsigned short		sk_max_ack_backlog;
 	__u32			sk_priority;
+#ifdef CONFIG_CGROUPS
+	__u32			sk_cgrp_prioidx;
+#endif
 	struct pid		*sk_peer_pid;
 	const struct cred	*sk_peer_cred;
 	long			sk_rcvtimeo;
diff --git a/net/Kconfig b/net/Kconfig
index a073148..63d2c5d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -232,6 +232,13 @@ config XPS
 	depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS
 	default y
 
+config NETPRIO_CGROUP
+	tristate "Network priority cgroup"
+	depends on CGROUPS
+	---help---
+	  Cgroup subsystem for use in assigning processes to network priorities on
+	  a per-interface basis
+
 config HAVE_BPF_JIT
 	bool
 
diff --git a/net/core/Makefile b/net/core/Makefile
index 0d357b1..3606d40 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_FIB_RULES) += fib_rules.o
 obj-$(CONFIG_TRACEPOINTS) += net-traces.o
 obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
 obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
+obj-$(CONFIG_NETPRIO_CGROUP) += netprio_cgroup.o
diff --git a/net/core/dev.c b/net/core/dev.c
index b7ba81a..a1dca83 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2456,6 +2456,17 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	return rc;
 }
 
+#ifdef CONFIG_CGROUPS
+void skb_update_prio(struct sk_buff *skb)
+{
+	struct netprio_map *map = rcu_dereference(skb->dev->priomap);
+
+	if ((!skb->priority) && (skb->sk) && map)
+		skb->priority = map->priomap[skb->sk->sk_cgrp_prioidx];
+}
+EXPORT_SYMBOL_GPL(skb_update_prio);
+#endif
+
 static DEFINE_PER_CPU(int, xmit_recursion);
 #define RECURSION_LIMIT 10
 
@@ -2496,6 +2507,8 @@ int dev_queue_xmit(struct sk_buff *skb)
 	 */
 	rcu_read_lock_bh();
 
+	skb_update_prio(skb);
+
 	txq = dev_pick_tx(dev, skb);
 	q = rcu_dereference_bh(txq->qdisc);
 
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
new file mode 100644
index 0000000..14e896c
--- /dev/null
+++ b/net/core/netprio_cgroup.c
@@ -0,0 +1,340 @@
+/*
+ * net/sched/cls_cgroup.c	Control Group Classifier
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/cgroup.h>
+#include <linux/rcupdate.h>
+#include <linux/atomic.h>
+#include <net/rtnetlink.h>
+#include <net/pkt_cls.h>
+#include <net/sock.h>
+#include <net/netprio_cgroup.h>
+
+static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
+					       struct cgroup *cgrp);
+static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
+static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
+
+struct cgroup_subsys net_prio_subsys = {
+	.name		= "net_prio",
+	.create		= cgrp_create,
+	.destroy	= cgrp_destroy,
+	.populate	= cgrp_populate,
+#ifdef CONFIG_NETPRIO_CGROUP
+	.subsys_id	= net_prio_subsys_id,
+#endif
+	.module		= THIS_MODULE
+};
+
+#define PRIOIDX_SZ 128
+
+static unsigned long prioidx_map[PRIOIDX_SZ];
+static DEFINE_SPINLOCK(prioidx_map_lock);
+static atomic_t max_prioidx = ATOMIC_INIT(0);
+
+static inline struct cgroup_netprio_state *cgrp_netprio_state(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, net_prio_subsys_id),
+			    struct cgroup_netprio_state, css);
+}
+
+static int get_prioidx(u32 *prio)
+{
+	unsigned long flags;
+	u32 prioidx;
+
+	spin_lock_irqsave(&prioidx_map_lock, flags);
+	prioidx = find_first_zero_bit(prioidx_map, sizeof(unsigned long) * PRIOIDX_SZ);
+	set_bit(prioidx, prioidx_map);
+	spin_unlock_irqrestore(&prioidx_map_lock, flags);
+	if (prioidx == sizeof(unsigned long) * PRIOIDX_SZ)
+		return -ENOSPC;
+
+	atomic_set(&max_prioidx, prioidx);
+	*prio = prioidx;
+	return 0;
+}
+
+static void put_prioidx(u32 idx)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&prioidx_map_lock, flags);
+	clear_bit(idx, prioidx_map);
+	spin_unlock_irqrestore(&prioidx_map_lock, flags);
+}
+
+static void extend_netdev_table(struct net_device *dev, u32 new_len)
+{
+	size_t new_size = sizeof(struct netprio_map) +
+			   ((sizeof(u32) * new_len));
+	struct netprio_map *new_priomap = kzalloc(new_size, GFP_KERNEL);
+	struct netprio_map *old_priomap;
+	int i;
+
+	old_priomap  = rcu_dereference_protected(dev->priomap, 1);
+
+
+	if (!new_priomap) {
+		printk(KERN_WARNING "Unable to alloc new priomap!\n");
+		return;
+	}
+
+	for (i = 0;
+	     dev->priomap && (i < dev->priomap->priomap_len);
+	     i++)
+		new_priomap->priomap[i] = dev->priomap->priomap[i];
+
+	new_priomap->priomap_len = new_len;
+
+	rcu_assign_pointer(dev->priomap, new_priomap);
+	if (old_priomap)
+		kfree_rcu(old_priomap, rcu);
+
+}
+
+static void update_netdev_tables(void)
+{
+	struct net_device *dev;
+	u32 max_len = atomic_read(&max_prioidx);
+
+	rtnl_lock();
+
+	for_each_netdev(&init_net, dev) {
+		if ((!dev->priomap) ||
+		    (dev->priomap->priomap_len < max_len))
+			extend_netdev_table(dev, max_len);
+	}
+
+	rtnl_unlock();
+}
+
+static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
+						 struct cgroup *cgrp)
+{
+	struct cgroup_netprio_state *cs;
+	int ret;
+
+	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
+	if (!cs)
+		return ERR_PTR(-ENOMEM);
+
+	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
+		return ERR_PTR(-EINVAL);
+
+	ret = get_prioidx(&cs->prioidx);
+	if (ret != 0) {
+		printk(KERN_WARNING "No space in priority index array\n");
+		return ERR_PTR(ret);
+	}
+
+	return &cs->css;
+}
+
+static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct cgroup_netprio_state *cs;
+	struct net_device *dev;
+
+	cs = cgrp_netprio_state(cgrp);
+	rtnl_lock();
+	for_each_netdev(&init_net, dev) {
+		if (dev->priomap)
+			dev->priomap->priomap[cs->prioidx] = 0;
+	}
+	rtnl_unlock();
+	put_prioidx(cs->prioidx);
+out_free:
+	kfree(cs);
+}
+
+static u64 read_prioidx(struct cgroup *cgrp, struct cftype *cft)
+{
+	return (u64)cgrp_netprio_state(cgrp)->prioidx;
+}
+
+static int read_priomap(struct cgroup *cont, struct cftype *cft,
+			struct cgroup_map_cb *cb)
+{
+	struct net_device *dev;
+	u32 prioidx = cgrp_netprio_state(cont)->prioidx;
+	u32 priority;
+
+	/*
+ 	 * Stub until I add the per-interface priority map
+ 	 */
+	rcu_read_lock();
+	for_each_netdev_rcu(&init_net, dev) {
+		priority = dev->priomap ? dev->priomap->priomap[prioidx] : 0;
+		cb->fill(cb, dev->name, priority);
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
+			 const char *buffer)
+{
+	char *devname = kstrdup(buffer, GFP_KERNEL);
+	int ret = -EINVAL;
+	u32 prioidx = cgrp_netprio_state(cgrp)->prioidx;
+	unsigned long priority;
+	char *priostr;
+	struct net_device *dev;
+
+	devname = kstrdup(buffer, GFP_KERNEL);
+	if (!devname)
+		return -ENOMEM;
+
+	/*
+	 * Minimally sized valid priomap string
+	 */
+	if (strlen(devname) < 3)
+		goto out_free_devname;
+
+	priostr = strstr(devname, " ");
+	if (!priostr)
+		goto out_free_devname;
+
+	/*
+	 *Separate the devname from the associated priority
+	 *and advance the priostr poitner to the priority value
+	 */
+	*priostr = '\0';
+	priostr++;
+
+	/*
+	 * If the priostr points to NULL, we're at the end of the passed
+	 * in string, and its not a valid write
+	 */
+	if (*priostr == '\0')
+		goto out_free_devname;
+
+	ret = kstrtoul(priostr, 10, &priority);
+	if (ret < 0)
+		goto out_free_devname;
+
+	ret = -ENODEV;
+
+	dev = dev_get_by_name(&init_net, devname);
+	if (!dev)
+		goto out_free_devname;
+
+	update_netdev_tables();
+	ret = 0;
+	if (dev->priomap)
+		dev->priomap->priomap[prioidx] = priority;
+
+	dev_put(dev);
+
+out_free_devname:
+	kfree(devname);
+	return ret;
+}
+
+static struct cftype ss_files[] = {
+	{
+		.name = "prioidx",
+		.read_u64 = read_prioidx,
+	},
+	{
+		.name = "ifpriomap",
+		.read_map = read_priomap,
+		.write_string = write_priomap,
+	},
+};
+
+static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, ss_files, ARRAY_SIZE(ss_files));
+}
+
+static int netprio_device_event(struct notifier_block *unused,
+				unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct netprio_map *old;
+	u32 max_len = atomic_read(&max_prioidx);
+
+	old = rcu_dereference_protected(dev->priomap, 1);
+	/*
+	 * Note this is called with rtnl_lock held so we have update side
+	 * protection on our rcu assignments
+	 */
+
+	switch (event) {
+
+	case NETDEV_REGISTER:
+		if (max_len)
+			extend_netdev_table(dev, max_len);
+		break;
+	case NETDEV_UNREGISTER:
+		rcu_assign_pointer(dev->priomap, NULL);
+		if (old)
+			kfree_rcu(old, rcu);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block netprio_device_notifier = {
+	.notifier_call = netprio_device_event
+};
+
+static int __init init_cgroup_netprio(void)
+{
+	int ret;
+
+	ret = cgroup_load_subsys(&net_prio_subsys);
+	if (ret)
+		goto out;
+#ifndef CONFIG_NETPRIO_CGROUP
+	smp_wmb();
+	net_prio_subsys_id = net_prio_subsys.subsys_id;
+#endif
+
+	register_netdevice_notifier(&netprio_device_notifier);
+
+out:
+	return ret;
+}
+
+static void __exit exit_cgroup_netprio(void)
+{
+	struct netprio_map *old;
+	struct net_device *dev;
+
+	unregister_netdevice_notifier(&netprio_device_notifier);
+
+	cgroup_unload_subsys(&net_prio_subsys);
+
+#ifndef CONFIG_NETPRIO_CGROUP
+	net_prio_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
+	rtnl_lock();
+	for_each_netdev(&init_net, dev) {
+		old = dev->priomap;
+		rcu_assign_pointer(dev->priomap, NULL);
+		if (old)
+			kfree_rcu(old, rcu);
+	}
+	rtnl_unlock();
+}
+
+module_init(init_cgroup_netprio);
+module_exit(exit_cgroup_netprio);
+MODULE_LICENSE("GPL v2");
diff --git a/net/core/sock.c b/net/core/sock.c
index 5a08762..77a4888 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -125,6 +125,7 @@
 #include <net/xfrm.h>
 #include <linux/ipsec.h>
 #include <net/cls_cgroup.h>
+#include <net/netprio_cgroup.h>
 
 #include <linux/filter.h>
 
@@ -221,10 +222,16 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
-#if defined(CONFIG_CGROUPS) && !defined(CONFIG_NET_CLS_CGROUP)
+#if defined(CONFIG_CGROUPS)
+#if !defined(CONFIG_NET_CLS_CGROUP)
 int net_cls_subsys_id = -1;
 EXPORT_SYMBOL_GPL(net_cls_subsys_id);
 #endif
+#if !defined(CONFIG_NETPRIO_CGROUP)
+int net_prio_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_prio_subsys_id);
+#endif
+#endif
 
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
@@ -1111,6 +1118,18 @@ void sock_update_classid(struct sock *sk)
 		sk->sk_classid = classid;
 }
 EXPORT_SYMBOL(sock_update_classid);
+
+void sock_update_netprioidx(struct sock *sk)
+{
+	struct cgroup_netprio_state *state;
+	if (in_interrupt())
+		return;
+	rcu_read_lock();
+	state = task_netprio_state(current);
+	sk->sk_cgrp_prioidx = state ? state->prioidx : 0;
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(sock_update_netprioidx);
 #endif
 
 /**
@@ -1138,6 +1157,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
+		sock_update_netprioidx(sk);
 	}
 
 	return sk;
diff --git a/net/socket.c b/net/socket.c
index 2877647..108716f 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -549,6 +549,8 @@ static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock,
 
 	sock_update_classid(sock->sk);
 
+	sock_update_netprioidx(sock->sk);
+
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/2] net: add documentation for net_prio cgroups
  2011-11-09 19:57 net: Add network priority cgroup Neil Horman
  2011-11-09 19:57 ` [RFC PATCH 1/2] net: add network priority cgroup infrastructure Neil Horman
@ 2011-11-09 19:57 ` Neil Horman
  2011-11-09 20:27 ` net: Add network priority cgroup Dave Taht
  2011-11-14 11:47 ` Neil Horman
  3 siblings, 0 replies; 15+ messages in thread
From: Neil Horman @ 2011-11-09 19:57 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, John Fastabend, Robert Love, David S. Miller

Add the requisite documentation to explain to new users how net_prio cgroups work

Signed-off-by:Neil Horman <nhorman@tuxdriver.com>
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 Documentation/cgroups/net_prio.txt |   53 ++++++++++++++++++++++++++++++++++++
 1 files changed, 53 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/net_prio.txt

diff --git a/Documentation/cgroups/net_prio.txt b/Documentation/cgroups/net_prio.txt
new file mode 100644
index 0000000..01b3226
--- /dev/null
+++ b/Documentation/cgroups/net_prio.txt
@@ -0,0 +1,53 @@
+Network priority cgroup
+-------------------------
+
+The Network priority cgroup provides an interface to allow an administrator to
+dynamically set the priority of network traffic generated by various
+applications
+
+Nominally, an application would set the priority of its traffic via the
+SO_PRIORITY socket option.  This however, is not always possible because:
+
+1) The application may not have been coded to set this value
+2) The priority of application traffic is often a site-specific administrative
+   decision rather than an application defined one.
+
+This cgroup allows an administrator to assign a process to a group which defines
+the priority of egress traffic on a given interface. Network priority groups can
+be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
+
+With the above step, the initial group acting as the parent accounting group
+becomes visible at '/sys/fs/cgroup/net_prio'.  This group includes all tasks in
+the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
+
+Each net_prio cgroup contains two files that are subsystem specific
+
+net_prio.prioidx
+This file is read-only, and is simply informative.  It contains a unique integer
+value that the kernel uses as an internal representation of this cgroup.
+
+net_prio.ifpriomap
+This file contains a map of the priorities assigned to traffic originating from
+processes in this group and egressing the system on various interfaces. It
+contains a list of tuples in the form <ifname priority>.  Contents of this file
+can be modified by echoing a string into the file using the same tuple format.
+for example:
+
+echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
+
+This command would force any traffic originating from processes belonging to the
+iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
+said traffic set to the value 5. The parent accounting group also has a
+writeable 'net_prio.ifpriomap' file that can be used to set a system default
+priority.
+
+Priorities are set immediately prior to queueing a frame to the device
+queueing discipline (qdisc) so priorities will be assigned prior to the hardware
+queue selection being made.
+
+One usage for the net_prio cgroup is with mqprio qdisc allowing application
+traffic to be steered to hardware/driver based traffic classes. These mappings
+can then be managed by administrators or other networking protocols such as
+DCBX.
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-09 19:57 net: Add network priority cgroup Neil Horman
  2011-11-09 19:57 ` [RFC PATCH 1/2] net: add network priority cgroup infrastructure Neil Horman
  2011-11-09 19:57 ` [RFC PATCH 2/2] net: add documentation for net_prio cgroups Neil Horman
@ 2011-11-09 20:27 ` Dave Taht
  2011-11-09 21:09   ` Neil Horman
  2011-11-09 21:10   ` John Fastabend
  2011-11-14 11:47 ` Neil Horman
  3 siblings, 2 replies; 15+ messages in thread
From: Dave Taht @ 2011-11-09 20:27 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, John Fastabend, Robert Love, David S. Miller

On Wed, Nov 9, 2011 at 8:57 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
>
> Data Center Bridging environments are currently somewhat limited in their
> ability to provide a general mechanism for controlling traffic priority.

>
> Specifically they are unable to administratively control the priority at which
> various types of network traffic are sent.
>
> Currently, the only ways to set the priority of a network buffer are:
>
> 1) Through the use of the SO_PRIORITY socket option
> 2) By using low level hooks, like a tc action
>
2), above is a little vague.

There are dozens of ways to control the relative priorities of network
streams in addition to priority notably diffserv, various forms of
fair queuing, and active queue management tecniques like RED, Blue,
etc.

The priority field within the Linux skb is used for multiple purposes
- in addition to SO_PRIORITY it is also used for queue selection
within tc for a variety of queuing disciplines. Certain bands are
reserved for vlan and wireless queueing, (these features are rarely
used)

Twiddling with it on one level or creating a controller for it can and
will still be messed up by attempts to sanely use it elsewhere in the
stack.

>
> (1) is difficult from an administrative perspective because it requires that the
> application to be coded to not just assume the default priority is sufficient,
> and must expose an administrative interface to allow priority adjustment.  Such
> a solution is not scalable in a DCB environment
>

Nor any other complex environment. Or even a simple one.

>
> (2) is also difficult, as it requires constant administrative oversight of
> applications so as to build appropriate rules to match traffic belonging to

Yes, your description of option 2, as simplified above, is difficult.

However certain algorithms are intended to improve fairness between
flows that do not require as much oversight and classification.

However, even when RED or a newer queue management algorithm such as
QFQ or DRR is applied, classes of traffic exist that benefit from more
specialized diffserv or diffserv-like behavior.

However, the evidence for something more complex in server
environments than simple priority management is compelling at this
point.

> various classes, so that priority can be appropriately set. It is further
> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
> only run after a root qdisc has been selected (DCB enabled hardware may reserve
> hw queues for various traffic classes and needs the priority to be set prior to
> selecting the root qdisc)
>

Multiple applications (somewhat) rightly set priorities according to
their view of the world.

background traffic and immediate traffic often set the appropriate
diffserv bits, other traffic can do the same, and at least a few apps
set the priority field also in the hope that that will do some good,
and perhaps more should.

>
> I've discussed various solutions with John Fastabend, and we saw a cgroup as
> being a good general solution to this problem.  The network priority cgroup

Not if you are wanting to apply queue management further down the stack!

>
> allows for a per-interface priority map to be built per cgroup.  Any traffic
> originating from an application in a cgroup, that does not explicitly set its
> priority with SO_PRIORITY will have its priority assigned to the value
> designated for that group on that interface.

> This allows a user space daemon,
> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> based on the APP_TLV value received and administratively assign applications to
> that priority using the existing cgroup utility infrastructure.

I would like it if the many uses of the priority field were reduced to
one use per semantic grouping.

You are adding a controller to something that is already
ill-controlled and ill-defined, overly overloaded and both under and
over used, to be managed in userspace by code to designed later, and
then re-mapped once it exits a vm into another host or hardware queue
management system which may or may not share similar assumptions.

Don't get me wrong, I LIKE the controller idea, but think the priority
field needs to be un-overloaded first to avoid ill-effects elsewhere
in the users of the down-stream subsystems.

> Tested by John and myself, with good results

With what?

> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: John Fastabend <john.r.fastabend@intel.com>
> CC: Robert Love <robert.w.love@intel.com>
> CC: "David S. Miller" <davem@davemloft.net>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Dave Täht
SKYPE: davetaht

http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-09 20:27 ` net: Add network priority cgroup Dave Taht
@ 2011-11-09 21:09   ` Neil Horman
  2011-11-09 21:10   ` John Fastabend
  1 sibling, 0 replies; 15+ messages in thread
From: Neil Horman @ 2011-11-09 21:09 UTC (permalink / raw)
  To: Dave Taht; +Cc: netdev, John Fastabend, Robert Love, David S. Miller

On Wed, Nov 09, 2011 at 09:27:08PM +0100, Dave Taht wrote:
> On Wed, Nov 9, 2011 at 8:57 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > Data Center Bridging environments are currently somewhat limited in their
> > ability to provide a general mechanism for controlling traffic priority.
> 
> 
> 
> >
> > Specifically they are unable to administratively control the priority at which
> > various types of network traffic are sent.
> >
> > Currently, the only ways to set the priority of a network buffer are:
> >
> > 1) Through the use of the SO_PRIORITY socket option
> > 2) By using low level hooks, like a tc action
> >
> 2), above is a little vague.
> 
> There are dozens of ways to control the relative priorities of network
> streams in addition to priority notably diffserv, various forms of
> fair queuing, and active queue management tecniques like RED, Blue,
> etc.
> 
I'm referring explicitly to skb->prioroity here.  Sorry If I wasn't clear.

> The priority field within the Linux skb is used for multiple purposes
> - in addition to SO_PRIORITY it is also used for queue selection
> within tc for a variety of queuing disciplines. Certain bands are
> reserved for vlan and wireless queueing, (these features are rarely
> used)
> 
Yes.

> Twiddling with it on one level or creating a controller for it can and
> will still be messed up by attempts to sanely use it elsewhere in the
> stack.
> 
Why?  Its not like it can't already be twiddled with via SO_PRIORITY.  This does
exactly the same thing, it just lets us do it via an administrative interface
rather than a programatic one.  I don't disagree that the use of skb->prioirty
is complex, but this doesn't add any complexity that isn't already there.  It
just gives us a general way to assign priorities for those that know how to use
it consistently, in a way that doesn't require application modification.  Thats
something that DCB needs.

> >
> > (1) is difficult from an administrative perspective because it requires that the
> > application to be coded to not just assume the default priority is sufficient,
> > and must expose an administrative interface to allow priority adjustment.  Such
> > a solution is not scalable in a DCB environment
> >
> 
> Nor any other complex environment. Or even a simple one.
Yes.

> 
> >
> > (2) is also difficult, as it requires constant administrative oversight of
> > applications so as to build appropriate rules to match traffic belonging to
> 
> Yes, your description of option 2, as simplified above, is difficult.
> 
> However certain algorithms are intended to improve fairness between
> flows that do not require as much oversight and classification.
> 
Yes, but DCB is orthogonal to software traffic control.  Its hardware queueing 
based on the priority value of an skb.  As such, when a DCB enabled multiqueue
adapter selects the output queues in dev_pick_tx, it needs to have the
skb->priority value set properly.  Since we don't run any of the tc filters or
classifiers until after thats complete, we can't use those to adjust the skb
priority, as the root qdisc is already selected.

> However, even when RED or a newer queue management algorithm such as
> QFQ or DRR is applied, classes of traffic exist that benefit from more
> specialized diffserv or diffserv-like behavior.
> 
I understand, but again, DCB is orthogonal to that.  DCB is a hardware based
solution that steers traffic to various output queues in the NIC based on the
skb->priority value.  Take a look at ixgbe_select_queue for an example.

> However, the evidence for something more complex in server
> environments than simple priority management is compelling at this
> point.
> 
> > various classes, so that priority can be appropriately set. It is further
> > limiting when DCB enabled hardware is in use, due to the fact that tc rules are
> > only run after a root qdisc has been selected (DCB enabled hardware may reserve
> > hw queues for various traffic classes and needs the priority to be set prior to
> > selecting the root qdisc)
> >
> 
> Multiple applications (somewhat) rightly set priorities according to
> their view of the world.
> 
> background traffic and immediate traffic often set the appropriate
> diffserv bits, other traffic can do the same, and at least a few apps
> set the priority field also in the hope that that will do some good,
> and perhaps more should.
> 
Agreed, and this patch respects that.  It only sets the priority of an skb that
doesn't already have its priority set.  See skb_update_prio.

> 
> >
> > I've discussed various solutions with John Fastabend, and we saw a cgroup as
> > being a good general solution to this problem.  The network priority cgroup
> 
> Not if you are wanting to apply queue management further down the stack!
> 
I'm not saying you can use the two together! I understand that this solution
interferes with the use of skb->priority in various queuing disciplines (just
like a program using SO_PRIORITY would), but the way those disciplines work is
incompatible with DCB at the moment.  You wouldn't use them all at the same
time.  I'd be happy to add some documentation to my patch to reflect that if you
like.

> >
> > allows for a per-interface priority map to be built per cgroup.  Any traffic
> > originating from an application in a cgroup, that does not explicitly set its
> > priority with SO_PRIORITY will have its priority assigned to the value
> > designated for that group on that interface.
> 
> > This allows a user space daemon,
> > when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> > based on the APP_TLV value received and administratively assign applications to
> > that priority using the existing cgroup utility infrastructure.
> 
> I would like it if the many uses of the priority field were reduced to
> one use per semantic grouping.
> 
> You are adding a controller to something that is already
> ill-controlled and ill-defined, overly overloaded and both under and
> over used, to be managed in userspace by code to designed later, and
> then re-mapped once it exits a vm into another host or hardware queue
> management system which may or may not share similar assumptions.
> 
> Don't get me wrong, I LIKE the controller idea, but think the priority
> field needs to be un-overloaded first to avoid ill-effects elsewhere
> in the users of the down-stream subsystems.
> 
We can certainly discuss the idea of separating the various semantic uses of
skb->priority out, but I don't think this patch is the place to do it. The
DCB use case for priority already exists (it specifically uses the prio_tc_map
as indexed by skb->priority in __skb_tx_hash).  I'm just adding a means of
controlling it more easily and reliably. 

> > Tested by John and myself, with good results
> 
> With what?
> 
What else?  and ixgbe adapter and ping.  I created a test netprio cgroup, assigned a
priority value to it, and did a did a cgexec -g net_prio:test ping www.yahoo.com
with a printk in the ixgbe tx method to valiedate that the proper queue mapping
was selected.

Neil

> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: John Fastabend <john.r.fastabend@intel.com>
> > CC: Robert Love <robert.w.love@intel.com>
> > CC: "David S. Miller" <davem@davemloft.net>
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Dave Täht
> SKYPE: davetaht
> 
> http://www.bufferbloat.net
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-09 20:27 ` net: Add network priority cgroup Dave Taht
  2011-11-09 21:09   ` Neil Horman
@ 2011-11-09 21:10   ` John Fastabend
  1 sibling, 0 replies; 15+ messages in thread
From: John Fastabend @ 2011-11-09 21:10 UTC (permalink / raw)
  To: Dave Taht
  Cc: Neil Horman, netdev@vger.kernel.org, Love, Robert W,
	David S. Miller

On 11/9/2011 12:27 PM, Dave Taht wrote:
> On Wed, Nov 9, 2011 at 8:57 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
>>
>> Data Center Bridging environments are currently somewhat limited in their
>> ability to provide a general mechanism for controlling traffic priority.
> 
> 
> 
>>
>> Specifically they are unable to administratively control the priority at which
>> various types of network traffic are sent.
>>
>> Currently, the only ways to set the priority of a network buffer are:
>>
>> 1) Through the use of the SO_PRIORITY socket option
>> 2) By using low level hooks, like a tc action
>>
> 2), above is a little vague.
> 
> There are dozens of ways to control the relative priorities of network
> streams in addition to priority notably diffserv, various forms of
> fair queuing, and active queue management tecniques like RED, Blue,
> etc.
> 

Maybe dozens of ways to control traffic using various combinations of
qdiscs but I think for classification we have a small set of reasonably
defined mechanisms.

 - tc filter/action
 - netfilter infrastructure think CLASSIFY (iptables/ebtables)
 - socket options SO_PRIORITY and SO_TOS

By the way setting the tos bits also sets the sk->priority. What other
classifications did I miss?

> The priority field within the Linux skb is used for multiple purposes
> - in addition to SO_PRIORITY it is also used for queue selection
> within tc for a variety of queuing disciplines. Certain bands are
> reserved for vlan and wireless queueing, (these features are rarely
> used)
> 
> Twiddling with it on one level or creating a controller for it can and
> will still be messed up by attempts to sanely use it elsewhere in the
> stack.
> 

The skb->priority is used by some qdiscs and also with vlan egress_maps.

Without knowing the wireless situation it seems you can either not manage
priority over wireless links if this is a problem or perhaps we can clean
up the wireless queueing and integrate it with the appropriate qdisc.

Could the wireless skb->priority usage be tied into mqprio?

>>
>> (1) is difficult from an administrative perspective because it requires that the
>> application to be coded to not just assume the default priority is sufficient,
>> and must expose an administrative interface to allow priority adjustment.  Such
>> a solution is not scalable in a DCB environment
>>
> 
> Nor any other complex environment. Or even a simple one.
> 
>>
>> (2) is also difficult, as it requires constant administrative oversight of
>> applications so as to build appropriate rules to match traffic belonging to
> 
> Yes, your description of option 2, as simplified above, is difficult.
> 
> However certain algorithms are intended to improve fairness between
> flows that do not require as much oversight and classification.
> 
> However, even when RED or a newer queue management algorithm such as
> QFQ or DRR is applied, classes of traffic exist that benefit from more
> specialized diffserv or diffserv-like behavior.
> 
> However, the evidence for something more complex in server
> environments than simple priority management is compelling at this
> point.
> 
>> various classes, so that priority can be appropriately set. It is further
>> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
>> only run after a root qdisc has been selected (DCB enabled hardware may reserve
>> hw queues for various traffic classes and needs the priority to be set prior to
>> selecting the root qdisc)
>>
> 
> Multiple applications (somewhat) rightly set priorities according to
> their view of the world.
> 
> background traffic and immediate traffic often set the appropriate
> diffserv bits, other traffic can do the same, and at least a few apps
> set the priority field also in the hope that that will do some good,
> and perhaps more should.

These patches do not overwrite existing priorities. So applications
that manage the priority can continue to do this.

> 
> 
>>
>> I've discussed various solutions with John Fastabend, and we saw a cgroup as
>> being a good general solution to this problem.  The network priority cgroup
> 
> Not if you are wanting to apply queue management further down the stack!
> 

I don't follow? Here your saying that you have a queue management that the
QOS layer is unaware of? OK so any qdisc or priority mechanism is going to
interfere with 'further down the stack'.

>>
>> allows for a per-interface priority map to be built per cgroup.  Any traffic
>> originating from an application in a cgroup, that does not explicitly set its
>> priority with SO_PRIORITY will have its priority assigned to the value
>> designated for that group on that interface.
> 
>> This allows a user space daemon,
>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>> based on the APP_TLV value received and administratively assign applications to
>> that priority using the existing cgroup utility infrastructure.
> 
> I would like it if the many uses of the priority field were reduced to
> one use per semantic grouping.
> 
> You are adding a controller to something that is already
> ill-controlled and ill-defined, overly overloaded and both under and
> over used, to be managed in userspace by code to designed later, and
> then re-mapped once it exits a vm into another host or hardware queue
> management system which may or may not share similar assumptions.
> 

I don't think its ill-defined or ill-controlled. The priority can be
set by well defined mechanisms. We provide another mechanism to set
the priority without having to modify existing applications and a
mechanism for administrators/tools to set dynamically.

Overloaded perhaps the egress_map is a bit of an overloading of this.
But its existed for a long time.

IMHO hardware queue management systems should be integrated into the
qdisc layer if possible. DCB enabled hardware had similar problems
trying to do hardware queue management without involving the OS and
had to add hacks into select_queue() or hard coded traffic types
into the base drivers to work around this. 'mqprio' and dev support
for traffic classes was my take at a generic mechanism to expose this
to the OS.


> Don't get me wrong, I LIKE the controller idea, but think the priority
> field needs to be un-overloaded first to avoid ill-effects elsewhere
> in the users of the down-stream subsystems.
> 

But doesn't this help the down-stream subsystems as well? The priority
will eventually be pushed down the stack.

>> Tested by John and myself, with good results
> 
> With what?
> 

I tested this with mqprio using the net_prio cgroups to set the priority
and using mqprio to bind hardware queue sets to each priority. Then
I used netperf, ping, and the cg* tools to test I/O.

As a side note I expect you could also use this in conjunction with
the vlan egress_map to push applications onto 802.1Q priorities.

>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>> CC: John Fastabend <john.r.fastabend@intel.com>
>> CC: Robert Love <robert.w.love@intel.com>
>> CC: "David S. Miller" <davem@davemloft.net>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Dave Täht
> SKYPE: davetaht
> 
> http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-09 19:57 net: Add network priority cgroup Neil Horman
                   ` (2 preceding siblings ...)
  2011-11-09 20:27 ` net: Add network priority cgroup Dave Taht
@ 2011-11-14 11:47 ` Neil Horman
  2011-11-14 12:32   ` Dave Taht
  3 siblings, 1 reply; 15+ messages in thread
From: Neil Horman @ 2011-11-14 11:47 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Robert Love, David S. Miller

On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> Data Center Bridging environments are currently somewhat limited in their
> ability to provide a general mechanism for controlling traffic priority.
> Specifically they are unable to administratively control the priority at which
> various types of network traffic are sent.
>  
> Currently, the only ways to set the priority of a network buffer are:
> 
> 1) Through the use of the SO_PRIORITY socket option
> 2) By using low level hooks, like a tc action
> 
> (1) is difficult from an administrative perspective because it requires that the
> application to be coded to not just assume the default priority is sufficient,
> and must expose an administrative interface to allow priority adjustment.  Such
> a solution is not scalable in a DCB environment
> 
> (2) is also difficult, as it requires constant administrative oversight of
> applications so as to build appropriate rules to match traffic belonging to
> various classes, so that priority can be appropriately set. It is further
> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
> only run after a root qdisc has been selected (DCB enabled hardware may reserve
> hw queues for various traffic classes and needs the priority to be set prior to
> selecting the root qdisc)
> 
> 
> I've discussed various solutions with John Fastabend, and we saw a cgroup as
> being a good general solution to this problem.  The network priority cgroup
> allows for a per-interface priority map to be built per cgroup.  Any traffic
> originating from an application in a cgroup, that does not explicitly set its
> priority with SO_PRIORITY will have its priority assigned to the value
> designated for that group on that interface.  This allows a user space daemon,
> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> based on the APP_TLV value received and administratively assign applications to
> that priority using the existing cgroup utility infrastructure.
> 
> Tested by John and myself, with good results
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: John Fastabend <john.r.fastabend@intel.com>
> CC: Robert Love <robert.w.love@intel.com>
> CC: "David S. Miller" <davem@davemloft.net>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
the use of skb->priority, but IMO they really seem orthogonal to the purpose of
this change.  Any other reviews would be welcome.

John, Robert, if you're supportive of these changes, some Acks would be
appreciated.


Regards
Neil

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-14 11:47 ` Neil Horman
@ 2011-11-14 12:32   ` Dave Taht
  2011-11-14 14:43     ` Neil Horman
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Taht @ 2011-11-14 12:32 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, John Fastabend, Robert Love, David S. Miller

On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
>> Data Center Bridging environments are currently somewhat limited in their
>> ability to provide a general mechanism for controlling traffic priority.
>> Specifically they are unable to administratively control the priority at which
>> various types of network traffic are sent.
>>
>> Currently, the only ways to set the priority of a network buffer are:
>>
>> 1) Through the use of the SO_PRIORITY socket option
>> 2) By using low level hooks, like a tc action
>>
>> (1) is difficult from an administrative perspective because it requires that the
>> application to be coded to not just assume the default priority is sufficient,
>> and must expose an administrative interface to allow priority adjustment.  Such
>> a solution is not scalable in a DCB environment
>>
>> (2) is also difficult, as it requires constant administrative oversight of
>> applications so as to build appropriate rules to match traffic belonging to
>> various classes, so that priority can be appropriately set. It is further
>> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
>> only run after a root qdisc has been selected (DCB enabled hardware may reserve
>> hw queues for various traffic classes and needs the priority to be set prior to
>> selecting the root qdisc)
>>
>>
>> I've discussed various solutions with John Fastabend, and we saw a cgroup as
>> being a good general solution to this problem.  The network priority cgroup
>> allows for a per-interface priority map to be built per cgroup.  Any traffic
>> originating from an application in a cgroup, that does not explicitly set its
>> priority with SO_PRIORITY will have its priority assigned to the value
>> designated for that group on that interface.  This allows a user space daemon,
>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>> based on the APP_TLV value received and administratively assign applications to
>> that priority using the existing cgroup utility infrastructure.
>>
>> Tested by John and myself, with good results
>>
>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>> CC: John Fastabend <john.r.fastabend@intel.com>
>> CC: Robert Love <robert.w.love@intel.com>
>> CC: "David S. Miller" <davem@davemloft.net>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
> the use of skb->priority, but IMO they really seem orthogonal to the purpose of
> this change.  Any other reviews would be welcome.

Well, in part I've been playing catchup in the hope that lldp and
openlldp and/or this dcb netlink layer that I don't know anything
about (pointers please?) could help somehow to resolve the semantic
mess skb->priority has become in the first place.

I liked what was described here.

"What if we did at least carve out the DCB functionality away from
skb->priority?  Since, AIUI, we're only concerning ourselves with
locally generated traffic here, we're talking
about skbs that have a socket attached to them.  We could, instead of indexing
the prio_tc_map with skb->priority, we could index it with
skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The cgroup
then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
something more appropriately named), and we don't have to touch skb->priority at
all.  I'd really rather not start down that road until I got more opinions and
consensus on that, but it seems like a pretty good solution, one that would
allow hardware queue selection in systems that use things like DCB to co-exist
with software queueing features."

The piece that still kind of bothered me about the original proposal
(and perhaps this one) was that setting SO_PRIORITY in an app means
'give my packets more mojo'.

Taking something that took unprioritized packets and assigned them and
*them only* to a hardware queue struck me as possibly deprioritizing
the 'more mojo wanted' packets in the app(s), as they would end up in
some other, possibly overloaded, hardware queue.

So a cgroup that moves all of the packets from an application into a
given hardware queue, and then gets scheduled normally according to
skb->priority and friends (software queue, default of pfifo_fast,
etc), seems to make some sense to me. (I wouldn't mind if we had
abstractions for software queues, too, like, I need a software queue
with these properties, find me a place for it on the hardware - but
I'm dreaming)

One open question is where do packets generated from other subsystems
end up, if you are using a cgroup for the app? arp, dns, etc?

So to rephrase your original description from this:

>> Any traffic originating from an application in a cgroup, that does not explicitly set its
>> priority with SO_PRIORITY will have its priority assigned to the value
>> designated for that group on that interface.  This allows a user space daemon,
>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>> based on the APP_TLV value received and administratively assign applications to
>> that priority using the existing cgroup utility infrastructure.
> John, Robert, if you're supportive of these changes, some Acks would be
> appreciated.

To this:

"Any traffic originating from an application in a cgroup,  will have
its hardware queue  assigned to the value designated for that group on
that interface.  This allows a user space daemon, when conducting LLDP
negotiation with a DCB enabled peer to create a cgroup based on the
APP_TLV value received and administratively assign applications to
that hardware queue using the existing cgroup utility infrastructure."

Assuming we're on the same page here, what the heck is APP_TLV?

> John, Robert, if you're supportive of these changes, some Acks would be
> appreciated.

>
>
> Regards
> Neil
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-14 12:32   ` Dave Taht
@ 2011-11-14 14:43     ` Neil Horman
  2011-11-14 16:38       ` John Fastabend
  2011-11-14 16:43       ` Shyam_Iyer
  0 siblings, 2 replies; 15+ messages in thread
From: Neil Horman @ 2011-11-14 14:43 UTC (permalink / raw)
  To: Dave Taht; +Cc: netdev, John Fastabend, Robert Love, David S. Miller

On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> >> Data Center Bridging environments are currently somewhat limited in their
> >> ability to provide a general mechanism for controlling traffic priority.
> >> Specifically they are unable to administratively control the priority at which
> >> various types of network traffic are sent.
> >>
> >> Currently, the only ways to set the priority of a network buffer are:
> >>
> >> 1) Through the use of the SO_PRIORITY socket option
> >> 2) By using low level hooks, like a tc action
> >>
> >> (1) is difficult from an administrative perspective because it requires that the
> >> application to be coded to not just assume the default priority is sufficient,
> >> and must expose an administrative interface to allow priority adjustment.  Such
> >> a solution is not scalable in a DCB environment
> >>
> >> (2) is also difficult, as it requires constant administrative oversight of
> >> applications so as to build appropriate rules to match traffic belonging to
> >> various classes, so that priority can be appropriately set. It is further
> >> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
> >> only run after a root qdisc has been selected (DCB enabled hardware may reserve
> >> hw queues for various traffic classes and needs the priority to be set prior to
> >> selecting the root qdisc)
> >>
> >>
> >> I've discussed various solutions with John Fastabend, and we saw a cgroup as
> >> being a good general solution to this problem.  The network priority cgroup
> >> allows for a per-interface priority map to be built per cgroup.  Any traffic
> >> originating from an application in a cgroup, that does not explicitly set its
> >> priority with SO_PRIORITY will have its priority assigned to the value
> >> designated for that group on that interface.  This allows a user space daemon,
> >> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> >> based on the APP_TLV value received and administratively assign applications to
> >> that priority using the existing cgroup utility infrastructure.
> >>
> >> Tested by John and myself, with good results
> >>
> >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> >> CC: John Fastabend <john.r.fastabend@intel.com>
> >> CC: Robert Love <robert.w.love@intel.com>
> >> CC: "David S. Miller" <davem@davemloft.net>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> > Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
> > the use of skb->priority, but IMO they really seem orthogonal to the purpose of
> > this change.  Any other reviews would be welcome.
> 
> Well, in part I've been playing catchup in the hope that lldp and
> openlldp and/or this dcb netlink layer that I don't know anything
> about (pointers please?) could help somehow to resolve the semantic
> mess skb->priority has become in the first place.
> 
> I liked what was described here.
> 
> "What if we did at least carve out the DCB functionality away from
> skb->priority?  Since, AIUI, we're only concerning ourselves with
> locally generated traffic here, we're talking
> about skbs that have a socket attached to them.  We could, instead of indexing
> the prio_tc_map with skb->priority, we could index it with
> skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The cgroup
> then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
> something more appropriately named), and we don't have to touch skb->priority at
> all.  I'd really rather not start down that road until I got more opinions and
> consensus on that, but it seems like a pretty good solution, one that would
> allow hardware queue selection in systems that use things like DCB to co-exist
> with software queueing features."
> 
I was initially ok with this, but the more I think about it, the more I think
its just not needed (see further down in this email for my reasoning).  John,
Rob, do you have any thoughts here?

> The piece that still kind of bothered me about the original proposal
> (and perhaps this one) was that setting SO_PRIORITY in an app means
> 'give my packets more mojo'.
> 
> Taking something that took unprioritized packets and assigned them and
> *them only* to a hardware queue struck me as possibly deprioritizing
> the 'more mojo wanted' packets in the app(s), as they would end up in
> some other, possibly overloaded, hardware queue.
> 
I don't really see what you mean by this at all.  Taking packets with no
priority and assigning them a priority doesn't really have an effect on
pre-prioritized packets.  Or rather it shouldn't.  You can certainly create a
problem by having apps prioritized according to conflicting semantic rules, but
that strikes me as administrative error.  Garbage in...Garbage out.

> So a cgroup that moves all of the packets from an application into a
> given hardware queue, and then gets scheduled normally according to
> skb->priority and friends (software queue, default of pfifo_fast,
> etc), seems to make some sense to me. (I wouldn't mind if we had
> abstractions for software queues, too, like, I need a software queue
> with these properties, find me a place for it on the hardware - but
> I'm dreaming)
> 
> One open question is where do packets generated from other subsystems
> end up, if you are using a cgroup for the app? arp, dns, etc?
> 
The overriding rule is the association of an skb to a socket.  If a transmitted
frame has skb->sk set in dev_queue_xmit, then we interrogate its priority index
as set when we passed through the sendmsg code at the top of the stack.
Otherwise its behavior is unchanged from its current standpoint.

> So to rephrase your original description from this:
> 
> >> Any traffic originating from an application in a cgroup, that does not explicitly set its
> >> priority with SO_PRIORITY will have its priority assigned to the value
> >> designated for that group on that interface.  This allows a user space daemon,
> >> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> >> based on the APP_TLV value received and administratively assign applications to
> >> that priority using the existing cgroup utility infrastructure.
> > John, Robert, if you're supportive of these changes, some Acks would be
> > appreciated.
> 
> To this:
> 
> "Any traffic originating from an application in a cgroup,  will have
> its hardware queue  assigned to the value designated for that group on
> that interface.  This allows a user space daemon, when conducting LLDP
> negotiation with a DCB enabled peer to create a cgroup based on the
> APP_TLV value received and administratively assign applications to
> that hardware queue using the existing cgroup utility infrastructure."
> 
As above, I'm split brained about this.  I'm ok with the idea of making this a
queue selection cgroup, and separating it from priority, but at the same time,
in the context of DCB, we really are assigning priority here, so it seems a bit
false to do something that is not priority.  I also like the fact that it
provides administrative control in a way that netfilter and tc don't really
enable. 

> Assuming we're on the same page here, what the heck is APP_TLV?
> 
LLDP does layer 2 discovery with peer networking devices. It does so using sets
of Type/length/value tuples.  The types carry various bits of information, such
as which priority groups are available on the network.  The APP tlv conveys
application or feature specific information.  for instance, There is an ISCSI
app tlv that tells the host that  "on the interface you received this tlv, iscsi
traffic must be sent at priority X".  The idea being that, on receipt of this
tlv, the DCB daemon can create an ISCSI network priority cgroup instance, and
augment the cgroup rules file such that, when the user space iscsi daemon is
started, its traffic automatically transmits at the appropriate priority.

Actually, as I sit here thinking about it, I'm less supportive of separating the
assignment of hw queue / root qdisc from priority.  I say that because its
really just not necessecary, at least not in a DCB environment.  In a dcb
environment, we use priority exclusively to select the root qdisc and the
associated hardware queue.  Any software queue policing/grooming that takes
place after that by definition can't make any decisions based on skb->priority,
as the priority will be the same for each frame in the given parent queue (since
we used skb->priority to assign the parent queue).  You can still use RED or
some other queue discipline to manage queue length, but you have to understand
that priority won't be a packet-differentiating factor there.  If you made this
a separate queue selection variable, you could manipulate priority independenty,
but it would be non-sensical, because the intent would still be to send all
packets on that queue at the same priority, since thats what that hardware queue
is configured for.

In the alternate, if you don't use DCB, then this cgroup becomes somewhat
worthless anyway, because no-one has a need to select a specific hardware queue.
If you want to categorize traffic into classes you can already use the net_cls
cgroup, which works quite well for bandwidth allocation, and that still frees up
the skb->prioirty assignment to use as you see fit.

In the end, I think its just plain old more useful to assign priorty here than
some new thing.

Neil
 
> > John, Robert, if you're supportive of these changes, some Acks would be
> > appreciated.
> 
> 
> >
> >
> > Regards
> > Neil
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> 
> 
> -- 
> Dave Täht
> SKYPE: davetaht
> US Tel: 1-239-829-5608
> FR Tel: 0638645374
> http://www.bufferbloat.net
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-14 14:43     ` Neil Horman
@ 2011-11-14 16:38       ` John Fastabend
  2011-11-14 16:44         ` David Täht
  2011-11-14 16:43       ` Shyam_Iyer
  1 sibling, 1 reply; 15+ messages in thread
From: John Fastabend @ 2011-11-14 16:38 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Taht, netdev@vger.kernel.org, Love, Robert W,
	David S. Miller

On 11/14/2011 6:43 AM, Neil Horman wrote:
> On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
>> On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
>>> On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
>>>> Data Center Bridging environments are currently somewhat limited in their
>>>> ability to provide a general mechanism for controlling traffic priority.
>>>> Specifically they are unable to administratively control the priority at which
>>>> various types of network traffic are sent.
>>>>
>>>> Currently, the only ways to set the priority of a network buffer are:
>>>>
>>>> 1) Through the use of the SO_PRIORITY socket option
>>>> 2) By using low level hooks, like a tc action
>>>>
>>>> (1) is difficult from an administrative perspective because it requires that the
>>>> application to be coded to not just assume the default priority is sufficient,
>>>> and must expose an administrative interface to allow priority adjustment.  Such
>>>> a solution is not scalable in a DCB environment
>>>>
>>>> (2) is also difficult, as it requires constant administrative oversight of
>>>> applications so as to build appropriate rules to match traffic belonging to
>>>> various classes, so that priority can be appropriately set. It is further
>>>> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
>>>> only run after a root qdisc has been selected (DCB enabled hardware may reserve
>>>> hw queues for various traffic classes and needs the priority to be set prior to
>>>> selecting the root qdisc)
>>>>
>>>>
>>>> I've discussed various solutions with John Fastabend, and we saw a cgroup as
>>>> being a good general solution to this problem.  The network priority cgroup
>>>> allows for a per-interface priority map to be built per cgroup.  Any traffic
>>>> originating from an application in a cgroup, that does not explicitly set its
>>>> priority with SO_PRIORITY will have its priority assigned to the value
>>>> designated for that group on that interface.  This allows a user space daemon,
>>>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>>>> based on the APP_TLV value received and administratively assign applications to
>>>> that priority using the existing cgroup utility infrastructure.
>>>>
>>>> Tested by John and myself, with good results
>>>>
>>>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>>>> CC: John Fastabend <john.r.fastabend@intel.com>
>>>> CC: Robert Love <robert.w.love@intel.com>
>>>> CC: "David S. Miller" <davem@davemloft.net>
>>>> --

Acked-by: John Fastabend <john.r.fastabend@intel.com>

>>>
>>> Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
>>> the use of skb->priority, but IMO they really seem orthogonal to the purpose of
>>> this change.  Any other reviews would be welcome.
>>
>> Well, in part I've been playing catchup in the hope that lldp and
>> openlldp and/or this dcb netlink layer that I don't know anything
>> about (pointers please?) could help somehow to resolve the semantic
>> mess skb->priority has become in the first place.
>>
>> I liked what was described here.
>>
>> "What if we did at least carve out the DCB functionality away from
>> skb->priority?  Since, AIUI, we're only concerning ourselves with
>> locally generated traffic here, we're talking
>> about skbs that have a socket attached to them.  We could, instead of indexing
>> the prio_tc_map with skb->priority, we could index it with
>> skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The cgroup
>> then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
>> something more appropriately named), and we don't have to touch skb->priority at
>> all.  I'd really rather not start down that road until I got more opinions and
>> consensus on that, but it seems like a pretty good solution, one that would
>> allow hardware queue selection in systems that use things like DCB to co-exist
>> with software queueing features."
>>
> I was initially ok with this, but the more I think about it, the more I think
> its just not needed (see further down in this email for my reasoning).  John,
> Rob, do you have any thoughts here?
> 

I agree the original mechanism seems sufficient. skb->priority already has
all the qdisc and netfilter infrastructure in place to be used and using
it to prioritize "steer" packets at queues seems reasonable to me. Using
the skb->priority this way is not new pfifo_fast uses it to pick a band
and sch_prio does some similar prioritization, mqprio is a multiqueue
variant.

>> The piece that still kind of bothered me about the original proposal
>> (and perhaps this one) was that setting SO_PRIORITY in an app means
>> 'give my packets more mojo'.
>>
>> Taking something that took unprioritized packets and assigned them and
>> *them only* to a hardware queue struck me as possibly deprioritizing
>> the 'more mojo wanted' packets in the app(s), as they would end up in
>> some other, possibly overloaded, hardware queue.
>>
> I don't really see what you mean by this at all.  Taking packets with no
> priority and assigning them a priority doesn't really have an effect on
> pre-prioritized packets.  Or rather it shouldn't.  You can certainly create a
> problem by having apps prioritized according to conflicting semantic rules, but
> that strikes me as administrative error.  Garbage in...Garbage out.
> 
>> So a cgroup that moves all of the packets from an application into a
>> given hardware queue, and then gets scheduled normally according to
>> skb->priority and friends (software queue, default of pfifo_fast,
>> etc), seems to make some sense to me. (I wouldn't mind if we had
>> abstractions for software queues, too, like, I need a software queue
>> with these properties, find me a place for it on the hardware - but
>> I'm dreaming)
>>
>> One open question is where do packets generated from other subsystems
>> end up, if you are using a cgroup for the app? arp, dns, etc?
>>
> The overriding rule is the association of an skb to a socket.  If a transmitted
> frame has skb->sk set in dev_queue_xmit, then we interrogate its priority index
> as set when we passed through the sendmsg code at the top of the stack.
> Otherwise its behavior is unchanged from its current standpoint.
> 

Having a queue selection (skb->queue_mapping?) cgroup also would defeat
any hashing across multiple queues. With mqprio we can assign many hardware
queues to a skb->priority.

w.r.t. software queue abstractions don't we already have this? mq and
mqprio enumerate a software qdisc per hardware queue. You can attach
your favorite qdisc to these. This is likely off-topic for this thread though.

[...]

> In the end, I think its just plain old more useful to assign priorty here than
> some new thing.

I agree.

Thanks,
John


> 
> Neil
>  
>>> John, Robert, if you're supportive of these changes, some Acks would be
>>> appreciated.
>>
>>
>>>
>>>
>>> Regards
>>> Neil
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>>
>> -- 
>> Dave Täht
>> SKYPE: davetaht
>> US Tel: 1-239-829-5608
>> FR Tel: 0638645374
>> http://www.bufferbloat.net
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: net: Add network priority cgroup
  2011-11-14 14:43     ` Neil Horman
  2011-11-14 16:38       ` John Fastabend
@ 2011-11-14 16:43       ` Shyam_Iyer
  2011-11-14 17:24         ` Neil Horman
  1 sibling, 1 reply; 15+ messages in thread
From: Shyam_Iyer @ 2011-11-14 16:43 UTC (permalink / raw)
  To: nhorman, dave.taht; +Cc: netdev, john.r.fastabend, robert.w.love, davem



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Neil Horman
> Sent: Monday, November 14, 2011 9:44 AM
> To: Dave Taht
> Cc: netdev@vger.kernel.org; John Fastabend; Robert Love; David S.
> Miller
> Subject: Re: net: Add network priority cgroup
> 
> On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> > On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com>
> wrote:
> > > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> > >> Data Center Bridging environments are currently somewhat limited
> in their
> > >> ability to provide a general mechanism for controlling traffic
> priority.
> > >> Specifically they are unable to administratively control the
> priority at which
> > >> various types of network traffic are sent.
> > >>
> > >> Currently, the only ways to set the priority of a network buffer
> are:
> > >>
> > >> 1) Through the use of the SO_PRIORITY socket option
> > >> 2) By using low level hooks, like a tc action
> > >>
> > >> (1) is difficult from an administrative perspective because it
> requires that the
> > >> application to be coded to not just assume the default priority is
> sufficient,
> > >> and must expose an administrative interface to allow priority
> adjustment.  Such
> > >> a solution is not scalable in a DCB environment
> > >>
> > >> (2) is also difficult, as it requires constant administrative
> oversight of
> > >> applications so as to build appropriate rules to match traffic
> belonging to
> > >> various classes, so that priority can be appropriately set. It is
> further
> > >> limiting when DCB enabled hardware is in use, due to the fact that
> tc rules are
> > >> only run after a root qdisc has been selected (DCB enabled
> hardware may reserve
> > >> hw queues for various traffic classes and needs the priority to be
> set prior to
> > >> selecting the root qdisc)
> > >>
> > >>
> > >> I've discussed various solutions with John Fastabend, and we saw a
> cgroup as
> > >> being a good general solution to this problem.  The network
> priority cgroup
> > >> allows for a per-interface priority map to be built per cgroup.
>  Any traffic
> > >> originating from an application in a cgroup, that does not
> explicitly set its
> > >> priority with SO_PRIORITY will have its priority assigned to the
> value
> > >> designated for that group on that interface.  This allows a user
> space daemon,
> > >> when conducting LLDP negotiation with a DCB enabled peer to create
> a cgroup
> > >> based on the APP_TLV value received and administratively assign
> applications to
> > >> that priority using the existing cgroup utility infrastructure.
> > >>
> > >> Tested by John and myself, with good results
> > >>
> > >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > >> CC: John Fastabend <john.r.fastabend@intel.com>
> > >> CC: Robert Love <robert.w.love@intel.com>
> > >> CC: "David S. Miller" <davem@davemloft.net>
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe netdev"
> in
> > >> the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>
> > >
> > > Bump, any other thoughts here?  Dave T. has some reasonable
> thoughts regarding
> > > the use of skb->priority, but IMO they really seem orthogonal to
> the purpose of
> > > this change.  Any other reviews would be welcome.
> >
> > Well, in part I've been playing catchup in the hope that lldp and
> > openlldp and/or this dcb netlink layer that I don't know anything
> > about (pointers please?) could help somehow to resolve the semantic
> > mess skb->priority has become in the first place.
> >
> > I liked what was described here.
> >
> > "What if we did at least carve out the DCB functionality away from
> > skb->priority?  Since, AIUI, we're only concerning ourselves with
> > locally generated traffic here, we're talking
> > about skbs that have a socket attached to them.  We could, instead of
> indexing
> > the prio_tc_map with skb->priority, we could index it with
> > skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The
> cgroup
> > then could be, instead of a strict priority cgroup, a queue_selector
> cgroup (or
> > something more appropriately named), and we don't have to touch skb-
> >priority at
> > all.  I'd really rather not start down that road until I got more
> opinions and
> > consensus on that, but it seems like a pretty good solution, one that
> would
> > allow hardware queue selection in systems that use things like DCB to
> co-exist
> > with software queueing features."
> >
> I was initially ok with this, but the more I think about it, the more I
> think
> its just not needed (see further down in this email for my reasoning).
> John,
> Rob, do you have any thoughts here?
> 
> > The piece that still kind of bothered me about the original proposal
> > (and perhaps this one) was that setting SO_PRIORITY in an app means
> > 'give my packets more mojo'.
> >
> > Taking something that took unprioritized packets and assigned them
> and
> > *them only* to a hardware queue struck me as possibly deprioritizing
> > the 'more mojo wanted' packets in the app(s), as they would end up in
> > some other, possibly overloaded, hardware queue.
> >
> I don't really see what you mean by this at all.  Taking packets with
> no
> priority and assigning them a priority doesn't really have an effect on
> pre-prioritized packets.  Or rather it shouldn't.  You can certainly
> create a
> problem by having apps prioritized according to conflicting semantic
> rules, but
> that strikes me as administrative error.  Garbage in...Garbage out.
> 
> > So a cgroup that moves all of the packets from an application into a
> > given hardware queue, and then gets scheduled normally according to
> > skb->priority and friends (software queue, default of pfifo_fast,
> > etc), seems to make some sense to me. (I wouldn't mind if we had
> > abstractions for software queues, too, like, I need a software queue
> > with these properties, find me a place for it on the hardware - but
> > I'm dreaming)
> >
> > One open question is where do packets generated from other subsystems
> > end up, if you are using a cgroup for the app? arp, dns, etc?
> >
> The overriding rule is the association of an skb to a socket.  If a
> transmitted
> frame has skb->sk set in dev_queue_xmit, then we interrogate its
> priority index
> as set when we passed through the sendmsg code at the top of the stack.
> Otherwise its behavior is unchanged from its current standpoint.
> 
> > So to rephrase your original description from this:
> >
> > >> Any traffic originating from an application in a cgroup, that does
> not explicitly set its
> > >> priority with SO_PRIORITY will have its priority assigned to the
> value
> > >> designated for that group on that interface.  This allows a user
> space daemon,
> > >> when conducting LLDP negotiation with a DCB enabled peer to create
> a cgroup
> > >> based on the APP_TLV value received and administratively assign
> applications to
> > >> that priority using the existing cgroup utility infrastructure.
> > > John, Robert, if you're supportive of these changes, some Acks
> would be
> > > appreciated.
> >
> > To this:
> >
> > "Any traffic originating from an application in a cgroup,  will have
> > its hardware queue  assigned to the value designated for that group
> on
> > that interface.  This allows a user space daemon, when conducting
> LLDP
> > negotiation with a DCB enabled peer to create a cgroup based on the
> > APP_TLV value received and administratively assign applications to
> > that hardware queue using the existing cgroup utility
> infrastructure."
> >
> As above, I'm split brained about this.  I'm ok with the idea of making
> this a
> queue selection cgroup, and separating it from priority, but at the
> same time,
> in the context of DCB, we really are assigning priority here, so it
> seems a bit
> false to do something that is not priority.  I also like the fact that
> it
> provides administrative control in a way that netfilter and tc don't
> really
> enable.
> 
> > Assuming we're on the same page here, what the heck is APP_TLV?
> >
> LLDP does layer 2 discovery with peer networking devices. It does so
> using sets
> of Type/length/value tuples.  The types carry various bits of
> information, such
> as which priority groups are available on the network.  The APP tlv
> conveys
> application or feature specific information.  for instance, There is an
> ISCSI
> app tlv that tells the host that  "on the interface you received this
> tlv, iscsi
> traffic must be sent at priority X".  The idea being that, on receipt
> of this
> tlv, the DCB daemon can create an ISCSI network priority cgroup
> instance, and
> augment the cgroup rules file such that, when the user space iscsi
> daemon is
> started, its traffic automatically transmits at the appropriate
> priority.

Love this !

I guess if this is integrated to libvirt via libcgroups VMs could be assigned a network priority..

http://linuxplumbersconf.org/2010/ocw/proposals/843

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-14 16:38       ` John Fastabend
@ 2011-11-14 16:44         ` David Täht
  0 siblings, 0 replies; 15+ messages in thread
From: David Täht @ 2011-11-14 16:44 UTC (permalink / raw)
  To: John Fastabend
  Cc: Neil Horman, netdev@vger.kernel.org, Love, Robert W,
	David S. Miller

[-- Attachment #1: Type: text/plain, Size: 304 bytes --]

On 11/14/2011 05:38 PM, John Fastabend wrote:
> Acked-by: John Fastabend <john.r.fastabend@intel.com> 

Alright. Do as you will.

I would comfort me to know if 802.1Qau (QCN) was going to be enabled by 
default.

http://blog.ioshints.info/2010/11/data-center-bridging-dcb-congestion.html

-- 
Dave Täht


[-- Attachment #2: dave_taht.vcf --]
[-- Type: text/x-vcard, Size: 204 bytes --]

begin:vcard
fn;quoted-printable:Dave T=C3=A4ht
n;quoted-printable:T=C3=A4ht;Dave
email;internet:dave.taht@gmail.com
tel;home:1-239-829-5608
tel;cell:0638645374
x-mozilla-html:FALSE
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-14 16:43       ` Shyam_Iyer
@ 2011-11-14 17:24         ` Neil Horman
  2011-11-14 20:02           ` Shyam_Iyer
  0 siblings, 1 reply; 15+ messages in thread
From: Neil Horman @ 2011-11-14 17:24 UTC (permalink / raw)
  To: Shyam_Iyer; +Cc: dave.taht, netdev, john.r.fastabend, robert.w.love, davem

On Mon, Nov 14, 2011 at 10:13:37PM +0530, Shyam_Iyer@Dell.com wrote:
> 
> 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org [mailto:netdev-
> > owner@vger.kernel.org] On Behalf Of Neil Horman
> > Sent: Monday, November 14, 2011 9:44 AM
> > To: Dave Taht
> > Cc: netdev@vger.kernel.org; John Fastabend; Robert Love; David S.
> > Miller
> > Subject: Re: net: Add network priority cgroup
> > 
> > On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> > > On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com>
> > wrote:
> > > > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> > > >> Data Center Bridging environments are currently somewhat limited
> > in their
> > > >> ability to provide a general mechanism for controlling traffic
> > priority.
> > > >> Specifically they are unable to administratively control the
> > priority at which
> > > >> various types of network traffic are sent.
> > > >>
> > > >> Currently, the only ways to set the priority of a network buffer
> > are:
> > > >>
> > > >> 1) Through the use of the SO_PRIORITY socket option
> > > >> 2) By using low level hooks, like a tc action
> > > >>
> > > >> (1) is difficult from an administrative perspective because it
> > requires that the
> > > >> application to be coded to not just assume the default priority is
> > sufficient,
> > > >> and must expose an administrative interface to allow priority
> > adjustment.  Such
> > > >> a solution is not scalable in a DCB environment
> > > >>
> > > >> (2) is also difficult, as it requires constant administrative
> > oversight of
> > > >> applications so as to build appropriate rules to match traffic
> > belonging to
> > > >> various classes, so that priority can be appropriately set. It is
> > further
> > > >> limiting when DCB enabled hardware is in use, due to the fact that
> > tc rules are
> > > >> only run after a root qdisc has been selected (DCB enabled
> > hardware may reserve
> > > >> hw queues for various traffic classes and needs the priority to be
> > set prior to
> > > >> selecting the root qdisc)
> > > >>
> > > >>
> > > >> I've discussed various solutions with John Fastabend, and we saw a
> > cgroup as
> > > >> being a good general solution to this problem.  The network
> > priority cgroup
> > > >> allows for a per-interface priority map to be built per cgroup.
> >  Any traffic
> > > >> originating from an application in a cgroup, that does not
> > explicitly set its
> > > >> priority with SO_PRIORITY will have its priority assigned to the
> > value
> > > >> designated for that group on that interface.  This allows a user
> > space daemon,
> > > >> when conducting LLDP negotiation with a DCB enabled peer to create
> > a cgroup
> > > >> based on the APP_TLV value received and administratively assign
> > applications to
> > > >> that priority using the existing cgroup utility infrastructure.
> > > >>
> > > >> Tested by John and myself, with good results
> > > >>
> > > >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > >> CC: John Fastabend <john.r.fastabend@intel.com>
> > > >> CC: Robert Love <robert.w.love@intel.com>
> > > >> CC: "David S. Miller" <davem@davemloft.net>
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe netdev"
> > in
> > > >> the body of a message to majordomo@vger.kernel.org
> > > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >>
> > > >
> > > > Bump, any other thoughts here?  Dave T. has some reasonable
> > thoughts regarding
> > > > the use of skb->priority, but IMO they really seem orthogonal to
> > the purpose of
> > > > this change.  Any other reviews would be welcome.
> > >
> > > Well, in part I've been playing catchup in the hope that lldp and
> > > openlldp and/or this dcb netlink layer that I don't know anything
> > > about (pointers please?) could help somehow to resolve the semantic
> > > mess skb->priority has become in the first place.
> > >
> > > I liked what was described here.
> > >
> > > "What if we did at least carve out the DCB functionality away from
> > > skb->priority?  Since, AIUI, we're only concerning ourselves with
> > > locally generated traffic here, we're talking
> > > about skbs that have a socket attached to them.  We could, instead of
> > indexing
> > > the prio_tc_map with skb->priority, we could index it with
> > > skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The
> > cgroup
> > > then could be, instead of a strict priority cgroup, a queue_selector
> > cgroup (or
> > > something more appropriately named), and we don't have to touch skb-
> > >priority at
> > > all.  I'd really rather not start down that road until I got more
> > opinions and
> > > consensus on that, but it seems like a pretty good solution, one that
> > would
> > > allow hardware queue selection in systems that use things like DCB to
> > co-exist
> > > with software queueing features."
> > >
> > I was initially ok with this, but the more I think about it, the more I
> > think
> > its just not needed (see further down in this email for my reasoning).
> > John,
> > Rob, do you have any thoughts here?
> > 
> > > The piece that still kind of bothered me about the original proposal
> > > (and perhaps this one) was that setting SO_PRIORITY in an app means
> > > 'give my packets more mojo'.
> > >
> > > Taking something that took unprioritized packets and assigned them
> > and
> > > *them only* to a hardware queue struck me as possibly deprioritizing
> > > the 'more mojo wanted' packets in the app(s), as they would end up in
> > > some other, possibly overloaded, hardware queue.
> > >
> > I don't really see what you mean by this at all.  Taking packets with
> > no
> > priority and assigning them a priority doesn't really have an effect on
> > pre-prioritized packets.  Or rather it shouldn't.  You can certainly
> > create a
> > problem by having apps prioritized according to conflicting semantic
> > rules, but
> > that strikes me as administrative error.  Garbage in...Garbage out.
> > 
> > > So a cgroup that moves all of the packets from an application into a
> > > given hardware queue, and then gets scheduled normally according to
> > > skb->priority and friends (software queue, default of pfifo_fast,
> > > etc), seems to make some sense to me. (I wouldn't mind if we had
> > > abstractions for software queues, too, like, I need a software queue
> > > with these properties, find me a place for it on the hardware - but
> > > I'm dreaming)
> > >
> > > One open question is where do packets generated from other subsystems
> > > end up, if you are using a cgroup for the app? arp, dns, etc?
> > >
> > The overriding rule is the association of an skb to a socket.  If a
> > transmitted
> > frame has skb->sk set in dev_queue_xmit, then we interrogate its
> > priority index
> > as set when we passed through the sendmsg code at the top of the stack.
> > Otherwise its behavior is unchanged from its current standpoint.
> > 
> > > So to rephrase your original description from this:
> > >
> > > >> Any traffic originating from an application in a cgroup, that does
> > not explicitly set its
> > > >> priority with SO_PRIORITY will have its priority assigned to the
> > value
> > > >> designated for that group on that interface.  This allows a user
> > space daemon,
> > > >> when conducting LLDP negotiation with a DCB enabled peer to create
> > a cgroup
> > > >> based on the APP_TLV value received and administratively assign
> > applications to
> > > >> that priority using the existing cgroup utility infrastructure.
> > > > John, Robert, if you're supportive of these changes, some Acks
> > would be
> > > > appreciated.
> > >
> > > To this:
> > >
> > > "Any traffic originating from an application in a cgroup,  will have
> > > its hardware queue  assigned to the value designated for that group
> > on
> > > that interface.  This allows a user space daemon, when conducting
> > LLDP
> > > negotiation with a DCB enabled peer to create a cgroup based on the
> > > APP_TLV value received and administratively assign applications to
> > > that hardware queue using the existing cgroup utility
> > infrastructure."
> > >
> > As above, I'm split brained about this.  I'm ok with the idea of making
> > this a
> > queue selection cgroup, and separating it from priority, but at the
> > same time,
> > in the context of DCB, we really are assigning priority here, so it
> > seems a bit
> > false to do something that is not priority.  I also like the fact that
> > it
> > provides administrative control in a way that netfilter and tc don't
> > really
> > enable.
> > 
> > > Assuming we're on the same page here, what the heck is APP_TLV?
> > >
> > LLDP does layer 2 discovery with peer networking devices. It does so
> > using sets
> > of Type/length/value tuples.  The types carry various bits of
> > information, such
> > as which priority groups are available on the network.  The APP tlv
> > conveys
> > application or feature specific information.  for instance, There is an
> > ISCSI
> > app tlv that tells the host that  "on the interface you received this
> > tlv, iscsi
> > traffic must be sent at priority X".  The idea being that, on receipt
> > of this
> > tlv, the DCB daemon can create an ISCSI network priority cgroup
> > instance, and
> > augment the cgroup rules file such that, when the user space iscsi
> > daemon is
> > started, its traffic automatically transmits at the appropriate
> > priority.
> 
> Love this !
> 
> I guess if this is integrated to libvirt via libcgroups VMs could be assigned a network priority..
> 
As the patch stand currently, absolutely.  Just drop a qemu process into the
approriate cgroup, and (assuming you're using a tun/tap type device), all the
traffic from that vm will get assigned the corresponding priority.  You can do
the same thing with classification using net_cls already.  I did a video of it
here:
http://www.youtube.com/watch?v=KX5QV4LId_c
Neil

> http://linuxplumbersconf.org/2010/ocw/proposals/843
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: net: Add network priority cgroup
  2011-11-14 17:24         ` Neil Horman
@ 2011-11-14 20:02           ` Shyam_Iyer
  2011-11-14 20:56             ` Neil Horman
  0 siblings, 1 reply; 15+ messages in thread
From: Shyam_Iyer @ 2011-11-14 20:02 UTC (permalink / raw)
  To: nhorman; +Cc: dave.taht, netdev, john.r.fastabend, robert.w.love, davem

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Neil Horman
> Sent: Monday, November 14, 2011 12:24 PM
> Subject: Re: net: Add network priority cgroup
> > >
> > > On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> > > > On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman
> <nhorman@tuxdriver.com>
> > > wrote:
> > > > > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> > > > >> Data Center Bridging environments are currently somewhat
> limited
> > > in their
> > > > >> ability to provide a general mechanism for controlling traffic
> > > priority.
> > > > >> Specifically they are unable to administratively control the
> > > priority at which
> > > > >> various types of network traffic are sent.
> > > > >>
> > > > >> Currently, the only ways to set the priority of a network
> buffer
> > > are:
> > > > >>
> > > > >> 1) Through the use of the SO_PRIORITY socket option
> > > > >> 2) By using low level hooks, like a tc action
> > > > >>
> > > > >> (1) is difficult from an administrative perspective because it
> > > requires that the
> > > > >> application to be coded to not just assume the default
> priority is
> > > sufficient,
> > > > >> and must expose an administrative interface to allow priority
> > > adjustment.  Such
> > > > >> a solution is not scalable in a DCB environment
> > > > >>
> > > > >> (2) is also difficult, as it requires constant administrative
> > > oversight of
> > > > >> applications so as to build appropriate rules to match traffic
> > > belonging to
> > > > >> various classes, so that priority can be appropriately set. It
> is
> > > further
> > > > >> limiting when DCB enabled hardware is in use, due to the fact
> that
> > > tc rules are
> > > > >> only run after a root qdisc has been selected (DCB enabled
> > > hardware may reserve
> > > > >> hw queues for various traffic classes and needs the priority
> to be
> > > set prior to
> > > > >> selecting the root qdisc)
> > > > >>
> > > > >>
> > > > >> I've discussed various solutions with John Fastabend, and we
> saw a
> > > cgroup as
> > > > >> being a good general solution to this problem.  The network
> > > priority cgroup
> > > > >> allows for a per-interface priority map to be built per
> cgroup.
> > >  Any traffic
> > > > >> originating from an application in a cgroup, that does not
> > > explicitly set its
> > > > >> priority with SO_PRIORITY will have its priority assigned to
> the
> > > value
> > > > >> designated for that group on that interface.  This allows a
> user
> > > space daemon,
> > > > >> when conducting LLDP negotiation with a DCB enabled peer to
> create
> > > a cgroup
> > > > >> based on the APP_TLV value received and administratively
> assign
> > > applications to
> > > > >> that priority using the existing cgroup utility
> infrastructure.
> > > > >>
> > > > >> Tested by John and myself, with good results
> > > > >>
> > > > >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > >> CC: John Fastabend <john.r.fastabend@intel.com>
> > > > >> CC: Robert Love <robert.w.love@intel.com>
> > > > >> CC: "David S. Miller" <davem@davemloft.net>
> > > > >> --
> > > > >> To unsubscribe from this list: send the line "unsubscribe
> netdev"
> > > in
> > > > >> the body of a message to majordomo@vger.kernel.org
> > > > >> More majordomo info at  http://vger.kernel.org/majordomo-
> info.html
> > > > >>
> > > > >
> > > > > Bump, any other thoughts here?  Dave T. has some reasonable
> > > thoughts regarding
> > > > > the use of skb->priority, but IMO they really seem orthogonal
> to
> > > the purpose of
> > > > > this change.  Any other reviews would be welcome.
> > > >
> > > > Well, in part I've been playing catchup in the hope that lldp and
> > > > openlldp and/or this dcb netlink layer that I don't know anything
> > > > about (pointers please?) could help somehow to resolve the
> semantic
> > > > mess skb->priority has become in the first place.
> > > >
> > > > I liked what was described here.
> > > >
> > > > "What if we did at least carve out the DCB functionality away
> from
> > > > skb->priority?  Since, AIUI, we're only concerning ourselves with
> > > > locally generated traffic here, we're talking
> > > > about skbs that have a socket attached to them.  We could,
> instead of
> > > indexing
> > > > the prio_tc_map with skb->priority, we could index it with
> > > > skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).
> The
> > > cgroup
> > > > then could be, instead of a strict priority cgroup, a
> queue_selector
> > > cgroup (or
> > > > something more appropriately named), and we don't have to touch
> skb-
> > > >priority at
> > > > all.  I'd really rather not start down that road until I got more
> > > opinions and
> > > > consensus on that, but it seems like a pretty good solution, one
> that
> > > would
> > > > allow hardware queue selection in systems that use things like
> DCB to
> > > co-exist
> > > > with software queueing features."
> > > >
> > > I was initially ok with this, but the more I think about it, the
> more I
> > > think
> > > its just not needed (see further down in this email for my
> reasoning).
> > > John,
> > > Rob, do you have any thoughts here?
> > >
> > > > The piece that still kind of bothered me about the original
> proposal
> > > > (and perhaps this one) was that setting SO_PRIORITY in an app
> means
> > > > 'give my packets more mojo'.
> > > >
> > > > Taking something that took unprioritized packets and assigned
> them
> > > and
> > > > *them only* to a hardware queue struck me as possibly
> deprioritizing
> > > > the 'more mojo wanted' packets in the app(s), as they would end
> up in
> > > > some other, possibly overloaded, hardware queue.
> > > >
> > > I don't really see what you mean by this at all.  Taking packets
> with
> > > no
> > > priority and assigning them a priority doesn't really have an
> effect on
> > > pre-prioritized packets.  Or rather it shouldn't.  You can
> certainly
> > > create a
> > > problem by having apps prioritized according to conflicting
> semantic
> > > rules, but
> > > that strikes me as administrative error.  Garbage in...Garbage out.
> > >
> > > > So a cgroup that moves all of the packets from an application
> into a
> > > > given hardware queue, and then gets scheduled normally according
> to
> > > > skb->priority and friends (software queue, default of pfifo_fast,
> > > > etc), seems to make some sense to me. (I wouldn't mind if we had
> > > > abstractions for software queues, too, like, I need a software
> queue
> > > > with these properties, find me a place for it on the hardware -
> but
> > > > I'm dreaming)
> > > >
> > > > One open question is where do packets generated from other
> subsystems
> > > > end up, if you are using a cgroup for the app? arp, dns, etc?
> > > >
> > > The overriding rule is the association of an skb to a socket.  If a
> > > transmitted
> > > frame has skb->sk set in dev_queue_xmit, then we interrogate its
> > > priority index
> > > as set when we passed through the sendmsg code at the top of the
> stack.
> > > Otherwise its behavior is unchanged from its current standpoint.
> > >
> > > > So to rephrase your original description from this:
> > > >
> > > > >> Any traffic originating from an application in a cgroup, that
> does
> > > not explicitly set its
> > > > >> priority with SO_PRIORITY will have its priority assigned to
> the
> > > value
> > > > >> designated for that group on that interface.  This allows a
> user
> > > space daemon,
> > > > >> when conducting LLDP negotiation with a DCB enabled peer to
> create
> > > a cgroup
> > > > >> based on the APP_TLV value received and administratively
> assign
> > > applications to
> > > > >> that priority using the existing cgroup utility
> infrastructure.
> > > > > John, Robert, if you're supportive of these changes, some Acks
> > > would be
> > > > > appreciated.
> > > >
> > > > To this:
> > > >
> > > > "Any traffic originating from an application in a cgroup,  will
> have
> > > > its hardware queue  assigned to the value designated for that
> group
> > > on
> > > > that interface.  This allows a user space daemon, when conducting
> > > LLDP
> > > > negotiation with a DCB enabled peer to create a cgroup based on
> the
> > > > APP_TLV value received and administratively assign applications
> to
> > > > that hardware queue using the existing cgroup utility
> > > infrastructure."
> > > >
> > > As above, I'm split brained about this.  I'm ok with the idea of
> making
> > > this a
> > > queue selection cgroup, and separating it from priority, but at the
> > > same time,
> > > in the context of DCB, we really are assigning priority here, so it
> > > seems a bit
> > > false to do something that is not priority.  I also like the fact
> that
> > > it
> > > provides administrative control in a way that netfilter and tc
> don't
> > > really
> > > enable.
> > >
> > > > Assuming we're on the same page here, what the heck is APP_TLV?
> > > >
> > > LLDP does layer 2 discovery with peer networking devices. It does
> so
> > > using sets
> > > of Type/length/value tuples.  The types carry various bits of
> > > information, such
> > > as which priority groups are available on the network.  The APP tlv
> > > conveys
> > > application or feature specific information.  for instance, There
> is an
> > > ISCSI
> > > app tlv that tells the host that  "on the interface you received
> this
> > > tlv, iscsi
> > > traffic must be sent at priority X".  The idea being that, on
> receipt
> > > of this
> > > tlv, the DCB daemon can create an ISCSI network priority cgroup
> > > instance, and
> > > augment the cgroup rules file such that, when the user space iscsi
> > > daemon is
> > > started, its traffic automatically transmits at the appropriate
> > > priority.
> >
> > Love this !
> >
> > I guess if this is integrated to libvirt via libcgroups VMs could be
> assigned a network priority..
> >
> As the patch stand currently, absolutely.  Just drop a qemu process
> into the
> approriate cgroup, and (assuming you're using a tun/tap type device),
> all the
> traffic from that vm will get assigned the corresponding priority.  You
> can do
> the same thing with classification using net_cls already.  I did a
> video of it
> here:
> http://www.youtube.com/watch?v=KX5QV4LId_c

Yes.. But I guess the present implementation allows iSCSI daemon to be grouped in an iSCSI network priority cgroup.
This is nicer since then we don't have to deal with a network cgroup + disk cgroup for the iSCSI use case.

Now.. since this is on the skb->priority I suspect this could become  a per-session priority cgroup.

Is that the case or does the iscsid form just one cgroup?

> Neil
> > http://linuxplumbersconf.org/2010/ocw/proposals/843
> >
> >
> >
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: net: Add network priority cgroup
  2011-11-14 20:02           ` Shyam_Iyer
@ 2011-11-14 20:56             ` Neil Horman
  0 siblings, 0 replies; 15+ messages in thread
From: Neil Horman @ 2011-11-14 20:56 UTC (permalink / raw)
  To: Shyam_Iyer; +Cc: dave.taht, netdev, john.r.fastabend, robert.w.love, davem

On Mon, Nov 14, 2011 at 12:02:59PM -0800, Shyam_Iyer@Dell.com wrote:
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org [mailto:netdev-
> > owner@vger.kernel.org] On Behalf Of Neil Horman
> > Sent: Monday, November 14, 2011 12:24 PM
> > Subject: Re: net: Add network priority cgroup
> > > >
> > > > On Mon, Nov 14, 2011 at 01:32:04PM +0100, Dave Taht wrote:
> > > > > On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman
> > <nhorman@tuxdriver.com>
> > > > wrote:
> > > > > > On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> > > > > >> Data Center Bridging environments are currently somewhat
> > limited
> > > > in their
> > > > > >> ability to provide a general mechanism for controlling traffic
> > > > priority.
> > > > > >> Specifically they are unable to administratively control the
> > > > priority at which
> > > > > >> various types of network traffic are sent.
> > > > > >>
> > > > > >> Currently, the only ways to set the priority of a network
> > buffer
> > > > are:
> > > > > >>
> > > > > >> 1) Through the use of the SO_PRIORITY socket option
> > > > > >> 2) By using low level hooks, like a tc action
> > > > > >>
> > > > > >> (1) is difficult from an administrative perspective because it
> > > > requires that the
> > > > > >> application to be coded to not just assume the default
> > priority is
> > > > sufficient,
> > > > > >> and must expose an administrative interface to allow priority
> > > > adjustment.  Such
> > > > > >> a solution is not scalable in a DCB environment
> > > > > >>
> > > > > >> (2) is also difficult, as it requires constant administrative
> > > > oversight of
> > > > > >> applications so as to build appropriate rules to match traffic
> > > > belonging to
> > > > > >> various classes, so that priority can be appropriately set. It
> > is
> > > > further
> > > > > >> limiting when DCB enabled hardware is in use, due to the fact
> > that
> > > > tc rules are
> > > > > >> only run after a root qdisc has been selected (DCB enabled
> > > > hardware may reserve
> > > > > >> hw queues for various traffic classes and needs the priority
> > to be
> > > > set prior to
> > > > > >> selecting the root qdisc)
> > > > > >>
> > > > > >>
> > > > > >> I've discussed various solutions with John Fastabend, and we
> > saw a
> > > > cgroup as
> > > > > >> being a good general solution to this problem.  The network
> > > > priority cgroup
> > > > > >> allows for a per-interface priority map to be built per
> > cgroup.
> > > >  Any traffic
> > > > > >> originating from an application in a cgroup, that does not
> > > > explicitly set its
> > > > > >> priority with SO_PRIORITY will have its priority assigned to
> > the
> > > > value
> > > > > >> designated for that group on that interface.  This allows a
> > user
> > > > space daemon,
> > > > > >> when conducting LLDP negotiation with a DCB enabled peer to
> > create
> > > > a cgroup
> > > > > >> based on the APP_TLV value received and administratively
> > assign
> > > > applications to
> > > > > >> that priority using the existing cgroup utility
> > infrastructure.
> > > > > >>
> > > > > >> Tested by John and myself, with good results
> > > > > >>
> > > > > >> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > >> CC: John Fastabend <john.r.fastabend@intel.com>
> > > > > >> CC: Robert Love <robert.w.love@intel.com>
> > > > > >> CC: "David S. Miller" <davem@davemloft.net>
> > > > > >> --
> > > > > >> To unsubscribe from this list: send the line "unsubscribe
> > netdev"
> > > > in
> > > > > >> the body of a message to majordomo@vger.kernel.org
> > > > > >> More majordomo info at  http://vger.kernel.org/majordomo-
> > info.html
> > > > > >>
> > > > > >
> > > > > > Bump, any other thoughts here?  Dave T. has some reasonable
> > > > thoughts regarding
> > > > > > the use of skb->priority, but IMO they really seem orthogonal
> > to
> > > > the purpose of
> > > > > > this change.  Any other reviews would be welcome.
> > > > >
> > > > > Well, in part I've been playing catchup in the hope that lldp and
> > > > > openlldp and/or this dcb netlink layer that I don't know anything
> > > > > about (pointers please?) could help somehow to resolve the
> > semantic
> > > > > mess skb->priority has become in the first place.
> > > > >
> > > > > I liked what was described here.
> > > > >
> > > > > "What if we did at least carve out the DCB functionality away
> > from
> > > > > skb->priority?  Since, AIUI, we're only concerning ourselves with
> > > > > locally generated traffic here, we're talking
> > > > > about skbs that have a socket attached to them.  We could,
> > instead of
> > > > indexing
> > > > > the prio_tc_map with skb->priority, we could index it with
> > > > > skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).
> > The
> > > > cgroup
> > > > > then could be, instead of a strict priority cgroup, a
> > queue_selector
> > > > cgroup (or
> > > > > something more appropriately named), and we don't have to touch
> > skb-
> > > > >priority at
> > > > > all.  I'd really rather not start down that road until I got more
> > > > opinions and
> > > > > consensus on that, but it seems like a pretty good solution, one
> > that
> > > > would
> > > > > allow hardware queue selection in systems that use things like
> > DCB to
> > > > co-exist
> > > > > with software queueing features."
> > > > >
> > > > I was initially ok with this, but the more I think about it, the
> > more I
> > > > think
> > > > its just not needed (see further down in this email for my
> > reasoning).
> > > > John,
> > > > Rob, do you have any thoughts here?
> > > >
> > > > > The piece that still kind of bothered me about the original
> > proposal
> > > > > (and perhaps this one) was that setting SO_PRIORITY in an app
> > means
> > > > > 'give my packets more mojo'.
> > > > >
> > > > > Taking something that took unprioritized packets and assigned
> > them
> > > > and
> > > > > *them only* to a hardware queue struck me as possibly
> > deprioritizing
> > > > > the 'more mojo wanted' packets in the app(s), as they would end
> > up in
> > > > > some other, possibly overloaded, hardware queue.
> > > > >
> > > > I don't really see what you mean by this at all.  Taking packets
> > with
> > > > no
> > > > priority and assigning them a priority doesn't really have an
> > effect on
> > > > pre-prioritized packets.  Or rather it shouldn't.  You can
> > certainly
> > > > create a
> > > > problem by having apps prioritized according to conflicting
> > semantic
> > > > rules, but
> > > > that strikes me as administrative error.  Garbage in...Garbage out.
> > > >
> > > > > So a cgroup that moves all of the packets from an application
> > into a
> > > > > given hardware queue, and then gets scheduled normally according
> > to
> > > > > skb->priority and friends (software queue, default of pfifo_fast,
> > > > > etc), seems to make some sense to me. (I wouldn't mind if we had
> > > > > abstractions for software queues, too, like, I need a software
> > queue
> > > > > with these properties, find me a place for it on the hardware -
> > but
> > > > > I'm dreaming)
> > > > >
> > > > > One open question is where do packets generated from other
> > subsystems
> > > > > end up, if you are using a cgroup for the app? arp, dns, etc?
> > > > >
> > > > The overriding rule is the association of an skb to a socket.  If a
> > > > transmitted
> > > > frame has skb->sk set in dev_queue_xmit, then we interrogate its
> > > > priority index
> > > > as set when we passed through the sendmsg code at the top of the
> > stack.
> > > > Otherwise its behavior is unchanged from its current standpoint.
> > > >
> > > > > So to rephrase your original description from this:
> > > > >
> > > > > >> Any traffic originating from an application in a cgroup, that
> > does
> > > > not explicitly set its
> > > > > >> priority with SO_PRIORITY will have its priority assigned to
> > the
> > > > value
> > > > > >> designated for that group on that interface.  This allows a
> > user
> > > > space daemon,
> > > > > >> when conducting LLDP negotiation with a DCB enabled peer to
> > create
> > > > a cgroup
> > > > > >> based on the APP_TLV value received and administratively
> > assign
> > > > applications to
> > > > > >> that priority using the existing cgroup utility
> > infrastructure.
> > > > > > John, Robert, if you're supportive of these changes, some Acks
> > > > would be
> > > > > > appreciated.
> > > > >
> > > > > To this:
> > > > >
> > > > > "Any traffic originating from an application in a cgroup,  will
> > have
> > > > > its hardware queue  assigned to the value designated for that
> > group
> > > > on
> > > > > that interface.  This allows a user space daemon, when conducting
> > > > LLDP
> > > > > negotiation with a DCB enabled peer to create a cgroup based on
> > the
> > > > > APP_TLV value received and administratively assign applications
> > to
> > > > > that hardware queue using the existing cgroup utility
> > > > infrastructure."
> > > > >
> > > > As above, I'm split brained about this.  I'm ok with the idea of
> > making
> > > > this a
> > > > queue selection cgroup, and separating it from priority, but at the
> > > > same time,
> > > > in the context of DCB, we really are assigning priority here, so it
> > > > seems a bit
> > > > false to do something that is not priority.  I also like the fact
> > that
> > > > it
> > > > provides administrative control in a way that netfilter and tc
> > don't
> > > > really
> > > > enable.
> > > >
> > > > > Assuming we're on the same page here, what the heck is APP_TLV?
> > > > >
> > > > LLDP does layer 2 discovery with peer networking devices. It does
> > so
> > > > using sets
> > > > of Type/length/value tuples.  The types carry various bits of
> > > > information, such
> > > > as which priority groups are available on the network.  The APP tlv
> > > > conveys
> > > > application or feature specific information.  for instance, There
> > is an
> > > > ISCSI
> > > > app tlv that tells the host that  "on the interface you received
> > this
> > > > tlv, iscsi
> > > > traffic must be sent at priority X".  The idea being that, on
> > receipt
> > > > of this
> > > > tlv, the DCB daemon can create an ISCSI network priority cgroup
> > > > instance, and
> > > > augment the cgroup rules file such that, when the user space iscsi
> > > > daemon is
> > > > started, its traffic automatically transmits at the appropriate
> > > > priority.
> > >
> > > Love this !
> > >
> > > I guess if this is integrated to libvirt via libcgroups VMs could be
> > assigned a network priority..
> > >
> > As the patch stand currently, absolutely.  Just drop a qemu process
> > into the
> > approriate cgroup, and (assuming you're using a tun/tap type device),
> > all the
> > traffic from that vm will get assigned the corresponding priority.  You
> > can do
> > the same thing with classification using net_cls already.  I did a
> > video of it
> > here:
> > http://www.youtube.com/watch?v=KX5QV4LId_c
> 
> Yes.. But I guess the present implementation allows iSCSI daemon to be grouped in an iSCSI network priority cgroup.
> This is nicer since then we don't have to deal with a network cgroup + disk cgroup for the iSCSI use case.
> 
Thats my understanding, yes.

> Now.. since this is on the skb->priority I suspect this could become  a per-session priority cgroup.
> 
Correct.

> Is that the case or does the iscsid form just one cgroup?
> 
The cgroups are actually formed by administrative tools like dcbx/lldapd/etc.
Placement into appropriate cgroups is then handled by appropriate configuration
of cgroups.conf (which may be done administratively by hand, or by the
aforementinoed utilities).  So iscsid, when started, would be assigned the
appropriate cgroup by virtue of said configuration file.
Neil

> > Neil
> > > http://linuxplumbersconf.org/2010/ocw/proposals/843
> > >
> > >
> > >
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-11-14 20:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-09 19:57 net: Add network priority cgroup Neil Horman
2011-11-09 19:57 ` [RFC PATCH 1/2] net: add network priority cgroup infrastructure Neil Horman
2011-11-09 19:57 ` [RFC PATCH 2/2] net: add documentation for net_prio cgroups Neil Horman
2011-11-09 20:27 ` net: Add network priority cgroup Dave Taht
2011-11-09 21:09   ` Neil Horman
2011-11-09 21:10   ` John Fastabend
2011-11-14 11:47 ` Neil Horman
2011-11-14 12:32   ` Dave Taht
2011-11-14 14:43     ` Neil Horman
2011-11-14 16:38       ` John Fastabend
2011-11-14 16:44         ` David Täht
2011-11-14 16:43       ` Shyam_Iyer
2011-11-14 17:24         ` Neil Horman
2011-11-14 20:02           ` Shyam_Iyer
2011-11-14 20:56             ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).