netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7] rps: Receive Packet Steering
@ 2010-03-12 20:13 Tom Herbert
  2010-03-12 21:28 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Tom Herbert @ 2010-03-12 20:13 UTC (permalink / raw)
  To: davem, netdev; +Cc: eric.dumazet

This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask.  The IPI mechanism is used to raise networking receive
softirqs between CPUs.  This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.

The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
bit maps for receive queues in the device (numbered by <n>).  If a device
does not support multi-queue, a single variable is used for the device (rx-0).

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
   Without RPS: 108K tps at 33% CPU
   With RPS:    311K tps at 64% CPU

forcedeth on 16 core AMD
   Without RPS: 156K tps at 15% CPU
   With RPS:    404K tps at 49% CPU
   
bnx2x on 16 core AMD
   Without RPS  567K tps at 61% CPU (4 HW RX queues)
   Without RPS  738K tps at 96% CPU (8 HW RX queues)
   With RPS:    854K tps at 76% CPU (4 HW RX queues)

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets.  It's
probably best not change the masks too frequently.

Signed-off-by: Tom Herbert <therbert@google.com>

 include/linux/netdevice.h |   32 ++++-
 include/linux/skbuff.h    |    3 +
 net/core/dev.c            |  330 +++++++++++++++++++++++++++++++++++++-------
 net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
 net/core/skbuff.c         |    2 +
 5 files changed, 536 insertions(+), 56 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c79a88b..de1a52b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -223,6 +223,7 @@ struct netif_rx_stats {
 	unsigned dropped;
 	unsigned time_squeeze;
 	unsigned cpu_collision;
+	unsigned received_rps;
 };
 
 DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
@@ -530,6 +531,24 @@ struct netdev_queue {
 	unsigned long		tx_dropped;
 } ____cacheline_aligned_in_smp;
 
+/*
+ * This structure holds an RPS map which can be of variable length.  The
+ * map is an array of CPUs.
+ */
+struct rps_map {
+	unsigned int len;
+	struct rcu_head rcu;
+	u16 cpus[0];
+};
+#define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
+
+/* This structure contains an instance of an RX queue. */
+struct netdev_rx_queue {
+	struct rps_map *rps_map;
+	struct kobject kobj;
+	struct netdev_rx_queue *first;
+	atomic_t count;
+} ____cacheline_aligned_in_smp;
 
 /*
  * This structure defines the management hooks for network devices.
@@ -878,6 +897,13 @@ struct net_device {
 
 	unsigned char		broadcast[MAX_ADDR_LEN];	/* hw bcast add	*/
 
+	struct kset		*queues_kset;
+
+	struct netdev_rx_queue	*_rx;
+
+	/* Number of RX queues allocated at alloc_netdev_mq() time  */
+	unsigned int		num_rx_queues;
+
 	struct netdev_queue	rx_queue;
 
 	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;
@@ -1311,14 +1337,16 @@ static inline int unregister_gifconf(unsigned int family)
  */
 struct softnet_data {
 	struct Qdisc		*output_queue;
-	struct sk_buff_head	input_pkt_queue;
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
+	/* Elements below can be accessed between CPUs for RPS */
+	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 };
 
-DECLARE_PER_CPU(struct softnet_data,softnet_data);
+DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 
 #define HAVE_NETIF_QUEUE
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 03f816a..58e1dda 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -267,6 +267,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@mac_header: Link layer header
  *	@_skb_dst: destination entry
  *	@sp: the security path, used for xfrm
+ *	@rxhash: the packet hash computed on receive
  *	@cb: Control buffer. Free for use by every layer. Put private vars here
  *	@len: Length of actual data
  *	@data_len: Data length
@@ -320,6 +321,8 @@ struct sk_buff {
 	struct sock		*sk;
 	struct net_device	*dev;
 
+	__u32			rxhash;
+
 	/*
 	 * This is the control buffer. It is free to use for every
 	 * layer. Please put your private variables there. If you
diff --git a/net/core/dev.c b/net/core/dev.c
index bcc490c..9f0508c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1931,7 +1931,7 @@ out_kfree_skb:
 	return rc;
 }
 
-static u32 skb_tx_hashrnd;
+static u32 hashrnd __read_mostly;
 
 u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
@@ -1949,7 +1949,7 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 	else
 		hash = skb->protocol;
 
-	hash = jhash_1word(hash, skb_tx_hashrnd);
+	hash = jhash_1word(hash, hashrnd);
 
 	return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
 }
@@ -2175,6 +2175,172 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
 
 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
+/*
+ * get_rps_cpu is called from netif_receive_skb and returns the target
+ * CPU from the RPS map of the receiving queue for a given skb.
+ */
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
+{
+	struct ipv6hdr *ip6;
+	struct iphdr *ip;
+	struct netdev_rx_queue *rxqueue;
+	struct rps_map *map;
+	int cpu = -1;
+	u8 ip_proto;
+	u32 addr1, addr2, ports, ihl;
+
+	rcu_read_lock();
+
+	if (skb_rx_queue_recorded(skb)) {
+		u16 index = skb_get_rx_queue(skb);
+		if (unlikely(index >= dev->num_rx_queues)) {
+			if (net_ratelimit()) {
+				WARN(1, "Received packet on %s for queue %u, "
+				    "but number of RX queues is %u\n",
+				     dev->name, index, dev->num_rx_queues);
+			}
+			goto done;
+		}
+		rxqueue = dev->_rx + index;
+	} else
+		rxqueue = dev->_rx;
+
+	if (!rxqueue->rps_map)
+		goto done;
+
+	if (skb->rxhash)
+		goto got_hash; /* Skip hash computation on packet header */
+
+	switch (skb->protocol) {
+	case __constant_htons(ETH_P_IP):
+		if (!pskb_may_pull(skb, sizeof(*ip)))
+			goto done;
+
+		ip = (struct iphdr *) skb->data;
+		ip_proto = ip->protocol;
+		addr1 = ip->saddr;
+		addr2 = ip->daddr;
+		ihl = ip->ihl;
+		break;
+	case __constant_htons(ETH_P_IPV6):
+		if (!pskb_may_pull(skb, sizeof(*ip6)))
+			goto done;
+
+		ip6 = (struct ipv6hdr *) skb->data;
+		ip_proto = ip6->nexthdr;
+		addr1 = ip6->saddr.s6_addr32[3];
+		addr2 = ip6->daddr.s6_addr32[3];
+		ihl = (40 >> 2);
+		break;
+	default:
+		goto done;
+	}
+	ports = 0;
+	switch (ip_proto) {
+	case IPPROTO_TCP:
+	case IPPROTO_UDP:
+	case IPPROTO_DCCP:
+	case IPPROTO_ESP:
+	case IPPROTO_AH:
+	case IPPROTO_SCTP:
+	case IPPROTO_UDPLITE:
+		if (pskb_may_pull(skb, (ihl * 4) + 4))
+			ports = *((u32 *) (skb->data + (ihl * 4)));
+		break;
+
+	default:
+		break;
+	}
+
+	skb->rxhash = jhash_3words(addr1, addr2, ports, hashrnd);
+	if (!skb->rxhash)
+		skb->rxhash = 1;
+
+got_hash:
+	map = rcu_dereference(rxqueue->rps_map);
+	if (map) {
+		u16 tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
+
+		if (cpu_online(tcpu)) {
+			cpu = tcpu;
+			goto done;
+		}
+	}
+
+done:
+	rcu_read_unlock();
+	return cpu;
+}
+
+/*
+ * This structure holds the per-CPU mask of CPUs for which IPIs are scheduled
+ * to be sent to kick remote softirq processing.  There are two masks since
+ * the sending of IPIs must be done with interrupts enabled.  The select field
+ * indicates the current mask that enqueue_backlog uses to schedule IPIs.
+ * select is flipped before net_rps_action is called while still under lock,
+ * net_rps_action then uses the non-selected mask to send the IPIs and clears
+ * it without conflicting with enqueue_backlog operation.
+ */
+struct rps_remote_softirq_cpus {
+	cpumask_t mask[2];
+	int select;
+};
+static DEFINE_PER_CPU(struct rps_remote_softirq_cpus, rps_remote_softirq_cpus);
+
+/* Called from hardirq (IPI) context */
+static void trigger_softirq(void *data)
+{
+	struct softnet_data *queue = data;
+	__napi_schedule(&queue->backlog);
+	__get_cpu_var(netdev_rx_stat).received_rps++;
+}
+
+/*
+ * enqueue_to_backlog is called to queue an skb to a per CPU backlog
+ * queue (may be a remote CPU queue).
+ */
+static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
+{
+	struct softnet_data *queue;
+	unsigned long flags;
+
+	queue = &per_cpu(softnet_data, cpu);
+
+	local_irq_save(flags);
+	__get_cpu_var(netdev_rx_stat).total++;
+
+	spin_lock(&queue->input_pkt_queue.lock);
+	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
+		if (queue->input_pkt_queue.qlen) {
+enqueue:
+			__skb_queue_tail(&queue->input_pkt_queue, skb);
+			spin_unlock_irqrestore(&queue->input_pkt_queue.lock,
+			    flags);
+			return NET_RX_SUCCESS;
+		}
+
+		/* Schedule NAPI for backlog device */
+		if (napi_schedule_prep(&queue->backlog)) {
+			if (cpu != smp_processor_id()) {
+				struct rps_remote_softirq_cpus *rcpus =
+				    &__get_cpu_var(rps_remote_softirq_cpus);
+
+				cpu_set(cpu, rcpus->mask[rcpus->select]);
+				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+			} else
+				__napi_schedule(&queue->backlog);
+		}
+		goto enqueue;
+	}
+
+	spin_unlock(&queue->input_pkt_queue.lock);
+
+	__get_cpu_var(netdev_rx_stat).dropped++;
+	local_irq_restore(flags);
+
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}
 
 /**
  *	netif_rx	-	post buffer to the network code
@@ -2193,8 +2359,7 @@ DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
 int netif_rx(struct sk_buff *skb)
 {
-	struct softnet_data *queue;
-	unsigned long flags;
+	int cpu;
 
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
@@ -2203,31 +2368,11 @@ int netif_rx(struct sk_buff *skb)
 	if (!skb->tstamp.tv64)
 		net_timestamp(skb);
 
-	/*
-	 * The code is rearranged so that the path is the most
-	 * short when CPU is congested, but is still operating.
-	 */
-	local_irq_save(flags);
-	queue = &__get_cpu_var(softnet_data);
-
-	__get_cpu_var(netdev_rx_stat).total++;
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
-enqueue:
-			__skb_queue_tail(&queue->input_pkt_queue, skb);
-			local_irq_restore(flags);
-			return NET_RX_SUCCESS;
-		}
-
-		napi_schedule(&queue->backlog);
-		goto enqueue;
-	}
-
-	__get_cpu_var(netdev_rx_stat).dropped++;
-	local_irq_restore(flags);
+	cpu = get_rps_cpu(skb->dev, skb);
+	if (cpu < 0)
+		cpu = smp_processor_id();
 
-	kfree_skb(skb);
-	return NET_RX_DROP;
+	return enqueue_to_backlog(skb, cpu);
 }
 EXPORT_SYMBOL(netif_rx);
 
@@ -2464,22 +2609,7 @@ void netif_nit_deliver(struct sk_buff *skb)
 	rcu_read_unlock();
 }
 
-/**
- *	netif_receive_skb - process receive buffer from network
- *	@skb: buffer to process
- *
- *	netif_receive_skb() is the main receive data processing function.
- *	It always succeeds. The buffer may be dropped during processing
- *	for congestion control or by the protocol layers.
- *
- *	This function may only be called from softirq context and interrupts
- *	should be enabled.
- *
- *	Return values (usually ignored):
- *	NET_RX_SUCCESS: no congestion
- *	NET_RX_DROP: packet was dropped
- */
-int netif_receive_skb(struct sk_buff *skb)
+int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
 	struct net_device *orig_dev;
@@ -2588,6 +2718,33 @@ out:
 	rcu_read_unlock();
 	return ret;
 }
+
+/**
+ *	netif_receive_skb - process receive buffer from network
+ *	@skb: buffer to process
+ *
+ *	netif_receive_skb() is the main receive data processing function.
+ *	It always succeeds. The buffer may be dropped during processing
+ *	for congestion control or by the protocol layers.
+ *
+ *	This function may only be called from softirq context and interrupts
+ *	should be enabled.
+ *
+ *	Return values (usually ignored):
+ *	NET_RX_SUCCESS: no congestion
+ *	NET_RX_DROP: packet was dropped
+ */
+int netif_receive_skb(struct sk_buff *skb)
+{
+	int cpu;
+
+	cpu = get_rps_cpu(skb->dev, skb);
+
+	if (cpu < 0)
+		return __netif_receive_skb(skb);
+	else
+		return enqueue_to_backlog(skb, cpu);
+}
 EXPORT_SYMBOL(netif_receive_skb);
 
 /* Network device is going away, flush any packets still pending  */
@@ -2914,16 +3071,16 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	do {
 		struct sk_buff *skb;
 
-		local_irq_disable();
+		spin_lock_irq(&queue->input_pkt_queue.lock);
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
 			__napi_complete(napi);
-			local_irq_enable();
+			spin_unlock_irq(&queue->input_pkt_queue.lock);
 			break;
 		}
-		local_irq_enable();
+		spin_unlock_irq(&queue->input_pkt_queue.lock);
 
-		netif_receive_skb(skb);
+		__netif_receive_skb(skb);
 	} while (++work < quota && jiffies == start_time);
 
 	return work;
@@ -3012,6 +3169,22 @@ void netif_napi_del(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(netif_napi_del);
 
+/*
+ * net_rps_action sends any pending IPI's for rps.  This is only called from
+ * softirq and interrupts must be enabled.
+ */
+static void net_rps_action(cpumask_t *mask)
+{
+	int cpu;
+
+	/* Send pending IPI's to kick RPS processing on remote cpus. */
+	for_each_cpu_mask_nr(cpu, *mask) {
+		struct softnet_data *queue = &per_cpu(softnet_data, cpu);
+		if (cpu_online(cpu))
+			__smp_call_function_single(cpu, &queue->csd, 0);
+	}
+	cpus_clear(*mask);
+}
 
 static void net_rx_action(struct softirq_action *h)
 {
@@ -3019,6 +3192,8 @@ static void net_rx_action(struct softirq_action *h)
 	unsigned long time_limit = jiffies + 2;
 	int budget = netdev_budget;
 	void *have;
+	int select;
+	struct rps_remote_softirq_cpus *rcpus;
 
 	local_irq_disable();
 
@@ -3081,8 +3256,14 @@ static void net_rx_action(struct softirq_action *h)
 		netpoll_poll_unlock(have);
 	}
 out:
+	rcpus = &__get_cpu_var(rps_remote_softirq_cpus);
+	select = rcpus->select;
+	rcpus->select ^= 1;
+
 	local_irq_enable();
 
+	net_rps_action(&rcpus->mask[select]);
+
 #ifdef CONFIG_NET_DMA
 	/*
 	 * There may not be any more sk_buffs coming right now, so push
@@ -3327,10 +3508,10 @@ static int softnet_seq_show(struct seq_file *seq, void *v)
 {
 	struct netif_rx_stats *s = v;
 
-	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision);
+		   s->cpu_collision, s->received_rps);
 	return 0;
 }
 
@@ -5067,6 +5248,23 @@ int register_netdevice(struct net_device *dev)
 
 	dev->iflink = -1;
 
+	if (!dev->num_rx_queues) {
+		/*
+		 * Allocate a single RX queue if driver never called
+		 * alloc_netdev_mq
+		 */
+
+		dev->_rx = kzalloc(sizeof(struct netdev_rx_queue), GFP_KERNEL);
+		if (!dev->_rx) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		dev->_rx->first = dev->_rx;
+		atomic_set(&dev->_rx->count, 1);
+		dev->num_rx_queues = 1;
+	}
+
 	/* Init, if this function is available */
 	if (dev->netdev_ops->ndo_init) {
 		ret = dev->netdev_ops->ndo_init(dev);
@@ -5424,9 +5622,11 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 		void (*setup)(struct net_device *), unsigned int queue_count)
 {
 	struct netdev_queue *tx;
+	struct netdev_rx_queue *rx;
 	struct net_device *dev;
 	size_t alloc_size;
 	struct net_device *p;
+	int i;
 
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
@@ -5452,11 +5652,27 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 		goto free_p;
 	}
 
+	rx = kcalloc(queue_count, sizeof(struct netdev_rx_queue), GFP_KERNEL);
+	if (!rx) {
+		printk(KERN_ERR "alloc_netdev: Unable to allocate "
+		       "rx queues.\n");
+		goto free_tx;
+	}
+
+	atomic_set(&rx->count, queue_count);
+
+	/*
+	 * Set a pointer to first element in the array which holds the
+	 * reference count.
+	 */
+	for (i = 0; i < queue_count; i++)
+		rx[i].first = rx;
+
 	dev = PTR_ALIGN(p, NETDEV_ALIGN);
 	dev->padded = (char *)dev - (char *)p;
 
 	if (dev_addr_init(dev))
-		goto free_tx;
+		goto free_rx;
 
 	dev_unicast_init(dev);
 
@@ -5466,6 +5682,9 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	dev->num_tx_queues = queue_count;
 	dev->real_num_tx_queues = queue_count;
 
+	dev->_rx = rx;
+	dev->num_rx_queues = queue_count;
+
 	dev->gso_max_size = GSO_MAX_SIZE;
 
 	netdev_init_queues(dev);
@@ -5480,9 +5699,10 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	strcpy(dev->name, name);
 	return dev;
 
+free_rx:
+	kfree(rx);
 free_tx:
 	kfree(tx);
-
 free_p:
 	kfree(p);
 	return NULL;
@@ -5985,6 +6205,10 @@ static int __init net_dev_init(void)
 		queue->completion_queue = NULL;
 		INIT_LIST_HEAD(&queue->poll_list);
 
+		queue->csd.func = trigger_softirq;
+		queue->csd.info = queue;
+		queue->csd.flags = 0;
+
 		queue->backlog.poll = process_backlog;
 		queue->backlog.weight = weight_p;
 		queue->backlog.gro_list = NULL;
@@ -6023,7 +6247,7 @@ subsys_initcall(net_dev_init);
 
 static int __init initialize_hashrnd(void)
 {
-	get_random_bytes(&skb_tx_hashrnd, sizeof(skb_tx_hashrnd));
+	get_random_bytes(&hashrnd, sizeof(hashrnd));
 	return 0;
 }
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 099c753..7a46343 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -466,6 +466,216 @@ static struct attribute_group wireless_group = {
 };
 #endif
 
+/*
+ * RX queue sysfs structures and functions.
+ */
+struct rx_queue_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, char *buf);
+	ssize_t (*store)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, const char *buf, size_t len);
+};
+#define to_rx_queue_attr(_attr) container_of(_attr,		\
+    struct rx_queue_attribute, attr)
+
+#define to_rx_queue(obj) container_of(obj, struct netdev_rx_queue, kobj)
+
+static ssize_t rx_queue_attr_show(struct kobject *kobj, struct attribute *attr,
+				  char *buf)
+{
+	struct rx_queue_attribute *attribute = to_rx_queue_attr(attr);
+	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	return attribute->show(queue, attribute, buf);
+}
+
+static ssize_t rx_queue_attr_store(struct kobject *kobj, struct attribute *attr,
+				   const char *buf, size_t count)
+{
+	struct rx_queue_attribute *attribute = to_rx_queue_attr(attr);
+	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	return attribute->store(queue, attribute, buf, count);
+}
+
+static struct sysfs_ops rx_queue_sysfs_ops = {
+	.show = rx_queue_attr_show,
+	.store = rx_queue_attr_store,
+};
+
+static ssize_t show_rps_map(struct netdev_rx_queue *queue,
+			    struct rx_queue_attribute *attribute, char *buf)
+{
+	struct rps_map *map;
+	cpumask_var_t mask;
+	size_t len = 0;
+	int i;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	rcu_read_lock();
+	map = rcu_dereference(queue->rps_map);
+	if (map)
+		for (i = 0; i < map->len; i++)
+			cpumask_set_cpu(map->cpus[i], mask);
+
+	len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask);
+	if (PAGE_SIZE - len < 3) {
+		rcu_read_unlock();
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+	rcu_read_unlock();
+
+	free_cpumask_var(mask);
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
+static void rps_map_release(struct rcu_head *rcu)
+{
+	struct rps_map *map = container_of(rcu, struct rps_map, rcu);
+
+	kfree(map);
+}
+
+ssize_t store_rps_map(struct netdev_rx_queue *queue,
+		      struct rx_queue_attribute *attribute,
+		      const char *buf, size_t len)
+{
+	struct rps_map *old_map, *map;
+	cpumask_var_t mask;
+	int err, cpu, i;
+	static DEFINE_SPINLOCK(rps_map_lock);
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits);
+	if (err) {
+		free_cpumask_var(mask);
+		return err;
+	}
+
+	map = kzalloc(max_t(unsigned,
+	    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
+	    GFP_KERNEL);
+	if (!map) {
+		free_cpumask_var(mask);
+		return -ENOMEM;
+	}
+
+	i = 0;
+	for_each_cpu_and(cpu, mask, cpu_online_mask)
+		map->cpus[i++] = cpu;
+
+	if (i)
+		map->len = i;
+	else {
+		kfree(map);
+		map = NULL;
+	}
+
+	spin_lock(&rps_map_lock);
+	old_map = queue->rps_map;
+	rcu_assign_pointer(queue->rps_map, map);
+	spin_unlock(&rps_map_lock);
+
+	if (old_map)
+		call_rcu(&old_map->rcu, rps_map_release);
+
+	free_cpumask_var(mask);
+	return len;
+}
+
+static struct rx_queue_attribute rps_cpus_attribute =
+	__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map);
+
+static struct attribute *rx_queue_default_attrs[] = {
+	&rps_cpus_attribute.attr,
+	NULL
+};
+
+static void rx_queue_release(struct kobject *kobj)
+{
+	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+	struct rps_map *map = queue->rps_map;
+	struct netdev_rx_queue *first = queue->first;
+
+	if (map)
+		call_rcu(&map->rcu, rps_map_release);
+
+	if (atomic_dec_and_test(&first->count))
+		kfree(first);
+}
+
+static struct kobj_type rx_queue_ktype = {
+	.sysfs_ops = &rx_queue_sysfs_ops,
+	.release = rx_queue_release,
+	.default_attrs = rx_queue_default_attrs,
+};
+
+static int rx_queue_add_kobject(struct net_device *net, int index)
+{
+	struct netdev_rx_queue *queue = net->_rx + index;
+	struct kobject *kobj = &queue->kobj;
+	int error = 0;
+
+	kobj->kset = net->queues_kset;
+	error = kobject_init_and_add(kobj, &rx_queue_ktype, NULL,
+	    "rx-%u", index);
+	if (error) {
+		kobject_put(kobj);
+		return error;
+	}
+
+	kobject_uevent(kobj, KOBJ_ADD);
+
+	return error;
+}
+
+static int rx_queue_register_kobjects(struct net_device *net)
+{
+	int i;
+	int error = 0;
+
+	net->queues_kset = kset_create_and_add("queues",
+	    NULL, &net->dev.kobj);
+	if (!net->queues_kset)
+		return -ENOMEM;
+	for (i = 0; i < net->num_rx_queues; i++) {
+		error = rx_queue_add_kobject(net, i);
+		if (error)
+			break;
+	}
+
+	if (error)
+		while (--i >= 0)
+			kobject_put(&net->_rx[i].kobj);
+
+	return error;
+}
+
+static void rx_queue_remove_kobjects(struct net_device *net)
+{
+	int i;
+
+	for (i = 0; i < net->num_rx_queues; i++)
+		kobject_put(&net->_rx[i].kobj);
+	kset_unregister(net->queues_kset);
+}
+
 #endif /* CONFIG_SYSFS */
 
 #ifdef CONFIG_HOTPLUG
@@ -529,6 +739,8 @@ void netdev_unregister_kobject(struct net_device * net)
 	if (!net_eq(dev_net(net), &init_net))
 		return;
 
+	rx_queue_remove_kobjects(net);
+
 	device_del(dev);
 }
 
@@ -537,6 +749,7 @@ int netdev_register_kobject(struct net_device *net)
 {
 	struct device *dev = &(net->dev);
 	const struct attribute_group **groups = net->sysfs_groups;
+	int error = 0;
 
 	dev->class = &net_class;
 	dev->platform_data = net;
@@ -563,7 +776,17 @@ int netdev_register_kobject(struct net_device *net)
 	if (!net_eq(dev_net(net), &init_net))
 		return 0;
 
-	return device_add(dev);
+	error = device_add(dev);
+	if (error)
+		return error;
+
+	error = rx_queue_register_kobjects(net);
+	if (error) {
+		device_del(dev);
+		return error;
+	}
+
+	return error;
 }
 
 int netdev_class_create_file(struct class_attribute *class_attr)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..bdea0ef 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -534,6 +534,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->network_header	= old->network_header;
 	new->mac_header		= old->mac_header;
 	skb_dst_set(new, dst_clone(skb_dst(old)));
+	new->rxhash		= old->rxhash;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
@@ -581,6 +582,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	C(len);
 	C(data_len);
 	C(mac_len);
+	C(rxhash);
 	n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
 	n->cloned = 1;
 	n->nohdr = 0;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-12 20:13 [PATCH v7] rps: Receive Packet Steering Tom Herbert
@ 2010-03-12 21:28 ` Eric Dumazet
  2010-03-12 23:08   ` Tom Herbert
  2010-03-12 22:20 ` Stephen Hemminger
  2010-03-12 22:23 ` Stephen Hemminger
  2 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2010-03-12 21:28 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev

Le vendredi 12 mars 2010 à 12:13 -0800, Tom Herbert a écrit :
> This patch implements software receive side packet steering (RPS).  RPS
> distributes the load of received packet processing across multiple CPUs.
> 
> Problem statement: Protocol processing done in the NAPI context for received
> packets is serialized per device queue and becomes a bottleneck under high
> packet load.  This substantially limits pps that can be achieved on a single
> queue NIC and provides no scaling with multiple cores.
> 
> This solution queues packets early on in the receive path on the backlog queues
> of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
> performed on packets in parallel.   For each device (or each receive queue in
> a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
> process packets. A CPU is selected on a per packet basis by hashing contents
> of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
> into the CPU mask.  The IPI mechanism is used to raise networking receive
> softirqs between CPUs.  This effectively emulates in software what a multi-queue
> NIC can provide, but is generic requiring no device support.
> 
> Many devices now provide a hash over the 4-tuple on a per packet basis
> (e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
> in an skb field, and that value in turn is used to index into the RPS maps.
> Using the HW generated hash can avoid cache misses on the packet when
> steering it to a remote CPU.
> 
> The CPU mask is set on a per device and per queue basis in the sysfs variable
> /sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
> bit maps for receive queues in the device (numbered by <n>).  If a device
> does not support multi-queue, a single variable is used for the device (rx-0).
> 
> Generally, we have found this technique increases pps capabilities of a single
> queue device with good CPU utilization.  Optimal settings for the CPU mask
> seem to depend on architectures and cache hierarcy.  Below are some results
> running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
> Results show cumulative transaction rate and system CPU utilization.
> 
> e1000e on 8 core Intel
>    Without RPS: 108K tps at 33% CPU
>    With RPS:    311K tps at 64% CPU
> 
> forcedeth on 16 core AMD
>    Without RPS: 156K tps at 15% CPU
>    With RPS:    404K tps at 49% CPU
>    
> bnx2x on 16 core AMD
>    Without RPS  567K tps at 61% CPU (4 HW RX queues)
>    Without RPS  738K tps at 96% CPU (8 HW RX queues)
>    With RPS:    854K tps at 76% CPU (4 HW RX queues)
> 
> Caveats:
> - The benefits of this patch are dependent on architecture and cache hierarchy.
> Tuning the masks to get best performance is probably necessary.
> - This patch adds overhead in the path for processing a single packet.  In
> a lightly loaded server this overhead may eliminate the advantages of
> increased parallelism, and possibly cause some relative performance degradation.
> We have found that masks that are cache aware (share same caches with
> the interrupting CPU) mitigate much of this.
> - The RPS masks can be changed dynamically, however whenever the mask is changed
> this introduces the possibility of generating out of order packets.  It's
> probably best not change the masks too frequently.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> 
>  include/linux/netdevice.h |   32 ++++-
>  include/linux/skbuff.h    |    3 +
>  net/core/dev.c            |  330 +++++++++++++++++++++++++++++++++++++-------
>  net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
>  net/core/skbuff.c         |    2 +
>  5 files changed, 536 insertions(+), 56 deletions(-)
> 

Excellent !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

One last point about placement of rxhash in struct sk_buff, that I
missed in my previous review, sorry...

You put it right before cb[48] which is now aligned to 8 bytes (since
commit da3f5cf1 skbuff: align sk_buff::cb to 64 bit and close some
potential holes), so this adds a 4 bytes hole.

Please put it elsewhere, possibly close to fields that are read in
get_rps_cpu() (skb->queue_mapping, skb->protocol, skb->data, ...) to
minimize number of cache lines that dispatcher cpu has to bring into its
cache, before giving skb to another cpu for IP/TCP processing.


Thanks !



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-12 20:13 [PATCH v7] rps: Receive Packet Steering Tom Herbert
  2010-03-12 21:28 ` Eric Dumazet
@ 2010-03-12 22:20 ` Stephen Hemminger
  2010-03-12 22:32   ` David Miller
  2010-03-12 22:23 ` Stephen Hemminger
  2 siblings, 1 reply; 25+ messages in thread
From: Stephen Hemminger @ 2010-03-12 22:20 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, eric.dumazet

On Fri, 12 Mar 2010 12:13:12 -0800 (PST)
Tom Herbert <therbert@google.com> wrote:

> +++ b/include/linux/netdevice.h
> @@ -223,6 +223,7 @@ struct netif_rx_stats {
>  	unsigned dropped;
>  	unsigned time_squeeze;
>  	unsigned cpu_collision;
> +	unsigned received_rps;

Maybe received_rps should be unsigned long so that
it could be 64 bit on 64 bit platforms?

-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-12 20:13 [PATCH v7] rps: Receive Packet Steering Tom Herbert
  2010-03-12 21:28 ` Eric Dumazet
  2010-03-12 22:20 ` Stephen Hemminger
@ 2010-03-12 22:23 ` Stephen Hemminger
  2010-03-12 22:33   ` David Miller
  2010-03-12 23:05   ` Tom Herbert
  2 siblings, 2 replies; 25+ messages in thread
From: Stephen Hemminger @ 2010-03-12 22:23 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, eric.dumazet

On Fri, 12 Mar 2010 12:13:12 -0800 (PST)
Tom Herbert <therbert@google.com> wrote:

> if (unlikely(index >= dev->num_rx_queues)) {
> +			if (net_ratelimit()) {
> +				WARN(1, "Received packet on %s for queue %u, "
> +				    "but number of RX queues is %u\n",
> +				     dev->name, index, dev->num_rx_queues);
> +			}
> +			goto done;


Use dev_WARN? or invent netdev_WARN?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-12 22:20 ` Stephen Hemminger
@ 2010-03-12 22:32   ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2010-03-12 22:32 UTC (permalink / raw)
  To: shemminger; +Cc: therbert, netdev, eric.dumazet

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Fri, 12 Mar 2010 14:20:01 -0800

> On Fri, 12 Mar 2010 12:13:12 -0800 (PST)
> Tom Herbert <therbert@google.com> wrote:
> 
>> +++ b/include/linux/netdevice.h
>> @@ -223,6 +223,7 @@ struct netif_rx_stats {
>>  	unsigned dropped;
>>  	unsigned time_squeeze;
>>  	unsigned cpu_collision;
>> +	unsigned received_rps;
> 
> Maybe received_rps should be unsigned long so that
> it could be 64 bit on 64 bit platforms?

They all can be made 64-bit since they are only
exported via /proc

So this is completely seperate from the rps changes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-12 22:23 ` Stephen Hemminger
@ 2010-03-12 22:33   ` David Miller
  2010-03-12 23:05   ` Tom Herbert
  1 sibling, 0 replies; 25+ messages in thread
From: David Miller @ 2010-03-12 22:33 UTC (permalink / raw)
  To: shemminger; +Cc: therbert, netdev, eric.dumazet

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Fri, 12 Mar 2010 14:23:02 -0800

> On Fri, 12 Mar 2010 12:13:12 -0800 (PST)
> Tom Herbert <therbert@google.com> wrote:
> 
>> if (unlikely(index >= dev->num_rx_queues)) {
>> +			if (net_ratelimit()) {
>> +				WARN(1, "Received packet on %s for queue %u, "
>> +				    "but number of RX queues is %u\n",
>> +				     dev->name, index, dev->num_rx_queues);
>> +			}
>> +			goto done;
> 
> 
> Use dev_WARN? or invent netdev_WARN?

Why invent?  egrep netdev_warn include/linux/netdevice.h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-12 22:23 ` Stephen Hemminger
  2010-03-12 22:33   ` David Miller
@ 2010-03-12 23:05   ` Tom Herbert
  1 sibling, 0 replies; 25+ messages in thread
From: Tom Herbert @ 2010-03-12 23:05 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, netdev, eric.dumazet

On Fri, Mar 12, 2010 at 2:23 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Fri, 12 Mar 2010 12:13:12 -0800 (PST)
> Tom Herbert <therbert@google.com> wrote:
>
>> if (unlikely(index >= dev->num_rx_queues)) {
>> +                     if (net_ratelimit()) {
>> +                             WARN(1, "Received packet on %s for queue %u, "
>> +                                 "but number of RX queues is %u\n",
>> +                                  dev->name, index, dev->num_rx_queues);
>> +                     }
>> +                     goto done;
>
>
> Use dev_WARN? or invent netdev_WARN?
>

netdev_warn looks good.  I'll use that here and also in
dev_cap_txqueue where WARN is similarly used to be consistent.

Tom

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-12 21:28 ` Eric Dumazet
@ 2010-03-12 23:08   ` Tom Herbert
  2010-03-16 18:03     ` Tom Herbert
  0 siblings, 1 reply; 25+ messages in thread
From: Tom Herbert @ 2010-03-12 23:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev

> One last point about placement of rxhash in struct sk_buff, that I
> missed in my previous review, sorry...
>
> You put it right before cb[48] which is now aligned to 8 bytes (since
> commit da3f5cf1 skbuff: align sk_buff::cb to 64 bit and close some
> potential holes), so this adds a 4 bytes hole.
>
> Please put it elsewhere, possibly close to fields that are read in
> get_rps_cpu() (skb->queue_mapping, skb->protocol, skb->data, ...) to
> minimize number of cache lines that dispatcher cpu has to bring into its
> cache, before giving skb to another cpu for IP/TCP processing.
>
Looks like it will fit right before queue_mapping, I'll put it there.

Tom

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-12 23:08   ` Tom Herbert
@ 2010-03-16 18:03     ` Tom Herbert
  2010-03-16 21:00       ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Tom Herbert @ 2010-03-16 18:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev

Tom Herbert wrote:
>> One last point about placement of rxhash in struct sk_buff, that I
>> missed in my previous review, sorry...
>>
>> You put it right before cb[48] which is now aligned to 8 bytes (since
>> commit da3f5cf1 skbuff: align sk_buff::cb to 64 bit and close some
>> potential holes), so this adds a 4 bytes hole.
>>
>> Please put it elsewhere, possibly close to fields that are read in
>> get_rps_cpu() (skb->queue_mapping, skb->protocol, skb->data, ...) to
>> minimize number of cache lines that dispatcher cpu has to bring into its
>> cache, before giving skb to another cpu for IP/TCP processing.
>>
> Looks like it will fit right before queue_mapping, I'll put it there.
> 
> Tom


This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask.  The IPI mechanism is used to raise networking receive
softirqs between CPUs.  This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.

The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
bit maps for receive queues in the device (numbered by <n>).  If a device
does not support multi-queue, a single variable is used for the device (rx-0).

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
   Without RPS: 108K tps at 33% CPU
   With RPS:    311K tps at 64% CPU

forcedeth on 16 core AMD
   Without RPS: 156K tps at 15% CPU
   With RPS:    404K tps at 49% CPU
   
bnx2x on 16 core AMD
   Without RPS  567K tps at 61% CPU (4 HW RX queues)
   Without RPS  738K tps at 96% CPU (8 HW RX queues)
   With RPS:    854K tps at 76% CPU (4 HW RX queues)

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets.  It's
probably best not change the masks too frequently.

Signed-off-by: Tom Herbert <therbert@google.com>

 include/linux/netdevice.h |   32 ++++-
 include/linux/skbuff.h    |    3 +
 net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
 net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
 net/core/skbuff.c         |    2 +
 5 files changed, 538 insertions(+), 59 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c79a88b..de1a52b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -223,6 +223,7 @@ struct netif_rx_stats {
 	unsigned dropped;
 	unsigned time_squeeze;
 	unsigned cpu_collision;
+	unsigned received_rps;
 };
 
 DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
@@ -530,6 +531,24 @@ struct netdev_queue {
 	unsigned long		tx_dropped;
 } ____cacheline_aligned_in_smp;
 
+/*
+ * This structure holds an RPS map which can be of variable length.  The
+ * map is an array of CPUs.
+ */
+struct rps_map {
+	unsigned int len;
+	struct rcu_head rcu;
+	u16 cpus[0];
+};
+#define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
+
+/* This structure contains an instance of an RX queue. */
+struct netdev_rx_queue {
+	struct rps_map *rps_map;
+	struct kobject kobj;
+	struct netdev_rx_queue *first;
+	atomic_t count;
+} ____cacheline_aligned_in_smp;
 
 /*
  * This structure defines the management hooks for network devices.
@@ -878,6 +897,13 @@ struct net_device {
 
 	unsigned char		broadcast[MAX_ADDR_LEN];	/* hw bcast add	*/
 
+	struct kset		*queues_kset;
+
+	struct netdev_rx_queue	*_rx;
+
+	/* Number of RX queues allocated at alloc_netdev_mq() time  */
+	unsigned int		num_rx_queues;
+
 	struct netdev_queue	rx_queue;
 
 	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;
@@ -1311,14 +1337,16 @@ static inline int unregister_gifconf(unsigned int family)
  */
 struct softnet_data {
 	struct Qdisc		*output_queue;
-	struct sk_buff_head	input_pkt_queue;
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
+	/* Elements below can be accessed between CPUs for RPS */
+	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 };
 
-DECLARE_PER_CPU(struct softnet_data,softnet_data);
+DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 
 #define HAVE_NETIF_QUEUE
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 03f816a..def10b0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -300,6 +300,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@nfct_reasm: netfilter conntrack re-assembly pointer
  *	@nf_bridge: Saved data about a bridged frame - see br_netfilter.c
  *	@skb_iif: ifindex of device we arrived on
+ *	@rxhash: the packet hash computed on receive
  *	@queue_mapping: Queue mapping for multiqueue devices
  *	@tc_index: Traffic control index
  *	@tc_verd: traffic control verdict
@@ -375,6 +376,8 @@ struct sk_buff {
 #endif
 #endif
 
+	__u32			rxhash;
+
 	kmemcheck_bitfield_begin(flags2);
 	__u16			queue_mapping:16;
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
diff --git a/net/core/dev.c b/net/core/dev.c
index bcc490c..17b1686 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1931,7 +1931,7 @@ out_kfree_skb:
 	return rc;
 }
 
-static u32 skb_tx_hashrnd;
+static u32 hashrnd __read_mostly;
 
 u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
@@ -1949,7 +1949,7 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 	else
 		hash = skb->protocol;
 
-	hash = jhash_1word(hash, skb_tx_hashrnd);
+	hash = jhash_1word(hash, hashrnd);
 
 	return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
 }
@@ -1959,10 +1959,9 @@ static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
 {
 	if (unlikely(queue_index >= dev->real_num_tx_queues)) {
 		if (net_ratelimit()) {
-			WARN(1, "%s selects TX queue %d, but "
+			netdev_warn(dev, "selects TX queue %d, but "
 			     "real number of TX queues is %d\n",
-			     dev->name, queue_index,
-			     dev->real_num_tx_queues);
+			     queue_index, dev->real_num_tx_queues);
 		}
 		return 0;
 	}
@@ -2175,6 +2174,172 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
 
 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
+/*
+ * get_rps_cpu is called from netif_receive_skb and returns the target
+ * CPU from the RPS map of the receiving queue for a given skb.
+ */
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
+{
+	struct ipv6hdr *ip6;
+	struct iphdr *ip;
+	struct netdev_rx_queue *rxqueue;
+	struct rps_map *map;
+	int cpu = -1;
+	u8 ip_proto;
+	u32 addr1, addr2, ports, ihl;
+
+	rcu_read_lock();
+
+	if (skb_rx_queue_recorded(skb)) {
+		u16 index = skb_get_rx_queue(skb);
+		if (unlikely(index >= dev->num_rx_queues)) {
+			if (net_ratelimit()) {
+				netdev_warn(dev, "received packet on queue "
+				    "%u, but number of RX queues is %u\n",
+				     index, dev->num_rx_queues);
+			}
+			goto done;
+		}
+		rxqueue = dev->_rx + index;
+	} else
+		rxqueue = dev->_rx;
+
+	if (!rxqueue->rps_map)
+		goto done;
+
+	if (skb->rxhash)
+		goto got_hash; /* Skip hash computation on packet header */
+
+	switch (skb->protocol) {
+	case __constant_htons(ETH_P_IP):
+		if (!pskb_may_pull(skb, sizeof(*ip)))
+			goto done;
+
+		ip = (struct iphdr *) skb->data;
+		ip_proto = ip->protocol;
+		addr1 = ip->saddr;
+		addr2 = ip->daddr;
+		ihl = ip->ihl;
+		break;
+	case __constant_htons(ETH_P_IPV6):
+		if (!pskb_may_pull(skb, sizeof(*ip6)))
+			goto done;
+
+		ip6 = (struct ipv6hdr *) skb->data;
+		ip_proto = ip6->nexthdr;
+		addr1 = ip6->saddr.s6_addr32[3];
+		addr2 = ip6->daddr.s6_addr32[3];
+		ihl = (40 >> 2);
+		break;
+	default:
+		goto done;
+	}
+	ports = 0;
+	switch (ip_proto) {
+	case IPPROTO_TCP:
+	case IPPROTO_UDP:
+	case IPPROTO_DCCP:
+	case IPPROTO_ESP:
+	case IPPROTO_AH:
+	case IPPROTO_SCTP:
+	case IPPROTO_UDPLITE:
+		if (pskb_may_pull(skb, (ihl * 4) + 4))
+			ports = *((u32 *) (skb->data + (ihl * 4)));
+		break;
+
+	default:
+		break;
+	}
+
+	skb->rxhash = jhash_3words(addr1, addr2, ports, hashrnd);
+	if (!skb->rxhash)
+		skb->rxhash = 1;
+
+got_hash:
+	map = rcu_dereference(rxqueue->rps_map);
+	if (map) {
+		u16 tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
+
+		if (cpu_online(tcpu)) {
+			cpu = tcpu;
+			goto done;
+		}
+	}
+
+done:
+	rcu_read_unlock();
+	return cpu;
+}
+
+/*
+ * This structure holds the per-CPU mask of CPUs for which IPIs are scheduled
+ * to be sent to kick remote softirq processing.  There are two masks since
+ * the sending of IPIs must be done with interrupts enabled.  The select field
+ * indicates the current mask that enqueue_backlog uses to schedule IPIs.
+ * select is flipped before net_rps_action is called while still under lock,
+ * net_rps_action then uses the non-selected mask to send the IPIs and clears
+ * it without conflicting with enqueue_backlog operation.
+ */
+struct rps_remote_softirq_cpus {
+	cpumask_t mask[2];
+	int select;
+};
+static DEFINE_PER_CPU(struct rps_remote_softirq_cpus, rps_remote_softirq_cpus);
+
+/* Called from hardirq (IPI) context */
+static void trigger_softirq(void *data)
+{
+	struct softnet_data *queue = data;
+	__napi_schedule(&queue->backlog);
+	__get_cpu_var(netdev_rx_stat).received_rps++;
+}
+
+/*
+ * enqueue_to_backlog is called to queue an skb to a per CPU backlog
+ * queue (may be a remote CPU queue).
+ */
+static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
+{
+	struct softnet_data *queue;
+	unsigned long flags;
+
+	queue = &per_cpu(softnet_data, cpu);
+
+	local_irq_save(flags);
+	__get_cpu_var(netdev_rx_stat).total++;
+
+	spin_lock(&queue->input_pkt_queue.lock);
+	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
+		if (queue->input_pkt_queue.qlen) {
+enqueue:
+			__skb_queue_tail(&queue->input_pkt_queue, skb);
+			spin_unlock_irqrestore(&queue->input_pkt_queue.lock,
+			    flags);
+			return NET_RX_SUCCESS;
+		}
+
+		/* Schedule NAPI for backlog device */
+		if (napi_schedule_prep(&queue->backlog)) {
+			if (cpu != smp_processor_id()) {
+				struct rps_remote_softirq_cpus *rcpus =
+				    &__get_cpu_var(rps_remote_softirq_cpus);
+
+				cpu_set(cpu, rcpus->mask[rcpus->select]);
+				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+			} else
+				__napi_schedule(&queue->backlog);
+		}
+		goto enqueue;
+	}
+
+	spin_unlock(&queue->input_pkt_queue.lock);
+
+	__get_cpu_var(netdev_rx_stat).dropped++;
+	local_irq_restore(flags);
+
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}
 
 /**
  *	netif_rx	-	post buffer to the network code
@@ -2193,8 +2358,7 @@ DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
 int netif_rx(struct sk_buff *skb)
 {
-	struct softnet_data *queue;
-	unsigned long flags;
+	int cpu;
 
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
@@ -2203,31 +2367,11 @@ int netif_rx(struct sk_buff *skb)
 	if (!skb->tstamp.tv64)
 		net_timestamp(skb);
 
-	/*
-	 * The code is rearranged so that the path is the most
-	 * short when CPU is congested, but is still operating.
-	 */
-	local_irq_save(flags);
-	queue = &__get_cpu_var(softnet_data);
-
-	__get_cpu_var(netdev_rx_stat).total++;
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
-enqueue:
-			__skb_queue_tail(&queue->input_pkt_queue, skb);
-			local_irq_restore(flags);
-			return NET_RX_SUCCESS;
-		}
-
-		napi_schedule(&queue->backlog);
-		goto enqueue;
-	}
-
-	__get_cpu_var(netdev_rx_stat).dropped++;
-	local_irq_restore(flags);
+	cpu = get_rps_cpu(skb->dev, skb);
+	if (cpu < 0)
+		cpu = smp_processor_id();
 
-	kfree_skb(skb);
-	return NET_RX_DROP;
+	return enqueue_to_backlog(skb, cpu);
 }
 EXPORT_SYMBOL(netif_rx);
 
@@ -2464,22 +2608,7 @@ void netif_nit_deliver(struct sk_buff *skb)
 	rcu_read_unlock();
 }
 
-/**
- *	netif_receive_skb - process receive buffer from network
- *	@skb: buffer to process
- *
- *	netif_receive_skb() is the main receive data processing function.
- *	It always succeeds. The buffer may be dropped during processing
- *	for congestion control or by the protocol layers.
- *
- *	This function may only be called from softirq context and interrupts
- *	should be enabled.
- *
- *	Return values (usually ignored):
- *	NET_RX_SUCCESS: no congestion
- *	NET_RX_DROP: packet was dropped
- */
-int netif_receive_skb(struct sk_buff *skb)
+int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
 	struct net_device *orig_dev;
@@ -2588,6 +2717,33 @@ out:
 	rcu_read_unlock();
 	return ret;
 }
+
+/**
+ *	netif_receive_skb - process receive buffer from network
+ *	@skb: buffer to process
+ *
+ *	netif_receive_skb() is the main receive data processing function.
+ *	It always succeeds. The buffer may be dropped during processing
+ *	for congestion control or by the protocol layers.
+ *
+ *	This function may only be called from softirq context and interrupts
+ *	should be enabled.
+ *
+ *	Return values (usually ignored):
+ *	NET_RX_SUCCESS: no congestion
+ *	NET_RX_DROP: packet was dropped
+ */
+int netif_receive_skb(struct sk_buff *skb)
+{
+	int cpu;
+
+	cpu = get_rps_cpu(skb->dev, skb);
+
+	if (cpu < 0)
+		return __netif_receive_skb(skb);
+	else
+		return enqueue_to_backlog(skb, cpu);
+}
 EXPORT_SYMBOL(netif_receive_skb);
 
 /* Network device is going away, flush any packets still pending  */
@@ -2914,16 +3070,16 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	do {
 		struct sk_buff *skb;
 
-		local_irq_disable();
+		spin_lock_irq(&queue->input_pkt_queue.lock);
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
 			__napi_complete(napi);
-			local_irq_enable();
+			spin_unlock_irq(&queue->input_pkt_queue.lock);
 			break;
 		}
-		local_irq_enable();
+		spin_unlock_irq(&queue->input_pkt_queue.lock);
 
-		netif_receive_skb(skb);
+		__netif_receive_skb(skb);
 	} while (++work < quota && jiffies == start_time);
 
 	return work;
@@ -3012,6 +3168,22 @@ void netif_napi_del(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(netif_napi_del);
 
+/*
+ * net_rps_action sends any pending IPI's for rps.  This is only called from
+ * softirq and interrupts must be enabled.
+ */
+static void net_rps_action(cpumask_t *mask)
+{
+	int cpu;
+
+	/* Send pending IPI's to kick RPS processing on remote cpus. */
+	for_each_cpu_mask_nr(cpu, *mask) {
+		struct softnet_data *queue = &per_cpu(softnet_data, cpu);
+		if (cpu_online(cpu))
+			__smp_call_function_single(cpu, &queue->csd, 0);
+	}
+	cpus_clear(*mask);
+}
 
 static void net_rx_action(struct softirq_action *h)
 {
@@ -3019,6 +3191,8 @@ static void net_rx_action(struct softirq_action *h)
 	unsigned long time_limit = jiffies + 2;
 	int budget = netdev_budget;
 	void *have;
+	int select;
+	struct rps_remote_softirq_cpus *rcpus;
 
 	local_irq_disable();
 
@@ -3081,8 +3255,14 @@ static void net_rx_action(struct softirq_action *h)
 		netpoll_poll_unlock(have);
 	}
 out:
+	rcpus = &__get_cpu_var(rps_remote_softirq_cpus);
+	select = rcpus->select;
+	rcpus->select ^= 1;
+
 	local_irq_enable();
 
+	net_rps_action(&rcpus->mask[select]);
+
 #ifdef CONFIG_NET_DMA
 	/*
 	 * There may not be any more sk_buffs coming right now, so push
@@ -3327,10 +3507,10 @@ static int softnet_seq_show(struct seq_file *seq, void *v)
 {
 	struct netif_rx_stats *s = v;
 
-	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision);
+		   s->cpu_collision, s->received_rps);
 	return 0;
 }
 
@@ -5067,6 +5247,23 @@ int register_netdevice(struct net_device *dev)
 
 	dev->iflink = -1;
 
+	if (!dev->num_rx_queues) {
+		/*
+		 * Allocate a single RX queue if driver never called
+		 * alloc_netdev_mq
+		 */
+
+		dev->_rx = kzalloc(sizeof(struct netdev_rx_queue), GFP_KERNEL);
+		if (!dev->_rx) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		dev->_rx->first = dev->_rx;
+		atomic_set(&dev->_rx->count, 1);
+		dev->num_rx_queues = 1;
+	}
+
 	/* Init, if this function is available */
 	if (dev->netdev_ops->ndo_init) {
 		ret = dev->netdev_ops->ndo_init(dev);
@@ -5424,9 +5621,11 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 		void (*setup)(struct net_device *), unsigned int queue_count)
 {
 	struct netdev_queue *tx;
+	struct netdev_rx_queue *rx;
 	struct net_device *dev;
 	size_t alloc_size;
 	struct net_device *p;
+	int i;
 
 	BUG_ON(strlen(name) >= sizeof(dev->name));
 
@@ -5452,11 +5651,27 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 		goto free_p;
 	}
 
+	rx = kcalloc(queue_count, sizeof(struct netdev_rx_queue), GFP_KERNEL);
+	if (!rx) {
+		printk(KERN_ERR "alloc_netdev: Unable to allocate "
+		       "rx queues.\n");
+		goto free_tx;
+	}
+
+	atomic_set(&rx->count, queue_count);
+
+	/*
+	 * Set a pointer to first element in the array which holds the
+	 * reference count.
+	 */
+	for (i = 0; i < queue_count; i++)
+		rx[i].first = rx;
+
 	dev = PTR_ALIGN(p, NETDEV_ALIGN);
 	dev->padded = (char *)dev - (char *)p;
 
 	if (dev_addr_init(dev))
-		goto free_tx;
+		goto free_rx;
 
 	dev_unicast_init(dev);
 
@@ -5466,6 +5681,9 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	dev->num_tx_queues = queue_count;
 	dev->real_num_tx_queues = queue_count;
 
+	dev->_rx = rx;
+	dev->num_rx_queues = queue_count;
+
 	dev->gso_max_size = GSO_MAX_SIZE;
 
 	netdev_init_queues(dev);
@@ -5480,9 +5698,10 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	strcpy(dev->name, name);
 	return dev;
 
+free_rx:
+	kfree(rx);
 free_tx:
 	kfree(tx);
-
 free_p:
 	kfree(p);
 	return NULL;
@@ -5985,6 +6204,10 @@ static int __init net_dev_init(void)
 		queue->completion_queue = NULL;
 		INIT_LIST_HEAD(&queue->poll_list);
 
+		queue->csd.func = trigger_softirq;
+		queue->csd.info = queue;
+		queue->csd.flags = 0;
+
 		queue->backlog.poll = process_backlog;
 		queue->backlog.weight = weight_p;
 		queue->backlog.gro_list = NULL;
@@ -6023,7 +6246,7 @@ subsys_initcall(net_dev_init);
 
 static int __init initialize_hashrnd(void)
 {
-	get_random_bytes(&skb_tx_hashrnd, sizeof(skb_tx_hashrnd));
+	get_random_bytes(&hashrnd, sizeof(hashrnd));
 	return 0;
 }
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 099c753..7a46343 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -466,6 +466,216 @@ static struct attribute_group wireless_group = {
 };
 #endif
 
+/*
+ * RX queue sysfs structures and functions.
+ */
+struct rx_queue_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, char *buf);
+	ssize_t (*store)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, const char *buf, size_t len);
+};
+#define to_rx_queue_attr(_attr) container_of(_attr,		\
+    struct rx_queue_attribute, attr)
+
+#define to_rx_queue(obj) container_of(obj, struct netdev_rx_queue, kobj)
+
+static ssize_t rx_queue_attr_show(struct kobject *kobj, struct attribute *attr,
+				  char *buf)
+{
+	struct rx_queue_attribute *attribute = to_rx_queue_attr(attr);
+	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+
+	if (!attribute->show)
+		return -EIO;
+
+	return attribute->show(queue, attribute, buf);
+}
+
+static ssize_t rx_queue_attr_store(struct kobject *kobj, struct attribute *attr,
+				   const char *buf, size_t count)
+{
+	struct rx_queue_attribute *attribute = to_rx_queue_attr(attr);
+	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+
+	if (!attribute->store)
+		return -EIO;
+
+	return attribute->store(queue, attribute, buf, count);
+}
+
+static struct sysfs_ops rx_queue_sysfs_ops = {
+	.show = rx_queue_attr_show,
+	.store = rx_queue_attr_store,
+};
+
+static ssize_t show_rps_map(struct netdev_rx_queue *queue,
+			    struct rx_queue_attribute *attribute, char *buf)
+{
+	struct rps_map *map;
+	cpumask_var_t mask;
+	size_t len = 0;
+	int i;
+
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	rcu_read_lock();
+	map = rcu_dereference(queue->rps_map);
+	if (map)
+		for (i = 0; i < map->len; i++)
+			cpumask_set_cpu(map->cpus[i], mask);
+
+	len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask);
+	if (PAGE_SIZE - len < 3) {
+		rcu_read_unlock();
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+	rcu_read_unlock();
+
+	free_cpumask_var(mask);
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+
+static void rps_map_release(struct rcu_head *rcu)
+{
+	struct rps_map *map = container_of(rcu, struct rps_map, rcu);
+
+	kfree(map);
+}
+
+ssize_t store_rps_map(struct netdev_rx_queue *queue,
+		      struct rx_queue_attribute *attribute,
+		      const char *buf, size_t len)
+{
+	struct rps_map *old_map, *map;
+	cpumask_var_t mask;
+	int err, cpu, i;
+	static DEFINE_SPINLOCK(rps_map_lock);
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits);
+	if (err) {
+		free_cpumask_var(mask);
+		return err;
+	}
+
+	map = kzalloc(max_t(unsigned,
+	    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
+	    GFP_KERNEL);
+	if (!map) {
+		free_cpumask_var(mask);
+		return -ENOMEM;
+	}
+
+	i = 0;
+	for_each_cpu_and(cpu, mask, cpu_online_mask)
+		map->cpus[i++] = cpu;
+
+	if (i)
+		map->len = i;
+	else {
+		kfree(map);
+		map = NULL;
+	}
+
+	spin_lock(&rps_map_lock);
+	old_map = queue->rps_map;
+	rcu_assign_pointer(queue->rps_map, map);
+	spin_unlock(&rps_map_lock);
+
+	if (old_map)
+		call_rcu(&old_map->rcu, rps_map_release);
+
+	free_cpumask_var(mask);
+	return len;
+}
+
+static struct rx_queue_attribute rps_cpus_attribute =
+	__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map);
+
+static struct attribute *rx_queue_default_attrs[] = {
+	&rps_cpus_attribute.attr,
+	NULL
+};
+
+static void rx_queue_release(struct kobject *kobj)
+{
+	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+	struct rps_map *map = queue->rps_map;
+	struct netdev_rx_queue *first = queue->first;
+
+	if (map)
+		call_rcu(&map->rcu, rps_map_release);
+
+	if (atomic_dec_and_test(&first->count))
+		kfree(first);
+}
+
+static struct kobj_type rx_queue_ktype = {
+	.sysfs_ops = &rx_queue_sysfs_ops,
+	.release = rx_queue_release,
+	.default_attrs = rx_queue_default_attrs,
+};
+
+static int rx_queue_add_kobject(struct net_device *net, int index)
+{
+	struct netdev_rx_queue *queue = net->_rx + index;
+	struct kobject *kobj = &queue->kobj;
+	int error = 0;
+
+	kobj->kset = net->queues_kset;
+	error = kobject_init_and_add(kobj, &rx_queue_ktype, NULL,
+	    "rx-%u", index);
+	if (error) {
+		kobject_put(kobj);
+		return error;
+	}
+
+	kobject_uevent(kobj, KOBJ_ADD);
+
+	return error;
+}
+
+static int rx_queue_register_kobjects(struct net_device *net)
+{
+	int i;
+	int error = 0;
+
+	net->queues_kset = kset_create_and_add("queues",
+	    NULL, &net->dev.kobj);
+	if (!net->queues_kset)
+		return -ENOMEM;
+	for (i = 0; i < net->num_rx_queues; i++) {
+		error = rx_queue_add_kobject(net, i);
+		if (error)
+			break;
+	}
+
+	if (error)
+		while (--i >= 0)
+			kobject_put(&net->_rx[i].kobj);
+
+	return error;
+}
+
+static void rx_queue_remove_kobjects(struct net_device *net)
+{
+	int i;
+
+	for (i = 0; i < net->num_rx_queues; i++)
+		kobject_put(&net->_rx[i].kobj);
+	kset_unregister(net->queues_kset);
+}
+
 #endif /* CONFIG_SYSFS */
 
 #ifdef CONFIG_HOTPLUG
@@ -529,6 +739,8 @@ void netdev_unregister_kobject(struct net_device * net)
 	if (!net_eq(dev_net(net), &init_net))
 		return;
 
+	rx_queue_remove_kobjects(net);
+
 	device_del(dev);
 }
 
@@ -537,6 +749,7 @@ int netdev_register_kobject(struct net_device *net)
 {
 	struct device *dev = &(net->dev);
 	const struct attribute_group **groups = net->sysfs_groups;
+	int error = 0;
 
 	dev->class = &net_class;
 	dev->platform_data = net;
@@ -563,7 +776,17 @@ int netdev_register_kobject(struct net_device *net)
 	if (!net_eq(dev_net(net), &init_net))
 		return 0;
 
-	return device_add(dev);
+	error = device_add(dev);
+	if (error)
+		return error;
+
+	error = rx_queue_register_kobjects(net);
+	if (error) {
+		device_del(dev);
+		return error;
+	}
+
+	return error;
 }
 
 int netdev_class_create_file(struct class_attribute *class_attr)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..bdea0ef 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -534,6 +534,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->network_header	= old->network_header;
 	new->mac_header		= old->mac_header;
 	skb_dst_set(new, dst_clone(skb_dst(old)));
+	new->rxhash		= old->rxhash;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
@@ -581,6 +582,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	C(len);
 	C(data_len);
 	C(mac_len);
+	C(rxhash);
 	n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
 	n->cloned = 1;
 	n->nohdr = 0;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-16 18:03     ` Tom Herbert
@ 2010-03-16 21:00       ` Eric Dumazet
  2010-03-16 21:13         ` David Miller
  2010-03-17  4:26         ` David Miller
  0 siblings, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2010-03-16 21:00 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev

Le mardi 16 mars 2010 à 11:03 -0700, Tom Herbert a écrit :
> Tom Herbert wrote:

> 
> This patch implements software receive side packet steering (RPS).  RPS
> distributes the load of received packet processing across multiple CPUs.
> 
> Problem statement: Protocol processing done in the NAPI context for received
> packets is serialized per device queue and becomes a bottleneck under high
> packet load.  This substantially limits pps that can be achieved on a single
> queue NIC and provides no scaling with multiple cores.
> 
> This solution queues packets early on in the receive path on the backlog queues
> of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
> performed on packets in parallel.   For each device (or each receive queue in
> a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
> process packets. A CPU is selected on a per packet basis by hashing contents
> of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
> into the CPU mask.  The IPI mechanism is used to raise networking receive
> softirqs between CPUs.  This effectively emulates in software what a multi-queue
> NIC can provide, but is generic requiring no device support.
> 
> Many devices now provide a hash over the 4-tuple on a per packet basis
> (e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
> in an skb field, and that value in turn is used to index into the RPS maps.
> Using the HW generated hash can avoid cache misses on the packet when
> steering it to a remote CPU.
> 
> The CPU mask is set on a per device and per queue basis in the sysfs variable
> /sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
> bit maps for receive queues in the device (numbered by <n>).  If a device
> does not support multi-queue, a single variable is used for the device (rx-0).
> 
> Generally, we have found this technique increases pps capabilities of a single
> queue device with good CPU utilization.  Optimal settings for the CPU mask
> seem to depend on architectures and cache hierarcy.  Below are some results
> running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
> Results show cumulative transaction rate and system CPU utilization.
> 
> e1000e on 8 core Intel
>    Without RPS: 108K tps at 33% CPU
>    With RPS:    311K tps at 64% CPU
> 
> forcedeth on 16 core AMD
>    Without RPS: 156K tps at 15% CPU
>    With RPS:    404K tps at 49% CPU
>    
> bnx2x on 16 core AMD
>    Without RPS  567K tps at 61% CPU (4 HW RX queues)
>    Without RPS  738K tps at 96% CPU (8 HW RX queues)
>    With RPS:    854K tps at 76% CPU (4 HW RX queues)
> 
> Caveats:
> - The benefits of this patch are dependent on architecture and cache hierarchy.
> Tuning the masks to get best performance is probably necessary.
> - This patch adds overhead in the path for processing a single packet.  In
> a lightly loaded server this overhead may eliminate the advantages of
> increased parallelism, and possibly cause some relative performance degradation.
> We have found that masks that are cache aware (share same caches with
> the interrupting CPU) mitigate much of this.
> - The RPS masks can be changed dynamically, however whenever the mask is changed
> this introduces the possibility of generating out of order packets.  It's
> probably best not change the masks too frequently.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> 

Well, I tested this on my dev machine, with a vlan over bonding setup,
tg3 and bnx2 drivers (mono queue), and all is good. UDP bench not
anymore using 100% of one cpu and dropping frames.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks Herbert, this gives a very good fanout of network load on cpus,
this is a huge improvement, I cannot wait 2.6.35 :)



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-16 21:00       ` Eric Dumazet
@ 2010-03-16 21:13         ` David Miller
  2010-03-17  1:54           ` Changli Gao
  2010-03-17  4:26         ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2010-03-16 21:13 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 16 Mar 2010 22:00:27 +0100

> Well, I tested this on my dev machine, with a vlan over bonding setup,
> tg3 and bnx2 drivers (mono queue), and all is good. UDP bench not
> anymore using 100% of one cpu and dropping frames.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Thanks Herbert, this gives a very good fanout of network load on cpus,
> this is a huge improvement, I cannot wait 2.6.35 :)

I'll integrate this as soon as I open up net-next-2.6

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-16 21:13         ` David Miller
@ 2010-03-17  1:54           ` Changli Gao
  2010-03-17  7:07             ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Changli Gao @ 2010-03-17  1:54 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, therbert, netdev

On Wed, Mar 17, 2010 at 5:13 AM, David Miller <davem@davemloft.net> wrote:
>
> I'll integrate this as soon as I open up net-next-2.6

It is really a good news. Now linux also can dispatch packets as
FreeBSD does via netisr. Can we walk farer, and support weighted
distribution?

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-16 21:00       ` Eric Dumazet
  2010-03-16 21:13         ` David Miller
@ 2010-03-17  4:26         ` David Miller
  1 sibling, 0 replies; 25+ messages in thread
From: David Miller @ 2010-03-17  4:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 16 Mar 2010 22:00:27 +0100

> Le mardi 16 mars 2010 à 11:03 -0700, Tom Herbert a écrit :
>> This patch implements software receive side packet steering (RPS).  RPS
>> distributes the load of received packet processing across multiple CPUs.
 ...
> Well, I tested this on my dev machine, with a vlan over bonding setup,
> tg3 and bnx2 drivers (mono queue), and all is good. UDP bench not
> anymore using 100% of one cpu and dropping frames.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-17  1:54           ` Changli Gao
@ 2010-03-17  7:07             ` Eric Dumazet
  2010-03-17  7:59               ` Changli Gao
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2010-03-17  7:07 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, therbert, netdev

Le mercredi 17 mars 2010 à 09:54 +0800, Changli Gao a écrit :
> On Wed, Mar 17, 2010 at 5:13 AM, David Miller <davem@davemloft.net> wrote:
> >
> > I'll integrate this as soon as I open up net-next-2.6
> 
> It is really a good news. Now linux also can dispatch packets as
> FreeBSD does via netisr. Can we walk farer, and support weighted
> distribution?
> 

May I ask why ? What would be the goal ?

If you perform too much work behalf the first cpu (the one actually
dealing with the device and dispatching packets to other cpus), you risk
bringing into its cache whole packet and consuming too many cpu cycles.

For instance, I would prefer a basic spreading, and try to reorganise
sk_buffs to that this cpu touches only one cache line in sk_buff, and
read one cache line in packet data to compute rxhash.

Then some NIC could compute rxhash themselves and provide it in their
rx_desc. Of course multiqueue support is much better and current
hardware prefer to implement the real thing.




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-17  7:07             ` Eric Dumazet
@ 2010-03-17  7:59               ` Changli Gao
  2010-03-17 14:09                 ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Changli Gao @ 2010-03-17  7:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, therbert, netdev

On Wed, Mar 17, 2010 at 3:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 17 mars 2010 à 09:54 +0800, Changli Gao a écrit :
>> On Wed, Mar 17, 2010 at 5:13 AM, David Miller <davem@davemloft.net> wrote:
>> >
>> > I'll integrate this as soon as I open up net-next-2.6
>>
>> It is really a good news. Now linux also can dispatch packets as
>> FreeBSD does via netisr. Can we walk farer, and support weighted
>> distribution?
>>
>
> May I ask why ? What would be the goal ?
>

For example, I have a firewall with dual core CPU, and use core 0 for
IRQ and dispatching. If I use both core 0 and core 1 for the left
processing, core 0 will be overloaded, and if I use core 1 for the
left processing, core 0 will be light load. In order to take full of
advantage of hardware, I need weighted the packet distribution.



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-17  7:59               ` Changli Gao
@ 2010-03-17 14:09                 ` Eric Dumazet
  2010-03-17 15:01                   ` Tom Herbert
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2010-03-17 14:09 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, therbert, netdev

Le mercredi 17 mars 2010 à 15:59 +0800, Changli Gao a écrit :
> On Wed, Mar 17, 2010 at 3:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mercredi 17 mars 2010 à 09:54 +0800, Changli Gao a écrit :
> >> On Wed, Mar 17, 2010 at 5:13 AM, David Miller <davem@davemloft.net> wrote:
> >> >
> >> > I'll integrate this as soon as I open up net-next-2.6
> >>
> >> It is really a good news. Now linux also can dispatch packets as
> >> FreeBSD does via netisr. Can we walk farer, and support weighted
> >> distribution?
> >>
> >
> > May I ask why ? What would be the goal ?
> >
> 
> For example, I have a firewall with dual core CPU, and use core 0 for
> IRQ and dispatching. If I use both core 0 and core 1 for the left
> processing, core 0 will be overloaded, and if I use core 1 for the
> left processing, core 0 will be light load. In order to take full of
> advantage of hardware, I need weighted the packet distribution.

I would not use RPS at all, weighted or not, unless your firewall must
perform heavy duty work (l7 or complex rules)

If the firewall setup is expensive, then IRQ processing has minor cost,
and RPS fits the bill (cpu handling IRQ will be a bit more loaded than
its buddy).

RPS is a win when _some_ TCP/UDP processing occurs, and we try to do
this processing on a cpu that will also run user space thread with data
in cpu cache (if process scheduler does a good job)
Not much applicable for routers...


Anyway, current sysfs RPS interface exposes
a /sys/class/net/eth0/queues/rx-0/rps_cpus bitmap,

I guess we could expose another file,
/sys/class/net/eth0/queues/rx-0/rps_map
to give different weight to cpus :

echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map

cpu0 would get 30% of the postprocessing load, cpu1 70%

Using /sys/class/net/eth0/queues/rx-0/rps_cpus interface would give an
equal weight to each cpu :


# echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map
# cat /sys/class/net/eth0/queues/rx-0/rps_cpus
3
# cat /sys/class/net/eth0/queues/rx-0/rps_map
0 1 0 1 0 1 1 1 1 1
# echo 3 >/sys/class/net/eth0/queues/rx-0/rps_cpus
# cat /sys/class/net/eth0/queues/rx-0/rps_map
0 1




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-17 14:09                 ` Eric Dumazet
@ 2010-03-17 15:01                   ` Tom Herbert
  2010-03-17 15:34                     ` Eric Dumazet
  2010-03-17 23:50                     ` Tom Herbert
  0 siblings, 2 replies; 25+ messages in thread
From: Tom Herbert @ 2010-03-17 15:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, David Miller, netdev

On Wed, Mar 17, 2010 at 7:09 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 17 mars 2010 à 15:59 +0800, Changli Gao a écrit :
>> On Wed, Mar 17, 2010 at 3:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > Le mercredi 17 mars 2010 à 09:54 +0800, Changli Gao a écrit :
>> >> On Wed, Mar 17, 2010 at 5:13 AM, David Miller <davem@davemloft.net> wrote:
>> >> >
>> >> > I'll integrate this as soon as I open up net-next-2.6
>> >>
>> >> It is really a good news. Now linux also can dispatch packets as
>> >> FreeBSD does via netisr. Can we walk farer, and support weighted
>> >> distribution?
>> >>
>> >
>> > May I ask why ? What would be the goal ?
>> >
>>
>> For example, I have a firewall with dual core CPU, and use core 0 for
>> IRQ and dispatching. If I use both core 0 and core 1 for the left
>> processing, core 0 will be overloaded, and if I use core 1 for the
>> left processing, core 0 will be light load. In order to take full of
>> advantage of hardware, I need weighted the packet distribution.
>
> I would not use RPS at all, weighted or not, unless your firewall must
> perform heavy duty work (l7 or complex rules)
>
> If the firewall setup is expensive, then IRQ processing has minor cost,
> and RPS fits the bill (cpu handling IRQ will be a bit more loaded than
> its buddy).
>
> RPS is a win when _some_ TCP/UDP processing occurs, and we try to do
> this processing on a cpu that will also run user space thread with data
> in cpu cache (if process scheduler does a good job)
> Not much applicable for routers...
>
>
> Anyway, current sysfs RPS interface exposes
> a /sys/class/net/eth0/queues/rx-0/rps_cpus bitmap,
>
> I guess we could expose another file,
> /sys/class/net/eth0/queues/rx-0/rps_map
> to give different weight to cpus :
>
> echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map
>
> cpu0 would get 30% of the postprocessing load, cpu1 70%
>
> Using /sys/class/net/eth0/queues/rx-0/rps_cpus interface would give an
> equal weight to each cpu :
>
>
> # echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map
> # cat /sys/class/net/eth0/queues/rx-0/rps_cpus
> 3
> # cat /sys/class/net/eth0/queues/rx-0/rps_map
> 0 1 0 1 0 1 1 1 1 1
> # echo 3 >/sys/class/net/eth0/queues/rx-0/rps_cpus
> # cat /sys/class/net/eth0/queues/rx-0/rps_map
> 0 1

Alternatively, the rps_map could be specified explicitly, which will
allow weighting.  For example "0 0 0 0 2 10 10 10"  would select CPUs
0, 2, 10 for the map with weights four, one, and three respectively.
This would go back to have sysfs files with multiple values in them,
so it might not be the right interface.

Tom

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-17 15:01                   ` Tom Herbert
@ 2010-03-17 15:34                     ` Eric Dumazet
  2010-03-17 23:50                     ` Tom Herbert
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2010-03-17 15:34 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Changli Gao, David Miller, netdev

Le mercredi 17 mars 2010 à 08:01 -0700, Tom Herbert a écrit :
> On Wed, Mar 17, 2010 at 7:09 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > Anyway, current sysfs RPS interface exposes
> > a /sys/class/net/eth0/queues/rx-0/rps_cpus bitmap,
> >
> > I guess we could expose another file,
> > /sys/class/net/eth0/queues/rx-0/rps_map
> > to give different weight to cpus :
> >
> > echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map
> >
> > cpu0 would get 30% of the postprocessing load, cpu1 70%
> >
> > Using /sys/class/net/eth0/queues/rx-0/rps_cpus interface would give an
> > equal weight to each cpu :
> >
> >
> > # echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map
> > # cat /sys/class/net/eth0/queues/rx-0/rps_cpus
> > 3
> > # cat /sys/class/net/eth0/queues/rx-0/rps_map
> > 0 1 0 1 0 1 1 1 1 1
> > # echo 3 >/sys/class/net/eth0/queues/rx-0/rps_cpus
> > # cat /sys/class/net/eth0/queues/rx-0/rps_map
> > 0 1
> 
> Alternatively, the rps_map could be specified explicitly, which will
> allow weighting.  For example "0 0 0 0 2 10 10 10"  would select CPUs
> 0, 2, 10 for the map with weights four, one, and three respectively.
> This would go back to have sysfs files with multiple values in them,
> so it might not be the right interface.
> 

Well, you describe same idea... being able to do
echo "0 1 0 1 0 1 1 1 1 1" >rps_map
or
echo "0 0 0 1 1 1 1 1 1 1" >rps_map
or
echo "0 1 1 0 1 1 0 1 1 1" >rps_map

(filling the real map[] with given cpu numbers)

I was interleaving my cpus because I found it cool :)




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-17 15:01                   ` Tom Herbert
  2010-03-17 15:34                     ` Eric Dumazet
@ 2010-03-17 23:50                     ` Tom Herbert
  2010-03-18  2:14                       ` Changli Gao
  2010-03-18  6:20                       ` Eric Dumazet
  1 sibling, 2 replies; 25+ messages in thread
From: Tom Herbert @ 2010-03-17 23:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, David Miller, netdev


>>
>> # echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map
>> # cat /sys/class/net/eth0/queues/rx-0/rps_cpus
>> 3
>> # cat /sys/class/net/eth0/queues/rx-0/rps_map
>> 0 1 0 1 0 1 1 1 1 1
>> # echo 3 >/sys/class/net/eth0/queues/rx-0/rps_cpus
>> # cat /sys/class/net/eth0/queues/rx-0/rps_map
>> 0 1
> 
> Alternatively, the rps_map could be specified explicitly, which will
> allow weighting.  For example "0 0 0 0 2 10 10 10"  would select CPUs
> 0, 2, 10 for the map with weights four, one, and three respectively.
> This would go back to have sysfs files with multiple values in them,
> so it might not be the right interface.

Here is a patch for this...

Allow specification of CPUs in rps to be done with a vector instead of a bit map.  This allows relative weighting of CPUs in the map by repeating ones to give higher weight.

For example "echo 0 0 0 3 4 4 4 4 > /sys/class/net/eth0/queues/rx-0/rps_cpus"

assigns CPUs 0, 3, and 4 to the RPS mask with relative weights 3, 1, and 4 respectively.

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 7a46343..41956a5 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -17,6 +17,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/wireless.h>
 #include <net/wext.h>
+#include <linux/ctype.h>
 
 #include "net-sysfs.h"
 
@@ -514,30 +515,20 @@ static ssize_t show_rps_map(struct netdev_rx_queue *queue,
 			    struct rx_queue_attribute *attribute, char *buf)
 {
 	struct rps_map *map;
-	cpumask_var_t mask;
 	size_t len = 0;
 	int i;
 
-	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
-		return -ENOMEM;
-
 	rcu_read_lock();
+
 	map = rcu_dereference(queue->rps_map);
 	if (map)
 		for (i = 0; i < map->len; i++)
-			cpumask_set_cpu(map->cpus[i], mask);
+			len += snprintf(buf + len, PAGE_SIZE - len, "%u%s",
+			    map->cpus[i], i + 1 < map->len ? " " : "\n");
 
-	len += cpumask_scnprintf(buf + len, PAGE_SIZE, mask);
-	if (PAGE_SIZE - len < 3) {
-		rcu_read_unlock();
-		free_cpumask_var(mask);
-		return -EINVAL;
-	}
 	rcu_read_unlock();
 
-	free_cpumask_var(mask);
-	len += sprintf(buf + len, "\n");
-	return len;
+	return len < PAGE_SIZE ? len : -EINVAL;
 }
 
 static void rps_map_release(struct rcu_head *rcu)
@@ -552,41 +543,50 @@ ssize_t store_rps_map(struct netdev_rx_queue *queue,
 		      const char *buf, size_t len)
 {
 	struct rps_map *old_map, *map;
-	cpumask_var_t mask;
-	int err, cpu, i;
+	int i, count = 0;
+	unsigned int val;
 	static DEFINE_SPINLOCK(rps_map_lock);
+	char *tbuf;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
-	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
-		return -ENOMEM;
+	/* Validate and count the number of CPUs in the input list. */
+	tbuf = (char *)buf;
+	while (tbuf < buf + len) {
+		char *rbuf;
 
-	err = bitmap_parse(buf, len, cpumask_bits(mask), nr_cpumask_bits);
-	if (err) {
-		free_cpumask_var(mask);
-		return err;
-	}
+		if (isspace(*tbuf)) {
+			tbuf++;
+			continue;
+		}
 
-	map = kzalloc(max_t(unsigned,
-	    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
-	    GFP_KERNEL);
-	if (!map) {
-		free_cpumask_var(mask);
-		return -ENOMEM;
-	}
+		val = simple_strtoul(tbuf, &rbuf, 0);
 
-	i = 0;
-	for_each_cpu_and(cpu, mask, cpu_online_mask)
-		map->cpus[i++] = cpu;
+		if ((tbuf == rbuf) || (val >= num_possible_cpus()))
+			return -EINVAL;
 
-	if (i)
-		map->len = i;
-	else {
-		kfree(map);
-		map = NULL;
+		tbuf = rbuf;
+		count++;
 	}
 
+	if (count) {
+		map = kzalloc(max_t(unsigned, RPS_MAP_SIZE(count),
+		    L1_CACHE_BYTES), GFP_KERNEL);
+		if (!map)
+			return -ENOMEM;
+
+		tbuf = (char *)buf;
+		for (i = 0; i < count; i++) {
+			while (isspace(*tbuf))
+				tbuf++;
+			map->cpus[i] = simple_strtoul(tbuf, &tbuf, 0);
+		}
+		map->len = count;
+	} else
+		map = NULL;
+
+
 	spin_lock(&rps_map_lock);
 	old_map = queue->rps_map;
 	rcu_assign_pointer(queue->rps_map, map);
@@ -595,7 +595,6 @@ ssize_t store_rps_map(struct netdev_rx_queue *queue,
 	if (old_map)
 		call_rcu(&old_map->rcu, rps_map_release);
 
-	free_cpumask_var(mask);
 	return len;
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-17 23:50                     ` Tom Herbert
@ 2010-03-18  2:14                       ` Changli Gao
  2010-03-18  6:30                         ` Eric Dumazet
  2010-03-18  6:20                       ` Eric Dumazet
  1 sibling, 1 reply; 25+ messages in thread
From: Changli Gao @ 2010-03-18  2:14 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, David Miller, netdev

On Thu, Mar 18, 2010 at 7:50 AM, Tom Herbert <therbert@google.com> wrote:
>
>>>
>>> # echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map
>>> # cat /sys/class/net/eth0/queues/rx-0/rps_cpus
>>> 3
>>> # cat /sys/class/net/eth0/queues/rx-0/rps_map
>>> 0 1 0 1 0 1 1 1 1 1
>>> # echo 3 >/sys/class/net/eth0/queues/rx-0/rps_cpus
>>> # cat /sys/class/net/eth0/queues/rx-0/rps_map
>>> 0 1
>>
>> Alternatively, the rps_map could be specified explicitly, which will
>> allow weighting.  For example "0 0 0 0 2 10 10 10"  would select CPUs
>> 0, 2, 10 for the map with weights four, one, and three respectively.
>> This would go back to have sysfs files with multiple values in them,
>> so it might not be the right interface.
>
> Here is a patch for this...
>
> Allow specification of CPUs in rps to be done with a vector instead of a bit map.  This allows relative weighting of CPUs in the map by repeating ones to give higher weight.
>
> For example "echo 0 0 0 3 4 4 4 4 > /sys/class/net/eth0/queues/rx-0/rps_cpus"
>
> assigns CPUs 0, 3, and 4 to the RPS mask with relative weights 3, 1, and 4 respectively.
>

If the weight of CPU0 is 100, I have to repeat 0 100 times. How about
using the * to simplify the weight.

The above examples will be "echo 0*3 3 4*4 >
/sys/class/net/eth0/queues/rx-0/rps_cpus"

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-17 23:50                     ` Tom Herbert
  2010-03-18  2:14                       ` Changli Gao
@ 2010-03-18  6:20                       ` Eric Dumazet
  2010-03-18  6:48                         ` Changli Gao
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2010-03-18  6:20 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Changli Gao, David Miller, netdev

Le mercredi 17 mars 2010 à 16:50 -0700, Tom Herbert a écrit :
> >>
> >> # echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map
> >> # cat /sys/class/net/eth0/queues/rx-0/rps_cpus
> >> 3
> >> # cat /sys/class/net/eth0/queues/rx-0/rps_map
> >> 0 1 0 1 0 1 1 1 1 1
> >> # echo 3 >/sys/class/net/eth0/queues/rx-0/rps_cpus
> >> # cat /sys/class/net/eth0/queues/rx-0/rps_map
> >> 0 1
> > 
> > Alternatively, the rps_map could be specified explicitly, which will
> > allow weighting.  For example "0 0 0 0 2 10 10 10"  would select CPUs
> > 0, 2, 10 for the map with weights four, one, and three respectively.
> > This would go back to have sysfs files with multiple values in them,
> > so it might not be the right interface.
> 
> Here is a patch for this...
> 
> Allow specification of CPUs in rps to be done with a vector instead of a bit map.  This allows relative weighting of CPUs in the map by repeating ones to give higher weight.
> 
> For example "echo 0 0 0 3 4 4 4 4 > /sys/class/net/eth0/queues/rx-0/rps_cpus"
> 
> assigns CPUs 0, 3, and 4 to the RPS mask with relative weights 3, 1, and 4 respectively.
> 

Hmm...

I believe we should keep existing sysfs cpumask interface, because its
the only workable thing on a PAGE_SIZE=4096 machine with 4096 cpus.

strlen("0 1 2 3 4 ... 4095") = 19369

Using base 16 instead of base 10 -> 16111




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-18  2:14                       ` Changli Gao
@ 2010-03-18  6:30                         ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2010-03-18  6:30 UTC (permalink / raw)
  To: Changli Gao; +Cc: Tom Herbert, David Miller, netdev

Le jeudi 18 mars 2010 à 10:14 +0800, Changli Gao a écrit :
> On Thu, Mar 18, 2010 at 7:50 AM, Tom Herbert <therbert@google.com> wrote:
> >
> >>>
> >>> # echo "0 1 0 1 0 1 1 1 1 1" >/sys/class/net/eth0/queues/rx-0/rps_map
> >>> # cat /sys/class/net/eth0/queues/rx-0/rps_cpus
> >>> 3
> >>> # cat /sys/class/net/eth0/queues/rx-0/rps_map
> >>> 0 1 0 1 0 1 1 1 1 1
> >>> # echo 3 >/sys/class/net/eth0/queues/rx-0/rps_cpus
> >>> # cat /sys/class/net/eth0/queues/rx-0/rps_map
> >>> 0 1
> >>
> >> Alternatively, the rps_map could be specified explicitly, which will
> >> allow weighting.  For example "0 0 0 0 2 10 10 10"  would select CPUs
> >> 0, 2, 10 for the map with weights four, one, and three respectively.
> >> This would go back to have sysfs files with multiple values in them,
> >> so it might not be the right interface.
> >
> > Here is a patch for this...
> >
> > Allow specification of CPUs in rps to be done with a vector instead of a bit map.  This allows relative weighting of CPUs in the map by repeating ones to give higher weight.
> >
> > For example "echo 0 0 0 3 4 4 4 4 > /sys/class/net/eth0/queues/rx-0/rps_cpus"
> >
> > assigns CPUs 0, 3, and 4 to the RPS mask with relative weights 3, 1, and 4 respectively.
> >
> 
> If the weight of CPU0 is 100, I have to repeat 0 100 times. How about
> using the * to simplify the weight.

This would make RPS foot print quite large, with a rps_map size of 100
shorts -> 200 bytes.

I doubt we need a high precision in weighting, I would limit values to 4
bits (1 -> 16) per cpu.

Anyway, sysfs is limited to PAGE_SIZE. If we provide weight
capabilities, we should also cope with machines with 4096 cpus.

sysfs might be overkill for that (we need one file per cpu)



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-18  6:20                       ` Eric Dumazet
@ 2010-03-18  6:48                         ` Changli Gao
  2010-03-18 20:37                           ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Changli Gao @ 2010-03-18  6:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, David Miller, netdev

On Thu, Mar 18, 2010 at 2:20 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 17 mars 2010 à 16:50 -0700, Tom Herbert a écrit :
>>
>> Here is a patch for this...
>>
>> Allow specification of CPUs in rps to be done with a vector instead of a bit map.  This allows relative weighting of CPUs in the map by repeating ones to give higher weight.
>>
>> For example "echo 0 0 0 3 4 4 4 4 > /sys/class/net/eth0/queues/rx-0/rps_cpus"
>>
>> assigns CPUs 0, 3, and 4 to the RPS mask with relative weights 3, 1, and 4 respectively.
>>
>
> Hmm...
>
> I believe we should keep existing sysfs cpumask interface, because its
> the only workable thing on a PAGE_SIZE=4096 machine with 4096 cpus.
>
> strlen("0 1 2 3 4 ... 4095") = 19369
>
> Using base 16 instead of base 10 -> 16111
>

sigh! How about adding file for each cpu weight setting.

.../rx-0/rps_cpu0...n

BTW: I think exporting the hook of hash function will help in some
case. So users can choose which hash to use depend on their
applications. I know FreeBSD supports hash based on flow, source or
CPU. Some network application have multiple instances for taking full
advantage of the SMP/C hardware, and each instance binds to a special
CPU/Core, so they need some kind of load distributing algorithm for
load balancing.

For example, memcached uses hash based on key, and its developer may
implement a hash function for RPS. Then it apply the following
iptables rule:

iptables -A PREROUTING -t nat -m cpu --cpuid 0 -m tcp --dport 1234
--REDIRECT 8081
iptables -A PREROUTING -t nat -m cpu --cpuid 0 -m tcp --dport 1234
--REDIRECT 8082
...

No other things to change, it can take full advantage of the
underlying hardware transparently.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-18  6:48                         ` Changli Gao
@ 2010-03-18 20:37                           ` Eric Dumazet
  2010-03-18 21:23                             ` Stephen Hemminger
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2010-03-18 20:37 UTC (permalink / raw)
  To: Changli Gao; +Cc: Tom Herbert, David Miller, netdev

Le jeudi 18 mars 2010 à 14:48 +0800, Changli Gao a écrit :
> sigh! How about adding file for each cpu weight setting.
> 
> .../rx-0/rps_cpu0...n
> 
> BTW: I think exporting the hook of hash function will help in some
> case. So users can choose which hash to use depend on their
> applications. I know FreeBSD supports hash based on flow, source or
> CPU. Some network application have multiple instances for taking full
> advantage of the SMP/C hardware, and each instance binds to a special
> CPU/Core, so they need some kind of load distributing algorithm for
> load balancing.
> 

exporting skb->rxhash would not be that interesting, but the cpu number
of last cpu handling the skb and queuing it on socket might be usefull.

> For example, memcached uses hash based on key, and its developer may
> implement a hash function for RPS. Then it apply the following
> iptables rule:
> 
> iptables -A PREROUTING -t nat -m cpu --cpuid 0 -m tcp --dport 1234
> --REDIRECT 8081
> iptables -A PREROUTING -t nat -m cpu --cpuid 0 -m tcp --dport 1234
> --REDIRECT 8082

Well, this would work only if load is evenly distributed to all cpus.
But you understand this kind of setup has nothing to do with RPS.
Going through REDIRECT (and conntrack) would kill performance, and would
not work for unpriviledged users (iptables changes forbidden).
It wont scale for future machines with 64 or 128 cores.

maybe some extension of REDIRECT target, being able to add cpu number to
destination port :

iptables -A PREROUTING -t nat -m tcp --dport 1234 --REDIRECT 1234+cpu


> ...
> 
> No other things to change, it can take full advantage of the
> underlying hardware transparently.
> 

Coming to mind would be a new socket operation, "bind to cpu", like the
"bind to device" operation.

This would work without need for netfilter (and permission to change its
rules)

But it would require changes to applications, to fully exploit SMP
capabilities of machine.




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7] rps: Receive Packet Steering
  2010-03-18 20:37                           ` Eric Dumazet
@ 2010-03-18 21:23                             ` Stephen Hemminger
  0 siblings, 0 replies; 25+ messages in thread
From: Stephen Hemminger @ 2010-03-18 21:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Tom Herbert, David Miller, netdev

On Thu, 18 Mar 2010 21:37:01 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 18 mars 2010 à 14:48 +0800, Changli Gao a écrit :
> > sigh! How about adding file for each cpu weight setting.
> > 
> > .../rx-0/rps_cpu0...n
> > 
> > BTW: I think exporting the hook of hash function will help in some
> > case. So users can choose which hash to use depend on their
> > applications. I know FreeBSD supports hash based on flow, source or
> > CPU. Some network application have multiple instances for taking full
> > advantage of the SMP/C hardware, and each instance binds to a special
> > CPU/Core, so they need some kind of load distributing algorithm for
> > load balancing.
> > 
> 
> exporting skb->rxhash would not be that interesting, but the cpu number
> of last cpu handling the skb and queuing it on socket might be usefull.
> 
> > For example, memcached uses hash based on key, and its developer may
> > implement a hash function for RPS. Then it apply the following
> > iptables rule:
> > 
> > iptables -A PREROUTING -t nat -m cpu --cpuid 0 -m tcp --dport 1234
> > --REDIRECT 8081
> > iptables -A PREROUTING -t nat -m cpu --cpuid 0 -m tcp --dport 1234
> > --REDIRECT 8082
> 
> Well, this would work only if load is evenly distributed to all cpus.
> But you understand this kind of setup has nothing to do with RPS.
> Going through REDIRECT (and conntrack) would kill performance, and would
> not work for unpriviledged users (iptables changes forbidden).
> It wont scale for future machines with 64 or 128 cores.
> 
> maybe some extension of REDIRECT target, being able to add cpu number to
> destination port :
> 
> iptables -A PREROUTING -t nat -m tcp --dport 1234 --REDIRECT 1234+cpu
> 
> 
> > ...
> > 
> > No other things to change, it can take full advantage of the
> > underlying hardware transparently.
> > 
> 
> Coming to mind would be a new socket operation, "bind to cpu", like the
> "bind to device" operation.
> 
> This would work without need for netfilter (and permission to change its
> rules)
> 
> But it would require changes to applications, to fully exploit SMP
> capabilities of machine.

Let's not make a useful feature (RPS) unusable by making it so complex
that mortals can't understand it.


-- 

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2010-03-18 23:51 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-12 20:13 [PATCH v7] rps: Receive Packet Steering Tom Herbert
2010-03-12 21:28 ` Eric Dumazet
2010-03-12 23:08   ` Tom Herbert
2010-03-16 18:03     ` Tom Herbert
2010-03-16 21:00       ` Eric Dumazet
2010-03-16 21:13         ` David Miller
2010-03-17  1:54           ` Changli Gao
2010-03-17  7:07             ` Eric Dumazet
2010-03-17  7:59               ` Changli Gao
2010-03-17 14:09                 ` Eric Dumazet
2010-03-17 15:01                   ` Tom Herbert
2010-03-17 15:34                     ` Eric Dumazet
2010-03-17 23:50                     ` Tom Herbert
2010-03-18  2:14                       ` Changli Gao
2010-03-18  6:30                         ` Eric Dumazet
2010-03-18  6:20                       ` Eric Dumazet
2010-03-18  6:48                         ` Changli Gao
2010-03-18 20:37                           ` Eric Dumazet
2010-03-18 21:23                             ` Stephen Hemminger
2010-03-17  4:26         ` David Miller
2010-03-12 22:20 ` Stephen Hemminger
2010-03-12 22:32   ` David Miller
2010-03-12 22:23 ` Stephen Hemminger
2010-03-12 22:33   ` David Miller
2010-03-12 23:05   ` Tom Herbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).