Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: batch skb dequeueing from softnet input_pkt_queue
From: David Miller @ 2010-04-13  4:44 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, netdev
In-Reply-To: <1271155268-2999-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Tue, 13 Apr 2010 18:41:08 +0800

> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> contention and irq disabling/enabling.

In exchange for a bunch of new atomic operations.

No, thanks.

^ permalink raw reply

* Re: [PATCH] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-13  5:19 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20100412.214453.57473458.davem@davemloft.net>

On Tue, Apr 13, 2010 at 12:44 PM, David Miller <davem@davemloft.net> wrote:
> From: Changli Gao <xiaosuo@gmail.com>
> Date: Tue, 13 Apr 2010 18:41:08 +0800
>
>> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
>> contention and irq disabling/enabling.
>
> In exchange for a bunch of new atomic operations.
>
> No, thanks.
>

Oh, I can eliminate atomic operations totally, if I use another
variable instead. I'll submit another version later.

^ permalink raw reply

* Re: [PATCH] ARM: dmabounce: fix partial sync in dma_sync_single_* API
From: FUJITA Tomonori @ 2010-04-13  5:27 UTC (permalink / raw)
  To: linux; +Cc: fujita.tomonori, netdev, davem, linux-arm-kernel, linux-kernel
In-Reply-To: <20100412193536.GO3048@n2100.arm.linux.org.uk>

On Mon, 12 Apr 2010 20:35:36 +0100
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:

> On Mon, Apr 05, 2010 at 12:39:32PM +0900, FUJITA Tomonori wrote:
> > I don't have arm hardware that uses dmabounce so I can't confirm the
> > problem but seems that dmabounce doesn't work for some drivers...
> 
> Patch reviews fine, except for one niggle.  I too don't have hardware
> I can test (well, I do except the kernel stopped supporting the UDA1341
> audio codec on the SA1110 Neponset.)

Thanks for reviewing.

> > @@ -171,10 +172,17 @@ find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_
> >  	read_lock_irqsave(&device_info->lock, flags);
> >  
> >  	list_for_each_entry(b, &device_info->safe_buffers, node)
> > -		if (b->safe_dma_addr == safe_dma_addr) {
> > -			rb = b;
> > -			break;
> > -		}
> > +		if (for_sync) {
> > +			if (b->safe_dma_addr <= safe_dma_addr &&
> > +			    safe_dma_addr < b->safe_dma_addr + b->size) {
> > +				rb = b;
> > +				break;
> > +			}
> > +		} else
> > +			if (b->safe_dma_addr == safe_dma_addr) {
> > +				rb = b;
> > +				break;
> > +			}
> 
> This is the niggle; I don't like this indentation style.  If you want to
> indent this if () statement, then please format like this:
> 
> 		} else {
> 			if (b->safe...) {
> 				...
> 			}
> 		}
> 
> or format it as:
> 
> 		} else if (b->safe...) {
> 			...
> 		}

ok, here's the fixed patch.

=
From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Subject: [PATCH] ARM: dmabounce: fix partial sync in dma_sync_single_* API

Some network drivers do a partial sync with
dma_sync_single_for_{device|cpu}. The dma_addr argument might not be
the same as one as passed into the mapping API.

This adds some tricks to find_safe_buffer() for
dma_sync_single_for_{device|cpu}.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 arch/arm/common/dmabounce.c |   30 +++++++++++++++++++++---------
 1 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c
index cc0a932..2e6deec 100644
--- a/arch/arm/common/dmabounce.c
+++ b/arch/arm/common/dmabounce.c
@@ -163,7 +163,8 @@ alloc_safe_buffer(struct dmabounce_device_info *device_info, void *ptr,
 
 /* determine if a buffer is from our "safe" pool */
 static inline struct safe_buffer *
-find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_addr)
+find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_addr,
+		 int for_sync)
 {
 	struct safe_buffer *b, *rb = NULL;
 	unsigned long flags;
@@ -171,9 +172,17 @@ find_safe_buffer(struct dmabounce_device_info *device_info, dma_addr_t safe_dma_
 	read_lock_irqsave(&device_info->lock, flags);
 
 	list_for_each_entry(b, &device_info->safe_buffers, node)
-		if (b->safe_dma_addr == safe_dma_addr) {
-			rb = b;
-			break;
+		if (for_sync) {
+			if (b->safe_dma_addr <= safe_dma_addr &&
+			    safe_dma_addr < b->safe_dma_addr + b->size) {
+				rb = b;
+				break;
+			}
+		} else {
+			if (b->safe_dma_addr == safe_dma_addr) {
+				rb = b;
+				break;
+			}
 		}
 
 	read_unlock_irqrestore(&device_info->lock, flags);
@@ -205,7 +214,8 @@ free_safe_buffer(struct dmabounce_device_info *device_info, struct safe_buffer *
 /* ************************************************** */
 
 static struct safe_buffer *find_safe_buffer_dev(struct device *dev,
-		dma_addr_t dma_addr, const char *where)
+						dma_addr_t dma_addr, const char *where,
+						int for_sync)
 {
 	if (!dev || !dev->archdata.dmabounce)
 		return NULL;
@@ -216,7 +226,7 @@ static struct safe_buffer *find_safe_buffer_dev(struct device *dev,
 			pr_err("unknown device: Trying to %s invalid mapping\n", where);
 		return NULL;
 	}
-	return find_safe_buffer(dev->archdata.dmabounce, dma_addr);
+	return find_safe_buffer(dev->archdata.dmabounce, dma_addr, for_sync);
 }
 
 static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size,
@@ -286,7 +296,7 @@ static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size,
 static inline void unmap_single(struct device *dev, dma_addr_t dma_addr,
 		size_t size, enum dma_data_direction dir)
 {
-	struct safe_buffer *buf = find_safe_buffer_dev(dev, dma_addr, "unmap");
+	struct safe_buffer *buf = find_safe_buffer_dev(dev, dma_addr, "unmap", 0);
 
 	if (buf) {
 		BUG_ON(buf->size != size);
@@ -398,7 +408,7 @@ int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr,
 	dev_dbg(dev, "%s(dma=%#x,off=%#lx,sz=%zx,dir=%x)\n",
 		__func__, addr, off, sz, dir);
 
-	buf = find_safe_buffer_dev(dev, addr, __func__);
+	buf = find_safe_buffer_dev(dev, addr, __func__, 1);
 	if (!buf)
 		return 1;
 
@@ -411,6 +421,8 @@ int dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr,
 	DO_STATS(dev->archdata.dmabounce->bounce_count++);
 
 	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) {
+		if (addr != buf->safe_dma_addr)
+			off = addr - buf->safe_dma_addr;
 		dev_dbg(dev, "%s: copy back safe %p to unsafe %p size %d\n",
 			__func__, buf->safe + off, buf->ptr + off, sz);
 		memcpy(buf->ptr + off, buf->safe + off, sz);
@@ -427,7 +439,7 @@ int dmabounce_sync_for_device(struct device *dev, dma_addr_t addr,
 	dev_dbg(dev, "%s(dma=%#x,off=%#lx,sz=%zx,dir=%x)\n",
 		__func__, addr, off, sz, dir);
 
-	buf = find_safe_buffer_dev(dev, addr, __func__);
+	buf = find_safe_buffer_dev(dev, addr, __func__, 1);
 	if (!buf)
 		return 1;
 
-- 
1.6.5

^ permalink raw reply related

* Re: Strange packet drops with heavy firewalling
From: Eric Dumazet @ 2010-04-13  5:56 UTC (permalink / raw)
  To: Changli Gao; +Cc: Benny Amorsen, zhigang gong, netdev
In-Reply-To: <u2y412e6f7f1004121618p6d6eff30q8a45a03faa59a912@mail.gmail.com>

Le mardi 13 avril 2010 à 07:18 +0800, Changli Gao a écrit :
> On Tue, Apr 13, 2010 at 1:06 AM, Benny Amorsen <benny+usenet@amorsen.dk> wrote:
> >
> >  99:         24    1306226          3          2   PCI-MSI-edge      eth1-tx-0
> >  100:      15735    1648774          3          7   PCI-MSI-edge      eth1-tx-1
> >  101:          8         11          9    1083022   PCI-MSI-edge      eth1-tx-2
> >  102:          0          0          0          0   PCI-MSI-edge      eth1-tx-3
> >  103:         18         15       6131    1095383   PCI-MSI-edge      eth1-rx-0
> >  104:        217         32      46544    1335325   PCI-MSI-edge      eth1-rx-1
> >  105:        154    1305595        218         16   PCI-MSI-edge      eth1-rx-2
> >  106:         17         16       8229    1467509   PCI-MSI-edge      eth1-rx-3
> >  107:          0          0          1          0   PCI-MSI-edge      eth1
> >  108:          2         14         15    1003053   PCI-MSI-edge      eth0-tx-0
> >  109:       8226    1668924        478        487   PCI-MSI-edge      eth0-tx-1
> >  110:          3    1188874         17         12   PCI-MSI-edge      eth0-tx-2
> >  111:          0          0          0          0   PCI-MSI-edge      eth0-tx-3
> >  112:        203        185       5324    1015263   PCI-MSI-edge      eth0-rx-0
> >  113:       4141    1600793        153        159   PCI-MSI-edge      eth0-rx-1
> >  114:      16242    1210108        436       3124   PCI-MSI-edge      eth0-rx-2
> >  115:        267       4173      19471    1321252   PCI-MSI-edge      eth0-rx-3
> >  116:          0          1          0          0   PCI-MSI-edge      eth0
> >
> >
> > irqbalanced seems to have picked CPU1 and CPU3 for all the interrupts,
> > which to my mind should cause the same problem as before (where CPU1 and
> > CPU3 was handling all packets). Yet the box clearly works much better
> > than before.
> 
> irqbalanced? I don't think it can work properly. Try RPS in netdev and
> linux-next tree, and if cpu load isn't even, try this patch:
> http://patchwork.ozlabs.org/patch/49915/ .
> 
> 

Dont try RPS on multiqueue devices !

If number of queues matches CPU numbers, it brings nothing but extra
latencies !

Benny, I am not sure your irqbalance is up2date with multiqueue devices,
you might need to disable it and manually irqaffine each interrupt

echo 01 >/proc/irq/100/smp_affinity
echo 02 >/proc/irq/101/smp_affinity
echo 04 >/proc/irq/102/smp_affinity
echo 08 >/proc/irq/103/smp_affinity
echo 10 >/proc/irq/104/smp_affinity
echo 20 >/proc/irq/105/smp_affinity
echo 40 >/proc/irq/106/smp_affinity
echo 80 >/proc/irq/107/smp_affinity

echo 01 >/proc/irq/108/smp_affinity
echo 02 >/proc/irq/109/smp_affinity
echo 04 >/proc/irq/110/smp_affinity
echo 08 >/proc/irq/111/smp_affinity
echo 10 >/proc/irq/112/smp_affinity
echo 20 >/proc/irq/113/smp_affinity
echo 40 >/proc/irq/114/smp_affinity
echo 80 >/proc/irq/115/smp_affinity



^ permalink raw reply

* [PATCH] CONFIG_SMP should be CONFIG_RPS
From: Changli Gao @ 2010-04-13 14:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, netdev, Tom Herbert, Changli Gao

CONFIG_SMP should be CONFIG_RPS

CONFIG_SMP should be CONFIG_RPS

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/linux/netdevice.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..0efb36e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1331,7 +1331,7 @@ struct softnet_data {
 	struct sk_buff		*completion_queue;
 
 	/* Elements below can be accessed between CPUs for RPS */
-#ifdef CONFIG_SMP
+#ifdef CONFIG_RPS
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
 #endif
 	struct sk_buff_head	input_pkt_queue;

^ permalink raw reply related

* Re: [PATCH] CONFIG_SMP should be CONFIG_RPS
From: Eric Dumazet @ 2010-04-13  6:59 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev, Tom Herbert
In-Reply-To: <1271169406-8115-1-git-send-email-xiaosuo@gmail.com>

Le mardi 13 avril 2010 à 22:36 +0800, Changli Gao a écrit :
> CONFIG_SMP should be CONFIG_RPS
> 
> CONFIG_SMP should be CONFIG_RPS
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ----
>  include/linux/netdevice.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index d1a21b5..0efb36e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1331,7 +1331,7 @@ struct softnet_data {
>  	struct sk_buff		*completion_queue;
>  
>  	/* Elements below can be accessed between CPUs for RPS */
> -#ifdef CONFIG_SMP
> +#ifdef CONFIG_RPS
>  	struct call_single_data	csd ____cacheline_aligned_in_smp;
>  #endif
>  	struct sk_buff_head	input_pkt_queue;

Thanks Changli, this is part of RFS patches :)



^ permalink raw reply

* Re: [Patch 3/3] net: reserve ports for applications using fixed port numbers
From: Cong Wang @ 2010-04-13  7:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: opurdila, eric.dumazet, netdev, nhorman, davem, ebiederm,
	linux-kernel
In-Reply-To: <201004130121.o3D1Lhh7099571@www262.sakura.ne.jp>

Tetsuo Handa wrote:
> Hello.
> 
>> --- linux-2.6.orig/drivers/infiniband/core/cma.c
>> +++ linux-2.6/drivers/infiniband/core/cma.c
>> @@ -1980,6 +1980,8 @@ retry:
>>  	/* FIXME: add proper port randomization per like inet_csk_get_port */
>>  	do {
>>  		ret = idr_get_new_above(ps, bind_list, next_port, &port);
>> +		if (!ret && inet_is_reserved_local_port(port))
>> +			ret = -EAGAIN;
>>  	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
>>  
>>  	if (ret)
>>
> I think above part is wrong. Below program
...
> 
> This result suggests that above loop will continue until idr_pre_get() fails
> due to out of memory if all ports were reserved.
> 
> Also, if idr_get_new_above() returned 0, bind_list (which is a kmalloc()ed
> pointer) is already installed into a free slot (see comment on
> idr_get_new_above_int()). Thus, simply calling idr_get_new_above() again will
> install the same pointer into multiple slots. I guess it will malfunction later.

Thanks for testing!

How about:

+		if (!ret && inet_is_reserved_local_port(port))
+			ret = -EBUSY;

? So that it will break the loop and return error.

^ permalink raw reply

* Re: BUG: using smp_processor_id() in preemptible [00000000] code: avahi-daemon: caller is netif_rx
From: Eric Dumazet @ 2010-04-13  7:14 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Paris, netdev, David Miller
In-Reply-To: <m2m65634d661004121354ta0189542l80473b82911e43f4@mail.gmail.com>

Le lundi 12 avril 2010 à 13:54 -0700, Tom Herbert a écrit :

> Would it be better to disable preemption in netif_rx?  Also note that
> with RFS we would be taking rcu_read_lock in netif_rx anyway and that
> could cover all the instances of smp_processor_id().
> 

Ok that makes sense.

What do you think applying a small fix before RFS integration, it is
better to have smaller patches anyway :)

[PATCH net-next-2.6] net: netif_rx() must disable preemption

Eric Paris reported netif_rx() is calling smp_processor_id() from
preemptible context, in particular when caller is
ip_dev_loopback_xmit().

RPS commit added this smp_processor_id() call, this patch makes sure
preemption is disabled. rps_get_cpus() wants rcu_read_lock() anyway, we
can dot it a bit earlier.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/dev.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index a10a216..a96ea6a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2206,6 +2206,7 @@ DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 /*
  * get_rps_cpu is called from netif_receive_skb and returns the target
  * CPU from the RPS map of the receiving queue for a given skb.
+ * rcu_read_lock must be held on entry.
  */
 static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 {
@@ -2217,8 +2218,6 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 	u8 ip_proto;
 	u32 addr1, addr2, ports, ihl;
 
-	rcu_read_lock();
-
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
 		if (unlikely(index >= dev->num_rx_queues)) {
@@ -2296,7 +2295,6 @@ got_hash:
 	}
 
 done:
-	rcu_read_unlock();
 	return cpu;
 }
 
@@ -2392,7 +2390,7 @@ enqueue:
 
 int netif_rx(struct sk_buff *skb)
 {
-	int cpu;
+	int ret;
 
 	/* if netpoll wants it, pretend we never saw it */
 	if (netpoll_rx(skb))
@@ -2402,14 +2400,21 @@ int netif_rx(struct sk_buff *skb)
 		net_timestamp(skb);
 
 #ifdef CONFIG_RPS
+	{
+	int cpu;
+
+	rcu_read_lock();
 	cpu = get_rps_cpu(skb->dev, skb);
 	if (cpu < 0)
 		cpu = smp_processor_id();
+	ret = enqueue_to_backlog(skb, cpu);
+	rcu_read_unlock();
+	}
 #else
-	cpu = smp_processor_id();
+	ret = enqueue_to_backlog(skb, get_cpu());
+	put_cpu();
 #endif
-
-	return enqueue_to_backlog(skb, cpu);
+	return ret;
 }
 EXPORT_SYMBOL(netif_rx);
 



^ permalink raw reply related

* Re: [Patch 1/3] sysctl: refactor integer handling proc code
From: Cong Wang @ 2010-04-13  7:35 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-kernel, Octavian Purdila, Eric Dumazet, penguin-kernel,
	netdev, Neil Horman, ebiederm, David Miller
In-Reply-To: <20100413111814.GB4396@x200>

Alexey Dobriyan wrote:
> On Mon, Apr 12, 2010 at 06:04:04AM -0400, Amerigo Wang wrote:
>> As we are about to add another integer handling proc function a little
>> bit of cleanup is in order: add a few helper functions to improve code
>> readability and decrease code duplication.
>>
>> In the process a bug is also fixed: if the user specifies a number
>> with more then 20 digits it will be interpreted as two integers
>> (e.g. 10000...13 will be interpreted as 100.... and 13).
> 
> ULONG_MAX is not 22 digits always.
> 
> The fix is to not rely on simple_strtoul()
> 
> I guess it's time to finally remove it. :-(


Or use strict_strtoul()?

> 
> Also, it's better to copy_from user stuff once.
> Without looking at non-trivial users, one page should be enough.

It seems that all proc code assumes that the input buffer will
not exceed one page size.


> 
>> Behavior for EFAULT handling was changed as well. Previous to this
>> patch, when an EFAULT error occurred in the middle of a write
>> operation, although some of the elements were set, that was not
>> acknowledged to the user (by shorting the write and returning the
>> number of bytes accepted). EFAULT is now treated just like any other
>> errors by acknowledging the amount of bytes accepted.
> 
>> +static int proc_skip_wspace(char __user **buf, size_t *size)
>> +{
>> +	char c;
>> +
>> +	while (*size) {
>> +		if (get_user(c, *buf))
>> +			return -EFAULT;
>> +		if (!isspace(c))
>> +			break;
>> +		(*size)--;
>> +		(*buf)++;
>> +	}
>> +
>> +	return 0;
>> +}
> 
> yeah, copy_from_user once, so we won't have this.

Ok.

> 
>> +static bool isanyof(char c, const char *v, unsigned len)
> 
> A what?
> this is memchr()
> 

Hmm, right, it should be memchr(v, c, len).

Thanks!

^ permalink raw reply

* [PATCH v2] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-13 15:38 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, netdev, Changli Gao

batch skb dequeueing from softnet input_pkt_queue

batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
contention and irq disabling/enabling.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/linux/netdevice.h |    2 ++
 net/core/dev.c            |   45 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 41 insertions(+), 6 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..bc7a0d7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1335,6 +1335,8 @@ struct softnet_data {
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
 #endif
 	struct sk_buff_head	input_pkt_queue;
+	struct sk_buff_head	processing_queue;
+	volatile bool		flush_processing_queue;
 	struct napi_struct	backlog;
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index a10a216..ac24293 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2324,6 +2324,11 @@ static void trigger_softirq(void *data)
 }
 #endif /* CONFIG_SMP */
 
+static inline u32 softnet_input_qlen(struct softnet_data *queue)
+{
+	return queue->input_pkt_queue.qlen + queue->processing_queue.qlen;
+}
+
 /*
  * enqueue_to_backlog is called to queue an skb to a per CPU backlog
  * queue (may be a remote CPU queue).
@@ -2339,8 +2344,8 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
 	__get_cpu_var(netdev_rx_stat).total++;
 
 	rps_lock(queue);
-	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
-		if (queue->input_pkt_queue.qlen) {
+	if (softnet_input_qlen(queue) <= netdev_max_backlog) {
+		if (softnet_input_qlen(queue)) {
 enqueue:
 			__skb_queue_tail(&queue->input_pkt_queue, skb);
 			rps_unlock(queue);
@@ -2803,6 +2808,7 @@ static void flush_backlog(void *arg)
 			__skb_unlink(skb, &queue->input_pkt_queue);
 			kfree_skb(skb);
 		}
+	queue->flush_processing_queue = true;
 	rps_unlock(queue);
 }
 
@@ -3112,14 +3118,23 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
 	unsigned long start_time = jiffies;
 
+	if (queue->flush_processing_queue) {
+		struct sk_buff *skb;
+
+		queue->flush_processing_queue = false;
+		while ((skb = __skb_dequeue(&queue->processing_queue)))
+			kfree_skb(skb);
+	}
+
 	napi->weight = weight_p;
 	do {
 		struct sk_buff *skb;
 
 		local_irq_disable();
 		rps_lock(queue);
-		skb = __skb_dequeue(&queue->input_pkt_queue);
-		if (!skb) {
+		skb_queue_splice_tail_init(&queue->input_pkt_queue,
+					   &queue->processing_queue);
+		if (skb_queue_empty(&queue->processing_queue)) {
 			__napi_complete(napi);
 			rps_unlock(queue);
 			local_irq_enable();
@@ -3128,9 +3143,22 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		rps_unlock(queue);
 		local_irq_enable();
 
-		__netif_receive_skb(skb);
-	} while (++work < quota && jiffies == start_time);
+		while ((skb = __skb_dequeue(&queue->processing_queue))) {
+			__netif_receive_skb(skb);
+			if (++work < quota && jiffies == start_time)
+				continue;
+			if (!queue->flush_processing_queue)
+				goto out;
+			queue->flush_processing_queue = false;
+			while ((skb = __skb_dequeue(&queue->processing_queue))) {
+				__netif_receive_skb(skb);
+				++work;
+			}
+			goto out;
+		}
+	} while (1);
 
+out:
 	return work;
 }
 
@@ -5487,6 +5515,9 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	raise_softirq_irqoff(NET_TX_SOFTIRQ);
 	local_irq_enable();
 
+	while ((skb = __skb_dequeue(&oldsd->processing_queue)))
+		netif_rx(skb);
+
 	/* Process offline CPU's input_pkt_queue */
 	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
 		netif_rx(skb);
@@ -5709,6 +5740,8 @@ static int __init net_dev_init(void)
 
 		queue = &per_cpu(softnet_data, i);
 		skb_queue_head_init(&queue->input_pkt_queue);
+		skb_queue_head_init(&queue->processing_queue);
+		queue->flush_processing_queue = false;
 		queue->completion_queue = NULL;
 		INIT_LIST_HEAD(&queue->poll_list);
 

^ permalink raw reply related

* Re: Strange packet drops with heavy firewalling
From: Benny Amorsen @ 2010-04-13  7:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, zhigang gong, netdev
In-Reply-To: <1271138186.16881.168.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> Benny, I am not sure your irqbalance is up2date with multiqueue devices,
> you might need to disable it and manually irqaffine each interrupt

True, that would probably help. Irqbalance might just believe that the
load is so low that it isn't worth rebalancing. The CPU's are spending
more than 90% of their time idling.

I'll keep monitoring the server, and if it starts dropping packets again
or load increases I'll check whether irqbalanced does the right thing,
and if not I'll implement your suggestion.

Thank you very much!

/Benny

^ permalink raw reply

* Re: [PATCH v2] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-13  8:08 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <1271173102-2980-1-git-send-email-xiaosuo@gmail.com>

Le mardi 13 avril 2010 à 23:38 +0800, Changli Gao a écrit :
> batch skb dequeueing from softnet input_pkt_queue
> 
> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> contention and irq disabling/enabling.
> 

Very interesting idea, but implementation is too complex, and probably
buggy, in a area that too few people understand today.

Could you keep it as simple as possible ?

> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ----
>  include/linux/netdevice.h |    2 ++
>  net/core/dev.c            |   45 +++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 41 insertions(+), 6 deletions(-)
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index d1a21b5..bc7a0d7 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1335,6 +1335,8 @@ struct softnet_data {
>  	struct call_single_data	csd ____cacheline_aligned_in_smp;
>  #endif
>  	struct sk_buff_head	input_pkt_queue;
> +	struct sk_buff_head	processing_queue;

	Probably not necessary.

> +	volatile bool		flush_processing_queue;

Use of 'volatile' is strongly discouraged, I would say, forbidden. 

Its usually a sign of 'I dont exactly what memory ordering I need, so I
throw a volatile just in case'. We live in a world full of RCU, read ,
write, full barriers. And these apis are well documented.

>  	struct napi_struct	backlog;
>  };
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a10a216..ac24293 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2324,6 +2324,11 @@ static void trigger_softirq(void *data)
>  }
>  #endif /* CONFIG_SMP */
>  
> +static inline u32 softnet_input_qlen(struct softnet_data *queue)
> +{
> +	return queue->input_pkt_queue.qlen + queue->processing_queue.qlen;
> +}
> +
>  /*
>   * enqueue_to_backlog is called to queue an skb to a per CPU backlog
>   * queue (may be a remote CPU queue).
> @@ -2339,8 +2344,8 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
>  	__get_cpu_var(netdev_rx_stat).total++;
>  
>  	rps_lock(queue);
> -	if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
> -		if (queue->input_pkt_queue.qlen) {
> +	if (softnet_input_qlen(queue) <= netdev_max_backlog) {
> +		if (softnet_input_qlen(queue)) {
>  enqueue:
>  			__skb_queue_tail(&queue->input_pkt_queue, skb);
>  			rps_unlock(queue);
> @@ -2803,6 +2808,7 @@ static void flush_backlog(void *arg)
>  			__skb_unlink(skb, &queue->input_pkt_queue);
>  			kfree_skb(skb);
>  		}
> +	queue->flush_processing_queue = true;

	Probably not necessary

>  	rps_unlock(queue);
>  }
>  
> @@ -3112,14 +3118,23 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  	struct softnet_data *queue = &__get_cpu_var(softnet_data);
>  	unsigned long start_time = jiffies;
>  
> +	if (queue->flush_processing_queue) {

Really... this is bloat IMHO

> +		struct sk_buff *skb;
> +
> +		queue->flush_processing_queue = false;
> +		while ((skb = __skb_dequeue(&queue->processing_queue)))
> +			kfree_skb(skb);
> +	}
> +
>  	napi->weight = weight_p;
>  	do {
>  		struct sk_buff *skb;
>  
>  		local_irq_disable();
>  		rps_lock(queue);
> -		skb = __skb_dequeue(&queue->input_pkt_queue);
> -		if (!skb) {
> +		skb_queue_splice_tail_init(&queue->input_pkt_queue,
> +					   &queue->processing_queue);
> +		if (skb_queue_empty(&queue->processing_queue)) {
>  			__napi_complete(napi);
>  			rps_unlock(queue);
>  			local_irq_enable();
> @@ -3128,9 +3143,22 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  		rps_unlock(queue);
>  		local_irq_enable();
>  
> -		__netif_receive_skb(skb);
> -	} while (++work < quota && jiffies == start_time);
> +		while ((skb = __skb_dequeue(&queue->processing_queue))) {
> +			__netif_receive_skb(skb);
> +			if (++work < quota && jiffies == start_time)
> +				continue;
> +			if (!queue->flush_processing_queue)
> +				goto out;
> +			queue->flush_processing_queue = false;

once again, ... so much code for a unlikely event...

> +			while ((skb = __skb_dequeue(&queue->processing_queue))) {
> +				__netif_receive_skb(skb);
> +				++work;
> +			}
> +			goto out;
> +		}
> +	} while (1);
>  
> +out:
>  	return work;
>  }
>  
> @@ -5487,6 +5515,9 @@ static int dev_cpu_callback(struct notifier_block *nfb,
>  	raise_softirq_irqoff(NET_TX_SOFTIRQ);
>  	local_irq_enable();
>  
> +	while ((skb = __skb_dequeue(&oldsd->processing_queue)))
> +		netif_rx(skb);
> +
>  	/* Process offline CPU's input_pkt_queue */
>  	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
>  		netif_rx(skb);
> @@ -5709,6 +5740,8 @@ static int __init net_dev_init(void)
>  
>  		queue = &per_cpu(softnet_data, i);
>  		skb_queue_head_init(&queue->input_pkt_queue);
> +		skb_queue_head_init(&queue->processing_queue);
> +		queue->flush_processing_queue = false;
>  		queue->completion_queue = NULL;
>  		INIT_LIST_HEAD(&queue->poll_list);
>  

I advise to keep it simple.

My suggestion would be to limit this patch only to process_backlog().

Really if you touch other areas, there is too much risk.

Perform sort of skb_queue_splice_tail_init() into a local (stack) queue,
but the trick is to not touch input_pkt_queue.qlen, so that we dont slow
down enqueue_to_backlog().

Process at most 'quota' skbs (or jiffies limit).

relock queue.
input_pkt_queue.qlen -= number_of_handled_skbs;

In the unlikely event we have unprocessed skbs in local queue,
re-insert the remaining skbs at head of input_pt_queue.

Consider if input_pkt_queue.qlen is 0 or not, to call
__napi_complete(napi); or not :)




^ permalink raw reply

* Re: [PATCH Resubmission] drivers/net/usb: Add new driver ipheth
From: David Miller @ 2010-04-13  8:15 UTC (permalink / raw)
  To: agimenez
  Cc: linux-kernel, dgiagio, dborca, gregkh, jonas.sjoquist,
	steve.glendinning, torgny.johansson, dbrownell, omar.oberthur,
	linux-usb, netdev
In-Reply-To: <1270678281-20750-1-git-send-email-agimenez@sysvalve.es>

From: L. Alberto Giménez <agimenez@sysvalve.es>
Date: Thu,  8 Apr 2010 00:11:20 +0200

> From: dborca@yahoo.com
> 
> Add new driver to use tethering with an iPhone device. After initial submission,
> apply fixes to fit the new driver into the kernel standards.
> 
> There are still a couple of minor (almost cosmetic-level) issues, but the driver
> is fully functional right now.
> 
> Signed-off-by: L. Alberto Giménez <agimenez@sysvalve.es>

I can't apply this.

Unless you add a rule to drivers/net/Makefile, the build won't
actually get to your driver unless one of the other USB networking
devices are configured.

Please fix this up and resubmit.

^ permalink raw reply

* Re: [PATCH net-next-2.6 1/3 (TAKE 3 RESENT)] ipv6 mcast: Introduce include/net/mld.h for MLD definitions.
From: David Miller @ 2010-04-13  8:21 UTC (permalink / raw)
  To: shemminger; +Cc: yoshfuji, netdev
In-Reply-To: <20100408094753.596bb61c@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Thu, 8 Apr 2010 09:47:53 -0700

> Is there a piece missing. I tried this against net-next-2.6
> 
> 
> -- 
>   CC [M]  net/ipv6/mcast.o
> net/ipv6/mcast.c: In function ‘igmp6_event_query’:
> net/ipv6/mcast.c:1134: error: implicit declaration of function ‘mld_msg’
> net/ipv6/mcast.c:1134: warning: assignment makes pointer from integer without a cast
> net/ipv6/mcast.c: In function ‘igmp6_event_report’:
> net/ipv6/mcast.c:1254: warning: assignment makes pointer from integer without a cast

Yeah I also can't see how this patch set ever compiled successfully.

I'm tossing this entire set.

^ permalink raw reply

* Re: [PATCH v2] can: Add esd board support to plx_pci CAN driver
From: David Miller @ 2010-04-13  8:23 UTC (permalink / raw)
  To: wg; +Cc: matthias.fuchs, netdev, Socketcan-core
In-Reply-To: <4BC1E1C6.1050405@grandegger.com>

From: Wolfgang Grandegger <wg@grandegger.com>
Date: Sun, 11 Apr 2010 16:50:46 +0200

> Matthias Fuchs wrote:
>> This patch adds support for SJA1000 based PCI CAN interface cards
>> from electronic system design gmbh.
>> 
>> Some changes have been done on the common code:
>>  - esd boards must not have the 2nd local interupt enabled (PLX9030/9050)
>>  - a new path for PLX9056/PEX8311 chips has been added
>>  - new plx9056 reset function has been implemented
>>  - struct plx_card_info got a reset function entry
>> 
>> In detail the following additional boards are now supported:
>> 
>>         CAN-PCI/200 (PCI)
>>         CAN-PCI/266 (PCI)
>>         CAN-PMC266 (PMC module)
>>         CAN-PCIe/2000 (PCI Express)
>>         CAN-CPCI/200 (Compact PCI, 3U)
>>         CAN-PCI104 (PCI104)
>> 
>> Signed-off-by: Matthias Fuchs <matthias.fuchs@esd.eu>
>> ---
>> v2:
>>  - update Kconfig
>>  - add proper plx9056 reset function
>>  - add reset function pointer to plx_card_info structure
>>  - use card's reset function in plx_pci_del_card()
> 
> Acked-by: Wolfgang Grandegger <wg@grandegger.com>

Applied to net-next-2.6, thanks.

^ permalink raw reply

* Re: KS8851: Possible NULL dereferenced in ks8851_rx_pkts
From: David Miller @ 2010-04-13  8:28 UTC (permalink / raw)
  To: abraham.arce.moreno; +Cc: netdev
In-Reply-To: <z2ocb8016981004080003h7f5ca358r7a1a4106809909ce@mail.gmail.com>

From: Abraham Arce <abraham.arce.moreno@gmail.com>
Date: Thu, 8 Apr 2010 02:03:57 -0500

> Hi,
> 
> These changes avoid a possible dereference in skb_reserve when skb is
> NULL. I am increasing rx dropped packet count but not sure about how
> to handle the dump of frames. Any advice is appreciated.

This isn't sufficient to handle the NULL pointer.

At a minimum you're going to have to do something about
the fact that the chip has already been told to start
bringing the packet into the RX fifo.

If we just return without finishing up the hardware
operation, the chip is probably going to hang the next
time we come in here and tell it to start another RX
DMA operation without having completed the previous one.

The bug definitely needs to be fixed, however, but someone who knows
this hardware and has access to it for testing will need to implement
the fix.

^ permalink raw reply

* Re: [PATCH] r6040: use (dev|netdev)_<level> macros helpers
From: David Miller @ 2010-04-13  8:29 UTC (permalink / raw)
  To: florian; +Cc: netdev
In-Reply-To: <201004080939.27917.florian@openwrt.org>

From: Florian Fainelli <florian@openwrt.org>
Date: Thu, 8 Apr 2010 09:39:27 +0200

> 
> Signed-off-by: Florian Fainelli <florian@openwrt.org>

Applied to net-next-2.6, thanks.

^ permalink raw reply

* Re: [PATCH v2] packet: support for TX time stamps on RAW sockets
From: David Miller @ 2010-04-13  8:30 UTC (permalink / raw)
  To: richardcochran; +Cc: netdev
In-Reply-To: <20100408084128.GA17098@riccoc20.at.omicron.at>

From: Richard Cochran <richardcochran@gmail.com>
Date: Thu, 8 Apr 2010 10:41:28 +0200

> Enable the SO_TIMESTAMPING socket infrastructure for raw packet sockets.
> We introduce PACKET_TX_TIMESTAMP for the control message cmsg_type.
> 
> Similar support for UDP and CAN sockets was added in commit
> 51f31cabe3ce5345b51e4a4f82138b38c4d5dc91
> 
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>

Looks good, applied to net-next-2.6, thanks.

^ permalink raw reply

* Re: Linux arp flux problem
From: Julian Anastasov @ 2010-04-13  8:26 UTC (permalink / raw)
  To: Ming-Ching Tiew; +Cc: Net Dev
In-Reply-To: <908408.23515.qm@web31502.mail.mud.yahoo.com>


	Hello,

On Sun, 11 Apr 2010, Ming-Ching Tiew wrote:

> The following link explains the Linux arp flux problem pretty well, and I myself have been burnt badly by a life site where the "arp_filter" does not help at all.
> 
>          http://linux-ip.net/html/ether-arp.html
> 
> And I tested the kernel patch by Julian Anastasov, and it works 100% reliably :-
> 
>          http://www.ssi.bg/~ja/#hidden
> 
> My question is the patches has been around for many years, why has it not been included into the kernel ? Is it that Linux is supposed to have this "side effects" of arp linux on purpose ?

	The "hidden" flag is obsolete, kernel already has arp_ignore
and arp_announce vars, see Documentation/networking/ip-sysctl.txt
for more information. May be what you need is arp_announce=1/2
and (arp_ignore=1/2 or arp_filter=1 or rp_filter=1).

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH 2/2] [V5] Add non-Virtex5 support for LL TEMAC driver
From: David Miller @ 2010-04-13  8:34 UTC (permalink / raw)
  To: grant.likely
  Cc: john.linn, netdev, linuxppc-dev, jwboyer, eric.dumazet,
	john.williams, michal.simek, jtyner
In-Reply-To: <x2sfa686aa41004091110l87cd2234rc68906080fc4fe03@mail.gmail.com>

From: Grant Likely <grant.likely@secretlab.ca>
Date: Fri, 9 Apr 2010 12:10:21 -0600

> On Thu, Apr 8, 2010 at 11:08 AM, John Linn <john.linn@xilinx.com> wrote:
>> This patch adds support for using the LL TEMAC Ethernet driver on
>> non-Virtex 5 platforms by adding support for accessing the Soft DMA
>> registers as if they were memory mapped instead of solely through the
>> DCR's (available on the Virtex 5).
>>
>> The patch also updates the driver so that it runs on the MicroBlaze.
>> The changes were tested on the PowerPC 440, PowerPC 405, and the
>> MicroBlaze platforms.
>>
>> Signed-off-by: John Tyner <jtyner@cs.ucr.edu>
>> Signed-off-by: John Linn <john.linn@xilinx.com>
> 
> Picked up and build tested both patches on 405, 440, 60x and ppc64.
> No build problems found either built-in or as a module.
> 
> for both:
> Acked-by: Grant Likely <grant.likely@secretlab.ca>

Ok, both applied to net-next-2.6, thanks everyone for sorting this
out.

^ permalink raw reply

* Re: [PATCH] Fix some #includes in CAN drivers
From: David Miller @ 2010-04-13  8:36 UTC (permalink / raw)
  To: hjk
  Cc: socketcan-core, netdev, kernel, 21cnbao, celston, chripell, wg,
	haas, per.dalen, P.B.Cheblakov, oliver.hartkopp, anantgole
In-Reply-To: <20100408192545.GD2004@bluebox.local>

From: "Hans J. Koch" <hjk@linutronix.de>
Date: Thu, 8 Apr 2010 21:25:45 +0200

> In the current implementation, CAN drivers need to #include <linux/can.h>
> _before_ they #include <linux/can/dev.h>, which is both ugly and
> unnecessary.
> 
> Fix this by including <linux/can.h> in <linux/can/dev.h> and remove the
> #include <linux/can.h> lines from drivers.
> 
> Signed-off-by: Hans J. Koch <hjk@linutronix.de>

This doesn't apply cleanly to net-next-2.6, please respin your patch
and resubmit.

Thanks.

^ permalink raw reply

* Re: [Patch 3/3] net: reserve ports for applications using fixed port numbers
From: Cong Wang @ 2010-04-13  8:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: opurdila, eric.dumazet, netdev, nhorman, davem, ebiederm,
	linux-kernel
In-Reply-To: <4BC41994.7030707@redhat.com>

Cong Wang wrote:
> Tetsuo Handa wrote:
>> Hello.
>>
>>> --- linux-2.6.orig/drivers/infiniband/core/cma.c
>>> +++ linux-2.6/drivers/infiniband/core/cma.c
>>> @@ -1980,6 +1980,8 @@ retry:
>>>  	/* FIXME: add proper port randomization per like inet_csk_get_port */
>>>  	do {
>>>  		ret = idr_get_new_above(ps, bind_list, next_port, &port);
>>> +		if (!ret && inet_is_reserved_local_port(port))
>>> +			ret = -EAGAIN;
>>>  	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
>>>  
>>>  	if (ret)
>>>
>> I think above part is wrong. Below program
> ...
>> This result suggests that above loop will continue until idr_pre_get() fails
>> due to out of memory if all ports were reserved.
>>
>> Also, if idr_get_new_above() returned 0, bind_list (which is a kmalloc()ed
>> pointer) is already installed into a free slot (see comment on
>> idr_get_new_above_int()). Thus, simply calling idr_get_new_above() again will
>> install the same pointer into multiple slots. I guess it will malfunction later.
> 
> Thanks for testing!
> 
> How about:
> 
> +		if (!ret && inet_is_reserved_local_port(port))
> +			ret = -EBUSY;
> 
> ? So that it will break the loop and return error.
> 

Or use the similar trick:

 int tries = 10;
...

 if(!ret && inet_is_reserved_local_port(port)) {
   if (tries--)
     ret = -EAGAIN;
   else
     ret = -EBUSY;
 }

Any comments?

^ permalink raw reply

* Re: [PATCH v4] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-13  8:45 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1004121651460.31468@pokey.mtv.corp.google.com>

Le lundi 12 avril 2010 à 17:03 -0700, Tom Herbert a écrit :
> Version 4 of RFS:
> - Use a mutex in rps_sock_flow_sysctl for mutual exclusion between
> concurrent writers and allows calling vmalloc.
> - Removed extra space before "rc = sock_queue_rcv_skb(sk, skb);"
> - Make changelog < 70 chars
> - Ensure calls to smp_processor_id in netif_rx are called in
> non-preemptable region
> ---
> This patch implements receive flow steering (RFS).  RFS steers
> received packets for layer 3 and 4 processing to the CPU where
> the application for the corresponding flow is running.  RFS is an
> extension of Receive Packet Steering (RPS).
> 
> The basic idea of RFS is that when an application calls recvmsg
> (or sendmsg) the application's running CPU is stored in a hash
> table that is indexed by the connection's rxhash which is stored in
> the socket structure.  The rxhash is passed in skb's received on
> the connection from netif_receive_skb.  For each received packet,
> the associated rxhash is used to look up the CPU in the hash table,
> if a valid CPU is set then the packet is steered to that CPU using
> the RPS mechanisms.
> 
> The convolution of the simple approach is that it would potentially
> allow OOO packets.  If threads are thrashing around CPUs or multiple
> threads are trying to read from the same sockets, a quickly changing
> CPU value in the hash table could cause rampant OOO packets--
> we consider this a non-starter.
> 
> To avoid OOO packets, this solution implements two types of hash
> tables: rps_sock_flow_table and rps_dev_flow_table.
> 
> rps_sock_table is a global hash table.  Each entry is just a CPU
> number and it is populated in recvmsg and sendmsg as described above.
> This table contains the "desired" CPUs for flows.
> 
> rps_dev_flow_table is specific to each device queue.  Each entry
> contains a CPU and a tail queue counter.  The CPU is the "current"
> CPU for a matching flow.  The tail queue counter holds the value
> of a tail queue counter for the associated CPU's backlog queue at
> the time of last enqueue for a flow matching the entry.
> 
> Each backlog queue has a queue head counter which is incremented
> on dequeue, and so a queue tail counter is computed as queue head
> count + queue length.  When a packet is enqueued on a backlog queue,
> the current value of the queue tail counter is saved in the hash
> entry of the rps_dev_flow_table.
> 
> And now the trick: when selecting the CPU for RPS (get_rps_cpu)
> the rps_sock_flow table and the rps_dev_flow table for the RX queue
> are consulted.  When the desired CPU for the flow (found in the
> rps_sock_flow table) does not match the current CPU (found in the
> rps_dev_flow table), the current CPU is changed to the desired CPU
> if one of the following is true:
> 
> - The current CPU is unset (equal to RPS_NO_CPU)
> - Current CPU is offline
> - The current CPU's queue head counter >= queue tail counter in the
> rps_dev_flow table.  This checks if the queue tail has advanced
> beyond the last packet that was enqueued using this table entry.
> This guarantees that all packets queued using this entry have been
> dequeued, thus preserving in order delivery.
> 
> Making each queue have its own rps_dev_flow table has two advantages:
> 1) the tail queue counters will be written on each receive, so
> keeping the table local to interrupting CPU s good for locality.  2)
> this allows lockless access to the table-- the CPU number and queue
> tail counter need to be accessed together under mutual exclusion
> from netif_receive_skb, we assume that this is only called from
> device napi_poll which is non-reentrant.
> 
> This patch implements RFS for TCP and connected UDP sockets.
> It should be usable for other flow oriented protocols.
> 
> There are two configuration parameters for RFS.  The
> "rps_flow_entries" kernel init parameter sets the number of
> entries in the rps_sock_flow_table, the per rxqueue sysfs entry
> "rps_flow_cnt" contains the number of entries in the rps_dev_flow
> table for the rxqueue.  Both are rounded to power of two.
> 
> The obvious benefit of RFS (over just RPS) is that it achieves
> CPU locality between the receive processing for a flow and the
> applications processing; this can result in increased performance
> (higher pps, lower latency).
> 
> The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors.  On simple benchmarks, we don't necessarily
> see improvement and sometimes see degradation.  However, for more
> complex benchmarks and for applications where cache pressure is
> much higher this technique seems to perform very well.
> 
> Below are some benchmark results which show the potential benfit of
> this patch.  The netperf test has 500 instances of netperf TCP_RR
> test with 1 byte req. and resp.  The RPC test is an request/response
> test similar in structure to netperf RR test ith 100 threads on
> each host, but does more work in userspace that netperf.
> 
> e1000e on 8 core Intel
>    No RFS or RPS		104K tps at 30% CPU
>    No RFS (best RPS config):    290K tps at 63% CPU
>    RFS				303K tps at 61% CPU
> 
> RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
>   No RFS/RPS	103K	48%	757/900/3185		4472.35
>   RPS only:	174K	73%	415/993/2468		491.66
>   RFS		223K	73%	379/651/1382		315.61
> 
> Signed-off-by: Tom Herbert <therbert@google.com> ---
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index d1a21b5..573e775 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -530,14 +530,77 @@ struct rps_map {
>  };
>  #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
>  
> +/*
> + * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
> + * tail pointer for that CPU's input queue at the time of last enqueue.
> + */
> +struct rps_dev_flow {
> +	u16 cpu;
> +	u16 fill;
> +	unsigned int last_qtail;
> +};
> +
> +/*
> + * The rps_dev_flow_table structure contains a table of flow mappings.
> + */
> +struct rps_dev_flow_table {
> +	unsigned int mask;
> +	struct rcu_head rcu;
> +	struct work_struct free_work;
> +	struct rps_dev_flow flows[0];
> +};
> +#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
> +    (_num * sizeof(struct rps_dev_flow)))
> +
> +/*
> + * The rps_sock_flow_table contains mappings of flows to the last CPU
> + * on which they were processed by the application (set in recvmsg).
> + */
> +struct rps_sock_flow_table {
> +	unsigned int mask;
> +	u16 ents[0];
> +};
> +#define	RPS_SOCK_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_sock_flow_table) + \
> +    (_num * sizeof(u16)))
> +
> +extern int rps_sock_flow_sysctl(ctl_table *table, int write,
> +				void __user *buffer, size_t *lenp,
> +				loff_t *ppos);

Hmm... ctl_table is not available in all contexts here.

  CC      fs/lockd/host.o
In file included from include/linux/icmpv6.h:173,
                 from include/linux/ipv6.h:216,
                 from include/net/ipv6.h:16,
                 from include/linux/sunrpc/clnt.h:25,
                 from fs/lockd/host.c:15:
include/linux/netdevice.h:566: error: expected ‘)’ before ‘*’ token
make[2]: *** [fs/lockd/host.o] Erreur 1
make[1]: *** [fs/lockd] Erreur 2
make: *** [fs] Erreur 2


Maybe rps_sock_flow_sysctl could be static in
net/core/sysctl_net_core.c ?


> +
> +#define RPS_NO_CPU 0xffff
> +
> +static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
> +					u32 hash)
> +{
> +	if (table && hash) {
> +		unsigned int cpu, index = hash & table->mask;
> +
> +		/* We only give a hint, preemption can change cpu under us */
> +		cpu = raw_smp_processor_id();
> +
> +		if (table->ents[index] != cpu)
> +			table->ents[index] = cpu;
> +	}
> +}
> +
> +static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table,
> +				       u32 hash)
> +{
> +	if (table && hash)
> +		table->ents[hash & table->mask] = RPS_NO_CPU;
> +}
> +
> +extern struct rps_sock_flow_table *rps_sock_flow_table;
> +
>  /* This structure contains an instance of an RX queue. */
>  struct netdev_rx_queue {
>  	struct rps_map *rps_map;
> +	struct rps_dev_flow_table *rps_flow_table;
>  	struct kobject kobj;
>  	struct netdev_rx_queue *first;
>  	atomic_t count;
>  } ____cacheline_aligned_in_smp;
> -#endif
> +#endif /* CONFIG_RPS */
>  
>  /*
>   * This structure defines the management hooks for network devices.
> @@ -1331,13 +1394,21 @@ struct softnet_data {
>  	struct sk_buff		*completion_queue;
>  
>  	/* Elements below can be accessed between CPUs for RPS */
> -#ifdef CONFIG_SMP
> +#ifdef CONFIG_RPS
>  	struct call_single_data	csd ____cacheline_aligned_in_smp;
> +	unsigned int		input_queue_head;
>  #endif
>  	struct sk_buff_head	input_pkt_queue;
>  	struct napi_struct	backlog;
>  };
>  
> +static inline void incr_input_queue_head(struct softnet_data *queue)
> +{
> +#ifdef CONFIG_RPS
> +	queue->input_queue_head++;
> +#endif
> +}
> +
>  DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
>  
>  #define HAVE_NETIF_QUEUE
> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> index 83fd344..b487bc1 100644
> --- a/include/net/inet_sock.h
> +++ b/include/net/inet_sock.h
> @@ -21,6 +21,7 @@
>  #include <linux/string.h>
>  #include <linux/types.h>
>  #include <linux/jhash.h>
> +#include <linux/netdevice.h>
>  
>  #include <net/flow.h>
>  #include <net/sock.h>
> @@ -101,6 +102,7 @@ struct rtable;
>   * @uc_ttl - Unicast TTL
>   * @inet_sport - Source port
>   * @inet_id - ID counter for DF pkts
> + * @rxhash - flow hash received from netif layer
>   * @tos - TOS
>   * @mc_ttl - Multicasting TTL
>   * @is_icsk - is this an inet_connection_sock?
> @@ -124,6 +126,9 @@ struct inet_sock {
>  	__u16			cmsg_flags;
>  	__be16			inet_sport;
>  	__u16			inet_id;
> +#ifdef CONFIG_RPS
> +	__u32			rxhash;
> +#endif
>  
>  	struct ip_options	*opt;
>  	__u8			tos;
> @@ -219,4 +224,37 @@ static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
>  	return inet_sk(sk)->transparent ? FLOWI_FLAG_ANYSRC : 0;
>  }
>  
> +static inline void inet_rps_record_flow(const struct sock *sk)
> +{
> +#ifdef CONFIG_RPS
> +	struct rps_sock_flow_table *sock_flow_table;
> +
> +	rcu_read_lock();
> +	sock_flow_table = rcu_dereference(rps_sock_flow_table);
> +	rps_record_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
> +	rcu_read_unlock();
> +#endif
> +}
> +
> +static inline void inet_rps_reset_flow(const struct sock *sk)
> +{
> +#ifdef CONFIG_RPS
> +	struct rps_sock_flow_table *sock_flow_table;
> +
> +	rcu_read_lock();
> +	sock_flow_table = rcu_dereference(rps_sock_flow_table);
> +	rps_reset_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
> +	rcu_read_unlock();
> +#endif
> +}
> +
> +static inline void inet_rps_save_rxhash(const struct sock *sk, u32 rxhash)
> +{
> +#ifdef CONFIG_RPS
> +	if (unlikely(inet_sk(sk)->rxhash != rxhash)) {
> +		inet_rps_reset_flow(sk);
> +		inet_sk(sk)->rxhash = rxhash;
> +	}
> +#endif
> +}
>  #endif	/* _INET_SOCK_H */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a10a216..7dbe64e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2203,22 +2203,81 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
>  DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
>  
>  #ifdef CONFIG_RPS
> +/* One global table that all flow-based protocols share. */
> +struct rps_sock_flow_table *rps_sock_flow_table;
> +EXPORT_SYMBOL(rps_sock_flow_table);
> +
> +int rps_sock_flow_sysctl(ctl_table *table, int write, void __user *buffer,
> +			 size_t *lenp, loff_t *ppos)
> +{
> +	unsigned int orig_size, size;
> +	int ret, i;
> +	ctl_table tmp = {
> +		.data = &size,
> +		.maxlen = sizeof(size),
> +		.mode = table->mode
> +	};
> +	struct rps_sock_flow_table *orig_sock_table, *sock_table;
> +	static DEFINE_MUTEX(sock_flow_mutex);
> +
> +	mutex_lock(&sock_flow_mutex);
> +
> +	orig_sock_table = rps_sock_flow_table;
> +	size = orig_size = orig_sock_table ? orig_sock_table->mask + 1 : 0;
> +
> +	ret = proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +
> +	if (write) {
> +		if (size) {
> +			size = roundup_pow_of_two(size);
> +			if (size != orig_size) {
> +				sock_table =
> +				    vmalloc(RPS_SOCK_FLOW_TABLE_SIZE(size));

Please take a look at overflows in this macro

On a 32 bit machine, what happens if someone does

echo 2147483648 >/proc/sys/net/core/rps_sock_flow_entries

(I bet for a crash :( )

> +				if (!sock_table) {
> +					mutex_unlock(&sock_flow_mutex);
> +					return -ENOMEM;
> +				}
> +
> +				sock_table->mask = size - 1;
> +			} else
> +				sock_table = orig_sock_table;
> +
> +			for (i = 0; i < size; i++)
> +				sock_table->ents[i] = RPS_NO_CPU;
> +		} else
> +			sock_table = NULL;
> +
> +		if (sock_table != orig_sock_table) {
> +			rcu_assign_pointer(rps_sock_flow_table, sock_table);
> +			synchronize_rcu();
> +			vfree(orig_sock_table);
> +		}
> +	}
> +
> +	mutex_unlock(&sock_flow_mutex);
> +
> +	return ret;
> +}
> +
>  /*
>   * get_rps_cpu is called from netif_receive_skb and returns the target
>   * CPU from the RPS map of the receiving queue for a given skb.
> + * rcu_read_lock must be held on entry.
>   */
> -static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
> +static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
> +		       struct rps_dev_flow **rflowp)
>  {
>  	struct ipv6hdr *ip6;
>  	struct iphdr *ip;
>  	struct netdev_rx_queue *rxqueue;
>  	struct rps_map *map;
> +	struct rps_dev_flow_table *flow_table;
> +	struct rps_sock_flow_table *sock_flow_table;
>  	int cpu = -1;
>  	u8 ip_proto;
> +	u16 tcpu;
>  	u32 addr1, addr2, ports, ihl;
>  
> -	rcu_read_lock();
> -
>  	if (skb_rx_queue_recorded(skb)) {
>  		u16 index = skb_get_rx_queue(skb);
>  		if (unlikely(index >= dev->num_rx_queues)) {
> @@ -2233,7 +2292,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
>  	} else
>  		rxqueue = dev->_rx;
>  
> -	if (!rxqueue->rps_map)
> +	if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
>  		goto done;
>  
>  	if (skb->rxhash)
> @@ -2285,9 +2344,48 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
>  		skb->rxhash = 1;
>  
>  got_hash:
> +	flow_table = rcu_dereference(rxqueue->rps_flow_table);
> +	sock_flow_table = rcu_dereference(rps_sock_flow_table);
> +	if (flow_table && sock_flow_table) {
> +		u16 next_cpu;
> +		struct rps_dev_flow *rflow;
> +
> +		rflow = &flow_table->flows[skb->rxhash & flow_table->mask];
> +		tcpu = rflow->cpu;
> +
> +		next_cpu = sock_flow_table->ents[skb->rxhash &
> +		    sock_flow_table->mask];
> +
> +		/*
> +		 * If the desired CPU (where last recvmsg was done) is
> +		 * different from current CPU (one in the rx-queue flow
> +		 * table entry), switch if one of the following holds:
> +		 *   - Current CPU is unset (equal to RPS_NO_CPU).
> +		 *   - Current CPU is offline.
> +		 *   - The current CPU's queue tail has advanced beyond the
> +		 *     last packet that was enqueued using this table entry.
> +		 *     This guarantees that all previous packets for the flow
> +		 *     have been dequeued, thus preserving in order delivery.
> +		 */
> +		if (unlikely(tcpu != next_cpu) &&
> +		    (tcpu == RPS_NO_CPU || !cpu_online(tcpu) ||
> +		     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -
> +		      rflow->last_qtail)) >= 0)) {
> +			tcpu = rflow->cpu = next_cpu;
> +			if (tcpu != RPS_NO_CPU)
> +				rflow->last_qtail = per_cpu(softnet_data,
> +				    tcpu).input_queue_head;
> +		}
> +		if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
> +			*rflowp = rflow;
> +			cpu = tcpu;
> +			goto done;
> +		}
> +	}
> +
>  	map = rcu_dereference(rxqueue->rps_map);
>  	if (map) {
> -		u16 tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
> +		tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
>  
>  		if (cpu_online(tcpu)) {
>  			cpu = tcpu;
> @@ -2296,7 +2394,6 @@ got_hash:
>  	}
>  
>  done:
> -	rcu_read_unlock();
>  	return cpu;
>  }
>  
> @@ -2322,13 +2419,14 @@ static void trigger_softirq(void *data)
>  	__napi_schedule(&queue->backlog);
>  	__get_cpu_var(netdev_rx_stat).received_rps++;
>  }
> -#endif /* CONFIG_SMP */
> +#endif /* CONFIG_RPS */
>  
>  /*
>   * enqueue_to_backlog is called to queue an skb to a per CPU backlog
>   * queue (may be a remote CPU queue).
>   */
> -static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
> +static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
> +			      unsigned int *qtail)
>  {
>  	struct softnet_data *queue;
>  	unsigned long flags;
> @@ -2343,6 +2441,10 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
>  		if (queue->input_pkt_queue.qlen) {
>  enqueue:
>  			__skb_queue_tail(&queue->input_pkt_queue, skb);
> +#ifdef CONFIG_RPS
> +			*qtail = queue->input_queue_head +
> +			    queue->input_pkt_queue.qlen;
> +#endif
>  			rps_unlock(queue);
>  			local_irq_restore(flags);
>  			return NET_RX_SUCCESS;
> @@ -2357,11 +2459,10 @@ enqueue:
>  
>  				cpu_set(cpu, rcpus->mask[rcpus->select]);
>  				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
> -			} else
> -				__napi_schedule(&queue->backlog);
> -#else
> -			__napi_schedule(&queue->backlog);
> +				goto enqueue;
> +			}
>  #endif
> +			__napi_schedule(&queue->backlog);
>  		}
>  		goto enqueue;
>  	}
> @@ -2392,7 +2493,8 @@ enqueue:
>  
>  int netif_rx(struct sk_buff *skb)
>  {
> -	int cpu;
> +	unsigned int qtail;
> +	int err;
>  
>  	/* if netpoll wants it, pretend we never saw it */
>  	if (netpoll_rx(skb))
> @@ -2402,14 +2504,26 @@ int netif_rx(struct sk_buff *skb)
>  		net_timestamp(skb);
>  
>  #ifdef CONFIG_RPS
> -	cpu = get_rps_cpu(skb->dev, skb);
> -	if (cpu < 0)
> -		cpu = smp_processor_id();
> +	{
> +		struct rps_dev_flow voidflow, *rflow = &voidflow;
> +		int cpu;
> +
> +		rcu_read_lock();
> +
> +		cpu = get_rps_cpu(skb->dev, skb, &rflow);
> +		if (cpu < 0)
> +			cpu = smp_processor_id();
> +
> +		err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
> +
> +		rcu_read_unlock();
> +	}
>  #else
> -	cpu = smp_processor_id();
> +	preempt_disable();
> +	err = enqueue_to_backlog(skb, smp_processor_id(), &qtail);
> +	preempt_enable();
>  #endif
> -
> -	return enqueue_to_backlog(skb, cpu);
> +	return err;
>  }
>  EXPORT_SYMBOL(netif_rx);
>  
> @@ -2776,17 +2890,22 @@ out:
>  int netif_receive_skb(struct sk_buff *skb)
>  {
>  #ifdef CONFIG_RPS
> -	int cpu;
> +	struct rps_dev_flow voidflow, *rflow = &voidflow;
> +	int cpu, err;
> +
> +	rcu_read_lock();
>  
> -	cpu = get_rps_cpu(skb->dev, skb);
> +	cpu = get_rps_cpu(skb->dev, skb, &rflow);
>  
> -	if (cpu < 0)
> -		return __netif_receive_skb(skb);
> -	else
> -		return enqueue_to_backlog(skb, cpu);
> -#else
> -	return __netif_receive_skb(skb);
> +	if (cpu >= 0) {
> +		err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
> +		rcu_read_unlock();
> +		return err;
> +	}
> +
> +	rcu_read_unlock();
>  #endif
> +	return __netif_receive_skb(skb);
>  }
>  EXPORT_SYMBOL(netif_receive_skb);
>  
> @@ -2802,6 +2921,7 @@ static void flush_backlog(void *arg)
>  		if (skb->dev == dev) {
>  			__skb_unlink(skb, &queue->input_pkt_queue);
>  			kfree_skb(skb);
> +			incr_input_queue_head(queue);
>  		}
>  	rps_unlock(queue);
>  }
> @@ -3125,6 +3245,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
>  			local_irq_enable();
>  			break;
>  		}
> +		incr_input_queue_head(queue);
>  		rps_unlock(queue);
>  		local_irq_enable();
>  
> @@ -5488,8 +5609,10 @@ static int dev_cpu_callback(struct notifier_block *nfb,
>  	local_irq_enable();
>  
>  	/* Process offline CPU's input_pkt_queue */
> -	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
> +	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
>  		netif_rx(skb);
> +		incr_input_queue_head(oldsd);
> +	}
>  
>  	return NOTIFY_OK;
>  }
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 96ed690..e518bee 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -601,22 +601,105 @@ ssize_t store_rps_map(struct netdev_rx_queue *queue,
>  	return len;
>  }
>  
> +static ssize_t show_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
> +					   struct rx_queue_attribute *attr,
> +					   char *buf)
> +{
> +	struct rps_dev_flow_table *flow_table;
> +	unsigned int val = 0;
> +
> +	rcu_read_lock();
> +	flow_table = rcu_dereference(queue->rps_flow_table);
> +	if (flow_table)
> +		val = flow_table->mask + 1;
> +	rcu_read_unlock();
> +
> +	return sprintf(buf, "%u\n", val);
> +}
> +
> +static void rps_dev_flow_table_release_work(struct work_struct *work)
> +{
> +	struct rps_dev_flow_table *table = container_of(work,
> +	    struct rps_dev_flow_table, free_work);
> +
> +	vfree(table);
> +}
> +
> +static void rps_dev_flow_table_release(struct rcu_head *rcu)
> +{
> +	struct rps_dev_flow_table *table = container_of(rcu,
> +	    struct rps_dev_flow_table, rcu);
> +
> +	INIT_WORK(&table->free_work, rps_dev_flow_table_release_work);
> +	schedule_work(&table->free_work);
> +}
> +
> +ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
> +				     struct rx_queue_attribute *attr,
> +				     const char *buf, size_t len)
> +{
> +	unsigned int count;
> +	char *endp;
> +	struct rps_dev_flow_table *table, *old_table;
> +	static DEFINE_SPINLOCK(rps_dev_flow_lock);
> +
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	count = simple_strtoul(buf, &endp, 0);
> +	if (endp == buf)
> +		return -EINVAL;
> +
> +	if (count) {
> +		int i;
> +
> +		count = roundup_pow_of_two(count);
> +		table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count));


Same overflow problem here

> +		if (!table)
> +			return -ENOMEM;
> +
> +		table->mask = count - 1;
> +		for (i = 0; i < count; i++)
> +			table->flows[i].cpu = RPS_NO_CPU;
> +	} else
> +		table = NULL;
> +
> +	spin_lock(&rps_dev_flow_lock);
> +	old_table = queue->rps_flow_table;
> +	rcu_assign_pointer(queue->rps_flow_table, table);
> +	spin_unlock(&rps_dev_flow_lock);
> +
> +	if (old_table)
> +		call_rcu(&old_table->rcu, rps_dev_flow_table_release);
> +
> +	return len;
> +}
> +
>  static struct rx_queue_attribute rps_cpus_attribute =
>  	__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map);
>  
> +
> +static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
> +	__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
> +	    show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
> +
>  static struct attribute *rx_queue_default_attrs[] = {
>  	&rps_cpus_attribute.attr,
> +	&rps_dev_flow_table_cnt_attribute.attr,
>  	NULL
>  };
>  
>  static void rx_queue_release(struct kobject *kobj)
>  {
>  	struct netdev_rx_queue *queue = to_rx_queue(kobj);
> -	struct rps_map *map = queue->rps_map;
>  	struct netdev_rx_queue *first = queue->first;
>  
> -	if (map)
> -		call_rcu(&map->rcu, rps_map_release);
> +	if (queue->rps_map)
> +		call_rcu(&queue->rps_map->rcu, rps_map_release);
> +
> +	if (queue->rps_flow_table)
> +		call_rcu(&queue->rps_flow_table->rcu,
> +		    rps_dev_flow_table_release);
>  
>  	if (atomic_dec_and_test(&first->count))
>  		kfree(first);
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index b7b6b82..9eb2f67 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -82,6 +82,14 @@ static struct ctl_table net_core_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec
>  	},
> +#ifdef CONFIG_RPS
> +	{
> +		.procname	= "rps_sock_flow_entries",
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= rps_sock_flow_sysctl
> +	},
> +#endif
>  #endif /* CONFIG_NET */
>  	{
>  		.procname	= "netdev_budget",
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index a0beb32..3703b5e 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -419,6 +419,8 @@ int inet_release(struct socket *sock)
>  	if (sk) {
>  		long timeout;
>  
> +		inet_rps_reset_flow(sk);
> +
>  		/* Applications forget to leave groups before exiting */
>  		ip_mc_drop_socket(sk);
>  
> @@ -720,6 +722,8 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
>  {
>  	struct sock *sk = sock->sk;
>  
> +	inet_rps_record_flow(sk);
> +
>  	/* We may need to bind the socket. */
>  	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
>  		return -EAGAIN;
> @@ -728,12 +732,13 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
>  }
>  EXPORT_SYMBOL(inet_sendmsg);
>  
> -
>  static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
>  			     size_t size, int flags)
>  {
>  	struct sock *sk = sock->sk;
>  
> +	inet_rps_record_flow(sk);
> +
>  	/* We may need to bind the socket. */
>  	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
>  		return -EAGAIN;
> @@ -743,6 +748,22 @@ static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
>  	return sock_no_sendpage(sock, page, offset, size, flags);
>  }
>  
> +int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
> +		 size_t size, int flags)
> +{
> +	struct sock *sk = sock->sk;
> +	int addr_len = 0;
> +	int err;
> +
> +	inet_rps_record_flow(sk);
> +
> +	err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
> +				   flags & ~MSG_DONTWAIT, &addr_len);
> +	if (err >= 0)
> +		msg->msg_namelen = addr_len;
> +	return err;
> +}
> +EXPORT_SYMBOL(inet_recvmsg);
>  
>  int inet_shutdown(struct socket *sock, int how)
>  {
> @@ -872,7 +893,7 @@ const struct proto_ops inet_stream_ops = {
>  	.setsockopt	   = sock_common_setsockopt,
>  	.getsockopt	   = sock_common_getsockopt,
>  	.sendmsg	   = tcp_sendmsg,
> -	.recvmsg	   = sock_common_recvmsg,
> +	.recvmsg	   = inet_recvmsg,
>  	.mmap		   = sock_no_mmap,
>  	.sendpage	   = tcp_sendpage,
>  	.splice_read	   = tcp_splice_read,
> @@ -899,7 +920,7 @@ const struct proto_ops inet_dgram_ops = {
>  	.setsockopt	   = sock_common_setsockopt,
>  	.getsockopt	   = sock_common_getsockopt,
>  	.sendmsg	   = inet_sendmsg,
> -	.recvmsg	   = sock_common_recvmsg,
> +	.recvmsg	   = inet_recvmsg,
>  	.mmap		   = sock_no_mmap,
>  	.sendpage	   = inet_sendpage,
>  #ifdef CONFIG_COMPAT
> @@ -929,7 +950,7 @@ static const struct proto_ops inet_sockraw_ops = {
>  	.setsockopt	   = sock_common_setsockopt,
>  	.getsockopt	   = sock_common_getsockopt,
>  	.sendmsg	   = inet_sendmsg,
> -	.recvmsg	   = sock_common_recvmsg,
> +	.recvmsg	   = inet_recvmsg,
>  	.mmap		   = sock_no_mmap,
>  	.sendpage	   = inet_sendpage,
>  #ifdef CONFIG_COMPAT
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index a24995c..ad08392 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1672,6 +1672,8 @@ process:
>  
>  	skb->dev = NULL;
>  
> +	inet_rps_save_rxhash(sk, skb->rxhash);
> +
>  	bh_lock_sock_nested(sk);
>  	ret = 0;
>  	if (!sock_owned_by_user(sk)) {
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 8fef859..666b963 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1217,6 +1217,7 @@ int udp_disconnect(struct sock *sk, int flags)
>  	sk->sk_state = TCP_CLOSE;
>  	inet->inet_daddr = 0;
>  	inet->inet_dport = 0;
> +	inet_rps_save_rxhash(sk, 0);
>  	sk->sk_bound_dev_if = 0;
>  	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
>  		inet_reset_saddr(sk);
> @@ -1258,8 +1259,12 @@ EXPORT_SYMBOL(udp_lib_unhash);
>  
>  static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  {
> -	int rc = sock_queue_rcv_skb(sk, skb);
> +	int rc;
> +
> +	if (inet_sk(sk)->inet_daddr)
> +		inet_rps_save_rxhash(sk, skb->rxhash);
>  
> +	rc = sock_queue_rcv_skb(sk, skb);
>  	if (rc < 0) {
>  		int is_udplite = IS_UDPLITE(sk);
>  



^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Dont use netdev_warn()
From: David Miller @ 2010-04-13  8:52 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, therbert, joe
In-Reply-To: <1270797973.2623.28.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 09 Apr 2010 09:26:13 +0200

> Dont use netdev_warn() in dev_cap_txqueue() and get_rps_cpu() so that we
> can catch following warnings without crash.
> 
> bond0.2240 received packet on queue 6, but number of RX queues is 1
> bond0.2240 received packet on queue 11, but number of RX queues is 1
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied to net-next-2.6, thanks Eric.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sk_dst_cache RCUification
From: David Miller @ 2010-04-13  8:52 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, paulmck
In-Reply-To: <1270803809.2623.69.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 09 Apr 2010 11:03:29 +0200

> With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
> work.
> 
> sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
> 
> This rwlock is readlocked for a very small amount of time, and dst
> entries are already freed after RCU grace period. This calls for RCU
> again :)
> 
> This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
> 
> __sk_dst_get() is supposed to be called with rcu_read_lock() or if
> socket locked by user, so use appropriate rcu_dereference_check()
> condition (rcu_read_lock_held() || sock_owned_by_user(sk))
> 
> This patch avoids two atomic ops per tx packet on UDP connected sockets,
> for example, and permits sk_dst_lock to be much less dirtied.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks for doing this work Eric.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox