Netdev List
 help / color / mirror / Atom feed
* Re: another cleanup patch gone wrong
From: David Miller @ 2010-04-16  3:01 UTC (permalink / raw)
  To: fthain; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004161214270.271@localhost>

From: Finn Thain <fthain@telegraphics.com.au>
Date: Fri, 16 Apr 2010 12:34:24 +1000 (EST)

> 
> ...but this one was already merged, unfortunately.
> 
>> Use printk_once
>> Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> Convert printks without KERN_<level> to pr_info and pr_cont
>> 
>> Signed-off-by: Joe Perches <joe@perches.com>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>> 
>> 
>> diff --git a/drivers/net/mac8390.c b/drivers/net/mac8390.c
>> index 517cee4..8bd09e2 100644 (file)
>> --- a/drivers/net/mac8390.c
>> +++ b/drivers/net/mac8390.c
>> @@ -17,6 +17,8 @@
>>  /* 2002-12-30: Try to support more cards, some clues from NetBSD driver */
>>  /* 2003-12-26: Make sure Asante cards always work. */
>>  
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
> 
> Why the macro? You only used it once.

It gets expanded internally into all of the pr_*() calls.

> The pr_xxx naming convention belongs to a kernel-wide include file. Is it 
> really a good idea to start repurposing it in .c files?

This is exactly how it can be used, and there is much
precedent for this now.

>> -                       printk("Don't know how to access card memory!\n");
>> +                       pr_info("Don't know how to access card memory!\n");
> 
> No, this is pr_err. The driver sets dev->mem_start expecting it to work, 
> obviously.

It was an unspecified printk() so Joe's conversion is equal
and that's a good way for him to have made these changes.

If we want to mark this as KERN_ERR or whatever, that's entirely
a seperate change.

I think your objections to Joe's changes are completely uncalled
for and his changes were good ones.

^ permalink raw reply

* Re: another cleanup patch gone wrong
From: Joe Perches @ 2010-04-16  3:11 UTC (permalink / raw)
  To: Finn Thain
  Cc: David S. Miller, Paul Gortmaker, netdev,
	Linux Kernel Mailing List, Linux/m68k
In-Reply-To: <alpine.OSX.2.00.1004161214270.271@localhost>

On Fri, 2010-04-16 at 12:34 +1000, Finn Thain wrote:
> ...but this one was already merged, unfortunately.
> 
> > Use printk_once
> > Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > Convert printks without KERN_<level> to pr_info and pr_cont
> > 
> > Signed-off-by: Joe Perches <joe@perches.com>
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> > 
> > 
> > diff --git a/drivers/net/mac8390.c b/drivers/net/mac8390.c
> > index 517cee4..8bd09e2 100644 (file)
> > --- a/drivers/net/mac8390.c
> > +++ b/drivers/net/mac8390.c
> > @@ -17,6 +17,8 @@
> >  /* 2002-12-30: Try to support more cards, some clues from NetBSD driver */
> >  /* 2003-12-26: Make sure Asante cards always work. */
> >  
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> 
> Why the macro? You only used it once.

It's used embedded in the pr_<level> functions.
It's used more than once.

> The pr_xxx naming convention belongs to a kernel-wide include file. Is it 
> really a good idea to start repurposing it in .c files?

It's in kernel.h, and yes, it is.
http://lkml.org/lkml/2008/11/12/297

> No, this is pr_err. The driver sets dev->mem_start expecting it to work, 
> obviously.

I suggest you change the levels to what you desire.

You could add yourself to the MAINTAINERS entry for this file.

cheers, Joe

^ permalink raw reply

* Re: another cleanup patch gone wrong
From: Finn Thain @ 2010-04-16  3:21 UTC (permalink / raw)
  To: Joe Perches
  Cc: David S. Miller, Paul Gortmaker, netdev,
	Linux Kernel Mailing List, Linux/m68k
In-Reply-To: <1271387506.2298.17.camel@Joe-Laptop.home>


On Thu, 15 Apr 2010, Joe Perches wrote:

> > Why the macro? You only used it once.
> 
> It's used embedded in the pr_<level> functions.
> It's used more than once.
> 
> > The pr_xxx naming convention belongs to a kernel-wide include file. Is it 
> > really a good idea to start repurposing it in .c files?
> 
> It's in kernel.h, and yes, it is.
> http://lkml.org/lkml/2008/11/12/297

My mistake.

Finn

^ permalink raw reply

* [PATCH] mac8390: fix pr_info() calls, was Re: another cleanup patch gone wrong
From: Finn Thain @ 2010-04-16  3:45 UTC (permalink / raw)
  To: David Miller; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100415.200113.215578006.davem@davemloft.net>


On Thu, 15 Apr 2010, David Miller wrote:

> >> -                       printk("Don't know how to access card memory!\n");
> >> +                       pr_info("Don't know how to access card memory!\n");
> > 
> > No, this is pr_err. The driver sets dev->mem_start expecting it to work, 
> > obviously.
> 
> It was an unspecified printk() so Joe's conversion is equal
> and that's a good way for him to have made these changes.

Seems to me that the code went from unspecified to wrong. But I can see 
your point of view.

> If we want to mark this as KERN_ERR or whatever, that's entirely a 
> seperate change.
>
> I think your objections to Joe's changes are completely uncalled for and 
> his changes were good ones.

Here's a patch, both uncalled-for and untested.


Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

--- a/drivers/net/mac8390.c	2010-04-16 13:31:04.000000000 +1000
+++ b/drivers/net/mac8390.c	2010-04-16 13:35:06.000000000 +1000
@@ -554,7 +554,7 @@
 	case MAC8390_APPLE:
 		switch (mac8390_testio(dev->mem_start)) {
 		case ACCESS_UNKNOWN:
-			pr_info("Don't know how to access card memory!\n");
+			pr_err("Don't know how to access card memory!\n");
 			return -ENODEV;
 			break;
 
@@ -643,8 +643,8 @@
 {
 	__ei_open(dev);
 	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
-		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
-		return -EAGAIN;
+		pr_err("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
+		return -EBUSY;
 	}
 	return 0;
 }
@@ -660,7 +660,7 @@
 {
 	ei_status.txing = 0;
 	if (ei_debug > 1)
-		pr_info("reset not supported\n");
+		pr_debug("reset not supported\n");
 	return;
 }
 
@@ -668,7 +668,7 @@
 {
 	unsigned char *target = nubus_slot_addr(IRQ2SLOT(dev->irq));
 	if (ei_debug > 1)
-		pr_info("Need to reset the NS8390 t=%lu...", jiffies);
+		pr_debug("Need to reset the NS8390 t=%lu...", jiffies);
 	ei_status.txing = 0;
 	target[0xC0000] = 0;
 	if (ei_debug > 1)

^ permalink raw reply

* Re: Network multiqueue question
From: George B. @ 2010-04-16  3:54 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev
In-Reply-To: <21433.1271354986@death.nxdomain.ibm.com>

On Thu, Apr 15, 2010 at 11:09 AM, Jay Vosburgh <fubar@us.ibm.com> wrote:


>        The question I have about it (and the above patch), is: what
> does multi-queue "awareness" really mean for a bonding device?  How does
> allocating a bunch of TX queues help, given that the determination of
> the transmitting device hasn't necessarily been made?

Good point.

>        I haven't had the chance to acquire some multi-queue network
> cards and check things out with bonding, so I'm not really sure how it
> should work.  Should the bond look, from a multi-queue perspective, like
> the largest slave, or should it look like the sum of the slaves?  Some
> of this is may be mode-specific, as well.

I would say that having the number of bands be either the number of
cores or 4, whichever is the smaller would be a good start.  That is
probably fine for GigE.  Of the network cards we have that support
multiqueue, they are either 4 or 8 bands.  In an optimal world, you
would have the number of bands that you have available at the physical
ethernet level but changing those on the fly in case of a change in
available interfaces might be more trouble than it is worth.

Four or eight would seem to be a good number to start with as I don't
think I have seen an ethernet card with less than 4.  If you have
fewer than 4 CPUs there probably isn't much utility in having more
bands than processors, or maybe that utility rapidly diminishes as the
number of bands increases beyond the number of CPUs.  At that point
you have probably just spent a lot of work building a bigger buffer.

I would be happy with 4 bands.  I guess it just depends on where you
want the bottleneck.  If you have 8 bands on the bond driver (another
reasonable alternative) and only 4 bands available for output, you
have just moved the contention down a layer to between the bond and
the ethernet driver.  But I am a fan of moving the point of contention
as far away from the application interface as possible.  If I have one
big lock around the bond driver and have 6 things waiting to talk to
the network, those are six things that can't be doing anything else.
I would rather have the application handle its network task and get
back to other things.  Now if you have 8 bands of bond and only 4
bands of ethernet, or even one band of ethernet, oh well.  Maybe have
1 to 8 bands configurable by an option to the driver that could be set
explicitly and defaults to, say, 4?

Thanks for taking the time to answer.

George

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls, was Re: another cleanup patch gone wrong
From: Joe Perches @ 2010-04-16  3:54 UTC (permalink / raw)
  To: Finn Thain; +Cc: David Miller, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004161323340.271@localhost>

On Fri, 2010-04-16 at 13:45 +1000, Finn Thain wrote:
> Here's a patch, both uncalled-for and untested.
> Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
> 
> --- a/drivers/net/mac8390.c	2010-04-16 13:31:04.000000000 +1000
> +++ b/drivers/net/mac8390.c	2010-04-16 13:35:06.000000000 +1000
> @@ -643,8 +643,8 @@
>  {
>  	__ei_open(dev);
>  	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
> -		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
> -		return -EAGAIN;
> +		pr_err("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
> +		return -EBUSY;

You should document this in the changelog.

>  	}
>  	return 0;
>  }
> @@ -660,7 +660,7 @@
>  {
>  	ei_status.txing = 0;
>  	if (ei_debug > 1)
> -		pr_info("reset not supported\n");
> +		pr_debug("reset not supported\n");

You'll need to add
#define DEBUG
for this to print.

> -		pr_info("Need to reset the NS8390 t=%lu...", jiffies);
> +		pr_debug("Need to reset the NS8390 t=%lu...", jiffies);

This also now doesn't print.

cheers, Joe

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls, was Re: another cleanup patch gone wrong
From: Finn Thain @ 2010-04-16  3:59 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <1271390080.2298.25.camel@Joe-Laptop.home>


On Thu, 15 Apr 2010, Joe Perches wrote:

> >  	return 0;
> >  }
> > @@ -660,7 +660,7 @@
> >  {
> >  	ei_status.txing = 0;
> >  	if (ei_debug > 1)
> > -		pr_info("reset not supported\n");
> > +		pr_debug("reset not supported\n");
> 
> You'll need to add
> #define DEBUG
> for this to print.
> 
> > -		pr_info("Need to reset the NS8390 t=%lu...", jiffies);
> > +		pr_debug("Need to reset the NS8390 t=%lu...", jiffies);
> 
> This also now doesn't print.
> 

Oops. Thanks for spotting that. I'll resend.

Finn

> cheers, Joe

^ permalink raw reply

* Re: Network multiqueue question
From: George B. @ 2010-04-16  4:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1271353637.16881.2846.camel@edumazet-laptop>

On Thu, Apr 15, 2010 at 10:47 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:



> Since this bothers me a bit, I will probably work on this in a near
> future. (adding real multiqueue capability and RCU to bonding fast
> paths)
>
> Ref: http://permalink.gmane.org/gmane.linux.network/152987

That would be great and you would have my sincere thanks..  And if
anyone is interested, what we do is take a pair of "top of rack"
switches and cluster them together so they appear as one switch.
Configure a LAG consisting of a port on each physical switch to a pair
of bonded interfaces on the server and use mode 2 bonding.  In normal
operation, both interfaces are active.  Should one switch experience a
power or interface failure, the server sees one of the interfaces fail
but just keeps working on the remaining interface.  There is no
"failover" event going on.

Thanks,

George

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls and change return code
From: Finn Thain @ 2010-04-16  4:21 UTC (permalink / raw)
  To: David Miller; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004161323340.271@localhost>


Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

--- a/drivers/net/mac8390.c	2010-04-16 13:31:04.000000000 +1000
+++ b/drivers/net/mac8390.c	2010-04-16 14:02:29.000000000 +1000
@@ -554,7 +554,7 @@
 	case MAC8390_APPLE:
 		switch (mac8390_testio(dev->mem_start)) {
 		case ACCESS_UNKNOWN:
-			pr_info("Don't know how to access card memory!\n");
+			pr_err("Don't know how to access card memory!\n");
 			return -ENODEV;
 			break;
 
@@ -643,8 +643,8 @@
 {
 	__ei_open(dev);
 	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
-		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
-		return -EAGAIN;
+		pr_err("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
+		return -EBUSY;
 	}
 	return 0;
 }
@@ -660,7 +660,7 @@
 {
 	ei_status.txing = 0;
 	if (ei_debug > 1)
-		pr_info("reset not supported\n");
+		printk(KERN_DEBUG "reset not supported\n");
 	return;
 }
 
@@ -668,11 +668,11 @@
 {
 	unsigned char *target = nubus_slot_addr(IRQ2SLOT(dev->irq));
 	if (ei_debug > 1)
-		pr_info("Need to reset the NS8390 t=%lu...", jiffies);
+		printk(KERN_DEBUG "Need to reset the NS8390 t=%lu...", jiffies);
 	ei_status.txing = 0;
 	target[0xC0000] = 0;
 	if (ei_debug > 1)
-		pr_cont("reset complete\n");
+		printk(KERN_CONT "reset complete\n");
 	return;
 }
 

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls and change return code
From: Joe Perches @ 2010-04-16  4:34 UTC (permalink / raw)
  To: Finn Thain; +Cc: David Miller, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004161403160.271@localhost>

On Fri, 2010-04-16 at 14:21 +1000, Finn Thain wrote:
> Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
> 
> --- a/drivers/net/mac8390.c	2010-04-16 13:31:04.000000000 +1000
> +++ b/drivers/net/mac8390.c	2010-04-16 14:02:29.000000000 +1000
> @@ -554,7 +554,7 @@
>  	case MAC8390_APPLE:
>  		switch (mac8390_testio(dev->mem_start)) {
>  		case ACCESS_UNKNOWN:
> -			pr_info("Don't know how to access card memory!\n");
> +			pr_err("Don't know how to access card memory!\n");
>  			return -ENODEV;
>  			break;
>  
> @@ -643,8 +643,8 @@
>  {
>  	__ei_open(dev);
>  	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
> -		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
> -		return -EAGAIN;
> +		pr_err("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
> +		return -EBUSY;

You should document the reason for the
return code change in the changelog.
Why is it better to use -EBUSY?

>  	}
>  	return 0;
>  }
> @@ -660,7 +660,7 @@
>  {
>  	ei_status.txing = 0;
>  	if (ei_debug > 1)
> -		pr_info("reset not supported\n");
> +		printk(KERN_DEBUG "reset not supported\n");

It'd be better to prefix this with the driver name
or use something like netdev_dbg with #define DEBUG
otherwise it's "huh? what device emits this message?"
when reading the logs.

Something like:
	printk(KERN_DEBUG pr_fmt("reset not supported\n"));
or
#define DEBUG
	netdev_dbg(dev, "reset not supported\n");
or
#define DEBUG
	pr_debug("reset not supported\n");

>  	if (ei_debug > 1)
> -		pr_cont("reset complete\n");
> +		printk(KERN_CONT "reset complete\n");

unnecessary conversion.

^ permalink raw reply

* Re: Network multiqueue question
From: Eric Dumazet @ 2010-04-16  4:53 UTC (permalink / raw)
  To: George B.; +Cc: netdev
In-Reply-To: <j2gb65cae941004152100je5a3c3c9lba9e96ecb95bf04c@mail.gmail.com>

Le jeudi 15 avril 2010 à 21:00 -0700, George B. a écrit :
> On Thu, Apr 15, 2010 at 10:47 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
> > Since this bothers me a bit, I will probably work on this in a near
> > future. (adding real multiqueue capability and RCU to bonding fast
> > paths)
> >
> > Ref: http://permalink.gmane.org/gmane.linux.network/152987
> 
> That would be great and you would have my sincere thanks..  And if
> anyone is interested, what we do is take a pair of "top of rack"
> switches and cluster them together so they appear as one switch.
> Configure a LAG consisting of a port on each physical switch to a pair
> of bonded interfaces on the server and use mode 2 bonding.  In normal
> operation, both interfaces are active.  Should one switch experience a
> power or interface failure, the server sees one of the interfaces fail
> but just keeps working on the remaining interface.  There is no
> "failover" event going on.
> 

What kind of traffic do your machines manage exactly ?

On server, you use two ports of the same kind (same number of queues) ?



^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-16  5:18 UTC (permalink / raw)
  To: Changli Gao
  Cc: hadi, Rick Jones, David Miller, therbert, netdev, robert, andi
In-Reply-To: <n2p412e6f7f1004151656q5f3f2cbeh324a859b88688398@mail.gmail.com>

Le vendredi 16 avril 2010 à 07:56 +0800, Changli Gao a écrit :
> On Fri, Apr 16, 2010 at 4:16 AM, jamal <hadi@cyberus.ca> wrote:
> >
> > Sounds interesting.
> > Wikipedia information overload. Any arch description of the HP9000?
> > Did your scheme use IPIs to message the other CPUs?
> >
> 
> If you doubt the cost of smp_call_function_single(), how about having
> a try with my another patch, which implements the similar of RPS, but
> uses kernel threads instead, so no explicit IPI.
> 
> http://patchwork.ozlabs.org/patch/38319/
> 
> 

Come on Changli.

How do you wake up a thread on a remote cpu ?

To answer Jamal question, we need to answer to Jamal question, that is
timing cost of IPIS.

A kernel module might do this, this could be integrated in perf bench so
that we can regression tests upcoming kernels.




^ permalink raw reply

* [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-16  5:47 UTC (permalink / raw)
  To: davem, netdev, eric.dumazet

Version 5 of RFS:
- Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
static function.
- Apply limits to rps_sock_flow_entires systcl and rps_flow_count
sysfs variable.
---
This patch implements receive flow steering (RFS).  RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running.  RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure.  The rxhash is passed in skb's received on
the connection from netif_receive_skb.  For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets.  If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table.  Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue.  Each entry
contains a CPU and a tail queue counter.  The CPU is the "current"
CPU for a matching flow.  The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length.  When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted.  When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table.  This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality.  2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS.  The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue.  Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors.  On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation.  However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp.  The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
   No RFS or RPS		104K tps at 30% CPU
   No RFS (best RPS config):    290K tps at 63% CPU
   RFS				303K tps at 61% CPU

RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
  No RFS/RPS	103K	48%	757/900/3185		4472.35
  RPS only:	174K	73%	415/993/2468		491.66
  RFS		223K	73%	379/651/1382		315.61

Signed-off-by: Tom Herbert <therbert@google.com> ---
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 55c2086..649a025 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,14 +530,73 @@ struct rps_map {
 };
 #define RPS_MAP_SIZE(_num) (sizeof(struct rps_map) + (_num * sizeof(u16)))
 
+/*
+ * The rps_dev_flow structure contains the mapping of a flow to a CPU and the
+ * tail pointer for that CPU's input queue at the time of last enqueue.
+ */
+struct rps_dev_flow {
+	u16 cpu;
+	u16 fill;
+	unsigned int last_qtail;
+};
+
+/*
+ * The rps_dev_flow_table structure contains a table of flow mappings.
+ */
+struct rps_dev_flow_table {
+	unsigned int mask;
+	struct rcu_head rcu;
+	struct work_struct free_work;
+	struct rps_dev_flow flows[0];
+};
+#define RPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_dev_flow_table) + \
+    (_num * sizeof(struct rps_dev_flow)))
+
+/*
+ * The rps_sock_flow_table contains mappings of flows to the last CPU
+ * on which they were processed by the application (set in recvmsg).
+ */
+struct rps_sock_flow_table {
+	unsigned int mask;
+	u16 ents[0];
+};
+#define	RPS_SOCK_FLOW_TABLE_SIZE(_num) (sizeof(struct rps_sock_flow_table) + \
+    (_num * sizeof(u16)))
+
+#define RPS_NO_CPU 0xffff
+
+static inline void rps_record_sock_flow(struct rps_sock_flow_table *table,
+					u32 hash)
+{
+	if (table && hash) {
+		unsigned int cpu, index = hash & table->mask;
+
+		/* We only give a hint, preemption can change cpu under us */
+		cpu = raw_smp_processor_id();
+
+		if (table->ents[index] != cpu)
+			table->ents[index] = cpu;
+	}
+}
+
+static inline void rps_reset_sock_flow(struct rps_sock_flow_table *table,
+				       u32 hash)
+{
+	if (table && hash)
+		table->ents[hash & table->mask] = RPS_NO_CPU;
+}
+
+extern struct rps_sock_flow_table *rps_sock_flow_table;
+
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
 	struct rps_map *rps_map;
+	struct rps_dev_flow_table *rps_flow_table;
 	struct kobject kobj;
 	struct netdev_rx_queue *first;
 	atomic_t count;
 } ____cacheline_aligned_in_smp;
-#endif
+#endif /* CONFIG_RPS */
 
 /*
  * This structure defines the management hooks for network devices.
@@ -1333,11 +1392,19 @@ struct softnet_data {
 	/* Elements below can be accessed between CPUs for RPS */
 #ifdef CONFIG_RPS
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	unsigned int		input_queue_head;
 #endif
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 };
 
+static inline void incr_input_queue_head(struct softnet_data *queue)
+{
+#ifdef CONFIG_RPS
+	queue->input_queue_head++;
+#endif
+}
+
 DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 
 #define HAVE_NETIF_QUEUE
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 83fd344..b487bc1 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -21,6 +21,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/jhash.h>
+#include <linux/netdevice.h>
 
 #include <net/flow.h>
 #include <net/sock.h>
@@ -101,6 +102,7 @@ struct rtable;
  * @uc_ttl - Unicast TTL
  * @inet_sport - Source port
  * @inet_id - ID counter for DF pkts
+ * @rxhash - flow hash received from netif layer
  * @tos - TOS
  * @mc_ttl - Multicasting TTL
  * @is_icsk - is this an inet_connection_sock?
@@ -124,6 +126,9 @@ struct inet_sock {
 	__u16			cmsg_flags;
 	__be16			inet_sport;
 	__u16			inet_id;
+#ifdef CONFIG_RPS
+	__u32			rxhash;
+#endif
 
 	struct ip_options	*opt;
 	__u8			tos;
@@ -219,4 +224,37 @@ static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
 	return inet_sk(sk)->transparent ? FLOWI_FLAG_ANYSRC : 0;
 }
 
+static inline void inet_rps_record_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	struct rps_sock_flow_table *sock_flow_table;
+
+	rcu_read_lock();
+	sock_flow_table = rcu_dereference(rps_sock_flow_table);
+	rps_record_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
+	rcu_read_unlock();
+#endif
+}
+
+static inline void inet_rps_reset_flow(const struct sock *sk)
+{
+#ifdef CONFIG_RPS
+	struct rps_sock_flow_table *sock_flow_table;
+
+	rcu_read_lock();
+	sock_flow_table = rcu_dereference(rps_sock_flow_table);
+	rps_reset_sock_flow(sock_flow_table, inet_sk(sk)->rxhash);
+	rcu_read_unlock();
+#endif
+}
+
+static inline void inet_rps_save_rxhash(const struct sock *sk, u32 rxhash)
+{
+#ifdef CONFIG_RPS
+	if (unlikely(inet_sk(sk)->rxhash != rxhash)) {
+		inet_rps_reset_flow(sk);
+		inet_sk(sk)->rxhash = rxhash;
+	}
+#endif
+}
 #endif	/* _INET_SOCK_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index e8041eb..d7107ac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2203,19 +2203,28 @@ int weight_p __read_mostly = 64;            /* old backlog weight */
 DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 
 #ifdef CONFIG_RPS
+
+/* One global table that all flow-based protocols share. */
+struct rps_sock_flow_table *rps_sock_flow_table;
+EXPORT_SYMBOL(rps_sock_flow_table);
+
 /*
  * get_rps_cpu is called from netif_receive_skb and returns the target
  * CPU from the RPS map of the receiving queue for a given skb.
  * rcu_read_lock must be held on entry.
  */
-static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
+		       struct rps_dev_flow **rflowp)
 {
 	struct ipv6hdr *ip6;
 	struct iphdr *ip;
 	struct netdev_rx_queue *rxqueue;
 	struct rps_map *map;
+	struct rps_dev_flow_table *flow_table;
+	struct rps_sock_flow_table *sock_flow_table;
 	int cpu = -1;
 	u8 ip_proto;
+	u16 tcpu;
 	u32 addr1, addr2, ports, ihl;
 
 	if (skb_rx_queue_recorded(skb)) {
@@ -2232,7 +2241,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 	} else
 		rxqueue = dev->_rx;
 
-	if (!rxqueue->rps_map)
+	if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
 		goto done;
 
 	if (skb->rxhash)
@@ -2284,9 +2293,48 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb)
 		skb->rxhash = 1;
 
 got_hash:
+	flow_table = rcu_dereference(rxqueue->rps_flow_table);
+	sock_flow_table = rcu_dereference(rps_sock_flow_table);
+	if (flow_table && sock_flow_table) {
+		u16 next_cpu;
+		struct rps_dev_flow *rflow;
+
+		rflow = &flow_table->flows[skb->rxhash & flow_table->mask];
+		tcpu = rflow->cpu;
+
+		next_cpu = sock_flow_table->ents[skb->rxhash &
+		    sock_flow_table->mask];
+
+		/*
+		 * If the desired CPU (where last recvmsg was done) is
+		 * different from current CPU (one in the rx-queue flow
+		 * table entry), switch if one of the following holds:
+		 *   - Current CPU is unset (equal to RPS_NO_CPU).
+		 *   - Current CPU is offline.
+		 *   - The current CPU's queue tail has advanced beyond the
+		 *     last packet that was enqueued using this table entry.
+		 *     This guarantees that all previous packets for the flow
+		 *     have been dequeued, thus preserving in order delivery.
+		 */
+		if (unlikely(tcpu != next_cpu) &&
+		    (tcpu == RPS_NO_CPU || !cpu_online(tcpu) ||
+		     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -
+		      rflow->last_qtail)) >= 0)) {
+			tcpu = rflow->cpu = next_cpu;
+			if (tcpu != RPS_NO_CPU)
+				rflow->last_qtail = per_cpu(softnet_data,
+				    tcpu).input_queue_head;
+		}
+		if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
+			*rflowp = rflow;
+			cpu = tcpu;
+			goto done;
+		}
+	}
+
 	map = rcu_dereference(rxqueue->rps_map);
 	if (map) {
-		u16 tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
+		tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32];
 
 		if (cpu_online(tcpu)) {
 			cpu = tcpu;
@@ -2320,13 +2368,14 @@ static void trigger_softirq(void *data)
 	__napi_schedule(&queue->backlog);
 	__get_cpu_var(netdev_rx_stat).received_rps++;
 }
-#endif /* CONFIG_SMP */
+#endif /* CONFIG_RPS */
 
 /*
  * enqueue_to_backlog is called to queue an skb to a per CPU backlog
  * queue (may be a remote CPU queue).
  */
-static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
+static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
+			      unsigned int *qtail)
 {
 	struct softnet_data *queue;
 	unsigned long flags;
@@ -2341,6 +2390,10 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu)
 		if (queue->input_pkt_queue.qlen) {
 enqueue:
 			__skb_queue_tail(&queue->input_pkt_queue, skb);
+#ifdef CONFIG_RPS
+			*qtail = queue->input_queue_head +
+			    queue->input_pkt_queue.qlen;
+#endif
 			rps_unlock(queue);
 			local_irq_restore(flags);
 			return NET_RX_SUCCESS;
@@ -2355,11 +2408,10 @@ enqueue:
 
 				cpu_set(cpu, rcpus->mask[rcpus->select]);
 				__raise_softirq_irqoff(NET_RX_SOFTIRQ);
-			} else
-				__napi_schedule(&queue->backlog);
-#else
-			__napi_schedule(&queue->backlog);
+				goto enqueue;
+			}
 #endif
+			__napi_schedule(&queue->backlog);
 		}
 		goto enqueue;
 	}
@@ -2401,18 +2453,25 @@ int netif_rx(struct sk_buff *skb)
 
 #ifdef CONFIG_RPS
 	{
+		struct rps_dev_flow voidflow, *rflow = &voidflow;
 		int cpu;
 
 		rcu_read_lock();
-		cpu = get_rps_cpu(skb->dev, skb);
+
+		cpu = get_rps_cpu(skb->dev, skb, &rflow);
 		if (cpu < 0)
 			cpu = smp_processor_id();
-		ret = enqueue_to_backlog(skb, cpu);
+
+		ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+
 		rcu_read_unlock();
 	}
 #else
-	ret = enqueue_to_backlog(skb, get_cpu());
-	put_cpu();
+	{
+		unsigned int qtail;
+		ret = enqueue_to_backlog(skb, get_cpu(), &qtail);
+		put_cpu();
+	}
 #endif
 	return ret;
 }
@@ -2830,14 +2889,22 @@ out:
 int netif_receive_skb(struct sk_buff *skb)
 {
 #ifdef CONFIG_RPS
-	int cpu;
+	struct rps_dev_flow voidflow, *rflow = &voidflow;
+	int cpu, ret;
+
+	rcu_read_lock();
 
-	cpu = get_rps_cpu(skb->dev, skb);
+	cpu = get_rps_cpu(skb->dev, skb, &rflow);
 
-	if (cpu < 0)
-		return __netif_receive_skb(skb);
-	else
-		return enqueue_to_backlog(skb, cpu);
+	if (cpu >= 0) {
+		ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+		rcu_read_unlock();
+	} else {
+		rcu_read_unlock();
+		ret = __netif_receive_skb(skb);
+	}
+
+	return ret;
 #else
 	return __netif_receive_skb(skb);
 #endif
@@ -2856,6 +2923,7 @@ static void flush_backlog(void *arg)
 		if (skb->dev == dev) {
 			__skb_unlink(skb, &queue->input_pkt_queue);
 			kfree_skb(skb);
+			incr_input_queue_head(queue);
 		}
 	rps_unlock(queue);
 }
@@ -3179,6 +3247,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
 			local_irq_enable();
 			break;
 		}
+		incr_input_queue_head(queue);
 		rps_unlock(queue);
 		local_irq_enable();
 
@@ -5542,8 +5611,10 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	local_irq_enable();
 
 	/* Process offline CPU's input_pkt_queue */
-	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
+	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
 		netif_rx(skb);
+		incr_input_queue_head(oldsd);
+	}
 
 	return NOTIFY_OK;
 }
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 96ed690..f0f1bb7 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -601,22 +601,109 @@ ssize_t store_rps_map(struct netdev_rx_queue *queue,
 	return len;
 }
 
+static ssize_t show_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+					   struct rx_queue_attribute *attr,
+					   char *buf)
+{
+	struct rps_dev_flow_table *flow_table;
+	unsigned int val = 0;
+
+	rcu_read_lock();
+	flow_table = rcu_dereference(queue->rps_flow_table);
+	if (flow_table)
+		val = flow_table->mask + 1;
+	rcu_read_unlock();
+
+	return sprintf(buf, "%u\n", val);
+}
+
+static void rps_dev_flow_table_release_work(struct work_struct *work)
+{
+	struct rps_dev_flow_table *table = container_of(work,
+	    struct rps_dev_flow_table, free_work);
+
+	vfree(table);
+}
+
+static void rps_dev_flow_table_release(struct rcu_head *rcu)
+{
+	struct rps_dev_flow_table *table = container_of(rcu,
+	    struct rps_dev_flow_table, rcu);
+
+	INIT_WORK(&table->free_work, rps_dev_flow_table_release_work);
+	schedule_work(&table->free_work);
+}
+
+ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
+				     struct rx_queue_attribute *attr,
+				     const char *buf, size_t len)
+{
+	unsigned int count;
+	char *endp;
+	struct rps_dev_flow_table *table, *old_table;
+	static DEFINE_SPINLOCK(rps_dev_flow_lock);
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	count = simple_strtoul(buf, &endp, 0);
+	if (endp == buf)
+		return -EINVAL;
+
+	if (count) {
+		int i;
+
+		if (count > 1<<30) {
+			/* Enforce a limit to prevent overflow */
+			return -EINVAL;
+		}
+		count = roundup_pow_of_two(count);
+		table = vmalloc(RPS_DEV_FLOW_TABLE_SIZE(count));
+		if (!table)
+			return -ENOMEM;
+
+		table->mask = count - 1;
+		for (i = 0; i < count; i++)
+			table->flows[i].cpu = RPS_NO_CPU;
+	} else
+		table = NULL;
+
+	spin_lock(&rps_dev_flow_lock);
+	old_table = queue->rps_flow_table;
+	rcu_assign_pointer(queue->rps_flow_table, table);
+	spin_unlock(&rps_dev_flow_lock);
+
+	if (old_table)
+		call_rcu(&old_table->rcu, rps_dev_flow_table_release);
+
+	return len;
+}
+
 static struct rx_queue_attribute rps_cpus_attribute =
 	__ATTR(rps_cpus, S_IRUGO | S_IWUSR, show_rps_map, store_rps_map);
 
+
+static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
+	__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
+	    show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
+
 static struct attribute *rx_queue_default_attrs[] = {
 	&rps_cpus_attribute.attr,
+	&rps_dev_flow_table_cnt_attribute.attr,
 	NULL
 };
 
 static void rx_queue_release(struct kobject *kobj)
 {
 	struct netdev_rx_queue *queue = to_rx_queue(kobj);
-	struct rps_map *map = queue->rps_map;
 	struct netdev_rx_queue *first = queue->first;
 
-	if (map)
-		call_rcu(&map->rcu, rps_map_release);
+	if (queue->rps_map)
+		call_rcu(&queue->rps_map->rcu, rps_map_release);
+
+	if (queue->rps_flow_table)
+		call_rcu(&queue->rps_flow_table->rcu,
+		    rps_dev_flow_table_release);
 
 	if (atomic_dec_and_test(&first->count))
 		kfree(first);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b7b6b82..e023c93 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -17,6 +17,65 @@
 #include <net/ip.h>
 #include <net/sock.h>
 
+#ifdef CONFIG_RPS
+static int rps_sock_flow_sysctl(ctl_table *table, int write,
+				void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	unsigned int orig_size, size;
+	int ret, i;
+	ctl_table tmp = {
+		.data = &size,
+		.maxlen = sizeof(size),
+		.mode = table->mode
+	};
+	struct rps_sock_flow_table *orig_sock_table, *sock_table;
+	static DEFINE_MUTEX(sock_flow_mutex);
+
+	mutex_lock(&sock_flow_mutex);
+
+	orig_sock_table = rps_sock_flow_table;
+	size = orig_size = orig_sock_table ? orig_sock_table->mask + 1 : 0;
+
+	ret = proc_dointvec(&tmp, write, buffer, lenp, ppos);
+
+	if (write) {
+		if (size) {
+			if (size > 1<<30) {
+				/* Enforce limit to prevent overflow */
+				mutex_unlock(&sock_flow_mutex);
+				return -EINVAL;
+			}
+			size = roundup_pow_of_two(size);
+			if (size != orig_size) {
+				sock_table =
+				    vmalloc(RPS_SOCK_FLOW_TABLE_SIZE(size));
+				if (!sock_table) {
+					mutex_unlock(&sock_flow_mutex);
+					return -ENOMEM;
+				}
+
+				sock_table->mask = size - 1;
+			} else
+				sock_table = orig_sock_table;
+
+			for (i = 0; i < size; i++)
+				sock_table->ents[i] = RPS_NO_CPU;
+		} else
+			sock_table = NULL;
+
+		if (sock_table != orig_sock_table) {
+			rcu_assign_pointer(rps_sock_flow_table, sock_table);
+			synchronize_rcu();
+			vfree(orig_sock_table);
+		}
+	}
+
+	mutex_unlock(&sock_flow_mutex);
+
+	return ret;
+}
+#endif /* CONFIG_RPS */
+
 static struct ctl_table net_core_table[] = {
 #ifdef CONFIG_NET
 	{
@@ -82,6 +141,14 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+#ifdef CONFIG_RPS
+	{
+		.procname	= "rps_sock_flow_entries",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= rps_sock_flow_sysctl
+	},
+#endif
 #endif /* CONFIG_NET */
 	{
 		.procname	= "netdev_budget",
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 193dcd6..c5376c7 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -419,6 +419,8 @@ int inet_release(struct socket *sock)
 	if (sk) {
 		long timeout;
 
+		inet_rps_reset_flow(sk);
+
 		/* Applications forget to leave groups before exiting */
 		ip_mc_drop_socket(sk);
 
@@ -720,6 +722,8 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 {
 	struct sock *sk = sock->sk;
 
+	inet_rps_record_flow(sk);
+
 	/* We may need to bind the socket. */
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
@@ -728,12 +732,13 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-
 static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 			     size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
 
+	inet_rps_record_flow(sk);
+
 	/* We may need to bind the socket. */
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
@@ -743,6 +748,22 @@ static ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 	return sock_no_sendpage(sock, page, offset, size, flags);
 }
 
+int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
+		 size_t size, int flags)
+{
+	struct sock *sk = sock->sk;
+	int addr_len = 0;
+	int err;
+
+	inet_rps_record_flow(sk);
+
+	err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
+				   flags & ~MSG_DONTWAIT, &addr_len);
+	if (err >= 0)
+		msg->msg_namelen = addr_len;
+	return err;
+}
+EXPORT_SYMBOL(inet_recvmsg);
 
 int inet_shutdown(struct socket *sock, int how)
 {
@@ -872,7 +893,7 @@ const struct proto_ops inet_stream_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = tcp_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = tcp_sendpage,
 	.splice_read	   = tcp_splice_read,
@@ -899,7 +920,7 @@ const struct proto_ops inet_dgram_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
@@ -929,7 +950,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.setsockopt	   = sock_common_setsockopt,
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
-	.recvmsg	   = sock_common_recvmsg,
+	.recvmsg	   = inet_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = inet_sendpage,
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a24995c..ad08392 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1672,6 +1672,8 @@ process:
 
 	skb->dev = NULL;
 
+	inet_rps_save_rxhash(sk, skb->rxhash);
+
 	bh_lock_sock_nested(sk);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8fef859..666b963 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1217,6 +1217,7 @@ int udp_disconnect(struct sock *sk, int flags)
 	sk->sk_state = TCP_CLOSE;
 	inet->inet_daddr = 0;
 	inet->inet_dport = 0;
+	inet_rps_save_rxhash(sk, 0);
 	sk->sk_bound_dev_if = 0;
 	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
 		inet_reset_saddr(sk);
@@ -1258,8 +1259,12 @@ EXPORT_SYMBOL(udp_lib_unhash);
 
 static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
-	int rc = sock_queue_rcv_skb(sk, skb);
+	int rc;
+
+	if (inet_sk(sk)->inet_daddr)
+		inet_rps_save_rxhash(sk, skb->rxhash);
 
+	rc = sock_queue_rcv_skb(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 

^ permalink raw reply related

* Re: [PATCH] mac8390: fix pr_info() calls and change return code
From: David Miller @ 2010-04-16  5:53 UTC (permalink / raw)
  To: fthain; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004161403160.271@localhost>

From: Finn Thain <fthain@telegraphics.com.au>
Date: Fri, 16 Apr 2010 14:21:00 +1000 (EST)

>  
> @@ -668,11 +668,11 @@
>  {
>  	unsigned char *target = nubus_slot_addr(IRQ2SLOT(dev->irq));
>  	if (ei_debug > 1)
> -		pr_info("Need to reset the NS8390 t=%lu...", jiffies);
> +		printk(KERN_DEBUG "Need to reset the NS8390 t=%lu...", jiffies);
>  	ei_status.txing = 0;
>  	target[0xC0000] = 0;
>  	if (ei_debug > 1)
> -		pr_cont("reset complete\n");
> +		printk(KERN_CONT "reset complete\n");
>  	return;

You're missing the whole point of using pr_info() et al.  in that it
includes the bits we define for pr_fmt at the top of the file.

Also, you write absolutely no commit log message entry for your
change explaining why you're doing the things you are doing.

And finally you are doing two completely unrelated things at one
(changing error return values and changing log message levels).

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-16  6:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Rick Jones, David Miller, therbert, netdev, robert, andi
In-Reply-To: <1271395106.16881.3645.camel@edumazet-laptop>

On Fri, Apr 16, 2010 at 1:18 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 16 avril 2010 à 07:56 +0800, Changli Gao a écrit :
>> On Fri, Apr 16, 2010 at 4:16 AM, jamal <hadi@cyberus.ca> wrote:
>> >
>> > Sounds interesting.
>> > Wikipedia information overload. Any arch description of the HP9000?
>> > Did your scheme use IPIs to message the other CPUs?
>> >
>>
>> If you doubt the cost of smp_call_function_single(), how about having
>> a try with my another patch, which implements the similar of RPS, but
>> uses kernel threads instead, so no explicit IPI.
>>
>> http://patchwork.ozlabs.org/patch/38319/
>>
>>
>
> Come on Changli.
>
> How do you wake up a thread on a remote cpu ?
>

resched IPI, apparently. But it is async absolutely. and its IRQ
handler is lighter.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-16  6:28 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, hadi, Rick Jones, David Miller, netdev, robert,
	andi
In-Reply-To: <o2v412e6f7f1004152302j1aca5edam9d53d01781ddbe9d@mail.gmail.com>

>> How do you wake up a thread on a remote cpu ?
>>
>
> resched IPI, apparently. But it is async absolutely. and its IRQ
> handler is lighter.
>
The IPI used in RPS is done asynchronously.

> --
> Regards,
> Changli Gao(xiaosuo@gmail.com)
>

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-16  6:32 UTC (permalink / raw)
  To: Changli Gao
  Cc: hadi, Rick Jones, David Miller, therbert, netdev, robert, andi
In-Reply-To: <o2v412e6f7f1004152302j1aca5edam9d53d01781ddbe9d@mail.gmail.com>

Le vendredi 16 avril 2010 à 14:02 +0800, Changli Gao a écrit :

> resched IPI, apparently. But it is async absolutely. and its IRQ
> handler is lighter.
> 

You still dont answer to the question, and your claims are not grounded
by hard facts, but by your interpretation of code.



^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-16  6:33 UTC (permalink / raw)
  To: therbert; +Cc: netdev, eric.dumazet
In-Reply-To: <alpine.DEB.1.00.1004152243470.15102@pokey.mtv.corp.google.com>

From: Tom Herbert <therbert@google.com>
Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT)

> Version 5 of RFS:
> - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
> static function.
> - Apply limits to rps_sock_flow_entires systcl and rps_flow_count
> sysfs variable.

I've read this over a few times and I think it's ready to go into
net-next-2.6, we can tweak things as-needed from here on out.

Eric, what do you think?

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/3 v2] ipv6: cancel to setting local_df in ip6_xmit()
From: David Miller @ 2010-04-16  6:37 UTC (permalink / raw)
  To: shanwei
  Cc: herbert, emils.tantilov, kuznet, pekkas, jmorris, yoshfuji, kaber,
	eric.dumazet, netdev
In-Reply-To: <4BC7CDD2.5020004@cn.fujitsu.com>

From: Shan Wei <shanwei@cn.fujitsu.com>
Date: Fri, 16 Apr 2010 10:39:14 +0800

> commit f88037(sctp: Drop ipfargok in sctp_xmit function)
> has droped ipfragok and set local_df value properly.
> 
> So the change of commit 77e2f1(ipv6: Fix ip6_xmit to 
> send fragments if ipfragok is true) is not needed. 
> So the patch remove them.
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/3 v2] net: replace ipfragok with skb->local_df
From: David Miller @ 2010-04-16  6:37 UTC (permalink / raw)
  To: shanwei
  Cc: herbert, yinghai.lu, kuznet, pekkas, jmorris, yoshfuji, kaber,
	netdev, dccp, linux-sctp, jchapman, mostrows
In-Reply-To: <4BC7CEBC.70200@cn.fujitsu.com>

From: Shan Wei <shanwei@cn.fujitsu.com>
Date: Fri, 16 Apr 2010 10:43:08 +0800

> As Herbert Xu said: we should be able to simply replace ipfragok
> with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
> has droped ipfragok and set local_df value properly.
> 
> The patch kills the ipfragok parameter of .queue_xmit().
> 
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 3/3 v2] ipv6: fix the comment of ip6_xmit()
From: David Miller @ 2010-04-16  6:37 UTC (permalink / raw)
  To: shanwei; +Cc: netdev
In-Reply-To: <4BC7D010.7080008@cn.fujitsu.com>

From: Shan Wei <shanwei@cn.fujitsu.com>
Date: Fri, 16 Apr 2010 10:48:48 +0800

> 
> ip6_xmit() is used by upper transport protocol.
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] net: replace ipfragok with skb->local_df
From: Herbert Xu @ 2010-04-16  6:43 UTC (permalink / raw)
  To: Shan Wei
  Cc: David Miller, yinghai.lu, kuznet, pekkas, jmorris,
	yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	Patrick McHardy, netdev@vger.kernel.org, dccp, linux-sctp,
	kleptog, jchapman, mostrows, acme
In-Reply-To: <4BC7CAC1.4000803@cn.fujitsu.com>

On Fri, Apr 16, 2010 at 10:26:09AM +0800, Shan Wei wrote:
> 
> Now, PPPoX/PPPoL2TP driver still use ip_queue_xmit to send packets with ipfragok == 1.
> So, now we can't remove the && ... bit. 

Huh? If they still call ip_queue_xmit with ipfragok then surely
the build will fail after your patch as it removes the ipfragok
argument?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH Resubmission v2] drivers/net/usb: Add new driver ipheth
From: David Miller @ 2010-04-16  6:44 UTC (permalink / raw)
  To: agimenez
  Cc: dgiagio, dborca, James.Bottomley, ralf, gregkh, jonas.sjoquist,
	torgny.johansson, steve.glendinning, dbrownell, omar.oberthur,
	remi.denis-courmont, netdev, linux-kernel, linux-usb
In-Reply-To: <1271360791-30312-1-git-send-email-agimenez@sysvalve.es>

From: L. Alberto Giménez <agimenez@sysvalve.es>
Date: Thu, 15 Apr 2010 21:46:29 +0200

> From: dborca@yahoo.com
> 
> Add new driver to use tethering with an iPhone device. After initial submission,
> apply fixes to fit the new driver into the kernel standards.
> 
> There are still a couple of minor (almost cosmetic-level) issues, but the driver
> is fully functional right now.
> 
> Signed-off-by: L. Alberto Giménez <agimenez@sysvalve.es>

I'm very confused about the authorship of this driver.

Who wrote it?

You added a "From: " line using specifying Daniel Borca (btw,
when you add these "From: " lines you  need to specify it in
the form "From: NAME <EMAIL>" not just "From: EMAIL" so in
this case we want to see "From: Daniel Borca <dborca@yahoo.com>")

The code itself gives copyright to Diego Giagio <diego@giagio.com>
and he is also the one listed in the MODULE_AUTHOR().

And you're the one submitting the code, and also the only person
actually giving a signoff in the commit message.

It's too confusing and ambiguous, and if there are any problems
down the road the last thing we need is for the authorship to
be ambiguous.

I would really appreciate it if the authorship was clearly stated, and
the actual author of the code actually gives a "Signed-off-by: " line
in the commit message for this inclusions of this driver.

Please fix this up and resubmit, thank you.

Thanks.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] net: replace ipfragok with skb->local_df
From: Herbert Xu @ 2010-04-16  6:49 UTC (permalink / raw)
  To: Shan Wei
  Cc: David Miller, yinghai.lu, kuznet, pekkas, jmorris,
	yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	Patrick McHardy, netdev@vger.kernel.org, dccp, linux-sctp,
	kleptog, jchapman, mostrows, acme
In-Reply-To: <20100416064344.GA12412@gondor.apana.org.au>

On Fri, Apr 16, 2010 at 02:43:44PM +0800, Herbert Xu wrote:
> On Fri, Apr 16, 2010 at 10:26:09AM +0800, Shan Wei wrote:
> > 
> > Now, PPPoX/PPPoL2TP driver still use ip_queue_xmit to send packets with ipfragok == 1.
> > So, now we can't remove the && ... bit. 
> 
> Huh? If they still call ip_queue_xmit with ipfragok then surely
> the build will fail after your patch as it removes the ipfragok
> argument?

Nevermind, I was looking at the wrong tree.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Duplicate IP false alerts from arping
From: unni krishnan @ 2010-04-16  6:51 UTC (permalink / raw)
  To: netdev

Hi,

I am trying to find a duplicate IP in the network using arping.

-------------------------
[root@vps1 ~]# ping -c 3 192.168.1.212
PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
64 bytes from 192.168.1.212: icmp_seq=1 ttl=64 time=1.33 ms
64 bytes from 192.168.1.212: icmp_seq=2 ttl=64 time=0.280 ms
64 bytes from 192.168.1.212: icmp_seq=3 ttl=64 time=0.306 ms

--- 192.168.1.212 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.280/0.641/1.339/0.494 ms
[root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
ARPING 192.168.1.212 from 0.0.0.0 eth0
0
-------------------------


As per arping that IP is duplicate. But if I go ahead and ifdown the
IP in the known location I cant ping that IP ( That means that IP is
not duplicated ? ). This is the result after shutting down the IP.

--------------------------
[root@vps1 ~]# ping -c 3 192.168.1.212
PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
>From 192.168.1.63 icmp_seq=1 Destination Host Unreachable
>From 192.168.1.63 icmp_seq=2 Destination Host Unreachable
>From 192.168.1.63 icmp_seq=3 Destination Host Unreachable

--- 192.168.1.212 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2001ms
, pipe 3
[root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
ARPING 192.168.1.212 from 0.0.0.0 eth0
Sent 5 probes (5 broadcast(s))
Received 0 response(s)
0
[root@vps1 ~]#
--------------------------

My question is, in this case IP 192.168.1.212 is not duplicated. But
still arping gives duplicate status. Why it is like that ?

-- 
Regards,
Unni
http://mutexes.org/
http://twitter.com/webofunni

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox