Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-16 14:10 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271425753.4606.65.camel@bigi>

On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 21:34 +0800, Changli Gao wrote:
>
>
> my observation is:
> s->total is the sum of all packets received by cpu (some directly from
> ethernet)

It is meaningless currently. If rps is enabled, it may be twice of the
number of the packets received, because one packet may be count twice:
one in enqueue_to_backlog(), and the other in __netif_receive_skb(). I
had posted a patch to solve this problem.

http://patchwork.ozlabs.org/patch/50217/

If you don't apply my patch, you'd better refer to /proc/net/dev for
the total number.

> s->received_rps was what the count receiver cpu saw incoming if they
> were sent by another cpu.

Maybe its name confused you.

/* Called from hardirq (IPI) context */
static void trigger_softirq(void *data)
{
        struct softnet_data *queue = data;
        __napi_schedule(&queue->backlog);
        __get_cpu_var(netdev_rx_stat).received_rps++;
}

the function above is called in IRQ of IPI. It counts the number of
IPIs received. It is actually ipi_rps you need.

> s-> ipi_rps is the times we tried to enq to remote cpu but found it to
> be empty and had to send an IPI.
> ipi_rps can be < received_rps if we receive > 1 packet without
> generating an IPI. What did i miss?
>

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* [PATCH] mac8390: change an error return code and some cleanup
From: Finn Thain @ 2010-04-16 14:14 UTC (permalink / raw)
  To: David Miller; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004161403160.271@localhost>


Change an error return from EAGAIN to EBUSY since the former is 
misleading. Also promote the log message. Likewise some other KERN_INFO 
log messages.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

--- a/drivers/net/mac8390.c	2010-04-16 13:31:04.000000000 +1000
+++ b/drivers/net/mac8390.c	2010-04-16 23:50:39.000000000 +1000
@@ -554,7 +554,7 @@
 	case MAC8390_APPLE:
 		switch (mac8390_testio(dev->mem_start)) {
 		case ACCESS_UNKNOWN:
-			pr_info("Don't know how to access card memory!\n");
+			pr_err("Don't know how to access card memory!\n");
 			return -ENODEV;
 			break;
 
@@ -643,8 +643,8 @@
 {
 	__ei_open(dev);
 	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
-		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
-		return -EAGAIN;
+		pr_err("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
+		return -EBUSY;
 	}
 	return 0;
 }
@@ -660,7 +660,7 @@
 {
 	ei_status.txing = 0;
 	if (ei_debug > 1)
-		pr_info("reset not supported\n");
+		printk(KERN_DEBUG pr_fmt("reset not supported\n"));
 	return;
 }
 
@@ -668,11 +668,11 @@
 {
 	unsigned char *target = nubus_slot_addr(IRQ2SLOT(dev->irq));
 	if (ei_debug > 1)
-		pr_info("Need to reset the NS8390 t=%lu...", jiffies);
+		printk(KERN_DEBUG pr_fmt("Need to reset the NS8390 t=%lu..."), jiffies);
 	ei_status.txing = 0;
 	target[0xC0000] = 0;
 	if (ei_debug > 1)
-		pr_cont("reset complete\n");
+		printk(KERN_CONT "reset complete\n");
 	return;
 }
 

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-16 14:43 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <r2q412e6f7f1004160710j6d575f36t8e39a283328cf2d7@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1331 bytes --]

On Fri, 2010-04-16 at 22:10 +0800, Changli Gao wrote:
> On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:

> > my observation is:
> > s->total is the sum of all packets received by cpu (some directly from
> > ethernet)
> 
> It is meaningless currently. If rps is enabled, it may be twice of the
> number of the packets received, because one packet may be count twice:
> one in enqueue_to_backlog(), and the other in __netif_receive_skb(). 

You are probably right - you made me look at my collected data ;->
i will look closely later, but it seems they are accounting for
different cpus, no? 
Example, attached are some of the stats i captured when i was running
the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just
cut to the first and last two columns):

cpu   Total     |rps_recv |rps_ipi
-----+----------+---------+---------
cpu0 | 002dc7f1 |00000000 |000f4246
cpu1 | 002dc804 |000f4240 |00000000
-------------------------------------

So: cpu0 receive 0x2dc7f1 pkts accummulative over time and
redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear
the data) and for the test 0xf4246 times it generated an IPI. It can be
seen that total running for CPU1 is 0x2dc804 but in this one run it
received 1M packets (0xf4240). 
i.e i dont see the double accounting..

cheers,
jamal

[-- Attachment #2: st1 --]
[-- Type: text/plain, Size: 792 bytes --]

002dc7f1 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000f4246
002dc804 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000f4240 00000000
00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0000003c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0000003e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-16 14:58 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271429025.4606.149.camel@bigi>

On Fri, Apr 16, 2010 at 10:43 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 22:10 +0800, Changli Gao wrote:
>> On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:
>
>> > my observation is:
>> > s->total is the sum of all packets received by cpu (some directly from
>> > ethernet)
>>
>> It is meaningless currently. If rps is enabled, it may be twice of the
>> number of the packets received, because one packet may be count twice:
>> one in enqueue_to_backlog(), and the other in __netif_receive_skb().
>
> You are probably right - you made me look at my collected data ;->
> i will look closely later, but it seems they are accounting for
> different cpus, no?
> Example, attached are some of the stats i captured when i was running
> the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just
> cut to the first and last two columns):
>
> cpu   Total     |rps_recv |rps_ipi
> -----+----------+---------+---------
> cpu0 | 002dc7f1 |00000000 |000f4246
> cpu1 | 002dc804 |000f4240 |00000000
> -------------------------------------
>
> So: cpu0 receive 0x2dc7f1 pkts accummulative over time and
> redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear
> the data) and for the test 0xf4246 times it generated an IPI. It can be
> seen that total running for CPU1 is 0x2dc804 but in this one run it
> received 1M packets (0xf4240).

I remeber you redirected all the traffic from cpu0 to cpu1, and the data shows:

about 0x2dc7f1 packets are processed, and about 0xf4240 IPI are generated.

> i.e i dont see the double accounting..
>

a single packet is counted twice by CPU0 and CPU1. If you change RPS setting by:

echo 1 > ..../rps_cpus

you will find the total number are doubled.


-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH net-2.6] packet : remove init_net restriction
From: Daniel Lezcano @ 2010-04-16 15:04 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271322674-21726-1-git-send-email-daniel.lezcano@free.fr>

Daniel Lezcano wrote:
> The af_packet protocol is used by Perl to do ioctls as reported by
> Stephane Riviere:
>
> "Net::RawIP relies on SIOCGIFADDR et SIOCGIFHWADDR to get the IP and MAC
> addresses of the network interface."
>
> But in a new network namespace these ioctl fail because it is disabled for
> a namespace different from the init_net_ns.
>
> These two lines should not be there as af_inet and af_packet are
> namespace aware since a long time now. I suppose we forget to remove these
> lines because we sent the af_packet first, before af_inet was supported.
>
> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
> Reported-by: Stephane Riviere <stephane.riviere@regis-dgac.net>
> ---
>  net/packet/af_packet.c |    2 --
>  1 files changed, 0 insertions(+), 2 deletions(-)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index cc90363..243946d 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2169,8 +2169,6 @@ static int packet_ioctl(struct socket *sock, unsigned int cmd,
>  	case SIOCGIFDSTADDR:
>  	case SIOCSIFDSTADDR:
>  	case SIOCSIFFLAGS:
> -		if (!net_eq(sock_net(sk), &init_net))
> -			return -ENOIOCTLCMD;
>  		return inet_dgram_ops.ioctl(sock, cmd, arg);
>  #endif
>  
>   
Shall I send it against net-next-2.6 ?

Thanks
  -- Daniel

^ permalink raw reply

* TCP keepalive question
From: Flavio Leitner @ 2010-04-16 15:06 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 700 bytes --]


Hi,


I'm reading the RFC 1122 and it says the following:
...
           Keep-alive packets MUST only be sent when no data or
           acknowledgement packets have been received for the
           connection within an interval.  
...

The receiving acknowledgement part seems to be ok and handled 
by tcp_keepalive_timer() when it does 
elapsed = tcp_time_stamp - tp->rcv_tstamp;

However, if one side just receive data and reply with ACK, the
keepalive probes is sent anyway - 2.6.32.9-70.fc12.i686.PAE.

Any reason to not reset keepalive timer when data is received?

Socket options used:
SO_KEEPALIVE, TCP_KEEPIDLE=40, TCP_KEEPCNT=6, TCP_KEEPINTVL=5

Traffic dump attached.

thanks!
-- 
Flavio

[-- Attachment #2: keepalive.pcap.gz --]
[-- Type: application/octet-stream, Size: 552 bytes --]

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Andi Kleen @ 2010-04-16 15:28 UTC (permalink / raw)
  To: jamal; +Cc: Andi Kleen, Andi Kleen, Tom Herbert, davem, netdev, eric.dumazet
In-Reply-To: <1271426715.4606.97.camel@bigi>

On Fri, Apr 16, 2010 at 10:05:15AM -0400, jamal wrote:
> On Fri, 2010-04-16 at 15:42 +0200, Andi Kleen wrote:
> > On Fri, Apr 16, 2010 at 09:32:06AM -0400, jamal wrote:
> > > How are you going to schedule the net softirq on an empty queue if you
> > > do this?
> > 
> > Sorry don't understand the question? 
> > 
> > You can always do the flow as if rps was not there.
> 
> Meaning you schedule the other side netrx softirq if queue is empty?

You handle the packet like if rps wasn't enabled. softirq on current
CPU and it queues it on the socket.

> > I meant an IPI to a sibling is not useful. You send it to the IPI
> > to get cache locality in the target, but if the target has the same
> > cache locality as you you can as well avoid the cost of the IPI
> > and process directly.
> > 
> 
> Isnt the purpose of the IPI to signal remote side that theres something
> for it to do? 

The current CPU can queue on that socket as well.

The whole point of the IPI is to do it with cache locality.
But if cache locality is already there on the current CPU you don't
need the IPI.

> Does it also sync the remote cache?

No, the caches are always coherent.

> 
> > For thread sibling I'm pretty sure it's useless. Not full sure about
> > socket sibling. Maybe.
> > 
> 
> Agreed, the SMT threads share L2. All the cores share L3. And it is
> inclusive, so if it is missing it is in L1 of one thread it must be
> present in L2 of shared cache as well as L3. Across the QPI i dont think
> that is true.
> But if you speacial case this - arent you being specific to Nehalem?

Other CPUs have SMT too (Niagara, POWER 6/7, mips, ...). It should
be the same there.

Assuming L3 affinity helps it might need to be a CPU specific tunable
yes. The scheduler has some information about this.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-16 15:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1271401007.16881.3762.camel@edumazet-laptop>

Eric, thanks for testing that.  Admittedly, we have looked at enabling
RFS/RPS over loopback.   I'll look at that today also.


On Thu, Apr 15, 2010 at 11:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit :
>> From: Tom Herbert <therbert@google.com>
>> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT)
>>
>> > Version 5 of RFS:
>> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
>> > static function.
>> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count
>> > sysfs variable.
>>
>> I've read this over a few times and I think it's ready to go into
>> net-next-2.6, we can tweak things as-needed from here on out.
>>
>> Eric, what do you think?
>
> I read the patch and found no error.
>
> I booted a test machine and performed some tests
>
> I am a bit worried of a tbench regression I am looking at right now.
>
> if RFS disabled , tbench 16   ->  4408.63 MB/sec
>
>
> # grep . /sys/class/net/lo/queues/rx-0/*
> /sys/class/net/lo/queues/rx-0/rps_cpus:00000000
> /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192
> # cat /proc/sys/net/core/rps_sock_flow_entries
> 8192
>
>
> echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus
>
> tbench 16 -> 2336.32 MB/sec
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>   PerfTop:   14561 irqs/sec  kernel:86.3% [1000Hz cycles],  (all, 16 CPUs)
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>             samples  pcnt function                       DSO
>             _______ _____ ______________________________ __________________________________________________________
>
>             2664.00  5.1% copy_user_generic_string       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             2323.00  4.4% acpi_os_read_port              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             1641.00  3.1% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             1260.00  2.4% schedule                       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             1159.00  2.2% _raw_spin_lock                 /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             1051.00  2.0% tcp_ack                        /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              991.00  1.9% tcp_sendmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              922.00  1.8% tcp_recvmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              821.00  1.6% child_run                      /usr/bin/tbench
>              766.00  1.5% all_string_sub                 /usr/bin/tbench
>              630.00  1.2% __switch_to                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              608.00  1.2% __GI_strchr                    /lib/tls/libc-2.3.4.so
>              606.00  1.2% ipt_do_table                   /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              600.00  1.1% __GI_strstr                    /lib/tls/libc-2.3.4.so
>              556.00  1.1% __netif_receive_skb            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              504.00  1.0% tcp_transmit_skb               /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              502.00  1.0% tick_nohz_stop_sched_tick      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              481.00  0.9% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              473.00  0.9% next_token                     /usr/bin/tbench
>              449.00  0.9% ip_rcv                         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              423.00  0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              422.00  0.8% ia32_sysenter_target           /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              420.00  0.8% compat_sys_socketcall          /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              401.00  0.8% mod_timer                      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              400.00  0.8% process_backlog                /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              399.00  0.8% ip_queue_xmit                  /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              387.00  0.7% select_task_rq_fair            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              377.00  0.7% _raw_spin_lock_bh              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              360.00  0.7% tcp_v4_rcv                     /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>
> But if RFS is on, why activating rps_cpus change tbench ?
>
>
>
>

^ permalink raw reply

* Re: [PATCH v4] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-16 15:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, netdev, eric.dumazet, Ingo Molnar, Paul Turner
In-Reply-To: <20100412171205.561a1aec@nehalam>

> There are two sometimes conflicting models:
>
> One model is to have the flow's be dispersed and let the scheduler
> be smarter about running the applications on the right CPU's where
> the packets arrive.
>
> The other is to have the flows redirected to the CPU where the application
> previously ran which is what RFS does.
>
> For benchmarks and private fixed configuration systems it is tempting
> to just nail everything down: i.e. use hard SMP affinity, for hardware, processes,
> and flows.  But this is the wrong solution for general purpose systems with
> varying workloads and requirements.  How well does RFS really work when
> applications, processes, and sockets come and go or get migrated among
> CPU's by the scheduler? My concern is this is overlapping scheduler
> design and might be a step backwards.
>
This is true.  There is a fundamental question of whether scheduler
should lead networking or vice versa.  The advantages of networking
following scheduler seem to become more apparent on heavily loaded
systems or with threads that handle more than one flow.

I'm not sure these two models have to be mutually exclusive, we are
looking at some ways to make a hybrid model.

The statement about pinning down resources is also true, we are
actively try to squash any instances this in our applications!

Tom

>
> --
>

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-16 15:57 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, netdev, robert, David Miller, Changli Gao,
	Andi Kleen
In-Reply-To: <1271271222.4567.51.camel@bigi>

> It would be valuable to have something like Documentation/networking/rps
> to detail things a little more.
>

Working on it.  Will try to post data for several platforms soon.

> cheers,
> jamal
>
>

^ permalink raw reply

* Re: [PATCH v4] rfs: Receive Flow Steering
From: Rick Jones @ 2010-04-16 17:33 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Stephen Hemminger, davem, netdev, eric.dumazet, Ingo Molnar,
	Paul Turner
In-Reply-To: <o2k65634d661004160851wc00c609p7136a22fd07503c1@mail.gmail.com>

> 
> This is true.  There is a fundamental question of whether scheduler
> should lead networking or vice versa.  The advantages of networking
> following scheduler seem to become more apparent on heavily loaded
> systems or with threads that handle more than one flow.

I will confess to being in the networking should follow the scheduler camp :)

> I'm not sure these two models have to be mutually exclusive, we are
> looking at some ways to make a hybrid model.

It is perhaps too speculative on my part, but if the host has no control over 
the remote addressing of the connections to/from it, doesn't that suggest that 
allowing networking to lead the scheduler gives "external forces" more say in 
intra-system resource consumption than we might want them to have?

rick jones

^ permalink raw reply

* Re: [PATCH v4] rfs: Receive Flow Steering
From: Paul Turner @ 2010-04-16 17:59 UTC (permalink / raw)
  To: Rick Jones
  Cc: Tom Herbert, Stephen Hemminger, davem, netdev, eric.dumazet,
	Ingo Molnar
In-Reply-To: <4BC89F6D.2080604@hp.com>

On Fri, Apr 16, 2010 at 10:33 AM, Rick Jones <rick.jones2@hp.com> wrote:
>>
>> This is true.  There is a fundamental question of whether scheduler
>> should lead networking or vice versa.  The advantages of networking
>> following scheduler seem to become more apparent on heavily loaded
>> systems or with threads that handle more than one flow.
>
> I will confess to being in the networking should follow the scheduler camp
> :)
>
>> I'm not sure these two models have to be mutually exclusive, we are
>> looking at some ways to make a hybrid model.
>
> It is perhaps too speculative on my part, but if the host has no control
> over the remote addressing of the connections to/from it, doesn't that
> suggest that allowing networking to lead the scheduler gives "external
> forces" more say in intra-system resource consumption than we might want
> them to have?
>
> rick jones
>

Even under a hybrid model I think phrasing it as networking leading
the scheduler here is a little strong.  The scheduler is in both cases
the most 'informed' place to make these decisions, but I think it
could benefit from more knowledge.  In the 'virgin' single flow case
without any steering the network stack is currently able to implicitly
hint to the scheduler where flows could be most efficiently served due
to wake-affine balancing behaviors.  This is a natural side-effect of
wake-ups being sourced by the networking cpus.

I think the win here would be allowing this (naturally existing)
hinting to be a little more explicit so that the scheduler and
load-balancer are able to gracefully 'collapse' back down onto the
network cpu socket under low stress conditions, even if previous
processing was balanced away from it due to load.

This would actually then look very much like today's model under loads
where you don't need scaling via parallelism.  One way to think about
making it an explicit hint could be: should the rx cpu sourcing the
wake-up in this case be the target for wake-affine as opposed to the
current bottom-half delegate?

- Paul

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16 18:15 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <w2t65634d661004160835z4a604ee7pb5f9d395fe61b5db@mail.gmail.com>

Le vendredi 16 avril 2010 à 08:35 -0700, Tom Herbert a écrit :
> Eric, thanks for testing that.  Admittedly, we have looked at enabling
> RFS/RPS over loopback.   I'll look at that today also.
> 
> 

Hi Tom

I am sorry, but I could not work on this today. I hope I can find some
time a bit later.



> On Thu, Apr 15, 2010 at 11:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit :
> >> From: Tom Herbert <therbert@google.com>
> >> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT)
> >>
> >> > Version 5 of RFS:
> >> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
> >> > static function.
> >> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count
> >> > sysfs variable.
> >>
> >> I've read this over a few times and I think it's ready to go into
> >> net-next-2.6, we can tweak things as-needed from here on out.
> >>
> >> Eric, what do you think?
> >
> > I read the patch and found no error.
> >
> > I booted a test machine and performed some tests
> >
> > I am a bit worried of a tbench regression I am looking at right now.
> >
> > if RFS disabled , tbench 16   ->  4408.63 MB/sec
> >
> >
> > # grep . /sys/class/net/lo/queues/rx-0/*
> > /sys/class/net/lo/queues/rx-0/rps_cpus:00000000
> > /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192
> > # cat /proc/sys/net/core/rps_sock_flow_entries
> > 8192
> >
> >
> > echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus
> >
> > tbench 16 -> 2336.32 MB/sec
> >
> >
> > -----------------------------------------------------------------------------------------------------------------------------------------------------
> >   PerfTop:   14561 irqs/sec  kernel:86.3% [1000Hz cycles],  (all, 16 CPUs)
> > -----------------------------------------------------------------------------------------------------------------------------------------------------
> >
> >             samples  pcnt function                       DSO
> >             _______ _____ ______________________________ __________________________________________________________
> >
> >             2664.00  5.1% copy_user_generic_string       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >             2323.00  4.4% acpi_os_read_port              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >             1641.00  3.1% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >             1260.00  2.4% schedule                       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >             1159.00  2.2% _raw_spin_lock                 /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >             1051.00  2.0% tcp_ack                        /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              991.00  1.9% tcp_sendmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              922.00  1.8% tcp_recvmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              821.00  1.6% child_run                      /usr/bin/tbench
> >              766.00  1.5% all_string_sub                 /usr/bin/tbench
> >              630.00  1.2% __switch_to                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              608.00  1.2% __GI_strchr                    /lib/tls/libc-2.3.4.so
> >              606.00  1.2% ipt_do_table                   /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              600.00  1.1% __GI_strstr                    /lib/tls/libc-2.3.4.so
> >              556.00  1.1% __netif_receive_skb            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              504.00  1.0% tcp_transmit_skb               /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              502.00  1.0% tick_nohz_stop_sched_tick      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              481.00  0.9% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              473.00  0.9% next_token                     /usr/bin/tbench
> >              449.00  0.9% ip_rcv                         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              423.00  0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              422.00  0.8% ia32_sysenter_target           /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              420.00  0.8% compat_sys_socketcall          /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              401.00  0.8% mod_timer                      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              400.00  0.8% process_backlog                /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              399.00  0.8% ip_queue_xmit                  /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              387.00  0.7% select_task_rq_fair            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              377.00  0.7% _raw_spin_lock_bh              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >              360.00  0.7% tcp_v4_rcv                     /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> >
> > But if RFS is on, why activating rps_cpus change tbench ?
> >
> >
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply

* Re: [PATCH v4] rfs: Receive Flow Steering
From: Rick Jones @ 2010-04-16 18:32 UTC (permalink / raw)
  To: Paul Turner
  Cc: Tom Herbert, Stephen Hemminger, davem, netdev, eric.dumazet,
	Ingo Molnar
In-Reply-To: <i2oed628a921004161059z65a3cf1aq5f3cd2194f40a811@mail.gmail.com>

> Even under a hybrid model I think phrasing it as networking leading
> the scheduler here is a little strong.  The scheduler is in both cases
> the most 'informed' place to make these decisions, but I think it
> could benefit from more knowledge.  In the 'virgin' single flow case
> without any steering the network stack is currently able to implicitly
> hint to the scheduler where flows could be most efficiently served due
> to wake-affine balancing behaviors.  This is a natural side-effect of
> wake-ups being sourced by the networking cpus.

Hinting to the scheduler is fine - so long as the final say is the scheduler. 
Presumably it is the thing that knows about the other forces tugging at where to 
run the thread - where its memory is allocated, what other flows are coming to 
it etc.

rick jones

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-16 18:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1271401007.16881.3762.camel@edumazet-laptop>

Results with "tbench 16" on an 8 core Intel machine.

No RPS/RFS:  2155 MB/sec
RPS (0ff mask): 1700 MB/sec
RFS: 1097

I am not particularly surprised by the results, using loopback
interface already provides good parallelism and RPS/RFS really would
only add overhead and more trips between CPUs (last part is why RPS <
RFS I suspect)-- I guess this is why we've never enabled RPS on
loopback :-)

Eric, do you have a particular concern that this could affect a real workload?

Tom


On Thu, Apr 15, 2010 at 11:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit :
>> From: Tom Herbert <therbert@google.com>
>> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT)
>>
>> > Version 5 of RFS:
>> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
>> > static function.
>> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count
>> > sysfs variable.
>>
>> I've read this over a few times and I think it's ready to go into
>> net-next-2.6, we can tweak things as-needed from here on out.
>>
>> Eric, what do you think?
>
> I read the patch and found no error.
>
> I booted a test machine and performed some tests
>
> I am a bit worried of a tbench regression I am looking at right now.
>
> if RFS disabled , tbench 16   ->  4408.63 MB/sec
>
>
> # grep . /sys/class/net/lo/queues/rx-0/*
> /sys/class/net/lo/queues/rx-0/rps_cpus:00000000
> /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192
> # cat /proc/sys/net/core/rps_sock_flow_entries
> 8192
>
>
> echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus
>
> tbench 16 -> 2336.32 MB/sec
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>   PerfTop:   14561 irqs/sec  kernel:86.3% [1000Hz cycles],  (all, 16 CPUs)
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>             samples  pcnt function                       DSO
>             _______ _____ ______________________________ __________________________________________________________
>
>             2664.00  5.1% copy_user_generic_string       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             2323.00  4.4% acpi_os_read_port              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             1641.00  3.1% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             1260.00  2.4% schedule                       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             1159.00  2.2% _raw_spin_lock                 /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>             1051.00  2.0% tcp_ack                        /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              991.00  1.9% tcp_sendmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              922.00  1.8% tcp_recvmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              821.00  1.6% child_run                      /usr/bin/tbench
>              766.00  1.5% all_string_sub                 /usr/bin/tbench
>              630.00  1.2% __switch_to                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              608.00  1.2% __GI_strchr                    /lib/tls/libc-2.3.4.so
>              606.00  1.2% ipt_do_table                   /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              600.00  1.1% __GI_strstr                    /lib/tls/libc-2.3.4.so
>              556.00  1.1% __netif_receive_skb            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              504.00  1.0% tcp_transmit_skb               /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              502.00  1.0% tick_nohz_stop_sched_tick      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              481.00  0.9% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              473.00  0.9% next_token                     /usr/bin/tbench
>              449.00  0.9% ip_rcv                         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              423.00  0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              422.00  0.8% ia32_sysenter_target           /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              420.00  0.8% compat_sys_socketcall          /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              401.00  0.8% mod_timer                      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              400.00  0.8% process_backlog                /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              399.00  0.8% ip_queue_xmit                  /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              387.00  0.7% select_task_rq_fair            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              377.00  0.7% _raw_spin_lock_bh              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              360.00  0.7% tcp_v4_rcv                     /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>
> But if RFS is on, why activating rps_cpus change tbench ?
>
>
>
>

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16 18:53 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <l2m65634d661004161135h1c1466afi54787022bfc2ce12@mail.gmail.com>

Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
> Results with "tbench 16" on an 8 core Intel machine.
> 
> No RPS/RFS:  2155 MB/sec
> RPS (0ff mask): 1700 MB/sec
> RFS: 1097
> 
> I am not particularly surprised by the results, using loopback
> interface already provides good parallelism and RPS/RFS really would
> only add overhead and more trips between CPUs (last part is why RPS <
> RFS I suspect)-- I guess this is why we've never enabled RPS on
> loopback :-)
> 
> Eric, do you have a particular concern that this could affect a real workload?
> 

I was expecting RFS to be better than RPS at least, for this particular
workload (tcp over loopback)

With RPS, the hash function of (127.0.0.1, port1, 127.0.0.1, port2)
is different than (127.0.0.1, port2, 127.0.0.1, port1), so basically we
force the server to run on different processor than client

However, I was expecting that with RFS, client and server would run on
same cpu.

Maybe we could change (for a test) hash function to use  (sport ^ dport)
instead of (sport << 16) + dport 




^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16 19:37 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20100415.233334.242114544.davem@davemloft.net>

Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit :
> From: Tom Herbert <therbert@google.com>
> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT)
> 
> > Version 5 of RFS:
> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
> > static function.
> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count
> > sysfs variable.
> 
> I've read this over a few times and I think it's ready to go into
> net-next-2.6, we can tweak things as-needed from here on out.
> 
> Eric, what do you think?

I think I can give my Sob, and we have time to fully test it and tweak
it if necessary.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks Tom !



^ permalink raw reply

* Re: [PATCH net-2.6] packet : remove init_net restriction
From: David Miller @ 2010-04-16 20:23 UTC (permalink / raw)
  To: daniel.lezcano; +Cc: netdev
In-Reply-To: <4BC87C7C.4060407@free.fr>

From: Daniel Lezcano <daniel.lezcano@free.fr>
Date: Fri, 16 Apr 2010 17:04:28 +0200

> Shall I send it against net-next-2.6 ?

No, I'll likely add it to net-2.6, I just haven't gotten around
to it yet.

Thanks.

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls and change return code
From: David Miller @ 2010-04-16 20:28 UTC (permalink / raw)
  To: fthain; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004162340370.271@localhost>

From: Finn Thain <fthain@telegraphics.com.au>
Date: Fri, 16 Apr 2010 23:57:34 +1000 (EST)

> 
> On Thu, 15 Apr 2010, Joe Perches wrote:
> 
>> ...Why is it better to use -EBUSY?
> 
> Nubus slots are geographically addressed and their irqs are equally 
> inflexible. -EAGAIN is misleading because retrying will not help fix 
> whatever bug caused the irq to unavailable.

This is exactly the kind of background information and verbose
explanation that belongs in the commit message.

Yet in your recent version of the patch, you're still being extremely
terse as per the reasoning for using -EBUSY

Just saying it's "misleading" doesn't tell anyone anything if they
have to go back in the commit history and try to figure out why this
change was made if it's causing problems later.

Please make the verbose and complete explanation in your commit
message, and resubmit your patch.

I just want to point out that with all the trouble you gave about
Joe's work, you're having one heck of a time even submitting your
changes properly. :-)

Thanks.

^ permalink raw reply

* Re: [PATCH] rdma/cm: Randomize local port allocation.
From: David Miller @ 2010-04-16 20:30 UTC (permalink / raw)
  To: penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp
  Cc: amwang-H+wXaHxf7aLQT0dZR+AlfA, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	opurdila-+zzKsuq53OdBDgjK7y7TUQ,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <201004162254.FJF73478.SHOOMOFtQFVJLF-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>

From: Tetsuo Handa <penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>
Date: Fri, 16 Apr 2010 22:54:22 +0900

> Cong Wang wrote:
>> Sean Hefty wrote:
>> > I like this version, thanks!  I'm not sure which tree to merge it through.
>> > Are you needing this for 2.6.34, or is 2.6.35 okay?
>> > 
>> 
>> As soon as possible, so 2.6.34. :)
>> 
> Cong, merge window for 2.6.34 was already closed.
> You need to make your patchset towards 2.6.35 (using net-next-2.6 tree)
> rather than 2.6.34 (using linux-2.6 tree). Therefore, this patch being
> queued for 2.6.35 (through net-next-2.6 tree) should be okay for you.

I don't take RDMA patches into net-next-2.6, the less I touch this
stack avoiding stuff the better and Roland has been taking this stuff
into his own tree for some time now.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-16 20:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1271443994.16881.4249.camel@edumazet-laptop>

On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
>> Results with "tbench 16" on an 8 core Intel machine.
>>
>> No RPS/RFS:  2155 MB/sec
>> RPS (0ff mask): 1700 MB/sec
>> RFS: 1097
>>

Blah, I mistakingly reported that... should have been:

No RPS/RFS:  2155 MB/sec
RPS (0ff mask): 1097 MB/sec
RFS: 1700 MB/sec

Sorry about that!

>> I am not particularly surprised by the results, using loopback
>> interface already provides good parallelism and RPS/RFS really would
>> only add overhead and more trips between CPUs (last part is why RPS <
>> RFS I suspect)-- I guess this is why we've never enabled RPS on
>> loopback :-)
>>
>> Eric, do you have a particular concern that this could affect a real workload?
>>
>
> I was expecting RFS to be better than RPS at least, for this particular
> workload (tcp over loopback)
>
This was my expectation too, and what my "corrected" numbers show :-)
But, I take it this is different in your results?

Tom

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16 21:12 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <u2t65634d661004161342zeadb5602w73c369ec717dc6e1@mail.gmail.com>

Le vendredi 16 avril 2010 à 13:42 -0700, Tom Herbert a écrit :
> On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
> >> Results with "tbench 16" on an 8 core Intel machine.
> >>
> >> No RPS/RFS:  2155 MB/sec
> >> RPS (0ff mask): 1700 MB/sec
> >> RFS: 1097
> >>
> 
> Blah, I mistakingly reported that... should have been:
> 
> No RPS/RFS:  2155 MB/sec
> RPS (0ff mask): 1097 MB/sec
> RFS: 1700 MB/sec
> 
> Sorry about that!

> This was my expectation too, and what my "corrected" numbers show :-)
> But, I take it this is different in your results?


My results are on a "tbench 16" on an dual X5570  @ 2.93GHz.
(16 logical cpus)

No RPS , no RFS : 4448.14 MB/sec 
RPS : 2298.00 MB/sec (but lot of variation)
RFS : 2600 MB/sec

Maybe my RFS setup is bad ?
(8192 flows)



^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16 21:25 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <1271452358.16881.4486.camel@edumazet-laptop>

Le vendredi 16 avril 2010 à 23:12 +0200, Eric Dumazet a écrit :
> Le vendredi 16 avril 2010 à 13:42 -0700, Tom Herbert a écrit :
> > On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
> > >> Results with "tbench 16" on an 8 core Intel machine.
> > >>
> > >> No RPS/RFS:  2155 MB/sec
> > >> RPS (0ff mask): 1700 MB/sec
> > >> RFS: 1097
> > >>
> > 
> > Blah, I mistakingly reported that... should have been:
> > 
> > No RPS/RFS:  2155 MB/sec
> > RPS (0ff mask): 1097 MB/sec
> > RFS: 1700 MB/sec
> > 
> > Sorry about that!
> 
> > This was my expectation too, and what my "corrected" numbers show :-)
> > But, I take it this is different in your results?
> 
> 
> My results are on a "tbench 16" on an dual X5570  @ 2.93GHz.
> (16 logical cpus)
> 
> No RPS , no RFS : 4448.14 MB/sec 
> RPS : 2298.00 MB/sec (but lot of variation)
> RFS : 2600 MB/sec
> 
> Maybe my RFS setup is bad ?
> (8192 flows)
> 

Very strange, a second tbench-16 RFS=y run gave me 2134.08 MB/sec 

A third run gave me 1813.21 MB/sec 
A fourth run gave me 2472.91 MB/sec 

Hmm...





^ permalink raw reply

* [PATCH] gigaset: include cleanup cleanup
From: Tilman Schmidt @ 2010-04-16 22:08 UTC (permalink / raw)
  To: Karsten Keil, David Miller
  Cc: Tejun Heo, Hansjoerg Lipp, i4ldeveloper, netdev, linux-kernel

Commit 5a0e3ad causes slab.h to be included twice in many of the
Gigaset driver's source files, first via the common include file
gigaset.h and then a second time directly. Drop the spares, and
use the opportunity to clean up a few more similar cases.

Impact: cleanup, no functional change
Signed-off-by: Tilman Schmidt <tilman@imap.cc>
CC: Tejun Heo <tj@kernel.org>
---
Seeing that the "include cleanup" patch triggering this was accepted
after the merge window, I have hopes this one will be accepted, too.

 drivers/isdn/gigaset/bas-gigaset.c |    5 -----
 drivers/isdn/gigaset/capi.c        |    2 --
 drivers/isdn/gigaset/common.c      |    2 --
 drivers/isdn/gigaset/gigaset.h     |    2 +-
 drivers/isdn/gigaset/i4l.c         |    1 -
 drivers/isdn/gigaset/interface.c   |    1 -
 drivers/isdn/gigaset/proc.c        |    1 -
 drivers/isdn/gigaset/ser-gigaset.c |    3 ---
 drivers/isdn/gigaset/usb-gigaset.c |    4 ----
 9 files changed, 1 insertions(+), 20 deletions(-)

diff --git a/drivers/isdn/gigaset/bas-gigaset.c b/drivers/isdn/gigaset/bas-gigaset.c
index 0be15c7..47a5ffe 100644
--- a/drivers/isdn/gigaset/bas-gigaset.c
+++ b/drivers/isdn/gigaset/bas-gigaset.c
@@ -14,11 +14,6 @@
  */
 
 #include "gigaset.h"
-
-#include <linux/errno.h>
-#include <linux/init.h>
-#include <linux/slab.h>
-#include <linux/timer.h>
 #include <linux/usb.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
diff --git a/drivers/isdn/gigaset/capi.c b/drivers/isdn/gigaset/capi.c
index eb7e271..964a55f 100644
--- a/drivers/isdn/gigaset/capi.c
+++ b/drivers/isdn/gigaset/capi.c
@@ -12,8 +12,6 @@
  */
 
 #include "gigaset.h"
-#include <linux/slab.h>
-#include <linux/ctype.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/isdn/capilli.h>
diff --git a/drivers/isdn/gigaset/common.c b/drivers/isdn/gigaset/common.c
index 0b39b38..f6f45f2 100644
--- a/drivers/isdn/gigaset/common.c
+++ b/drivers/isdn/gigaset/common.c
@@ -14,10 +14,8 @@
  */
 
 #include "gigaset.h"
-#include <linux/ctype.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
-#include <linux/slab.h>
 
 /* Version Information */
 #define DRIVER_AUTHOR "Hansjoerg Lipp <hjlipp@web.de>, Tilman Schmidt <tilman@imap.cc>, Stefan Eilers"
diff --git a/drivers/isdn/gigaset/gigaset.h b/drivers/isdn/gigaset/gigaset.h
index 9ef5b04..d32efb6 100644
--- a/drivers/isdn/gigaset/gigaset.h
+++ b/drivers/isdn/gigaset/gigaset.h
@@ -22,9 +22,9 @@
 #include <linux/kernel.h>
 #include <linux/compiler.h>
 #include <linux/types.h>
+#include <linux/ctype.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
-#include <linux/usb.h>
 #include <linux/skbuff.h>
 #include <linux/netdevice.h>
 #include <linux/ppp_defs.h>
diff --git a/drivers/isdn/gigaset/i4l.c b/drivers/isdn/gigaset/i4l.c
index c99fb97..c22e5ac 100644
--- a/drivers/isdn/gigaset/i4l.c
+++ b/drivers/isdn/gigaset/i4l.c
@@ -15,7 +15,6 @@
 
 #include "gigaset.h"
 #include <linux/isdnif.h>
-#include <linux/slab.h>
 
 #define HW_HDR_LEN	2	/* Header size used to store ack info */
 
diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c
index f0dc6c9..c9f28dd 100644
--- a/drivers/isdn/gigaset/interface.c
+++ b/drivers/isdn/gigaset/interface.c
@@ -13,7 +13,6 @@
 
 #include "gigaset.h"
 #include <linux/gigaset_dev.h>
-#include <linux/tty.h>
 #include <linux/tty_flip.h>
 
 /*** our ioctls ***/
diff --git a/drivers/isdn/gigaset/proc.c b/drivers/isdn/gigaset/proc.c
index b69f73a..b943efb 100644
--- a/drivers/isdn/gigaset/proc.c
+++ b/drivers/isdn/gigaset/proc.c
@@ -14,7 +14,6 @@
  */
 
 #include "gigaset.h"
-#include <linux/ctype.h>
 
 static ssize_t show_cidmode(struct device *dev,
 			    struct device_attribute *attr, char *buf)
diff --git a/drivers/isdn/gigaset/ser-gigaset.c b/drivers/isdn/gigaset/ser-gigaset.c
index 8b0afd2..e96c058 100644
--- a/drivers/isdn/gigaset/ser-gigaset.c
+++ b/drivers/isdn/gigaset/ser-gigaset.c
@@ -11,13 +11,10 @@
  */
 
 #include "gigaset.h"
-
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/platform_device.h>
-#include <linux/tty.h>
 #include <linux/completion.h>
-#include <linux/slab.h>
 
 /* Version Information */
 #define DRIVER_AUTHOR "Tilman Schmidt"
diff --git a/drivers/isdn/gigaset/usb-gigaset.c b/drivers/isdn/gigaset/usb-gigaset.c
index 9430a2b..76dbb20 100644
--- a/drivers/isdn/gigaset/usb-gigaset.c
+++ b/drivers/isdn/gigaset/usb-gigaset.c
@@ -16,10 +16,6 @@
  */
 
 #include "gigaset.h"
-
-#include <linux/errno.h>
-#include <linux/init.h>
-#include <linux/slab.h>
 #include <linux/usb.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
-- 
1.6.5.3.298.g39add

^ permalink raw reply related

* [PATCH net-next-2.6] net: Introduce skb_orphan_try()
From: Eric Dumazet @ 2010-04-16 22:18 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100415.143321.200497785.davem@davemloft.net>

Le jeudi 15 avril 2010 à 14:33 -0700, David Miller a écrit :

> If it's not legal to skb_orphan() here then it would not be legal for
> the drivers to unconditionally skb_orphan(), which they do.
> 
> So either your test is unnecessary, or we have a big existing problem
> :-)

I cooked following patch, introducing skb_orphan_try() helper, to
document all known exceptions.

I have a possible followup for this patch :

Orphaning skbs earlier could also make dev_kfree_skb_irq() faster.
Instead of queing skb into completion_queue and triggering
NET_TX_SOFTIRQ, we would directly free an orphaned skb ?

[PATCH net-next-2.6] net: Introduce skb_orphan_try()

Transmitted skb might be attached to a socket and a destructor, for
memory accounting purposes.

Traditionally, this destructor is called at tx completion time, when skb
is freed.

When tx completion is performed by another cpu than the sender, this
forces some cache lines to change ownership. XPS was an attempt to give
tx completion to initial cpu.

David idea is to call destructor right before giving skb to device (call
to ndo_start_xmit()). Because device queues are usually small, orphaning
skb before tx completion is not a big deal. Some drivers already do
this, we could do it in upper level.

There is one known exception to this early orphaning, called tx
timestamping. It needs to keep a reference to socket until device can
give a hardware or software timestamp.

This patch adds a skb_orphan_try() helper, to centralize all exceptions
to early orphaning in one spot, and use it in dev_hard_start_xmit().

"tbench 16" results on a Nehalem machine (2 X5570  @ 2.93GHz)
before: Throughput 4428.9 MB/sec 16 procs
after: Throughput 4448.14 MB/sec 16 procs

UDP should get even better results, its destructor being more complex,
since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index e8041eb..acae5fe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1880,6 +1880,17 @@ static int dev_gso_segment(struct sk_buff *skb)
 	return 0;
 }

+/*
+ * Try to orphan skb early, right before transmission by the device.
+ * We cannot orphan skb if tx timestamp is requested, since
+ * drivers need to call skb_tstamp_tx() to send the timestamp.
+ */
+static inline void skb_orphan_try(struct sk_buff *skb)
+{
+	if (!skb_tx(skb)->flags)
+		skb_orphan(skb);
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			struct netdev_queue *txq)
 {
@@ -1904,23 +1915,10 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(skb);

+		skb_orphan_try(skb);
 		rc = ops->ndo_start_xmit(skb, dev);
 		if (rc == NETDEV_TX_OK)
 			txq_trans_update(txq);
-		/*
-		 * TODO: if skb_orphan() was called by
-		 * dev->hard_start_xmit() (for example, the unmodified
-		 * igb driver does that; bnx2 doesn't), then
-		 * skb_tx_software_timestamp() will be unable to send
-		 * back the time stamp.
-		 *
-		 * How can this be prevented? Always create another
-		 * reference to the socket before calling
-		 * dev->hard_start_xmit()? Prevent that skb_orphan()
-		 * does anything in dev->hard_start_xmit() by clearing
-		 * the skb destructor before the call and restoring it
-		 * afterwards, then doing the skb_orphan() ourselves?
-		 */
 		return rc;
 	}

@@ -1938,6 +1936,7 @@ gso:
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(nskb);

+		skb_orphan_try(nskb);
 		rc = ops->ndo_start_xmit(nskb, dev);
 		if (unlikely(rc != NETDEV_TX_OK)) {
 			if (rc & ~NETDEV_TX_MASK)

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox