Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: sierra_net default m
From: David Miller @ 2010-04-30 18:49 UTC (permalink / raw)
  To: epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8
  Cc: jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	rfiler-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1272653150.9050.20.camel@Linuxdev3>

From: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
Date: Fri, 30 Apr 2010 11:45:50 -0700

> OK. Makes sense, so that the kernel does not build the driver if the
> configuration option is not explicitly set.

All non-core drivers should default to "n".

The default controls what the user is suggested to choose when they
run the configuration without an explicit setting in their existing
config.

They should never be told to turn on a driver by default for an
odd-ball piece of hardware.

Even the IDE and ATA disk layers don't default to anything explicitly.

Just remove the default tag altogether.  I see that a bunch of other USB
networking drivers have all sorts of default tags, I wish you didn't
follow their lead. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-30  5:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, hadi, Changli Gao, David S. Miller, Tom Herbert,
	Stephen Hemminger, netdev, lenb, arjan
In-Reply-To: <20100429214144.GA10663@gargoyle.fritz.box>

Le jeudi 29 avril 2010 à 23:41 +0200, Andi Kleen a écrit :
> On Thu, Apr 29, 2010 at 09:12:27PM +0200, Eric Dumazet wrote:
> > Yes, mostly, but about 200.000 wakeups per second I would say...
> > 
> > If a cpu in deep state receives an IPI, process a softirq, should it
> > come back to deep state immediately, or should it wait for some
> > milliseconds ?
> 
> In principle the cpuidle governour should detect this and not put the target into
> the slow deep c states. One change that was done recently to fix a similar 
> problem for disk IO was to take processes that wait for IO into account 
> (see 69d25870). But it doesn't work for networking.
> 
> Here's a untested patch that might help: tell the cpuidle governour 
> networking is waiting for IO. This will tell it to not go down the deeply.
> 
> I might have missed some schedule() paths, feel free to add more.
> 
> Actually it's probably too aggressive because it will avoid C states even for
> a closed window on the other side which might be hours. Better would
> be some heuristic to only do this when you're really expected IO shortly.
> 
> Also does your workload even sleep at all? If not we would need to increase
> the iowait counters in recvmsg() itself.
> 

My workload yes, uses blocking recvmsg() calls, but Jamal one uses
epoll() so I guess problem is more generic than that. We should have an
estimate of the number of wakeups (IO or not...) per second (or
sub-second) so that cpuidle can avoid these deep states ?

> Anyways might be still worth a try.
> 
> For routing we probably need some other solution though, there are no 
> schedules there.
> 
> > 
> > > Perhaps need to feed some information to cpuidle's governour to prevent this problem.
> > > 
> > > idle=poll is very drastic, better to limit to C1 
> > > 
> > 
> > How can I do this ?
> 
> processor.max_cstate=1 or using /dev/network_latency 
> (see Documentation/power/pm_qos_interface.txt)
> 
> -Andi
> 

Thanks, I'll play with this today !

> 
> 
> commit 810227a7c24ecae2bb4aac320490a7115ac33be8
> Author: Andi Kleen <ak@linux.intel.com>
> Date:   Thu Apr 29 23:33:18 2010 +0200
> 
>     Use io_schedule() in network stack to tell cpuidle governour to guarantee lower latencies
> 
>     XXX: probably too aggressive, some of these sleeps are not under high load.
> 
>     Based on a bug report from Eric Dumazet.
>     
>     Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index c5812bb..c246d6c 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1402,7 +1402,7 @@ static long sock_wait_for_wmem(struct sock *sk, long timeo)
>  			break;
>  		if (sk->sk_err)
>  			break;
> -		timeo = schedule_timeout(timeo);
> +		timeo = io_schedule_timeout(timeo);
>  	}
>  	finish_wait(sk->sk_sleep, &wait);
>  	return timeo;
> @@ -1512,7 +1512,7 @@ static void __lock_sock(struct sock *sk)
>  		prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait,
>  					TASK_UNINTERRUPTIBLE);
>  		spin_unlock_bh(&sk->sk_lock.slock);
> -		schedule();
> +		io_schedule();
>  		spin_lock_bh(&sk->sk_lock.slock);
>  		if (!sock_owned_by_user(sk))
>  			break;
> 
> > 
> > Thanks !
> > 
> > 



^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: David Miller @ 2010-04-30 18:57 UTC (permalink / raw)
  To: tglx; +Cc: shemminger, eric.dumazet, ak, netdev, andi, peterz
In-Reply-To: <alpine.LFD.2.00.1004292055471.2951@localhost.localdomain>

From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 29 Apr 2010 21:19:36 +0200 (CEST)

> Aside of that I seriously doubt that you can do networking w/o time
> and timers.

You're right that we need timestamps and the like.

But only if we actually process the packets on these restricted cpus :-)

If we use RPS and farm out all packets to other cpus, ie. just doing
the driver work and the remote cpu dispatch on these "offline" cpus,
it is doable.

Then we can do cool tricks like having the cpu spin on a mwait() on the
network device's status descriptor in memory.

In any event I agree with you, it's a cool idea at best, and likely
not really practical.

^ permalink raw reply

* Re: [PATCH linux-next v2 1/2] irq: Add CPU mask affinity hint
From: Thomas Gleixner @ 2010-04-30 19:04 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <Pine.WNT.4.64.1004301059230.6264@PPWASKIE-MOBL2.amr.corp.intel.com>

On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:
> On Fri, 30 Apr 2010, Thomas Gleixner wrote:
> > > +extern int irq_register_affinity_hint(unsigned int irq,
> > > +                                      const struct cpumask *m);
> > 
> > I think we can do with a single funtion irq_set_affinity_hint() and
> > let the caller set the pointer to NULL.
> 
> Ok, I've been running into some issues.  If CONFIG_CPUMASK_OFFSTACK is not
> set, then cpumask_var_t structs are single-element arrays that cannot be
> NULL'd out.  I'm pretty sure I need to keep the unregister part of the API.
> Thoughts?

extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);

So why should calling irq_set_affinity_hint(irqnr, NULL) not work ?
 
> I just looked at the original show_affinity function, and it does not grab
> desc->lock before copying mask out of desc.  Should I follow that model, or
> should I fix that function to honor desc->lock?

desc->affinity can only race against something changing the affinity
bits, so that just might return some random data.

In the hint case the irq could be shut down and the affinity hint
could be freed while you are accessing it. Not a good idea :)

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH linux-next v2 1/2] irq: Add CPU mask affinity hint
From: Peter P Waskiewicz Jr @ 2010-04-30 19:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1004302038140.2951@localhost.localdomain>

On Fri, 30 Apr 2010, Thomas Gleixner wrote:

> On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:
>> On Fri, 30 Apr 2010, Thomas Gleixner wrote:
>>>> +extern int irq_register_affinity_hint(unsigned int irq,
>>>> +                                      const struct cpumask *m);
>>>
>>> I think we can do with a single funtion irq_set_affinity_hint() and
>>> let the caller set the pointer to NULL.
>>
>> Ok, I've been running into some issues.  If CONFIG_CPUMASK_OFFSTACK is not
>> set, then cpumask_var_t structs are single-element arrays that cannot be
>> NULL'd out.  I'm pretty sure I need to keep the unregister part of the API.
>> Thoughts?
>
> extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
>
> So why should calling irq_set_affinity_hint(irqnr, NULL) not work ?

What was that you said about coffee and brain cells?  :-)

>
>> I just looked at the original show_affinity function, and it does not grab
>> desc->lock before copying mask out of desc.  Should I follow that model, or
>> should I fix that function to honor desc->lock?
>
> desc->affinity can only race against something changing the affinity
> bits, so that just might return some random data.
>
> In the hint case the irq could be shut down and the affinity hint
> could be freed while you are accessing it. Not a good idea :)

Good point.

Latest spin coming shortly.  Thanks for the quick feedback!

-PJ

^ permalink raw reply

* RE: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: Allan, Bruce W @ 2010-04-30 19:27 UTC (permalink / raw)
  To: David Miller
  Cc: anton@samba.org, Kirsher, Jeffrey T, netdev@vger.kernel.org,
	gospo@redhat.com, mjg@redhat.com
In-Reply-To: <20100429.120416.184828647.davem@davemloft.net>

On Thursday, April 29, 2010 12:04 PM, David Miller wrote:
> From: "Allan, Bruce W" <bruce.w.allan@intel.com>
> Date: Thu, 29 Apr 2010 10:19:56 -0700
> 
>> Your patch is probably the correct thing to do but I'm not all that
>> familiar with the ppc64 architecture.  Would you please provide the
>> output of 'lspci -t' and 'lspci -vvv -xxx'.
> 
> You're not guarenteed for there to be a pci_dev backing the top-level
> host controller, at the very least.  Some platforms don't even
> implement the PCI config space for the host controller, whilst on
> others access to them is protected by the hypervisor.
> 
> So you can't go poking around the PCI host controller registers
> unconditionally.
> 
> The same OOPS probably would happen on Sparc64 in some configurations
> too.  Although all of my PCI-E slots do have PCI-E express switch port
> nodes, so maybe it wouldn't trigger here.

In that case, I agree with Anton's patch.

Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>

Thanks,
Bruce.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-29  4:09 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272498293.4258.121.camel@bigi>

Le mercredi 28 avril 2010 à 19:44 -0400, jamal a écrit :
> On Wed, 2010-04-28 at 16:06 +0200, Eric Dumazet wrote:
> 
> > Here it is ;)
> 
> Sorry - things got a little hectic with TheMan.
> 
> I am afraid i dont have good news.
> Actually, I should say i dont have good news in regards to rps.
> For my sample app, two things seem to be happening:
> a) The overall performance has gotten better for both rps
> and non-rps.
> b) non-rps is now performing relatively better
> 
> This is just what i see in net-next not related to your patch.
> It seems the kernels i tested prior to April 23 showed rps better.
> The one i tested on Apr23 showed rps being about the same as non-rps.
> As i stated in my last result posting, I thought i didnt test properly
> but i did again today and saw the same thing. And now non-rps is
> _consistently_ better.
> So some regression is going on...
> 
> Your patch has improved the performance of rps relative to what is in
> net-next very lightly; but it has also improved the performance of
> non-rps;->
> My traces look different for the app cpu than yours - likely because of
> the apps being different.
> 
> At the moment i dont have time to dig deeper into code, but i could
> test as cycles show up.
> 
> I am attaching the profile traces and results.
> 
> cheers,
> jamal

Hi Jamal

I dont see in your results the number of pps, number of udp ports,
number of flows.

In my latest results, I can handle more pps than before, regardless of
rps being on or off, and with various number of udp ports (one user
thread per port), number of flows (many src addr so that rps spread
packets on many cpus)

If/when contention windows are smaller, cpu can run uncontended, and can
consume more cycles to process more frames ?

With a non yet published patch, I even can reach 600.000 pps in DDOS
situations, instead of 400.000.

Thanks !



^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-04-30 19:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272573383.3969.8.camel@bigi>

[-- Attachment #1: Type: text/plain, Size: 1322 bytes --]

Eric!

I managed to mod your program to look conceptually similar to mine
and i reproduced the results with same test kernel from yesterday. 
So it is likely the issue is in using epoll vs not using any async as
in your case.
Results attached as well as modified program.

Note: the key things to remember:
rps with this program gets worse over time and different net-next
kernels since Apr14 (look at graph i supplied). Sorry, I am really
busy-ed out to dig any further.

cheers,
jamal



On Thu, 2010-04-29 at 16:36 -0400, jamal wrote:
> On Thu, 2010-04-29 at 09:56 -0400, jamal wrote:
> 
> > 
> > I will try your program instead so we can reduce the variables
> 
> Results attached.
> With your app rps does a hell lot better and non-rps worse ;->
> With my proggie, non-rps does much better than yours and rps does
> a lot worse for same setup. I see the scheduler kicking quiet a bit in
> non-rps for you...
> 
> The main difference between us as i see it is:
> a) i use epoll - actually linked to libevent (1.0.something)
> b) I fork processes and you use pthreads.
> 
> I dont have time to chase it today, but 1) I am either going to change
> yours to use libevent or make mine get rid of it then 2) move towards
> pthreads or have yours fork..
> then observe if that makes any difference..
> 
> 
> cheers,
> jamal

[-- Attachment #2: apr30-ericmod --]
[-- Type: text/plain, Size: 8919 bytes --]


First a few runs with Eric's code + epoll/libevent

-------------------------------------------------------------------------------
   PerfTop:    4009 irqs/sec  kernel:83.4% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

             2097.00  8.6% sky2_poll                   [sky2]              
             1742.00  7.2% _raw_spin_lock_irqsave      [kernel]            
              831.00  3.4% system_call                 [kernel]            
              654.00  2.7% copy_user_generic_string    [kernel]            
              654.00  2.7% datagram_poll               [kernel]            
              647.00  2.7% fget                        [kernel]            
              623.00  2.6% _raw_spin_unlock_irqrestore [kernel]            
              547.00  2.3% _raw_spin_lock_bh           [kernel]            
              506.00  2.1% sys_epoll_ctl               [kernel]            
              475.00  2.0% kmem_cache_free             [kernel]            
              466.00  1.9% schedule                    [kernel]            
              436.00  1.8% vread_tsc                   [kernel].vsyscall_fn
              417.00  1.7% fput                        [kernel]            
              415.00  1.7% sys_epoll_wait              [kernel]            
              402.00  1.7% _raw_spin_lock              [kernel]            


-------------------------------------------------------------------------------
   PerfTop:     616 irqs/sec  kernel:98.7% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ ________

             2534.00 28.6% sky2_poll              [sky2]  
              503.00  5.7% ip_route_input         [kernel]
              438.00  4.9% _raw_spin_lock_irqsave [kernel]
              418.00  4.7% __udp4_lib_lookup      [kernel]
              378.00  4.3% __alloc_skb            [kernel]
              364.00  4.1% ip_rcv                 [kernel]
              323.00  3.6% _raw_spin_lock         [kernel]
              315.00  3.5% sock_queue_rcv_skb     [kernel]
              284.00  3.2% __netif_receive_skb    [kernel]
              281.00  3.2% __udp4_lib_rcv         [kernel]
              266.00  3.0% __wake_up_common       [kernel]
              238.00  2.7% sock_def_readable      [kernel]
              181.00  2.0% __kmalloc              [kernel]
              163.00  1.8% kmem_cache_alloc       [kernel]
              150.00  1.7% ep_poll_callback       [kernel]


-------------------------------------------------------------------------------
   PerfTop:     854 irqs/sec  kernel:80.2% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

              341.00  8.0% _raw_spin_lock_irqsave      [kernel]            
              235.00  5.5% system_call                 [kernel]            
              174.00  4.1% datagram_poll               [kernel]            
              174.00  4.1% fget                        [kernel]            
              173.00  4.1% copy_user_generic_string    [kernel]            
              135.00  3.2% _raw_spin_unlock_irqrestore [kernel]            
              125.00  2.9% _raw_spin_lock_bh           [kernel]            
              122.00  2.9% schedule                    [kernel]            
              113.00  2.6% sys_epoll_ctl               [kernel]            
              113.00  2.6% kmem_cache_free             [kernel]            
              108.00  2.5% vread_tsc                   [kernel].vsyscall_fn
              105.00  2.5% sys_epoll_wait              [kernel]            
              102.00  2.4% udp_recvmsg                 [kernel]            
               95.00  2.2% mutex_lock                  [kernel]            

Average 97.55% of 10M packets at 750Kpps

Turn on rps mask ee and irq affinity to cpu0

-------------------------------------------------------------------------------
   PerfTop:    3885 irqs/sec  kernel:83.6% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________

             2945.00 16.7% sky2_poll                      [sky2]  
              653.00  3.7% _raw_spin_lock_irqsave         [kernel]
              460.00  2.6% system_call                    [kernel]
              420.00  2.4% _raw_spin_unlock_irqrestore    [kernel]
              414.00  2.3% sky2_intr                      [sky2]  
              392.00  2.2% fget                           [kernel]
              360.00  2.0% ip_rcv                         [kernel]
              324.00  1.8% sys_epoll_ctl                  [kernel]
              323.00  1.8% __netif_receive_skb            [kernel]
              310.00  1.8% schedule                       [kernel]
              292.00  1.7% ip_route_input                 [kernel]
              292.00  1.7% _raw_spin_lock                 [kernel]
              291.00  1.7% copy_user_generic_string       [kernel]
              284.00  1.6% kmem_cache_free                [kernel]
              262.00  1.5% call_function_single_interrupt [kernel]

-------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:98.1% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                            DSO
             _______ _____ ___________________________________ ________

             4170.00 61.9% sky2_poll                           [sky2]  
              723.00 10.7% sky2_intr                           [sky2]  
              159.00  2.4% __alloc_skb                         [kernel]
              140.00  2.1% get_rps_cpu                         [kernel]
              106.00  1.6% __kmalloc                           [kernel]
               95.00  1.4% enqueue_to_backlog                  [kernel]
               86.00  1.3% kmem_cache_alloc                    [kernel]
               85.00  1.3% irq_entries_start                   [kernel]
               85.00  1.3% _raw_spin_lock_irqsave              [kernel]
               82.00  1.2% _raw_spin_lock                      [kernel]
               66.00  1.0% swiotlb_sync_single                 [kernel]
               58.00  0.9% sky2_remove                         [sky2]  
               49.00  0.7% default_send_IPI_mask_sequence_phys [kernel]
               47.00  0.7% sky2_rx_submit                      [sky2]  
               36.00  0.5% _raw_spin_unlock_irqrestore         [kernel]

-------------------------------------------------------------------------------
   PerfTop:     344 irqs/sec  kernel:84.3% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ____________________

              114.00  5.2% _raw_spin_lock_irqsave         [kernel]            
               79.00  3.6% fget                           [kernel]            
               78.00  3.6% ip_rcv                         [kernel]            
               78.00  3.6% system_call                    [kernel]            
               75.00  3.4% _raw_spin_unlock_irqrestore    [kernel]            
               67.00  3.1% sys_epoll_ctl                  [kernel]            
               65.00  3.0% schedule                       [kernel]            
               61.00  2.8% ip_route_input                 [kernel]            
               48.00  2.2% vread_tsc                      [kernel].vsyscall_fn
               48.00  2.2% call_function_single_interrupt [kernel]            
               46.00  2.1% kmem_cache_free                [kernel]            
               45.00  2.1% __netif_receive_skb            [kernel]            
               41.00  1.9% process_recv                   snkudp              
               40.00  1.8% kfree                          [kernel]            
               39.00  1.8% _raw_spin_lock                 [kernel]            

92.97% of 10M packets at 750Kpps


Ok, so this is exactly what i saw with my app. non-rps is better.
To summarize: It used to be the opposite on net-next before around
Apr14. rps has gotten worse.

[-- Attachment #3: udpsnkfrk.c --]
[-- Type: text/x-csrc, Size: 3650 bytes --]

/*
 *  Usage: udpsink [ -p baseport] nbports
*/
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <event.h>

struct worker_data {
	struct event *snk_ev;
	struct event_base *base;
	struct timeval t;
	unsigned long pack_count;
	unsigned long bytes_count;
	unsigned long tout;
	int fd;			/* move to avoid hole on 64-bit */
	int pad1;		/*64B - let Eric figure the math;-> */
	//unsigned long _padd[16 - 3]; /* alignment */ 
};

void usage(int code)
{
	fprintf(stderr, "Usage: udpsink [-p baseport] nbports\n");
	exit(code);
}

void process_recv(int fd, short ev, void *arg)
{
	char buffer[4096];
	struct sockaddr_in addr;
	socklen_t len = sizeof(addr);
	struct worker_data *wdata = (struct worker_data *)arg;
	int lu = 0;

	if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
		perror("cb event_add");
		return;
	}

	if (ev == EV_TIMEOUT) {
		wdata->tout++;
	} else {
		lu = recvfrom(wdata->fd, buffer, sizeof(buffer), 0,
			      (struct sockaddr *)&addr, &len);
		if (lu > 0) {
			wdata->pack_count++;
			wdata->bytes_count += lu;
		}
	}
}

int prep_thread(struct worker_data *wdata)
{
	wdata->t.tv_sec = 1;
	wdata->t.tv_usec = random() % 50000L;

	wdata->base = event_init();
	event_set(wdata->snk_ev, wdata->fd, EV_READ, process_recv, wdata);
	event_base_set(wdata->base, wdata->snk_ev);
	if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
		perror("event_add");
		return -1;
	}
	return 0;
}

void *worker_func(void *arg)
{
	struct worker_data *wdata = (struct worker_data *)arg;

	return (void *)event_base_loop(wdata->base, 0);
}

int main(int argc, char *argv[])
{
	int c;
	int baseport = 4000;
	int nbthreads;
	struct worker_data *wdata;
	unsigned long ototal = 0;
	int concurrent = 0;
	int verbose = 0;
	int i;
	while ((c = getopt(argc, argv, "cvp:")) != -1) {
		if (c == 'p')
			baseport = atoi(optarg);
		else if (c == 'c')
			concurrent = 1;
		else if (c == 'v')
			verbose++;
		else
			usage(1);
	}
	if (optind == argc)
		usage(1);
	nbthreads = atoi(argv[optind]);
	wdata = calloc(sizeof(struct worker_data), nbthreads);
	if (!wdata) {
		perror("calloc");
		return 1;
	}

	for (i = 0; i < nbthreads; i++) {
		struct sockaddr_in addr;
		pthread_t tid;

		if (i && concurrent) {
			wdata[i].fd = wdata[0].fd;
		} else {
			wdata[i].snk_ev = malloc(sizeof(struct event));
			if (!wdata[i].snk_ev)
				return 1;
			memset(wdata[i].snk_ev, 0, sizeof(struct event));

			wdata[i].fd = socket(PF_INET, SOCK_DGRAM, 0);
			if (wdata[i].fd == -1) {
				free(wdata[i].snk_ev);
				perror("socket");
				return 1;
			}
			memset(&addr, 0, sizeof(addr));
			addr.sin_family = AF_INET;
//                      addr.sin_addr.s_addr = inet_addr(argv[optind]);
			addr.sin_port = htons(baseport + i);
			if (bind
			    (wdata[i].fd, (struct sockaddr *)&addr,
			     sizeof(addr)) < 0) {
				free(wdata[i].snk_ev);
				perror("bind");
				return 1;
			}
//                      fcntl(wdata[i].fd, F_SETFL, O_NDELAY);
		}
		if (prep_thread(wdata + i)) {
			printf("failed to allocate thread %d, exit\n", i);
			exit(0);
		}
		pthread_create(&tid, NULL, worker_func, wdata + i);
	}

	for (;;) {
		unsigned long total;
		long delta;

		sleep(1);
		total = 0;
		for (i = 0; i < nbthreads; i++) {
			total += wdata[i].pack_count;
		}
		delta = total - ototal;
		if (delta) {
			printf("%lu pps (%lu", delta, total);
			if (verbose) {
				for (i = 0; i < nbthreads; i++) {
					if (wdata[i].pack_count)
						printf(" %d:%lu", i,
						       wdata[i].pack_count);
				}
			}
			printf(")\n");
		}
		ototal = total;
	}
}

^ permalink raw reply

* [RFC] BPF program access to transport header
From: Paul LeoNerd Evans @ 2010-04-30 19:39 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1816 bytes --]

Via the SKF_NET_OFF extension area, a BPF program has nice easy access
to the network header, wherever it might happen to be in the packet.
This makes it simpler to write filters on e.g. IPv4 headers, knowing
that fields will always be at simple offsets relative to SKF_NET_OFF.
Using the data at WORD[SKF_AD_PROTO] it's easy also to find out what
network protocol this is.

I would like to provide similar for the transport header. Without doing
so, it is very hard to parse e.g. UDP or TCP headers that may be
contained within IPv6 protocol, because of the linked-list way IPv6
headers chain on to each other. BPF doesn't provide a while() loop or
any kind of backward jump, meaning the filter program has to be
loop-unrolled a static number of times. This quickly leads to very large
programs.

I forsee a number of issues with trying to provide this:

 * How to provide the protocol number (e.g. 6 for TCP, 1 for ICMP) to
   the BPF program

 * How to obtain the transport offset - AIUI, the skf_transport_offset()
   won't actually be set yet by the time the filter program runs.

 * What to do if the underlying protocol doesn't support a transport
   layer above it - e.g. ARP.

Ideally, this would make it easy to filter, say, TCP destination port
80, by doing the following:

  LD WORD[SKF_AD_PROTO]
  JEQ ETHERTYPE_IPV4, 1, fail
  JEQ ETHERTYPE_IPv6, 0, fail

  LD WORD[SKF_AD_TRANSPROTO]
  JEQ IPPROTO_TCP, 0, fail

  LD WORD[SKF_TRANS_OFF+0]
  JEQ 80, 0, fail

  LD len
  RET A

fail:
  RET 0

In this short simple BPF program we've avoided all the issues involved
with trying to parse IPv6 headers.

Can we make this work?

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: David Miller @ 2010-04-30 19:51 UTC (permalink / raw)
  To: bruce.w.allan; +Cc: anton, jeffrey.t.kirsher, netdev, gospo, mjg
In-Reply-To: <8DD2590731AB5D4C9DBF71A877482A90622DE3C2@orsmsx509.amr.corp.intel.com>

From: "Allan, Bruce W" <bruce.w.allan@intel.com>
Date: Fri, 30 Apr 2010 12:27:02 -0700

> In that case, I agree with Anton's patch.
> 
> Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>

Great, applied, thanks guys.

^ permalink raw reply

* Re: pull request: wireless-2.6 2010-04-30
From: David Miller @ 2010-04-30 19:54 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20100430180840.GD4120@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Fri, 30 Apr 2010 14:08:40 -0400

> One more for 2.6.34...it avoids some DMA mapping-related failures.

Pulled, thanks John.

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Thomas Gleixner @ 2010-04-30 19:58 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, eric.dumazet, ak, netdev, andi, peterz
In-Reply-To: <20100430.115715.216750975.davem@davemloft.net>

Dave,

On Fri, 30 Apr 2010, David Miller wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Thu, 29 Apr 2010 21:19:36 +0200 (CEST)
> 
> > Aside of that I seriously doubt that you can do networking w/o time
> > and timers.
> 
> You're right that we need timestamps and the like.
> 
> But only if we actually process the packets on these restricted cpus :-)
> 
> If we use RPS and farm out all packets to other cpus, ie. just doing
> the driver work and the remote cpu dispatch on these "offline" cpus,
> it is doable.
> 
> Then we can do cool tricks like having the cpu spin on a mwait() on the
> network device's status descriptor in memory.
> 
> In any event I agree with you, it's a cool idea at best, and likely
> not really practical.

Well, it might be worth to experiment with that once we get the basic
infrastructure in place to "isolate" cores under full kernel control. 

It's not too hard to solve the problems, but it seems nobody has a
free time slot to tackle them.

Thanks

	tglx

^ permalink raw reply

* Re: [PATCH] macvtap: add ioctl to modify vnet header size
From: Arnd Bergmann @ 2010-04-29 14:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David S. Miller, Sridhar Samudrala, Eric Dumazet, netdev,
	linux-kernel, David Stevens
In-Reply-To: <20100429135158.GA26303@redhat.com>

On Thursday 29 April 2010, Michael S. Tsirkin wrote:
> This adds TUNSETVNETHDRSZ/TUNGETVNETHDRSZ support
> to macvtap.

Looks good, thanks Michael!

> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* [PATCH linux-next v3 1/2] irq: Add CPU mask affinity hint
From: Peter P Waskiewicz Jr @ 2010-04-30 20:23 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel

This patch adds a cpumask affinity hint to the irq_desc
structure, along with a registration function and a read-only
proc entry for each interrupt.

This affinity_hint handle for each interrupt can be used by
underlying drivers that need a better mechanism to control
interrupt affinity.  The underlying driver can register a
cpumask for the interrupt, which will allow the driver to
provide the CPU mask for the interrupt to anything that
requests it.  The intent is to extend the userspace daemon,
irqbalance, to help hint to it a preferred CPU mask to balance
the interrupt into.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/interrupt.h |    7 +++++++
 include/linux/irq.h       |    1 +
 kernel/irq/manage.c       |   19 +++++++++++++++++++
 kernel/irq/proc.c         |   42 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 0 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 75f3f00..4d1df9b 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -209,6 +209,8 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
 extern int irq_can_set_affinity(unsigned int irq);
 extern int irq_select_affinity(unsigned int irq);
 
+extern int irq_register_affinity_hint(unsigned int irq,
+                                      const struct cpumask *m);
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -223,6 +225,11 @@ static inline int irq_can_set_affinity(unsigned int irq)
 
 static inline int irq_select_affinity(unsigned int irq)  { return 0; }
 
+static inline int irq_register_affinity_hint(unsigned int irq,
+                                             const struct cpumask *m)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SMP && CONFIG_GENERIC_HARDIRQS */
 
 #ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 707ab12..83b16d7 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -206,6 +206,7 @@ struct irq_desc {
 	struct proc_dir_entry	*dir;
 #endif
 	const char		*name;
+	struct cpumask		*affinity_hint;
 } ____cacheline_internodealigned_in_smp;
 
 extern void arch_init_copy_chip_data(struct irq_desc *old_desc,
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 704e488..1354fc9 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -138,6 +138,22 @@ int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask)
 	return 0;
 }
 
+int irq_register_affinity_hint(unsigned int irq, const struct cpumask *m)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	if (!desc)
+		return -EINVAL;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+	desc->affinity_hint = m;
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(irq_register_affinity_hint);
+
 #ifndef CONFIG_AUTO_IRQ_AFFINITY
 /*
  * Generic version of the affinity autoselector.
@@ -916,6 +932,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
 			desc->chip->disable(irq);
 	}
 
+	/* make sure affinity_hint is cleaned up */
+	desc->affinity_hint = NULL;
+
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 
 	unregister_handler_proc(irq, action);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 7a6eb04..1aa7939 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -32,6 +32,32 @@ static int irq_affinity_proc_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
+{
+	struct irq_desc *desc = irq_to_desc((long)m->private);
+	unsigned long flags;
+	cpumask_var_t mask;
+	int ret = -EINVAL;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+	if (desc->affinity_hint) {
+		cpumask_copy(mask, desc->affinity_hint);
+		ret = 0;
+	}
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	if (!ret) {
+		seq_cpumask(m, mask);
+		seq_putc(m, '\n');
+	}
+	free_cpumask_var(mask);
+
+	return ret;
+}
+
 #ifndef is_affinity_mask_valid
 #define is_affinity_mask_valid(val) 1
 #endif
@@ -84,6 +110,11 @@ static int irq_affinity_proc_open(struct inode *inode, struct file *file)
 	return single_open(file, irq_affinity_proc_show, PDE(inode)->data);
 }
 
+static int irq_affinity_hint_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, irq_affinity_hint_proc_show, PDE(inode)->data);
+}
+
 static const struct file_operations irq_affinity_proc_fops = {
 	.open		= irq_affinity_proc_open,
 	.read		= seq_read,
@@ -92,6 +123,13 @@ static const struct file_operations irq_affinity_proc_fops = {
 	.write		= irq_affinity_proc_write,
 };
 
+static const struct file_operations irq_affinity_hint_proc_fops = {
+	.open		= irq_affinity_hint_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static int default_affinity_show(struct seq_file *m, void *v)
 {
 	seq_cpumask(m, irq_default_affinity);
@@ -231,6 +269,10 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	/* create /proc/irq/<irq>/smp_affinity */
 	proc_create_data("smp_affinity", 0600, desc->dir,
 			 &irq_affinity_proc_fops, (void *)(long)irq);
+
+	/* create /proc/irq/<irq>/affinity_hint */
+	proc_create_data("affinity_hint", 0400, desc->dir,
+			 &irq_affinity_hint_proc_fops, (void *)(long)irq);
 #endif
 
 	proc_create_data("spurious", 0444, desc->dir,


^ permalink raw reply related

* Re: [RFC] BPF program access to transport header
From: Patrick McHardy @ 2010-04-30 20:15 UTC (permalink / raw)
  To: Paul LeoNerd Evans; +Cc: netdev
In-Reply-To: <20100430193916.GZ19334@cel.leo>

Paul LeoNerd Evans wrote:
> Via the SKF_NET_OFF extension area, a BPF program has nice easy access
> to the network header, wherever it might happen to be in the packet.
> This makes it simpler to write filters on e.g. IPv4 headers, knowing
> that fields will always be at simple offsets relative to SKF_NET_OFF.
> Using the data at WORD[SKF_AD_PROTO] it's easy also to find out what
> network protocol this is.
> 
> I would like to provide similar for the transport header. Without doing
> so, it is very hard to parse e.g. UDP or TCP headers that may be
> contained within IPv6 protocol, because of the linked-list way IPv6
> headers chain on to each other. BPF doesn't provide a while() loop or
> any kind of backward jump, meaning the filter program has to be
> loop-unrolled a static number of times. This quickly leads to very large
> programs.
> 
> I forsee a number of issues with trying to provide this:
> 
>  * How to provide the protocol number (e.g. 6 for TCP, 1 for ICMP) to
>    the BPF program

Using one of the registers?

>  * How to obtain the transport offset - AIUI, the skf_transport_offset()
>    won't actually be set yet by the time the filter program runs.

For IPv4 its trivial. For IPv6 you could use ipv6_skip_exthdr().
A slightly more flexible way would be to use something like the
netfilter ipv6_find_hdr() function to get the offset of any header
type. The protocol number could be returned in one of the registers
(the other one would contain the offset).

>  * What to do if the underlying protocol doesn't support a transport
>    layer above it - e.g. ARP.

I'd say simply abort the filter.

> Ideally, this would make it easy to filter, say, TCP destination port
> 80, by doing the following:
> 
>   LD WORD[SKF_AD_PROTO]
>   JEQ ETHERTYPE_IPV4, 1, fail
>   JEQ ETHERTYPE_IPv6, 0, fail
> 
>   LD WORD[SKF_AD_TRANSPROTO]
>   JEQ IPPROTO_TCP, 0, fail
> 
>   LD WORD[SKF_TRANS_OFF+0]
>   JEQ 80, 0, fail
> 
>   LD len
>   RET A
> 
> fail:
>   RET 0
> 
> In this short simple BPF program we've avoided all the issues involved
> with trying to parse IPv6 headers.
> 
> Can we make this work?


^ permalink raw reply

* [PATCH linux-next v3 2/2] ixgbe: Example usage of the new IRQ affinity_hint callback
From: Peter P Waskiewicz Jr @ 2010-04-30 20:24 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel
In-Reply-To: <20100430202343.4591.66240.stgit@ppwaskie-hc2.jf.intel.com>

This patch uses the new IRQ affinity_hint callback mechanism.
It serves purely as an example of how a low-level driver can
utilize this new interface.

An official ixgbe patch will be pushed through netdev once the
IRQ patches have been accepted and merged.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/ixgbe.h      |    2 ++
 drivers/net/ixgbe/ixgbe_main.c |   21 ++++++++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..c220b9f 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/netdevice.h>
 #include <linux/aer.h>
+#include <linux/cpumask.h>
 
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
@@ -236,6 +237,7 @@ struct ixgbe_q_vector {
 	u8 tx_itr;
 	u8 rx_itr;
 	u32 eitr;
+	cpumask_var_t affinity_mask;
 };
 
 /* Helper macros to switch between ints/sec and what the register uses.
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 1b1419c..8f84bb8 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1083,6 +1083,16 @@ static void ixgbe_configure_msix(struct ixgbe_adapter *adapter)
 			q_vector->eitr = adapter->rx_eitr_param;
 
 		ixgbe_write_eitr(q_vector);
+
+		/*
+		 * Allocate the affinity_hint cpumask, assign the mask for
+		 * this vector, and register our affinity_hint for this irq.
+		 */
+		if (!alloc_cpumask_var(&q_vector->affinity_mask, GFP_KERNEL))
+			return;
+		cpumask_set_cpu(v_idx, q_vector->affinity_mask);
+		irq_register_affinity_hint(adapter->msix_entries[v_idx].vector,
+		                           q_vector->affinity_mask);
 	}
 
 	if (adapter->hw.mac.type == ixgbe_mac_82598EB)
@@ -3218,7 +3228,7 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	u32 rxctrl;
 	u32 txdctl;
-	int i, j;
+	int i, j, num_q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
 
 	/* signal that we are down to the interrupt handler */
 	set_bit(__IXGBE_DOWN, &adapter->state);
@@ -3251,6 +3261,15 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 
 	ixgbe_napi_disable_all(adapter);
 
+	for (i = 0; i < num_q_vectors; i++) {
+		struct ixgbe_q_vector *q_vector = adapter->q_vector[i];
+		/* release the CPU mask memory */
+		free_cpumask_var(q_vector->affinity_mask);
+		/* clear the affinity_mask in the IRQ descriptor */
+		irq_register_affinity_hint(adapter->msix_entries[i].vector,
+		                           NULL);
+	}
+
 	clear_bit(__IXGBE_SFP_MODULE_NOT_FOUND, &adapter->state);
 	del_timer_sync(&adapter->sfp_timer);
 	del_timer_sync(&adapter->watchdog_timer);


^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-29 12:45 UTC (permalink / raw)
  To: Changli Gao
  Cc: hadi, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <u2l412e6f7f1004290512x7fcdb5c3w591c6446d676502@mail.gmail.com>

Le jeudi 29 avril 2010 à 20:12 +0800, Changli Gao a écrit :
> On Thu, Apr 29, 2010 at 7:35 PM, jamal <hadi@cyberus.ca> wrote:
> >
> > Same here - even in my worst case scenario 88.5% of 750Kpps > 600Kpps.
> > Attached is history results to make more sense of what i am saying:
> > we have net-next kernels from apr14, apr23, apr23 with changlis change,
> > apr28, apr28 with your change. What you'll see is non-rps (blue) gets
> > better and rps (Orange) gets better slowly then by apr28 it is worse.
> 
> Did the number of IPIs increase in the apr28 test? The finial patch
> with Eric's change may introduce more IPIs. And I am wondering why
> 23rdcl-non-rps is better than before. Maybe it is the side effect of
> my patch: enlarge the netdev_max_backlog.
> 
> 

Changli, I wonder how you can cook "performance" patches without testing
them at all for real... This cannot be true ?

When the cpu doing the device softirq is flooded, it handles 300 packets
per net_rx_action() round (netdev_budget), so sends at most 6 ipis per
300 packets, with or without my patch, with or without your patch as
well.

(At most because if remote cpus are flooded as well, they dont
napi_complete so no IPI needed at all)

(My patch had an effect only on normal load, ie one packet received in a
while... up to 50.000 pps I would say). And it also has a nice effect on
non RPS loads (mostly the more typical load for following years).
If a second packet comes 3us after the first one, and before 2nd CPU
handled it, we _can_ afford an extra IPI.

750.000/50 = 15.000 IPI per second.

Even with 200.000 IPI per second, 'perf top -C CPU_IPI_sender' shows
that sending IPI is very cheap (maybe ~1% of cpu cycles)

# Samples: 32033467127
#
# Overhead         Command      Shared Object  Symbol
# ........  ..............  .................  ......
#
    18.05%            init  [kernel.kallsyms]  [k] poll_idle
    10.91%            init  [kernel.kallsyms]  [k] bnx2x_rx_int
    10.42%            init  [kernel.kallsyms]  [k] eth_type_trans
     5.72%            init  [kernel.kallsyms]  [k] kmem_cache_alloc_node
     5.43%            init  [kernel.kallsyms]  [k] __memset
     5.20%            init  [kernel.kallsyms]  [k] get_rps_cpu
     4.82%            init  [kernel.kallsyms]  [k] __slab_alloc
     4.34%            init  [kernel.kallsyms]  [k] get_partial_node
     4.22%            init  [kernel.kallsyms]  [k] _raw_spin_lock
     3.41%            init  [kernel.kallsyms]  [k] __kmalloc_node_track_caller
     3.01%            init  [kernel.kallsyms]  [k] __alloc_skb
     2.22%            init  [kernel.kallsyms]  [k] enqueue_to_backlog
     2.10%            init  [kernel.kallsyms]  [k] vlan_gro_common
     1.34%            init  [kernel.kallsyms]  [k] swiotlb_map_page
     1.25%            init  [kernel.kallsyms]  [k] skb_put
     1.06%            init  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     0.92%            init  [kernel.kallsyms]  [k] dev_gro_receive
     0.88%            init  [kernel.kallsyms]  [k] swiotlb_dma_mapping_error
     0.83%            init  [kernel.kallsyms]  [k] vlan_gro_receive
     0.83%            init  [kernel.kallsyms]  [k] __phys_addr
     0.83%            init  [kernel.kallsyms]  [k] __napi_complete
     0.83%            init  [kernel.kallsyms]  [k] default_send_IPI_mask_sequence_phys
     0.77%            init  [kernel.kallsyms]  [k] is_swiotlb_buffer
     0.76%            init  [kernel.kallsyms]  [k] __netdev_alloc_skb
     0.74%            init  [kernel.kallsyms]  [k] deactivate_slab
     0.73%            init  [kernel.kallsyms]  [k] netif_receive_skb
     0.72%            init  [kernel.kallsyms]  [k] unmap_single
     0.69%            init  [kernel.kallsyms]  [k] csd_lock
     0.63%            init  [kernel.kallsyms]  [k] bnx2x_poll
     0.61%            init  [kernel.kallsyms]  [k] bnx2x_msix_fp_int
     0.59%            init  [kernel.kallsyms]  [k] irq_entries_start
     0.59%            init  [kernel.kallsyms]  [k] swiotlb_sync_single
     0.54%            init  [kernel.kallsyms]  [k] get_slab
     0.46%            init  [kernel.kallsyms]  [k] napi_skb_finish




^ permalink raw reply

* Re: [PATCH linux-next v3 1/2] irq: Add CPU mask affinity hint
From: Thomas Gleixner @ 2010-04-30 20:24 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr; +Cc: davem, arjan, netdev, linux-kernel
In-Reply-To: <20100430202343.4591.66240.stgit@ppwaskie-hc2.jf.intel.com>

Peter,

On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:
>  
> +extern int irq_register_affinity_hint(unsigned int irq,
> +                                      const struct cpumask *m);

One last nitpick. 

Can you please rename to irq_set_affinity_hint() ?

Otherwise, I'm happy with the outcome. Thanks for your patience!

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH linux-next v3 1/2] irq: Add CPU mask affinity hint
From: Peter P Waskiewicz Jr @ 2010-04-30 20:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1004302220270.2951@localhost.localdomain>

On Fri, 30 Apr 2010, Thomas Gleixner wrote:

> Peter,
>
> On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:
>>
>> +extern int irq_register_affinity_hint(unsigned int irq,
>> +                                      const struct cpumask *m);
>
> One last nitpick.
>
> Can you please rename to irq_set_affinity_hint() ?

Not a problem.  It's a better name anyways now that we've dropped the 
callback function.  Patch update incoming.

>
> Otherwise, I'm happy with the outcome. Thanks for your patience!

Thanks for your help getting this pulled together.  This is a big step for 
our performance tuning goals.

Cheers,
-PJ

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Scott Feldman @ 2010-04-30 20:34 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: davem, netdev, chrisw, Jens Osterkamp
In-Reply-To: <201004291748.38702.arnd@arndb.de>

On 4/29/10 8:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> I believe Chris is the one that was pushing most for having a single interface
> for both VDP/LLDPAD and enic.
> While I now understand your reasons for doing it in firmware and requiring the
> kernel interface in addition to the user interface, my doubts on whether VDP
> and your protocol should be part of the same interface are increasing.
> 
> While I'm convinced that you can make it work for both now, the alternative
> to split the two may turn out to be cleaner. We'd still be able to do
> either of the two in kernel or user space. Using iproute2 syntax to describe
> this again, it would mean an interface like
> 
>    ip iov set  port-profile DEVICE [ base BASE-DEVICE ] name PORT-PROFILE
>                              [ host_uuid HOST_UUID ]
>                      [ client_name CLIENT_NAME ]
>                                       [ client_uuid CLIENT_UUID ]
>    ip iov set  vsi { associate | pre-associate | pre-associate-rr }
> BASE-DEVICE
>                                       vsi MGR:VTID:VER
>                                       mac LLADDR [ vlan VID ]
>                                       client_uuid CLIENT_UUID
> 
>    ip iov del  port_profile DEVICE      [ base BASE-DEVICE ]
>    ip iov del  vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
>        [ client_uuid CLIENT_UUID ]
> 
>    ip iov show port_profile DEVICE      [ base BASE-DEVICE ]
>    ip iov show vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
> [ client_uuid CLIENT_UUID ]
> 
> You would obvioulsy only implement the kernel support for the port-profile
> stuff as callbacks, because no driver yet does VDP in the kernel, but we
> should
> have a common netlink header that defines both variants.
> 
> Chris, any opinion on this interface as opposed to the combined one?
> Either one should work, but splitting it seems cleaner to me.

I haven't seen Chris's response, but it seems vger was down for awhile, so
maybe it's coming.  Assuming we go for the split design, we're still talking
about using RTM_SETLINK/RTM_GETLINK/RTM_DELLINK for these netlink msgs?  Or
are you suggesting by your cmd syntax that we return to
RTM_SETIOV/RTM_GETIOV like in the first iovnl patch?  RTM_SET/GET/DELLINK is
probably simplier, cleaner patch.

-scott


^ permalink raw reply

* Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: Daniel Lezcano @ 2010-04-30 20:35 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, Vlad Yasevich, netdev, David Miller
In-Reply-To: <1272646855-17327-1-git-send-email-danms@us.ibm.com>

Dan Smith wrote:
> This patch adds support for checkpointing and restoring route information.
> It keeps enough information to restore basic routes at the level of detail
> of /proc/net/route.  It uses RTNETLINK to extract the information during
> checkpoint and also to insert it back during restore.  This gives us a
> nice layer of isolation between us and the various "fib" implementations.
>
> Changes in v2:
>
> This version of the patch actually moves the current task into the
> desired network namespace temporarily, for the purposes of examining and
> restoring the route information.  This is a instead of creating a cross-
> namespace socket to do the job, as was done in v1.
>
> This is just an RFC to see if this is an acceptable method.  For a final
> version, adding a helper to nsproxy.c would allow us to create a new
> nsproxy with the desired netns instead of creating one with
> copy_namespaces() just to kill it off and use the target one.
>
> I still think the previous method is cleaner, but this way may violate
> fewer namespace boundaries (I'm still undecided :)
>
> Signed-off-by: Dan Smith <danms@us.ibm.com>
> Cc: David Miller <davem@davemloft.net>
> Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
> Cc: jamal <hadi@cyberus.ca>
> ---
Hi Dan,

Eric did a patchset (as Jamal mentioned it) where you can have a process 
to enter a specific namespace from userspace.

http://git.kernel.org/?p=linux/kernel/git/ebiederm/linux-2.6.33-nsfd-v5.git;a=commit;h=9c2f86a44d9ca93e78fd8e81a4e2a8c2a4cdb054

Is it possible to enter the namespace and dump / restore the routes with 
NETLINK_ROUTE from userspace ? Or is it something not possible ?

Thanks
  -- Daniel



^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Arnd Bergmann @ 2010-04-29 15:48 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw, Jens Osterkamp
In-Reply-To: <C7FEE68A.2CBEF%scofeldm@cisco.com>

On Thursday 29 April 2010, Scott Feldman wrote:
> On 4/29/10 5:27 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> 
> I don't believe those links are available at this time.
> 
> > Is it possible or planned to implement the same protocol in Linux so you
> > can do it with Cisco switches and cheap non-IOV NICs?
> 
> That seems very possible from a technical standpoint.  I don't think the
> port-profile netlink API we're specing out excludes that option.

Ok, good.

> >>    ip port_profile set DEVICE [ base DEVICE ] [ { pre_associate |
> >>                                                   pre_associate_rr } ]
> >>                               { name PORT-PROFILE | vsi MGR:VTID:VER }
> 
> BTW, I was meaning to ask: is there a way to role the vsi tuple and the
> flags up into a single identifier, say a string like PORT-PROFILE?  I'm
> asking because it seems awkward from an admin's perspective to know how to
> construct a vsi tuple or to know what pre_associate_rr means. I have to
> admit I didn't fully grok what pre_associate_rr means myself.  Even if there
> was a simple local database to map named port-profiles to the underlying
> {vsi tuple, flags}, that would bring us closer to a more consistent user
> interface.  Is this possible?

I think that's technically possible but may not be helpful to make the
user interface easier. Some background on pre-associate:

The purpose of this is to assist guest migration. A single VSI (i.e. guest
network adapter) may only be connected to a single switch port at any
given time. The VSI is identified by its UUID and it has a unique
MAC address.

When migrating a guest to a new hypervisor, we need to ask the switch
to associate that VSI at the destination switch port (which may or may
not be on the same different switch as the source port). This operation
may fail for a number of reasons and can take some time. Since we want
migration to alway succeed and take as little time as possible, we
do a pre-associate-with-resource-reservation before the migration and
only start the actual guest migration if that completes successfully.

After a successful pre-associate-with-resource-reservation step, we
know that the actual associate step will be both fast and successful.
After it completes, the VSI is known to be on the destination
and all traffic goes there (replacing the gratuitous ARP method we do
today).

I don't think we'd ever do a pre-associate without the
resource-reservation, but the standard defines both. In theory,
we could do a pre-associate at every switch in the data center
in order to find out if it's possible to migrate there.

If you want to have more details, please look at the draft spec at
http://www.ieee802.org/1/files/public/docs2010/bg-joint-evb-0410v1.pdf

> >> 2. Future enic for pass-thru case where base != target.  We get:
> >> 
> >>     ip port_profile set eth1 base eth0 name joes-garage ...
> >> 
> >> And
> >> 
> >>     eth0:ndi_set_port_profile(eth1, ...)
> > 
> > Is eth1 the static device and eth0 the dynamic device in this scenario
> > or the other way round?
> 
> eth0 is the static and eth1 is the dynamic.  So eth0 is the base device.
> (The PF in SR-IOV parlance).

ok.

> > Wouldn't you still require access to both devices from the host root
> > network namespace here or do you just ignore the identifier for the
> > dynamic device here?
> 
> The dynamic device is the one to apply the port-profile to (we'll, I should
> say to apply to the dynamic's devices switch port).  So we need the dynamic
> device identified.

What I mean is: how do you identify it when it belongs to someone else?
Do we always have a proxy netdev for an SR-IOV VF that is assigned to
the guest?

For the separate network namespace case, I guess we could still require
doing it before assigning the device to the guest namespace, but it's
still not ideal.

> >> Does this work?  I want to get agreement before coding up patch attempt #4.
> > 
> > Seems ok for all I can see at this point, other than the complexity
> > that results from doing two network protocols through a single netlink
> > protocol. Maybe Jens and Chris can comment some more on this.
> 
> Ok, thanks Arnd.  I'll start coding this up now, hedging that the design is
> set before hearing back from Jens/Chris.

I believe Chris is the one that was pushing most for having a single interface
for both VDP/LLDPAD and enic.
While I now understand your reasons for doing it in firmware and requiring the
kernel interface in addition to the user interface, my doubts on whether VDP
and your protocol should be part of the same interface are increasing.

While I'm convinced that you can make it work for both now, the alternative
to split the two may turn out to be cleaner. We'd still be able to do
either of the two in kernel or user space. Using iproute2 syntax to describe
this again, it would mean an interface like

   ip iov set  port-profile DEVICE [ base BASE-DEVICE ] name PORT-PROFILE
	                              [ host_uuid HOST_UUID ]
        	                      [ client_name CLIENT_NAME ]
                                      [ client_uuid CLIENT_UUID ]
   ip iov set  vsi { associate | pre-associate | pre-associate-rr } BASE-DEVICE
                                      vsi MGR:VTID:VER
                                      mac LLADDR [ vlan VID ]
                                      client_uuid CLIENT_UUID

   ip iov del  port_profile DEVICE      [ base BASE-DEVICE ]
   ip iov del  vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
				        [ client_uuid CLIENT_UUID ]

   ip iov show port_profile DEVICE      [ base BASE-DEVICE ]
   ip iov show vsi          BASE-DEVICE [ mac LLADDR [ vlan VID ] ]
					[ client_uuid CLIENT_UUID ]

You would obvioulsy only implement the kernel support for the port-profile
stuff as callbacks, because no driver yet does VDP in the kernel, but we should
have a common netlink header that defines both variants.

Chris, any opinion on this interface as opposed to the combined one?
Either one should work, but splitting it seems cleaner to me.

	Arnd

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-30 20:40 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272655814.3879.8.camel@bigi>

Le vendredi 30 avril 2010 à 15:30 -0400, jamal a écrit :
> Eric!
> 
> I managed to mod your program to look conceptually similar to mine
> and i reproduced the results with same test kernel from yesterday. 
> So it is likely the issue is in using epoll vs not using any async as
> in your case.
> Results attached as well as modified program.
> 
> Note: the key things to remember:
> rps with this program gets worse over time and different net-next
> kernels since Apr14 (look at graph i supplied). Sorry, I am really
> busy-ed out to dig any further.
> 
> cheers,
> jamal
> 

I am lost.

I used your program, and with RPS off, I can get at most 220.000 pps
with my "old" hardware. I dont understand how you can reach 700.000 pps
with RPS off. Or is it with your Nehalem ?




^ permalink raw reply

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Richard Cochran @ 2010-04-29 15:34 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: netdev
In-Reply-To: <4BD97573.5050101@grandegger.com>

On Thu, Apr 29, 2010 at 02:02:59PM +0200, Wolfgang Grandegger wrote:
> 
> I realized two other netdev drivers already supporting PTP timestamping:
> igb and bfin_mac. From the PTP developer point of view, the interface
> looks rather complete to me and it works fine on my MPC8313 setup.

Do you know whether these two also have PTP clocks? If so, is the API
that I suggested going to work for controlling those clocks, too?

> The only thing I stumbled over was that PTP clock registration
> failed when PTP support is statically linked into the kernel.

Okay, will look into that...

Thanks,
Richard


^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Andi Kleen @ 2010-04-30 21:01 UTC (permalink / raw)
  To: David Miller; +Cc: tglx, shemminger, eric.dumazet, netdev, peterz
In-Reply-To: <20100430.115715.216750975.davem@davemloft.net>

> Then we can do cool tricks like having the cpu spin on a mwait() on the
> network device's status descriptor in memory.

When you specify a deep C state in that mwait then it will also have the long 
wakeup latency in the idle case.  When you don't then you just killed higher
Turbo mode on that socket and give away a lot of performance on the other
cores.

So you have to solve the idle state governour issue anyways, and then
you likely don't need it anymore.

Besides it seems to me that dispatching is something the NIC should
just do directly. "RPS only CPU" would be essentially just an 
interrupt mitigation/flow redirection scheme that a lot of NICs
do anyways.

> In any event I agree with you, it's a cool idea at best, and likely
> not really practical.

s/cool//

-Andi

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox