Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] macvtap: add ioctl to modify vnet header size
From: Arnd Bergmann @ 2010-04-29 14:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David S. Miller, Sridhar Samudrala, Eric Dumazet, netdev,
	linux-kernel, David Stevens
In-Reply-To: <20100429135158.GA26303@redhat.com>

On Thursday 29 April 2010, Michael S. Tsirkin wrote:
> This adds TUNSETVNETHDRSZ/TUNGETVNETHDRSZ support
> to macvtap.

Looks good, thanks Michael!

> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: Thomas Gleixner @ 2010-04-30 19:58 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, eric.dumazet, ak, netdev, andi, peterz
In-Reply-To: <20100430.115715.216750975.davem@davemloft.net>

Dave,

On Fri, 30 Apr 2010, David Miller wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Thu, 29 Apr 2010 21:19:36 +0200 (CEST)
> 
> > Aside of that I seriously doubt that you can do networking w/o time
> > and timers.
> 
> You're right that we need timestamps and the like.
> 
> But only if we actually process the packets on these restricted cpus :-)
> 
> If we use RPS and farm out all packets to other cpus, ie. just doing
> the driver work and the remote cpu dispatch on these "offline" cpus,
> it is doable.
> 
> Then we can do cool tricks like having the cpu spin on a mwait() on the
> network device's status descriptor in memory.
> 
> In any event I agree with you, it's a cool idea at best, and likely
> not really practical.

Well, it might be worth to experiment with that once we get the basic
infrastructure in place to "isolate" cores under full kernel control. 

It's not too hard to solve the problems, but it seems nobody has a
free time slot to tackle them.

Thanks

	tglx

^ permalink raw reply

* Re: pull request: wireless-2.6 2010-04-30
From: David Miller @ 2010-04-30 19:54 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20100430180840.GD4120@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Fri, 30 Apr 2010 14:08:40 -0400

> One more for 2.6.34...it avoids some DMA mapping-related failures.

Pulled, thanks John.

^ permalink raw reply

* Re: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: David Miller @ 2010-04-30 19:51 UTC (permalink / raw)
  To: bruce.w.allan; +Cc: anton, jeffrey.t.kirsher, netdev, gospo, mjg
In-Reply-To: <8DD2590731AB5D4C9DBF71A877482A90622DE3C2@orsmsx509.amr.corp.intel.com>

From: "Allan, Bruce W" <bruce.w.allan@intel.com>
Date: Fri, 30 Apr 2010 12:27:02 -0700

> In that case, I agree with Anton's patch.
> 
> Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>

Great, applied, thanks guys.

^ permalink raw reply

* [RFC] BPF program access to transport header
From: Paul LeoNerd Evans @ 2010-04-30 19:39 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1816 bytes --]

Via the SKF_NET_OFF extension area, a BPF program has nice easy access
to the network header, wherever it might happen to be in the packet.
This makes it simpler to write filters on e.g. IPv4 headers, knowing
that fields will always be at simple offsets relative to SKF_NET_OFF.
Using the data at WORD[SKF_AD_PROTO] it's easy also to find out what
network protocol this is.

I would like to provide similar for the transport header. Without doing
so, it is very hard to parse e.g. UDP or TCP headers that may be
contained within IPv6 protocol, because of the linked-list way IPv6
headers chain on to each other. BPF doesn't provide a while() loop or
any kind of backward jump, meaning the filter program has to be
loop-unrolled a static number of times. This quickly leads to very large
programs.

I forsee a number of issues with trying to provide this:

 * How to provide the protocol number (e.g. 6 for TCP, 1 for ICMP) to
   the BPF program

 * How to obtain the transport offset - AIUI, the skf_transport_offset()
   won't actually be set yet by the time the filter program runs.

 * What to do if the underlying protocol doesn't support a transport
   layer above it - e.g. ARP.

Ideally, this would make it easy to filter, say, TCP destination port
80, by doing the following:

  LD WORD[SKF_AD_PROTO]
  JEQ ETHERTYPE_IPV4, 1, fail
  JEQ ETHERTYPE_IPv6, 0, fail

  LD WORD[SKF_AD_TRANSPROTO]
  JEQ IPPROTO_TCP, 0, fail

  LD WORD[SKF_TRANS_OFF+0]
  JEQ 80, 0, fail

  LD len
  RET A

fail:
  RET 0

In this short simple BPF program we've avoided all the issues involved
with trying to parse IPv6 headers.

Can we make this work?

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-04-30 19:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272573383.3969.8.camel@bigi>

[-- Attachment #1: Type: text/plain, Size: 1322 bytes --]

Eric!

I managed to mod your program to look conceptually similar to mine
and i reproduced the results with same test kernel from yesterday. 
So it is likely the issue is in using epoll vs not using any async as
in your case.
Results attached as well as modified program.

Note: the key things to remember:
rps with this program gets worse over time and different net-next
kernels since Apr14 (look at graph i supplied). Sorry, I am really
busy-ed out to dig any further.

cheers,
jamal



On Thu, 2010-04-29 at 16:36 -0400, jamal wrote:
> On Thu, 2010-04-29 at 09:56 -0400, jamal wrote:
> 
> > 
> > I will try your program instead so we can reduce the variables
> 
> Results attached.
> With your app rps does a hell lot better and non-rps worse ;->
> With my proggie, non-rps does much better than yours and rps does
> a lot worse for same setup. I see the scheduler kicking quiet a bit in
> non-rps for you...
> 
> The main difference between us as i see it is:
> a) i use epoll - actually linked to libevent (1.0.something)
> b) I fork processes and you use pthreads.
> 
> I dont have time to chase it today, but 1) I am either going to change
> yours to use libevent or make mine get rid of it then 2) move towards
> pthreads or have yours fork..
> then observe if that makes any difference..
> 
> 
> cheers,
> jamal

[-- Attachment #2: apr30-ericmod --]
[-- Type: text/plain, Size: 8919 bytes --]


First a few runs with Eric's code + epoll/libevent

-------------------------------------------------------------------------------
   PerfTop:    4009 irqs/sec  kernel:83.4% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

             2097.00  8.6% sky2_poll                   [sky2]              
             1742.00  7.2% _raw_spin_lock_irqsave      [kernel]            
              831.00  3.4% system_call                 [kernel]            
              654.00  2.7% copy_user_generic_string    [kernel]            
              654.00  2.7% datagram_poll               [kernel]            
              647.00  2.7% fget                        [kernel]            
              623.00  2.6% _raw_spin_unlock_irqrestore [kernel]            
              547.00  2.3% _raw_spin_lock_bh           [kernel]            
              506.00  2.1% sys_epoll_ctl               [kernel]            
              475.00  2.0% kmem_cache_free             [kernel]            
              466.00  1.9% schedule                    [kernel]            
              436.00  1.8% vread_tsc                   [kernel].vsyscall_fn
              417.00  1.7% fput                        [kernel]            
              415.00  1.7% sys_epoll_wait              [kernel]            
              402.00  1.7% _raw_spin_lock              [kernel]            


-------------------------------------------------------------------------------
   PerfTop:     616 irqs/sec  kernel:98.7% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ ________

             2534.00 28.6% sky2_poll              [sky2]  
              503.00  5.7% ip_route_input         [kernel]
              438.00  4.9% _raw_spin_lock_irqsave [kernel]
              418.00  4.7% __udp4_lib_lookup      [kernel]
              378.00  4.3% __alloc_skb            [kernel]
              364.00  4.1% ip_rcv                 [kernel]
              323.00  3.6% _raw_spin_lock         [kernel]
              315.00  3.5% sock_queue_rcv_skb     [kernel]
              284.00  3.2% __netif_receive_skb    [kernel]
              281.00  3.2% __udp4_lib_rcv         [kernel]
              266.00  3.0% __wake_up_common       [kernel]
              238.00  2.7% sock_def_readable      [kernel]
              181.00  2.0% __kmalloc              [kernel]
              163.00  1.8% kmem_cache_alloc       [kernel]
              150.00  1.7% ep_poll_callback       [kernel]


-------------------------------------------------------------------------------
   PerfTop:     854 irqs/sec  kernel:80.2% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ____________________

              341.00  8.0% _raw_spin_lock_irqsave      [kernel]            
              235.00  5.5% system_call                 [kernel]            
              174.00  4.1% datagram_poll               [kernel]            
              174.00  4.1% fget                        [kernel]            
              173.00  4.1% copy_user_generic_string    [kernel]            
              135.00  3.2% _raw_spin_unlock_irqrestore [kernel]            
              125.00  2.9% _raw_spin_lock_bh           [kernel]            
              122.00  2.9% schedule                    [kernel]            
              113.00  2.6% sys_epoll_ctl               [kernel]            
              113.00  2.6% kmem_cache_free             [kernel]            
              108.00  2.5% vread_tsc                   [kernel].vsyscall_fn
              105.00  2.5% sys_epoll_wait              [kernel]            
              102.00  2.4% udp_recvmsg                 [kernel]            
               95.00  2.2% mutex_lock                  [kernel]            

Average 97.55% of 10M packets at 750Kpps

Turn on rps mask ee and irq affinity to cpu0

-------------------------------------------------------------------------------
   PerfTop:    3885 irqs/sec  kernel:83.6% [1000Hz cycles],  (all, 8 CPUs)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ________

             2945.00 16.7% sky2_poll                      [sky2]  
              653.00  3.7% _raw_spin_lock_irqsave         [kernel]
              460.00  2.6% system_call                    [kernel]
              420.00  2.4% _raw_spin_unlock_irqrestore    [kernel]
              414.00  2.3% sky2_intr                      [sky2]  
              392.00  2.2% fget                           [kernel]
              360.00  2.0% ip_rcv                         [kernel]
              324.00  1.8% sys_epoll_ctl                  [kernel]
              323.00  1.8% __netif_receive_skb            [kernel]
              310.00  1.8% schedule                       [kernel]
              292.00  1.7% ip_route_input                 [kernel]
              292.00  1.7% _raw_spin_lock                 [kernel]
              291.00  1.7% copy_user_generic_string       [kernel]
              284.00  1.6% kmem_cache_free                [kernel]
              262.00  1.5% call_function_single_interrupt [kernel]

-------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:98.1% [1000Hz cycles],  (all, cpu: 0)
-------------------------------------------------------------------------------

             samples  pcnt function                            DSO
             _______ _____ ___________________________________ ________

             4170.00 61.9% sky2_poll                           [sky2]  
              723.00 10.7% sky2_intr                           [sky2]  
              159.00  2.4% __alloc_skb                         [kernel]
              140.00  2.1% get_rps_cpu                         [kernel]
              106.00  1.6% __kmalloc                           [kernel]
               95.00  1.4% enqueue_to_backlog                  [kernel]
               86.00  1.3% kmem_cache_alloc                    [kernel]
               85.00  1.3% irq_entries_start                   [kernel]
               85.00  1.3% _raw_spin_lock_irqsave              [kernel]
               82.00  1.2% _raw_spin_lock                      [kernel]
               66.00  1.0% swiotlb_sync_single                 [kernel]
               58.00  0.9% sky2_remove                         [sky2]  
               49.00  0.7% default_send_IPI_mask_sequence_phys [kernel]
               47.00  0.7% sky2_rx_submit                      [sky2]  
               36.00  0.5% _raw_spin_unlock_irqrestore         [kernel]

-------------------------------------------------------------------------------
   PerfTop:     344 irqs/sec  kernel:84.3% [1000Hz cycles],  (all, cpu: 2)
-------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ ____________________

              114.00  5.2% _raw_spin_lock_irqsave         [kernel]            
               79.00  3.6% fget                           [kernel]            
               78.00  3.6% ip_rcv                         [kernel]            
               78.00  3.6% system_call                    [kernel]            
               75.00  3.4% _raw_spin_unlock_irqrestore    [kernel]            
               67.00  3.1% sys_epoll_ctl                  [kernel]            
               65.00  3.0% schedule                       [kernel]            
               61.00  2.8% ip_route_input                 [kernel]            
               48.00  2.2% vread_tsc                      [kernel].vsyscall_fn
               48.00  2.2% call_function_single_interrupt [kernel]            
               46.00  2.1% kmem_cache_free                [kernel]            
               45.00  2.1% __netif_receive_skb            [kernel]            
               41.00  1.9% process_recv                   snkudp              
               40.00  1.8% kfree                          [kernel]            
               39.00  1.8% _raw_spin_lock                 [kernel]            

92.97% of 10M packets at 750Kpps


Ok, so this is exactly what i saw with my app. non-rps is better.
To summarize: It used to be the opposite on net-next before around
Apr14. rps has gotten worse.

[-- Attachment #3: udpsnkfrk.c --]
[-- Type: text/x-csrc, Size: 3650 bytes --]

/*
 *  Usage: udpsink [ -p baseport] nbports
*/
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <event.h>

struct worker_data {
	struct event *snk_ev;
	struct event_base *base;
	struct timeval t;
	unsigned long pack_count;
	unsigned long bytes_count;
	unsigned long tout;
	int fd;			/* move to avoid hole on 64-bit */
	int pad1;		/*64B - let Eric figure the math;-> */
	//unsigned long _padd[16 - 3]; /* alignment */ 
};

void usage(int code)
{
	fprintf(stderr, "Usage: udpsink [-p baseport] nbports\n");
	exit(code);
}

void process_recv(int fd, short ev, void *arg)
{
	char buffer[4096];
	struct sockaddr_in addr;
	socklen_t len = sizeof(addr);
	struct worker_data *wdata = (struct worker_data *)arg;
	int lu = 0;

	if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
		perror("cb event_add");
		return;
	}

	if (ev == EV_TIMEOUT) {
		wdata->tout++;
	} else {
		lu = recvfrom(wdata->fd, buffer, sizeof(buffer), 0,
			      (struct sockaddr *)&addr, &len);
		if (lu > 0) {
			wdata->pack_count++;
			wdata->bytes_count += lu;
		}
	}
}

int prep_thread(struct worker_data *wdata)
{
	wdata->t.tv_sec = 1;
	wdata->t.tv_usec = random() % 50000L;

	wdata->base = event_init();
	event_set(wdata->snk_ev, wdata->fd, EV_READ, process_recv, wdata);
	event_base_set(wdata->base, wdata->snk_ev);
	if ((event_add(wdata->snk_ev, &wdata->t)) < 0) {
		perror("event_add");
		return -1;
	}
	return 0;
}

void *worker_func(void *arg)
{
	struct worker_data *wdata = (struct worker_data *)arg;

	return (void *)event_base_loop(wdata->base, 0);
}

int main(int argc, char *argv[])
{
	int c;
	int baseport = 4000;
	int nbthreads;
	struct worker_data *wdata;
	unsigned long ototal = 0;
	int concurrent = 0;
	int verbose = 0;
	int i;
	while ((c = getopt(argc, argv, "cvp:")) != -1) {
		if (c == 'p')
			baseport = atoi(optarg);
		else if (c == 'c')
			concurrent = 1;
		else if (c == 'v')
			verbose++;
		else
			usage(1);
	}
	if (optind == argc)
		usage(1);
	nbthreads = atoi(argv[optind]);
	wdata = calloc(sizeof(struct worker_data), nbthreads);
	if (!wdata) {
		perror("calloc");
		return 1;
	}

	for (i = 0; i < nbthreads; i++) {
		struct sockaddr_in addr;
		pthread_t tid;

		if (i && concurrent) {
			wdata[i].fd = wdata[0].fd;
		} else {
			wdata[i].snk_ev = malloc(sizeof(struct event));
			if (!wdata[i].snk_ev)
				return 1;
			memset(wdata[i].snk_ev, 0, sizeof(struct event));

			wdata[i].fd = socket(PF_INET, SOCK_DGRAM, 0);
			if (wdata[i].fd == -1) {
				free(wdata[i].snk_ev);
				perror("socket");
				return 1;
			}
			memset(&addr, 0, sizeof(addr));
			addr.sin_family = AF_INET;
//                      addr.sin_addr.s_addr = inet_addr(argv[optind]);
			addr.sin_port = htons(baseport + i);
			if (bind
			    (wdata[i].fd, (struct sockaddr *)&addr,
			     sizeof(addr)) < 0) {
				free(wdata[i].snk_ev);
				perror("bind");
				return 1;
			}
//                      fcntl(wdata[i].fd, F_SETFL, O_NDELAY);
		}
		if (prep_thread(wdata + i)) {
			printf("failed to allocate thread %d, exit\n", i);
			exit(0);
		}
		pthread_create(&tid, NULL, worker_func, wdata + i);
	}

	for (;;) {
		unsigned long total;
		long delta;

		sleep(1);
		total = 0;
		for (i = 0; i < nbthreads; i++) {
			total += wdata[i].pack_count;
		}
		delta = total - ototal;
		if (delta) {
			printf("%lu pps (%lu", delta, total);
			if (verbose) {
				for (i = 0; i < nbthreads; i++) {
					if (wdata[i].pack_count)
						printf(" %d:%lu", i,
						       wdata[i].pack_count);
				}
			}
			printf(")\n");
		}
		ototal = total;
	}
}

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-29  4:09 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272498293.4258.121.camel@bigi>

Le mercredi 28 avril 2010 à 19:44 -0400, jamal a écrit :
> On Wed, 2010-04-28 at 16:06 +0200, Eric Dumazet wrote:
> 
> > Here it is ;)
> 
> Sorry - things got a little hectic with TheMan.
> 
> I am afraid i dont have good news.
> Actually, I should say i dont have good news in regards to rps.
> For my sample app, two things seem to be happening:
> a) The overall performance has gotten better for both rps
> and non-rps.
> b) non-rps is now performing relatively better
> 
> This is just what i see in net-next not related to your patch.
> It seems the kernels i tested prior to April 23 showed rps better.
> The one i tested on Apr23 showed rps being about the same as non-rps.
> As i stated in my last result posting, I thought i didnt test properly
> but i did again today and saw the same thing. And now non-rps is
> _consistently_ better.
> So some regression is going on...
> 
> Your patch has improved the performance of rps relative to what is in
> net-next very lightly; but it has also improved the performance of
> non-rps;->
> My traces look different for the app cpu than yours - likely because of
> the apps being different.
> 
> At the moment i dont have time to dig deeper into code, but i could
> test as cycles show up.
> 
> I am attaching the profile traces and results.
> 
> cheers,
> jamal

Hi Jamal

I dont see in your results the number of pps, number of udp ports,
number of flows.

In my latest results, I can handle more pps than before, regardless of
rps being on or off, and with various number of udp ports (one user
thread per port), number of flows (many src addr so that rps spread
packets on many cpus)

If/when contention windows are smaller, cpu can run uncontended, and can
consume more cycles to process more frames ?

With a non yet published patch, I even can reach 600.000 pps in DDOS
situations, instead of 400.000.

Thanks !



^ permalink raw reply

* RE: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: Allan, Bruce W @ 2010-04-30 19:27 UTC (permalink / raw)
  To: David Miller
  Cc: anton@samba.org, Kirsher, Jeffrey T, netdev@vger.kernel.org,
	gospo@redhat.com, mjg@redhat.com
In-Reply-To: <20100429.120416.184828647.davem@davemloft.net>

On Thursday, April 29, 2010 12:04 PM, David Miller wrote:
> From: "Allan, Bruce W" <bruce.w.allan@intel.com>
> Date: Thu, 29 Apr 2010 10:19:56 -0700
> 
>> Your patch is probably the correct thing to do but I'm not all that
>> familiar with the ppc64 architecture.  Would you please provide the
>> output of 'lspci -t' and 'lspci -vvv -xxx'.
> 
> You're not guarenteed for there to be a pci_dev backing the top-level
> host controller, at the very least.  Some platforms don't even
> implement the PCI config space for the host controller, whilst on
> others access to them is protected by the hypervisor.
> 
> So you can't go poking around the PCI host controller registers
> unconditionally.
> 
> The same OOPS probably would happen on Sparc64 in some configurations
> too.  Although all of my PCI-E slots do have PCI-E express switch port
> nodes, so maybe it wouldn't trigger here.

In that case, I agree with Anton's patch.

Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>

Thanks,
Bruce.

^ permalink raw reply

* Re: [PATCH linux-next v2 1/2] irq: Add CPU mask affinity hint
From: Peter P Waskiewicz Jr @ 2010-04-30 19:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1004302038140.2951@localhost.localdomain>

On Fri, 30 Apr 2010, Thomas Gleixner wrote:

> On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:
>> On Fri, 30 Apr 2010, Thomas Gleixner wrote:
>>>> +extern int irq_register_affinity_hint(unsigned int irq,
>>>> +                                      const struct cpumask *m);
>>>
>>> I think we can do with a single funtion irq_set_affinity_hint() and
>>> let the caller set the pointer to NULL.
>>
>> Ok, I've been running into some issues.  If CONFIG_CPUMASK_OFFSTACK is not
>> set, then cpumask_var_t structs are single-element arrays that cannot be
>> NULL'd out.  I'm pretty sure I need to keep the unregister part of the API.
>> Thoughts?
>
> extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
>
> So why should calling irq_set_affinity_hint(irqnr, NULL) not work ?

What was that you said about coffee and brain cells?  :-)

>
>> I just looked at the original show_affinity function, and it does not grab
>> desc->lock before copying mask out of desc.  Should I follow that model, or
>> should I fix that function to honor desc->lock?
>
> desc->affinity can only race against something changing the affinity
> bits, so that just might return some random data.
>
> In the hint case the irq could be shut down and the affinity hint
> could be freed while you are accessing it. Not a good idea :)

Good point.

Latest spin coming shortly.  Thanks for the quick feedback!

-PJ

^ permalink raw reply

* Re: [PATCH linux-next v2 1/2] irq: Add CPU mask affinity hint
From: Thomas Gleixner @ 2010-04-30 19:04 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <Pine.WNT.4.64.1004301059230.6264@PPWASKIE-MOBL2.amr.corp.intel.com>

On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:
> On Fri, 30 Apr 2010, Thomas Gleixner wrote:
> > > +extern int irq_register_affinity_hint(unsigned int irq,
> > > +                                      const struct cpumask *m);
> > 
> > I think we can do with a single funtion irq_set_affinity_hint() and
> > let the caller set the pointer to NULL.
> 
> Ok, I've been running into some issues.  If CONFIG_CPUMASK_OFFSTACK is not
> set, then cpumask_var_t structs are single-element arrays that cannot be
> NULL'd out.  I'm pretty sure I need to keep the unregister part of the API.
> Thoughts?

extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);

So why should calling irq_set_affinity_hint(irqnr, NULL) not work ?
 
> I just looked at the original show_affinity function, and it does not grab
> desc->lock before copying mask out of desc.  Should I follow that model, or
> should I fix that function to honor desc->lock?

desc->affinity can only race against something changing the affinity
bits, so that just might return some random data.

In the hint case the irq could be shut down and the affinity hint
could be freed while you are accessing it. Not a good idea :)

Thanks,

	tglx

^ permalink raw reply

* Re: OFT - reserving CPU's for networking
From: David Miller @ 2010-04-30 18:57 UTC (permalink / raw)
  To: tglx; +Cc: shemminger, eric.dumazet, ak, netdev, andi, peterz
In-Reply-To: <alpine.LFD.2.00.1004292055471.2951@localhost.localdomain>

From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 29 Apr 2010 21:19:36 +0200 (CEST)

> Aside of that I seriously doubt that you can do networking w/o time
> and timers.

You're right that we need timestamps and the like.

But only if we actually process the packets on these restricted cpus :-)

If we use RPS and farm out all packets to other cpus, ie. just doing
the driver work and the remote cpu dispatch on these "offline" cpus,
it is doable.

Then we can do cool tricks like having the cpu spin on a mwait() on the
network device's status descriptor in memory.

In any event I agree with you, it's a cool idea at best, and likely
not really practical.

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-30  5:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, hadi, Changli Gao, David S. Miller, Tom Herbert,
	Stephen Hemminger, netdev, lenb, arjan
In-Reply-To: <20100429214144.GA10663@gargoyle.fritz.box>

Le jeudi 29 avril 2010 à 23:41 +0200, Andi Kleen a écrit :
> On Thu, Apr 29, 2010 at 09:12:27PM +0200, Eric Dumazet wrote:
> > Yes, mostly, but about 200.000 wakeups per second I would say...
> > 
> > If a cpu in deep state receives an IPI, process a softirq, should it
> > come back to deep state immediately, or should it wait for some
> > milliseconds ?
> 
> In principle the cpuidle governour should detect this and not put the target into
> the slow deep c states. One change that was done recently to fix a similar 
> problem for disk IO was to take processes that wait for IO into account 
> (see 69d25870). But it doesn't work for networking.
> 
> Here's a untested patch that might help: tell the cpuidle governour 
> networking is waiting for IO. This will tell it to not go down the deeply.
> 
> I might have missed some schedule() paths, feel free to add more.
> 
> Actually it's probably too aggressive because it will avoid C states even for
> a closed window on the other side which might be hours. Better would
> be some heuristic to only do this when you're really expected IO shortly.
> 
> Also does your workload even sleep at all? If not we would need to increase
> the iowait counters in recvmsg() itself.
> 

My workload yes, uses blocking recvmsg() calls, but Jamal one uses
epoll() so I guess problem is more generic than that. We should have an
estimate of the number of wakeups (IO or not...) per second (or
sub-second) so that cpuidle can avoid these deep states ?

> Anyways might be still worth a try.
> 
> For routing we probably need some other solution though, there are no 
> schedules there.
> 
> > 
> > > Perhaps need to feed some information to cpuidle's governour to prevent this problem.
> > > 
> > > idle=poll is very drastic, better to limit to C1 
> > > 
> > 
> > How can I do this ?
> 
> processor.max_cstate=1 or using /dev/network_latency 
> (see Documentation/power/pm_qos_interface.txt)
> 
> -Andi
> 

Thanks, I'll play with this today !

> 
> 
> commit 810227a7c24ecae2bb4aac320490a7115ac33be8
> Author: Andi Kleen <ak@linux.intel.com>
> Date:   Thu Apr 29 23:33:18 2010 +0200
> 
>     Use io_schedule() in network stack to tell cpuidle governour to guarantee lower latencies
> 
>     XXX: probably too aggressive, some of these sleeps are not under high load.
> 
>     Based on a bug report from Eric Dumazet.
>     
>     Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index c5812bb..c246d6c 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1402,7 +1402,7 @@ static long sock_wait_for_wmem(struct sock *sk, long timeo)
>  			break;
>  		if (sk->sk_err)
>  			break;
> -		timeo = schedule_timeout(timeo);
> +		timeo = io_schedule_timeout(timeo);
>  	}
>  	finish_wait(sk->sk_sleep, &wait);
>  	return timeo;
> @@ -1512,7 +1512,7 @@ static void __lock_sock(struct sock *sk)
>  		prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait,
>  					TASK_UNINTERRUPTIBLE);
>  		spin_unlock_bh(&sk->sk_lock.slock);
> -		schedule();
> +		io_schedule();
>  		spin_lock_bh(&sk->sk_lock.slock);
>  		if (!sock_owned_by_user(sk))
>  			break;
> 
> > 
> > Thanks !
> > 
> > 



^ permalink raw reply

* Re: sierra_net default m
From: David Miller @ 2010-04-30 18:49 UTC (permalink / raw)
  To: epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8
  Cc: jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	rfiler-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1272653150.9050.20.camel@Linuxdev3>

From: Elina Pasheva <epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8@public.gmane.org>
Date: Fri, 30 Apr 2010 11:45:50 -0700

> OK. Makes sense, so that the kernel does not build the driver if the
> configuration option is not explicitly set.

All non-core drivers should default to "n".

The default controls what the user is suggested to choose when they
run the configuration without an explicit setting in their existing
config.

They should never be told to turn on a driver by default for an
odd-ball piece of hardware.

Even the IDE and ATA disk layers don't default to anything explicitly.

Just remove the default tag altogether.  I see that a bunch of other USB
networking drivers have all sorts of default tags, I wish you didn't
follow their lead. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-29 17:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hadi, Changli Gao, David S. Miller, Tom Herbert,
	Stephen Hemminger, netdev, Andi Kleen
In-Reply-To: <20100429174056.GA8044@gargoyle.fritz.box>

Le jeudi 29 avril 2010 à 19:42 +0200, Andi Kleen a écrit :
> > Andi, what do you think of this one ?
> > Dont we have a function to send an IPI to an individual cpu instead ?
> 
> That's what this function already does. You only set a single CPU 
> in the target mask, right?
> 
> IPIs are unfortunately always a bit slow. Nehalem-EX systems have X2APIC
> which is a bit faster for this, but that's not available in the lower
> end Nehalems. But even then it's not exactly fast.
> 
> I don't think the IPI primitive can be optimized much. It's not a cheap 
> operation.
> 
> If it's a problem do it less often and batch IPIs.
> 
> It's essentially the same problem as interrupt mitigation or NAPI 
> are solving for NICs. I guess just need a suitable mitigation mechanism.
> 
> Of course that would move more work to the sending CPU again, but 
> perhaps there's no alternative. I guess you could make it cheaper it by
> minimizing access to packet data.
> 
> -Andi

Well, IPI are already batched, and rate is auto adaptative.

After various changes, it seems things are going better, maybe there is
something related to cache line trashing.

I 'solved' it by using idle=poll, but you might take a look at
clockevents_notify (acpi_idle_enter_bm) abuse of a shared and higly
contended spinlock...




    23.52%            init  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
                      |
                      --- _raw_spin_lock_irqsave
                         |          
                         |--94.74%-- clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                         |--4.10%-- tick_broadcast_oneshot_control
                         |          tick_notify
                         |          notifier_call_chain
                         |          __raw_notifier_call_chain
                         |          raw_notifier_call_chain
                         |          clockevents_do_notify
                         |          clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          


^ permalink raw reply

* Re: sierra_net default m
From: Elina Pasheva @ 2010-04-30 18:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org, Rory Filer,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	epasheva-ywE8TTl5eJHWpu6QEFMNjNBPR1lH4CV8, linux-usb
In-Reply-To: <20100430172838.GJ27497-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>

On Fri, 2010-04-30 at 10:28 -0700, Jens Axboe wrote:
> Hi,
> 
> Can we please not add new drivers that default to y/m? Any new driver
> should default to 'n'.
> 
Hi,
OK. Makes sense, so that the kernel does not build the driver if the
configuration option is not explicitly set.


David, would you like me to issue a patch for that for sierra_net
driver?
Is this the general rule one should follow when adding a new driver?

Thanks,
Elina

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 2/2] ppp_generic: linearise skbs before passing them to pppd
From: Simon Arlott @ 2010-04-30 18:41 UTC (permalink / raw)
  To: netdev; +Cc: paulus, linux-ppp
In-Reply-To: <4BDB244D.40800@simon.arlott.org.uk>

Frequently when using PPPoE with an interface MTU greater than 1500,
the skb is likely to be non-linear. If the skb needs to be passed to
pppd then the skb must be linearised first.

The previous commit fixes an issue with accidentally sending skbs
to pppd based on an invalid read of the protocol type. When that
error occurred pppd was reading invalid skb data too.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
Tested with PPPoE over e1000 at MTU 16110.

 drivers/net/ppp_generic.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index fdd8deb..6855a7b 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1222,6 +1222,8 @@ ppp_send_frame(struct ppp *ppp, struct sk_buff *skb)
 	if (ppp->flags & SC_LOOP_TRAFFIC) {
 		if (ppp->file.rq.qlen > PPP_MAX_RQLEN)
 			goto drop;
+		if (skb_linearize(skb))
+			goto drop;
 		skb_queue_tail(&ppp->file.rq, skb);
 		wake_up_interruptible(&ppp->file.rwait);
 		return;
@@ -1586,6 +1588,12 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 	proto = PPP_PROTO(skb);
 	if (!pch->ppp || proto >= 0xc000 || proto == PPP_CCPFRAG) {
 		/* put it on the channel queue */
+		if (skb_linearise(skb)) {
+			kfree_skb(skb);
+			if (pch->ppp)
+				ppp_receive_error(pch->ppp);
+			goto done;
+		}
 		skb_queue_tail(&pch->file.rq, skb);
 		/* drop old frames if queue too long */
 		while (pch->file.rq.qlen > PPP_MAX_RQLEN &&
@@ -1733,6 +1741,8 @@ ppp_receive_nonmp_frame(struct ppp *ppp, struct sk_buff *skb)
 	npi = proto_to_npindex(proto);
 	if (npi < 0) {
 		/* control or unknown frame - pass it to pppd */
+		if (skb_linearize(skb))
+			goto err;
 		skb_queue_tail(&ppp->file.rq, skb);
 		/* limit queue length by dropping old frames */
 		while (ppp->file.rq.qlen > PPP_MAX_RQLEN &&
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* [PATCH 1/2] ppp_generic: pull 2 bytes so that PPP_PROTO(skb) is valid
From: Simon Arlott @ 2010-04-30 18:41 UTC (permalink / raw)
  To: netdev; +Cc: paulus, linux-ppp

In ppp_input(), PPP_PROTO(skb) may refer to invalid data in the skb.

If this happens and (proto >= 0xc000 || proto == PPP_CCPFRAG) then
the packet is passed directly to pppd.

This occurs frequently when using PPPoE with an interface MTU
greater than 1500 because the skb is more likely to be non-linear.

The next 2 bytes need to be pulled in ppp_input(). The pull of 2
bytes in ppp_receive_frame() has been removed as it is no longer
required.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
Tested with PPPoE over e1000 at MTU 16110.

 drivers/net/ppp_generic.c |   28 ++++++++++++++++++----------
 1 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 6e281bc..fdd8deb 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1572,8 +1572,18 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 		return;
 	}
 
-	proto = PPP_PROTO(skb);
+
 	read_lock_bh(&pch->upl);
+	if (!pskb_may_pull(skb, 2)) {
+		kfree_skb(skb);
+		if (pch->ppp) {
+			++pch->ppp->dev->stats.rx_length_errors;
+			ppp_receive_error(pch->ppp);
+		}
+		goto done;
+	}
+
+	proto = PPP_PROTO(skb);
 	if (!pch->ppp || proto >= 0xc000 || proto == PPP_CCPFRAG) {
 		/* put it on the channel queue */
 		skb_queue_tail(&pch->file.rq, skb);
@@ -1585,6 +1595,8 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
 	} else {
 		ppp_do_recv(pch->ppp, skb, pch);
 	}
+
+done:
 	read_unlock_bh(&pch->upl);
 }
 
@@ -1617,7 +1629,8 @@ ppp_input_error(struct ppp_channel *chan, int code)
 static void
 ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
 {
-	if (pskb_may_pull(skb, 2)) {
+	/* note: a 0-length skb is used as an error indication */
+	if (skb->len > 0) {
 #ifdef CONFIG_PPP_MULTILINK
 		/* XXX do channel-level decompression here */
 		if (PPP_PROTO(skb) == PPP_MP)
@@ -1625,15 +1638,10 @@ ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
 		else
 #endif /* CONFIG_PPP_MULTILINK */
 			ppp_receive_nonmp_frame(ppp, skb);
-		return;
+	} else {
+		kfree_skb(skb);
+		ppp_receive_error(ppp);
 	}
-
-	if (skb->len > 0)
-		/* note: a 0-length skb is used as an error indication */
-		++ppp->dev->stats.rx_length_errors;
-
-	kfree_skb(skb);
-	ppp_receive_error(ppp);
 }
 
 static void
-- 
1.7.0.4

-- 
Simon Arlott

^ permalink raw reply related

* Re: [PATCH 5/5] sctp: Fix oops when sending queued ASCONF chunks
From: Shuaijun Zhang @ 2010-04-29 14:09 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev, davem, linux-sctp, Yuansong Qiao
In-Reply-To: <1272480442-32673-6-git-send-email-vladislav.yasevich@hp.com>

Vlad Yasevich wrote:
> When we finish processing ASCONF_ACK chunk, we try to send
> the next queued ASCONF.  This action runs the sctp state
> machine recursively and it's not prepared to do so.
>
> kernel BUG at kernel/timer.c:790!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/module/ipv6/initstate
> Modules linked in: sha256_generic sctp libcrc32c ipv6 dm_multipath
> uinput 8139too i2c_piix4 8139cp mii i2c_core pcspkr virtio_net joydev
> floppy virtio_blk virtio_pci [last unloaded: scsi_wait_scan]
>
> Pid: 0, comm: swapper Not tainted 2.6.34-rc4 #15 /Bochs
> EIP: 0060:[<c044a2ef>] EFLAGS: 00010286 CPU: 0
> EIP is at add_timer+0xd/0x1b
> EAX: cecbab14 EBX: 000000f0 ECX: c0957b1c EDX: 03595cf4
> ESI: cecba800 EDI: cf276f00 EBP: c0957aa0 ESP: c0957aa0
>  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Process swapper (pid: 0, ti=c0956000 task=c0988ba0 task.ti=c0956000)
> Stack:
>  c0957ae0 d1851214 c0ab62e4 c0ab5f26 0500ffff 00000004 00000005 00000004
> <0> 00000000 d18694fd 00000004 1666b892 cecba800 cecba800 c0957b14
> 00000004
> <0> c0957b94 d1851b11 ceda8b00 cecba800 cf276f00 00000001 c0957b14
> 000000d0
>   
According to the call trace below, it seems that our modification did 
not take affect.
sctp_primitive_ASCONF should be invoked after sctp_side_effects().
Our code fixed the same problem in kernel 2.6.27.28.
Not sure about the difference between 2.6.34-rc4 kernel and 2.6.27.28 
kernel.
> Call Trace:
>  [<d1851214>] ? sctp_side_effects+0x607/0xdfc [sctp]
>  [<d1851b11>] ? sctp_do_sm+0x108/0x159 [sctp]
>  [<d1863386>] ? sctp_pname+0x0/0x1d [sctp]
>  [<d1861a56>] ? sctp_primitive_ASCONF+0x36/0x3b [sctp]		<--- sctp_side_effects() should show up here before send next asconf
>  [<d185657c>] ? sctp_process_asconf_ack+0x2a4/0x2d3 [sctp]
>  [<d184e35c>] ? sctp_sf_do_asconf_ack+0x1dd/0x2b4 [sctp]
>  [<d1851ac1>] ? sctp_do_sm+0xb8/0x159 [sctp]
>  [<d1863334>] ? sctp_cname+0x0/0x52 [sctp]
>  [<d1854377>] ? sctp_assoc_bh_rcv+0xac/0xe1 [sctp]
>  [<d1858f0f>] ? sctp_inq_push+0x2d/0x30 [sctp]
>  [<d186329d>] ? sctp_rcv+0x797/0x82e [sctp]
>
> Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com>
> Signed-off-by: Yuansong Qiao <ysqiao@research.ait.ie>
> Signed-off-by: Shuaijun Zhang <szhang@research.ait.ie>
> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
> ---
>  include/net/sctp/command.h |    1 +
>  net/sctp/sm_make_chunk.c   |   15 ---------------
>  net/sctp/sm_sideeffect.c   |   26 ++++++++++++++++++++++++++
>  net/sctp/sm_statefuns.c    |    8 +++++++-
>  4 files changed, 34 insertions(+), 16 deletions(-)
>
> diff --git a/include/net/sctp/command.h b/include/net/sctp/command.h
> index 8be5135..2c55a7e 100644
> --- a/include/net/sctp/command.h
> +++ b/include/net/sctp/command.h
> @@ -107,6 +107,7 @@ typedef enum {
>  	SCTP_CMD_T1_RETRAN,	 /* Mark for retransmission after T1 timeout  */
>  	SCTP_CMD_UPDATE_INITTAG, /* Update peer inittag */
>  	SCTP_CMD_SEND_MSG,	 /* Send the whole use message */
> +	SCTP_CMD_SEND_NEXT_ASCONF, /* Send the next ASCONF after ACK */
>  	SCTP_CMD_LAST
>  } sctp_verb_t;
>  
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index f6fc5c1..0fd5b4c 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -3318,21 +3318,6 @@ int sctp_process_asconf_ack(struct sctp_association *asoc,
>  	sctp_chunk_free(asconf);
>  	asoc->addip_last_asconf = NULL;
>  
> -	/* Send the next asconf chunk from the addip chunk queue. */
> -	if (!list_empty(&asoc->addip_chunk_list)) {
> -		struct list_head *entry = asoc->addip_chunk_list.next;
> -		asconf = list_entry(entry, struct sctp_chunk, list);
> -
> -		list_del_init(entry);
> -
> -		/* Hold the chunk until an ASCONF_ACK is received. */
> -		sctp_chunk_hold(asconf);
> -		if (sctp_primitive_ASCONF(asoc, asconf))
> -			sctp_chunk_free(asconf);
> -		else
> -			asoc->addip_last_asconf = asconf;
> -	}
> -
>  	return retval;
>  }
>  
> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> index 4c5bed9..d5ae450 100644
> --- a/net/sctp/sm_sideeffect.c
> +++ b/net/sctp/sm_sideeffect.c
> @@ -962,6 +962,29 @@ static int sctp_cmd_send_msg(struct sctp_association *asoc,
>  }
>  
>  
> +/* Sent the next ASCONF packet currently stored in the association.
> + * This happens after the ASCONF_ACK was succeffully processed.
> + */
> +static void sctp_cmd_send_asconf(struct sctp_association *asoc)
> +{
> +	/* Send the next asconf chunk from the addip chunk
> +	 * queue.
> +	 */
> +	if (!list_empty(&asoc->addip_chunk_list)) {
> +		struct list_head *entry = asoc->addip_chunk_list.next;
> +		struct sctp_chunk *asconf = list_entry(entry,
> +						struct sctp_chunk, list);
> +		list_del_init(entry);
> +
> +		/* Hold the chunk until an ASCONF_ACK is received. */
> +		sctp_chunk_hold(asconf);
> +		if (sctp_primitive_ASCONF(asoc, asconf))
> +			sctp_chunk_free(asconf);
> +		else
> +			asoc->addip_last_asconf = asconf;
> +	}
> +}
> +
>  
>  /* These three macros allow us to pull the debugging code out of the
>   * main flow of sctp_do_sm() to keep attention focused on the real
> @@ -1617,6 +1640,9 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
>  			}
>  			error = sctp_cmd_send_msg(asoc, cmd->obj.msg);
>  			break;
> +		case SCTP_CMD_SEND_NEXT_ASCONF:
> +			sctp_cmd_send_asconf(asoc);
> +			break;
>  		default:
>  			printk(KERN_WARNING "Impossible command: %u, %p\n",
>  			       cmd->verb, cmd->obj.ptr);
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index abf601a..24b2cd5 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -3676,8 +3676,14 @@ sctp_disposition_t sctp_sf_do_asconf_ack(const struct sctp_endpoint *ep,
>  				SCTP_TO(SCTP_EVENT_TIMEOUT_T4_RTO));
>  
>  		if (!sctp_process_asconf_ack((struct sctp_association *)asoc,
> -					     asconf_ack))
> +					     asconf_ack)) {
> +			/* Successfully processed ASCONF_ACK.  We can
> +			 * release the next asconf if we have one.
> +			 */
> +			sctp_add_cmd_sf(commands, SCTP_CMD_SEND_NEXT_ASCONF,
> +					SCTP_NULL());
>  			return SCTP_DISPOSITION_CONSUME;
> +		}
>  
>  		abort = sctp_make_abort(asoc, asconf_ack,
>  					sizeof(sctp_errhdr_t));
>   


^ permalink raw reply

* Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: Serge E. Hallyn @ 2010-04-30 18:37 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, Vlad Yasevich, netdev, David Miller
In-Reply-To: <87mxwkztjf.fsf@caffeine.danplanet.com>

Quoting Dan Smith (danms@us.ibm.com):
> SH> So I'm afraid you're going to have to do a slightly uglier thing
> SH> where you unshare_nsproxy_namespaces() and then
> SH> switch_task_namespaces() to the new nsproxy.
> 
> Well, I think that would be hidden in the nicer helper function I
> think I'll need, which I eluded to in the patch header.  This is just
> an RFC proof that it can be done in this manner, but I think a
> separate helper in nsproxy.c is in order to make it nice (and avoid
> the extra alloc/free of the netns that copy_namespaces() will create).
> Agreed?

Yup - thanks!

-serge

^ permalink raw reply

* Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: Dan Smith @ 2010-04-30 18:25 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: containers, Vlad Yasevich, netdev, David Miller
In-Reply-To: <20100430181946.GA26761@us.ibm.com>

SH> So I'm afraid you're going to have to do a slightly uglier thing
SH> where you unshare_nsproxy_namespaces() and then
SH> switch_task_namespaces() to the new nsproxy.

Well, I think that would be hidden in the nicer helper function I
think I'll need, which I eluded to in the patch header.  This is just
an RFC proof that it can be done in this manner, but I think a
separate helper in nsproxy.c is in order to make it nice (and avoid
the extra alloc/free of the netns that copy_namespaces() will create).
Agreed?

Thanks!

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-29 13:49 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272548258.4258.185.camel@bigi>

Le jeudi 29 avril 2010 à 09:37 -0400, jamal a écrit :
> On Thu, 2010-04-29 at 15:21 +0200, Eric Dumazet wrote:
> 
> 
> > 
> > You could try following program :
> > 
> 
> Will do later today (test machine is not on the network and is about 20
> minutes from here; so worst case i will get you results by end of day)
> I guess this program is good enough since it tells me the system wide
> ipi count - what my patch did was also to break it down by which cpu got
> how many IPIs (served to check if there was uneven distribution)
> 
> > 
> > Is your application mono threaded and receiving data to 8 sockets ?
> > 
> 
> I fork one instance per detected cpu and bind to different ports each
> time. Example bind to port 8200 on cpu0, 8201 on cpu1, etc.
> 

I guess this is the problem ;)

With RPS, you should not bind your threads to cpu.
This is the rps hash who will decide for you.


I am using following program :

/*
 *  Usage: udpsink [ -p baseport] nbports
 *
 */
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>

struct worker_data {
	int fd;
	unsigned long pack_count;
	unsigned long bytes_count;
	unsigned long _padd[16 - 3]; /* alignment */ 
};

void usage(int code)
{
	fprintf(stderr, "Usage: udpsink [-p baseport] nbports\n");
	exit(code);
}

void *worker_func(void *arg)
{
	struct worker_data *wdata = (struct worker_data *)arg;
	char buffer[4096];
	struct sockaddr_in addr;
	int lu;

	while (1) {
		socklen_t len = sizeof(addr);
		lu = recvfrom(wdata->fd, buffer, sizeof(buffer), 0, (struct sockaddr *)&addr, &len);
		if (lu > 0) {
			wdata->pack_count++;
			wdata->bytes_count += lu;
		}
	}
}

int main(int argc, char *argv[])
{
int c;
int baseport = 4000;
int nbthreads;
struct worker_data *wdata;
unsigned long ototal = 0;
int concurrent = 0;
int verbose = 0;
int i;
	while ((c = getopt(argc, argv, "cvp:")) != -1) {
		if (c == 'p')
			baseport = atoi(optarg);
		else if (c == 'c')
			concurrent = 1;
		else if (c == 'v')
			verbose++;
		else usage(1);
	}
	if (optind == argc)
		usage(1);
	nbthreads = atoi(argv[optind]);
	wdata = calloc(sizeof(struct worker_data), nbthreads);
	if (!wdata) {
		perror("calloc");
		return 1;
	}
	for (i = 0; i < nbthreads; i++) {
		struct sockaddr_in addr;
		pthread_t tid;

		if (i && concurrent) {
			wdata[i].fd = wdata[0].fd ;
		} else {
			wdata[i].fd = socket(PF_INET, SOCK_DGRAM, 0);
			if (wdata[i].fd == -1) {
				perror("socket");
				return 1;
			}
			memset(&addr, 0, sizeof(addr));
			addr.sin_family = AF_INET;
//			addr.sin_addr.s_addr = inet_addr(argv[optind]);
			addr.sin_port = htons(baseport + i);
			if (bind(wdata[i].fd, (struct sockaddr *) &addr, sizeof(addr)) < 0) {
				perror("bind");
				return 1;
				}
//			fcntl(wdata[i].fd, F_SETFL, O_NDELAY);
			}
		pthread_create(&tid, NULL, worker_func, wdata + i);
	}
	for (;;) {
		unsigned long total;
		long delta;

		sleep(1);
		total = 0;
		for (i = 0; i < nbthreads;i++) {
			total += wdata[i].pack_count;
		}
		delta = total - ototal;
		if (delta) {
			printf("%lu pps (%lu", delta, total);
			if (verbose) {
				for (i = 0; i < nbthreads;i++) { 
					if (wdata[i].pack_count)
						printf(" %d:%lu", i, wdata[i].pack_count);
				}
			}
			printf(")\n");
		}
		ototal = total;
	}
}




^ permalink raw reply

* [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: Eric Dumazet @ 2010-04-29 21:01 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272573383.3969.8.camel@bigi>

Le jeudi 29 avril 2010 à 16:36 -0400, jamal a écrit :

> Results attached.
> With your app rps does a hell lot better and non-rps worse ;->
> With my proggie, non-rps does much better than yours and rps does
> a lot worse for same setup. I see the scheduler kicking quiet a bit in
> non-rps for you...
> 
> The main difference between us as i see it is:
> a) i use epoll - actually linked to libevent (1.0.something)
> b) I fork processes and you use pthreads.
> 
> I dont have time to chase it today, but 1) I am either going to change
> yours to use libevent or make mine get rid of it then 2) move towards
> pthreads or have yours fork..
> then observe if that makes any difference..
> 

Thanks !

Here is last 'patch of the day' for me ;)

Next one will be able to coalesce wakeup calls (they'll be delayed at
the end of net_rx_action(), like a patch I did last year to help
multicast reception)

vger seems to be down, I suspect I'll have to resend it later.

[PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion

sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.

RCU conversion is pretty much needed :

1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).

[Future patch will add a list anchor for wakeup coalescing]

2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().

3) Respect RCU grace period when freeing a "struct socket_wq"

4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"

5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep

6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.

7) Change all sk_has_sleeper() callers to :
  - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
  - Use wq_has_sleeper() to eventually wakeup tasks.
  - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

8) sock_wake_async() is modified to use rcu protection as well.

9) Exceptions :
  macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.

Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
--- 
 drivers/net/macvtap.c |   13 +++++++---
 drivers/net/tun.c     |   21 +++++++++-------
 include/linux/net.h   |   14 +++++++----
 include/net/af_unix.h |   20 ++++++++--------
 include/net/sock.h    |   40 ++++++++++++++++----------------
 net/atm/common.c      |   22 +++++++++++------
 net/core/sock.c       |   50 ++++++++++++++++++++++++----------------
 net/core/stream.c     |   10 +++++---
 net/dccp/output.c     |   10 ++++----
 net/iucv/af_iucv.c    |   11 +++++---
 net/phonet/pep.c      |    8 +++---
 net/phonet/socket.c   |    2 -
 net/rxrpc/af_rxrpc.c  |   10 ++++----
 net/sctp/socket.c     |    2 -
 net/socket.c          |   47 ++++++++++++++++++++++++++++---------
 net/unix/af_unix.c    |   17 ++++++-------
 16 files changed, 182 insertions(+), 115 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index d97e1fd..1c4110d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -37,6 +37,7 @@
 struct macvtap_queue {
 	struct sock sk;
 	struct socket sock;
+	struct socket_wq wq;
 	struct macvlan_dev *vlan;
 	struct file *file;
 	unsigned int flags;
@@ -242,12 +243,15 @@ static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
 
 static void macvtap_sock_write_space(struct sock *sk)
 {
+	wait_queue_head_t *wqueue;
+
 	if (!sock_writeable(sk) ||
 	    !test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
 		return;
 
-	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
-		wake_up_interruptible_poll(sk_sleep(sk), POLLOUT | POLLWRNORM | POLLWRBAND);
+	wqueue = sk_sleep(sk);
+	if (wqueue && waitqueue_active(wqueue))
+		wake_up_interruptible_poll(wqueue, POLLOUT | POLLWRNORM | POLLWRBAND);
 }
 
 static int macvtap_open(struct inode *inode, struct file *file)
@@ -272,7 +276,8 @@ static int macvtap_open(struct inode *inode, struct file *file)
 	if (!q)
 		goto out;
 
-	init_waitqueue_head(&q->sock.wait);
+	q->sock.wq = &q->wq;
+	init_waitqueue_head(&q->wq.wait);
 	q->sock.type = SOCK_RAW;
 	q->sock.state = SS_CONNECTED;
 	q->sock.file = file;
@@ -308,7 +313,7 @@ static unsigned int macvtap_poll(struct file *file, poll_table * wait)
 		goto out;
 
 	mask = 0;
-	poll_wait(file, &q->sock.wait, wait);
+	poll_wait(file, &q->wq.wait, wait);
 
 	if (!skb_queue_empty(&q->sk.sk_receive_queue))
 		mask |= POLLIN | POLLRDNORM;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 20a1793..e525a6c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -109,7 +109,7 @@ struct tun_struct {
 
 	struct tap_filter       txflt;
 	struct socket		socket;
-
+	struct socket_wq	wq;
 #ifdef TUN_DEBUG
 	int debug;
 #endif
@@ -323,7 +323,7 @@ static void tun_net_uninit(struct net_device *dev)
 	/* Inform the methods they need to stop using the dev.
 	 */
 	if (tfile) {
-		wake_up_all(&tun->socket.wait);
+		wake_up_all(&tun->wq.wait);
 		if (atomic_dec_and_test(&tfile->count))
 			__tun_detach(tun);
 	}
@@ -398,7 +398,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Notify and wake up reader process */
 	if (tun->flags & TUN_FASYNC)
 		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
-	wake_up_interruptible_poll(&tun->socket.wait, POLLIN |
+	wake_up_interruptible_poll(&tun->wq.wait, POLLIN |
 				   POLLRDNORM | POLLRDBAND);
 	return NETDEV_TX_OK;
 
@@ -498,7 +498,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
 
 	DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name);
 
-	poll_wait(file, &tun->socket.wait, wait);
+	poll_wait(file, &tun->wq.wait, wait);
 
 	if (!skb_queue_empty(&sk->sk_receive_queue))
 		mask |= POLLIN | POLLRDNORM;
@@ -773,7 +773,7 @@ static ssize_t tun_do_read(struct tun_struct *tun,
 
 	DBG(KERN_INFO "%s: tun_chr_read\n", tun->dev->name);
 
-	add_wait_queue(&tun->socket.wait, &wait);
+	add_wait_queue(&tun->wq.wait, &wait);
 	while (len) {
 		current->state = TASK_INTERRUPTIBLE;
 
@@ -804,7 +804,7 @@ static ssize_t tun_do_read(struct tun_struct *tun,
 	}
 
 	current->state = TASK_RUNNING;
-	remove_wait_queue(&tun->socket.wait, &wait);
+	remove_wait_queue(&tun->wq.wait, &wait);
 
 	return ret;
 }
@@ -861,6 +861,7 @@ static struct rtnl_link_ops tun_link_ops __read_mostly = {
 static void tun_sock_write_space(struct sock *sk)
 {
 	struct tun_struct *tun;
+	wait_queue_head_t *wqueue;
 
 	if (!sock_writeable(sk))
 		return;
@@ -868,8 +869,9 @@ static void tun_sock_write_space(struct sock *sk)
 	if (!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
 		return;
 
-	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
-		wake_up_interruptible_sync_poll(sk_sleep(sk), POLLOUT |
+	wqueue = sk_sleep(sk);
+	if (wqueue && waitqueue_active(wqueue))
+		wake_up_interruptible_sync_poll(wqueue, POLLOUT |
 						POLLWRNORM | POLLWRBAND);
 
 	tun = tun_sk(sk)->tun;
@@ -1039,7 +1041,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		if (!sk)
 			goto err_free_dev;
 
-		init_waitqueue_head(&tun->socket.wait);
+		tun->socket.wq = &tun->wq;
+		init_waitqueue_head(&tun->wq.wait);
 		tun->socket.ops = &tun_socket_ops;
 		sock_init_data(&tun->socket, sk);
 		sk->sk_write_space = tun_sock_write_space;
diff --git a/include/linux/net.h b/include/linux/net.h
index 4157b5d..2b4deee 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -59,6 +59,7 @@ typedef enum {
 #include <linux/wait.h>
 #include <linux/fcntl.h>	/* For O_CLOEXEC and O_NONBLOCK */
 #include <linux/kmemcheck.h>
+#include <linux/rcupdate.h>
 
 struct poll_table_struct;
 struct pipe_inode_info;
@@ -116,6 +117,12 @@ enum sock_shutdown_cmd {
 	SHUT_RDWR	= 2,
 };
 
+struct socket_wq {
+	wait_queue_head_t	wait;
+	struct fasync_struct	*fasync_list;
+	struct rcu_head		rcu;
+} ____cacheline_aligned_in_smp;
+
 /**
  *  struct socket - general BSD socket
  *  @state: socket state (%SS_CONNECTED, etc)
@@ -135,11 +142,8 @@ struct socket {
 	kmemcheck_bitfield_end(type);
 
 	unsigned long		flags;
-	/*
-	 * Please keep fasync_list & wait fields in the same cache line
-	 */
-	struct fasync_struct	*fasync_list;
-	wait_queue_head_t	wait;
+
+	struct socket_wq	*wq;
 
 	struct file		*file;
 	struct sock		*sk;
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 1614d78..20725e2 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -30,7 +30,7 @@ struct unix_skb_parms {
 #endif
 };
 
-#define UNIXCB(skb) 	(*(struct unix_skb_parms*)&((skb)->cb))
+#define UNIXCB(skb) 	(*(struct unix_skb_parms *)&((skb)->cb))
 #define UNIXCREDS(skb)	(&UNIXCB((skb)).creds)
 #define UNIXSID(skb)	(&UNIXCB((skb)).secid)
 
@@ -45,21 +45,23 @@ struct unix_skb_parms {
 struct unix_sock {
 	/* WARNING: sk has to be the first member */
 	struct sock		sk;
-        struct unix_address     *addr;
-        struct dentry		*dentry;
-        struct vfsmount		*mnt;
+	struct unix_address     *addr;
+	struct dentry		*dentry;
+	struct vfsmount		*mnt;
 	struct mutex		readlock;
-        struct sock		*peer;
-        struct sock		*other;
+	struct sock		*peer;
+	struct sock		*other;
 	struct list_head	link;
-        atomic_long_t           inflight;
-        spinlock_t		lock;
+	atomic_long_t		inflight;
+	spinlock_t		lock;
 	unsigned int		gc_candidate : 1;
 	unsigned int		gc_maybe_cycle : 1;
-        wait_queue_head_t       peer_wait;
+	struct socket_wq	peer_wq;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
+#define peer_wait peer_wq.wait
+
 #ifdef CONFIG_SYSCTL
 extern int unix_sysctl_register(struct net *net);
 extern void unix_sysctl_unregister(struct net *net);
diff --git a/include/net/sock.h b/include/net/sock.h
index d361c77..03d0046 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -159,7 +159,7 @@ struct sock_common {
   *	@sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
   *	@sk_lock:	synchronizer
   *	@sk_rcvbuf: size of receive buffer in bytes
-  *	@sk_sleep: sock wait queue
+  *	@sk_wq: sock wait queue and async head
   *	@sk_dst_cache: destination cache
   *	@sk_dst_lock: destination cache lock
   *	@sk_policy: flow policy
@@ -257,7 +257,7 @@ struct sock {
 		struct sk_buff *tail;
 		int len;
 	} sk_backlog;
-	wait_queue_head_t	*sk_sleep;
+	struct socket_wq	*sk_wq;
 	struct dst_entry	*sk_dst_cache;
 #ifdef CONFIG_XFRM
 	struct xfrm_policy	*sk_policy[2];
@@ -1219,7 +1219,7 @@ static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 
 static inline wait_queue_head_t *sk_sleep(struct sock *sk)
 {
-	return sk->sk_sleep;
+	return &sk->sk_wq->wait;
 }
 /* Detach socket from process context.
  * Announce socket dead, detach it from wait queue and inode.
@@ -1233,14 +1233,14 @@ static inline void sock_orphan(struct sock *sk)
 	write_lock_bh(&sk->sk_callback_lock);
 	sock_set_flag(sk, SOCK_DEAD);
 	sk_set_socket(sk, NULL);
-	sk->sk_sleep  = NULL;
+	sk->sk_wq  = NULL;
 	write_unlock_bh(&sk->sk_callback_lock);
 }
 
 static inline void sock_graft(struct sock *sk, struct socket *parent)
 {
 	write_lock_bh(&sk->sk_callback_lock);
-	sk->sk_sleep = &parent->wait;
+	rcu_assign_pointer(sk->sk_wq, parent->wq);
 	parent->sk = sk;
 	sk_set_socket(sk, parent);
 	security_sock_graft(sk, parent);
@@ -1392,12 +1392,12 @@ static inline int sk_has_allocations(const struct sock *sk)
 }
 
 /**
- * sk_has_sleeper - check if there are any waiting processes
- * @sk: socket
+ * wq_has_sleeper - check if there are any waiting processes
+ * @sk: struct socket_wq
  *
- * Returns true if socket has waiting processes
+ * Returns true if socket_wq has waiting processes
  *
- * The purpose of the sk_has_sleeper and sock_poll_wait is to wrap the memory
+ * The purpose of the wq_has_sleeper and sock_poll_wait is to wrap the memory
  * barrier call. They were added due to the race found within the tcp code.
  *
  * Consider following tcp code paths:
@@ -1410,9 +1410,10 @@ static inline int sk_has_allocations(const struct sock *sk)
  *   ...                 ...
  *   tp->rcv_nxt check   sock_def_readable
  *   ...                 {
- *   schedule               ...
- *                          if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
- *                              wake_up_interruptible(sk_sleep(sk))
+ *   schedule               rcu_read_lock();
+ *                          wq = rcu_dereference(sk->sk_wq);
+ *                          if (wq && waitqueue_active(&wq->wait))
+ *                              wake_up_interruptible(&wq->wait)
  *                          ...
  *                       }
  *
@@ -1421,28 +1422,27 @@ static inline int sk_has_allocations(const struct sock *sk)
  * could then endup calling schedule and sleep forever if there are no more
  * data on the socket.
  *
- * The sk_has_sleeper is always called right after a call to read_lock, so we
- * can use smp_mb__after_lock barrier.
  */
-static inline int sk_has_sleeper(struct sock *sk)
+static inline bool wq_has_sleeper(struct socket_wq *wq)
 {
+
 	/*
 	 * We need to be sure we are in sync with the
 	 * add_wait_queue modifications to the wait queue.
 	 *
 	 * This memory barrier is paired in the sock_poll_wait.
 	 */
-	smp_mb__after_lock();
-	return sk_sleep(sk) && waitqueue_active(sk_sleep(sk));
+	smp_mb();
+	return wq && waitqueue_active(&wq->wait);
 }
-
+ 
 /**
  * sock_poll_wait - place memory barrier behind the poll_wait call.
  * @filp:           file
  * @wait_address:   socket wait queue
  * @p:              poll_table
  *
- * See the comments in the sk_has_sleeper function.
+ * See the comments in the wq_has_sleeper function.
  */
 static inline void sock_poll_wait(struct file *filp,
 		wait_queue_head_t *wait_address, poll_table *p)
@@ -1453,7 +1453,7 @@ static inline void sock_poll_wait(struct file *filp,
 		 * We need to be sure we are in sync with the
 		 * socket flags modification.
 		 *
-		 * This memory barrier is paired in the sk_has_sleeper.
+		 * This memory barrier is paired in the wq_has_sleeper.
 		*/
 		smp_mb();
 	}
diff --git a/net/atm/common.c b/net/atm/common.c
index e3e10e6..b43feb1 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -90,10 +90,13 @@ static void vcc_sock_destruct(struct sock *sk)
 
 static void vcc_def_wakeup(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
-	if (sk_has_sleeper(sk))
-		wake_up(sk_sleep(sk));
-	read_unlock(&sk->sk_callback_lock);
+	struct socket_wq *wq;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up(&wq->wait);
+	rcu_read_unlock();
 }
 
 static inline int vcc_writable(struct sock *sk)
@@ -106,16 +109,19 @@ static inline int vcc_writable(struct sock *sk)
 
 static void vcc_write_space(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
+	struct socket_wq *wq;
+
+	rcu_read_lock();
 
 	if (vcc_writable(sk)) {
-		if (sk_has_sleeper(sk))
-			wake_up_interruptible(sk_sleep(sk));
+		wq = rcu_dereference(sk->sk_wq);
+		if (wq_has_sleeper(wq))
+			wake_up_interruptible(&wq->wait);
 
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 
-	read_unlock(&sk->sk_callback_lock);
+	rcu_read_unlock();
 }
 
 static struct proto vcc_proto = {
diff --git a/net/core/sock.c b/net/core/sock.c
index 5104175..94c4aff 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1211,7 +1211,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		 */
 		sk_refcnt_debug_inc(newsk);
 		sk_set_socket(newsk, NULL);
-		newsk->sk_sleep	 = NULL;
+		newsk->sk_wq = NULL;
 
 		if (newsk->sk_prot->sockets_allocated)
 			percpu_counter_inc(newsk->sk_prot->sockets_allocated);
@@ -1800,41 +1800,53 @@ EXPORT_SYMBOL(sock_no_sendpage);
 
 static void sock_def_wakeup(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
-	if (sk_has_sleeper(sk))
-		wake_up_interruptible_all(sk_sleep(sk));
-	read_unlock(&sk->sk_callback_lock);
+	struct socket_wq *wq;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_all(&wq->wait);
+	rcu_read_unlock();
 }
 
 static void sock_def_error_report(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
-	if (sk_has_sleeper(sk))
-		wake_up_interruptible_poll(sk_sleep(sk), POLLERR);
+	struct socket_wq *wq;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_poll(&wq->wait, POLLERR);
 	sk_wake_async(sk, SOCK_WAKE_IO, POLL_ERR);
-	read_unlock(&sk->sk_callback_lock);
+	rcu_read_unlock();
 }
 
 static void sock_def_readable(struct sock *sk, int len)
 {
-	read_lock(&sk->sk_callback_lock);
-	if (sk_has_sleeper(sk))
-		wake_up_interruptible_sync_poll(sk_sleep(sk), POLLIN |
+	struct socket_wq *wq;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
 						POLLRDNORM | POLLRDBAND);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
-	read_unlock(&sk->sk_callback_lock);
+	rcu_read_unlock();
 }
 
 static void sock_def_write_space(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
+	struct socket_wq *wq;
+
+	rcu_read_lock();
 
 	/* Do not wake up a writer until he can make "significant"
 	 * progress.  --DaveM
 	 */
 	if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
-		if (sk_has_sleeper(sk))
-			wake_up_interruptible_sync_poll(sk_sleep(sk), POLLOUT |
+		wq = rcu_dereference(sk->sk_wq);
+		if (wq_has_sleeper(wq))
+			wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
 						POLLWRNORM | POLLWRBAND);
 
 		/* Should agree with poll, otherwise some programs break */
@@ -1842,7 +1854,7 @@ static void sock_def_write_space(struct sock *sk)
 			sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 
-	read_unlock(&sk->sk_callback_lock);
+	rcu_read_unlock();
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1896,10 +1908,10 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 	if (sock) {
 		sk->sk_type	=	sock->type;
-		sk->sk_sleep	=	&sock->wait;
+		sk->sk_wq	=	sock->wq;
 		sock->sk	=	sk;
 	} else
-		sk->sk_sleep	=	NULL;
+		sk->sk_wq	=	NULL;
 
 	spin_lock_init(&sk->sk_dst_lock);
 	rwlock_init(&sk->sk_callback_lock);
diff --git a/net/core/stream.c b/net/core/stream.c
index 7b3c3f3..cc196f4 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -28,15 +28,19 @@
 void sk_stream_write_space(struct sock *sk)
 {
 	struct socket *sock = sk->sk_socket;
+	struct socket_wq *wq;
 
 	if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock) {
 		clear_bit(SOCK_NOSPACE, &sock->flags);
 
-		if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
-			wake_up_interruptible_poll(sk_sleep(sk), POLLOUT |
+		rcu_read_lock();
+		wq = rcu_dereference(sk->sk_wq);
+		if (wq_has_sleeper(wq))
+			wake_up_interruptible_poll(&wq->wait, POLLOUT |
 						POLLWRNORM | POLLWRBAND);
-		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
+		if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, SOCK_WAKE_SPACE, POLL_OUT);
+		rcu_read_unlock();
 	}
 }
 
diff --git a/net/dccp/output.c b/net/dccp/output.c
index 2d3dcb3..aadbdb5 100644
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -195,15 +195,17 @@ EXPORT_SYMBOL_GPL(dccp_sync_mss);
 
 void dccp_write_space(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
+	struct socket_wq *wq;
 
-	if (sk_has_sleeper(sk))
-		wake_up_interruptible(sk_sleep(sk));
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible(&wq->wait);
 	/* Should agree with poll, otherwise some programs break */
 	if (sock_writeable(sk))
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 
-	read_unlock(&sk->sk_callback_lock);
+	rcu_read_unlock();
 }
 
 /**
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 9636b7d..8be324f 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -305,11 +305,14 @@ static inline int iucv_below_msglim(struct sock *sk)
  */
 static void iucv_sock_wake_msglim(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
-	if (sk_has_sleeper(sk))
-		wake_up_interruptible_all(sk_sleep(sk));
+	struct socket_wq *wq;
+
+	rcu_read_lock();
+	wq = rcu_dereference(sk->sk_wq);
+	if (wq_has_sleeper(wq))
+		wake_up_interruptible_all(&wq->wait);
 	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
-	read_unlock(&sk->sk_callback_lock);
+	rcu_read_unlock();
 }
 
 /* Timers */
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index e2a9576..af4d38b 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -664,12 +664,12 @@ static int pep_wait_connreq(struct sock *sk, int noblock)
 		if (signal_pending(tsk))
 			return sock_intr_errno(timeo);
 
-		prepare_to_wait_exclusive(&sk->sk_socket->wait, &wait,
+		prepare_to_wait_exclusive(sk_sleep(sk), &wait,
 						TASK_INTERRUPTIBLE);
 		release_sock(sk);
 		timeo = schedule_timeout(timeo);
 		lock_sock(sk);
-		finish_wait(&sk->sk_socket->wait, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 	}
 
 	return 0;
@@ -910,10 +910,10 @@ disabled:
 			goto out;
 		}
 
-		prepare_to_wait(&sk->sk_socket->wait, &wait,
+		prepare_to_wait(sk_sleep(sk), &wait,
 				TASK_INTERRUPTIBLE);
 		done = sk_wait_event(sk, &timeo, atomic_read(&pn->tx_credits));
-		finish_wait(&sk->sk_socket->wait, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 
 		if (sk->sk_state != TCP_ESTABLISHED)
 			goto disabled;
diff --git a/net/phonet/socket.c b/net/phonet/socket.c
index c785bfd..6e9848b 100644
--- a/net/phonet/socket.c
+++ b/net/phonet/socket.c
@@ -265,7 +265,7 @@ static unsigned int pn_socket_poll(struct file *file, struct socket *sock,
 	struct pep_sock *pn = pep_sk(sk);
 	unsigned int mask = 0;
 
-	poll_wait(file, &sock->wait, wait);
+	poll_wait(file, sk_sleep(sk), wait);
 
 	switch (sk->sk_state) {
 	case TCP_LISTEN:
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index c432d76..0b9bb20 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -62,13 +62,15 @@ static inline int rxrpc_writable(struct sock *sk)
 static void rxrpc_write_space(struct sock *sk)
 {
 	_enter("%p", sk);
-	read_lock(&sk->sk_callback_lock);
+	rcu_read_lock();
 	if (rxrpc_writable(sk)) {
-		if (sk_has_sleeper(sk))
-			wake_up_interruptible(sk_sleep(sk));
+		struct socket_wq *wq = rcu_dereference(sk->sk_wq);
+
+		if (wq_has_sleeper(wq))
+			wake_up_interruptible(&wq->wait);
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
-	read_unlock(&sk->sk_callback_lock);
+	rcu_read_unlock();
 }
 
 /*
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 13d8229..d54700a 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6065,7 +6065,7 @@ static void __sctp_write_space(struct sctp_association *asoc)
 			 * here by modeling from the current TCP/UDP code.
 			 * We have not tested with it yet.
 			 */
-			if (sock->fasync_list &&
+			if (sock->wq->fasync_list &&
 			    !(sk->sk_shutdown & SEND_SHUTDOWN))
 				sock_wake_async(sock,
 						SOCK_WAKE_SPACE, POLL_OUT);
diff --git a/net/socket.c b/net/socket.c
index 9822081..a0a59cb 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -252,9 +252,14 @@ static struct inode *sock_alloc_inode(struct super_block *sb)
 	ei = kmem_cache_alloc(sock_inode_cachep, GFP_KERNEL);
 	if (!ei)
 		return NULL;
-	init_waitqueue_head(&ei->socket.wait);
+	ei->socket.wq = kmalloc(sizeof(struct socket_wq), GFP_KERNEL);
+	if (!ei->socket.wq) {
+		kmem_cache_free(sock_inode_cachep, ei);
+		return NULL;
+	}
+	init_waitqueue_head(&ei->socket.wq->wait);
+	ei->socket.wq->fasync_list = NULL;
 
-	ei->socket.fasync_list = NULL;
 	ei->socket.state = SS_UNCONNECTED;
 	ei->socket.flags = 0;
 	ei->socket.ops = NULL;
@@ -264,10 +269,21 @@ static struct inode *sock_alloc_inode(struct super_block *sb)
 	return &ei->vfs_inode;
 }
 
+
+static void wq_free_rcu(struct rcu_head *head)
+{
+	struct socket_wq *wq = container_of(head, struct socket_wq, rcu);
+
+	kfree(wq);
+}
+
 static void sock_destroy_inode(struct inode *inode)
 {
-	kmem_cache_free(sock_inode_cachep,
-			container_of(inode, struct socket_alloc, vfs_inode));
+	struct socket_alloc *ei;
+
+	ei = container_of(inode, struct socket_alloc, vfs_inode);
+	call_rcu(&ei->socket.wq->rcu, wq_free_rcu);
+	kmem_cache_free(sock_inode_cachep, ei);
 }
 
 static void init_once(void *foo)
@@ -513,7 +529,7 @@ void sock_release(struct socket *sock)
 		module_put(owner);
 	}
 
-	if (sock->fasync_list)
+	if (sock->wq->fasync_list)
 		printk(KERN_ERR "sock_release: fasync list not empty!\n");
 
 	percpu_sub(sockets_in_use, 1);
@@ -1080,9 +1096,9 @@ static int sock_fasync(int fd, struct file *filp, int on)
 
 	lock_sock(sk);
 
-	fasync_helper(fd, filp, on, &sock->fasync_list);
+	fasync_helper(fd, filp, on, &sock->wq->fasync_list);
 
-	if (!sock->fasync_list)
+	if (!sock->wq->fasync_list)
 		sock_reset_flag(sk, SOCK_FASYNC);
 	else
 		sock_set_flag(sk, SOCK_FASYNC);
@@ -1091,12 +1107,20 @@ static int sock_fasync(int fd, struct file *filp, int on)
 	return 0;
 }
 
-/* This function may be called only under socket lock or callback_lock */
+/* This function may be called only under socket lock or callback_lock or rcu_lock */
 
 int sock_wake_async(struct socket *sock, int how, int band)
 {
-	if (!sock || !sock->fasync_list)
+	struct socket_wq *wq;
+
+	if (!sock)
 		return -1;
+	rcu_read_lock();
+	wq = rcu_dereference(sock->wq);
+	if (!wq || !wq->fasync_list) {
+		rcu_read_unlock();
+		return -1;
+	}
 	switch (how) {
 	case SOCK_WAKE_WAITD:
 		if (test_bit(SOCK_ASYNC_WAITDATA, &sock->flags))
@@ -1108,11 +1132,12 @@ int sock_wake_async(struct socket *sock, int how, int band)
 		/* fall through */
 	case SOCK_WAKE_IO:
 call_kill:
-		kill_fasync(&sock->fasync_list, SIGIO, band);
+		kill_fasync(&wq->fasync_list, SIGIO, band);
 		break;
 	case SOCK_WAKE_URG:
-		kill_fasync(&sock->fasync_list, SIGURG, band);
+		kill_fasync(&wq->fasync_list, SIGURG, band);
 	}
+	rcu_read_unlock();
 	return 0;
 }
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 87c0360..fef2cc5 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -313,13 +313,16 @@ static inline int unix_writable(struct sock *sk)
 
 static void unix_write_space(struct sock *sk)
 {
-	read_lock(&sk->sk_callback_lock);
+	struct socket_wq *wq;
+
+	rcu_read_lock();
 	if (unix_writable(sk)) {
-		if (sk_has_sleeper(sk))
-			wake_up_interruptible_sync(sk_sleep(sk));
+		wq = rcu_dereference(sk->sk_wq);
+		if (wq_has_sleeper(wq))
+			wake_up_interruptible_sync(&wq->wait);
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
-	read_unlock(&sk->sk_callback_lock);
+	rcu_read_unlock();
 }
 
 /* When dgram socket disconnects (or changes its peer), we clear its receive
@@ -406,9 +409,7 @@ static int unix_release_sock(struct sock *sk, int embrion)
 				skpair->sk_err = ECONNRESET;
 			unix_state_unlock(skpair);
 			skpair->sk_state_change(skpair);
-			read_lock(&skpair->sk_callback_lock);
 			sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
-			read_unlock(&skpair->sk_callback_lock);
 		}
 		sock_put(skpair); /* It may now die */
 		unix_peer(sk) = NULL;
@@ -1142,7 +1143,7 @@ restart:
 	newsk->sk_peercred.pid	= task_tgid_vnr(current);
 	current_euid_egid(&newsk->sk_peercred.uid, &newsk->sk_peercred.gid);
 	newu = unix_sk(newsk);
-	newsk->sk_sleep		= &newu->peer_wait;
+	newsk->sk_wq		= &newu->peer_wq;
 	otheru = unix_sk(other);
 
 	/* copy address information from listening to new sock*/
@@ -1931,12 +1932,10 @@ static int unix_shutdown(struct socket *sock, int mode)
 			other->sk_shutdown |= peer_mode;
 			unix_state_unlock(other);
 			other->sk_state_change(other);
-			read_lock(&other->sk_callback_lock);
 			if (peer_mode == SHUTDOWN_MASK)
 				sk_wake_async(other, SOCK_WAKE_WAITD, POLL_HUP);
 			else if (peer_mode & RCV_SHUTDOWN)
 				sk_wake_async(other, SOCK_WAKE_WAITD, POLL_IN);
-			read_unlock(&other->sk_callback_lock);
 		}
 		if (other)
 			sock_put(other);



^ permalink raw reply related

* Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: Serge E. Hallyn @ 2010-04-30 18:19 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, Vlad Yasevich, netdev, David Miller
In-Reply-To: <1272646855-17327-1-git-send-email-danms@us.ibm.com>

Quoting Dan Smith (danms@us.ibm.com):
> +static int temp_netns_enter(struct net *net)
> +{
> +	int ret;
> +	struct net *tmp_netns;
> +
> +	ret = copy_namespaces(CLONE_NEWNET, current);
> +	if (ret)
> +		return ret;

Actually there is one problem here - copy_namespaces() is
specifically used only by clone() and it expects tsk to
not yet be live.  So it just does

	tsk->nsproxy = new_ns

Since you're doing this on current which is live, it would
have to use rcu_assign_pointer() to be safe.

So I'm afraid you're going to have to do a slightly uglier
thing where you unshare_nsproxy_namespaces() and then
switch_task_namespaces() to the new nsproxy.

> +
> +	tmp_netns = current->nsproxy->net_ns;
> +	get_net(net);
> +	current->nsproxy->net_ns = net;
> +	put_net(tmp_netns);
> +
> +	return 0;
> +}

Otherwise it looks good to me.  My only other comment would be to soothe
readers' anxieties by putting a comment right here explaining that
switch_task_namespaces() will drop your ref to current->nsproxy->net_ns,
and that you had never dropped the ref to prev so it will be safe.

> +static void temp_netns_exit(struct nsproxy *prev)
> +{
> +	switch_task_namespaces(current, prev);
> +}

thanks,
-serge

^ permalink raw reply

* r8169 INFO: inconsistent lock state
From: Sergey Senozhatsky @ 2010-04-30 18:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Oleg Nesterov, David Miller, Ingo Molnar, Francois Romieu,
	Peter Zijlstra, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6536 bytes --]

Hello,

Yet another one (during resume):

kernel: [ 1968.334646] 
kernel: [ 1968.334648] =================================
kernel: [ 1968.334651] [ INFO: inconsistent lock state ]
kernel: [ 1968.334654] 2.6.34-rc6-dbg #105
kernel: [ 1968.334656] ---------------------------------
kernel: [ 1968.334659] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kernel: [ 1968.334663] events/1/3854 [HC0[0]:SC0[0]:HE1:SE1] takes:
kernel: [ 1968.334666]  (&(&table->hash[i].lock)->rlock){+.?...}, at: [<c1292ec4>] __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334678] {IN-SOFTIRQ-W} state was registered at:
kernel: [ 1968.334681]   [<c104fc8d>] __lock_acquire+0x2ba/0xc01
kernel: [ 1968.334688]   [<c10509df>] lock_acquire+0x5e/0x75
kernel: [ 1968.334693]   [<c12c366a>] _raw_spin_lock+0x28/0x58
kernel: [ 1968.334699]   [<c1292ec4>] __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334704]   [<c12931a7>] __udp4_lib_rcv+0x1dc/0x3ac
kernel: [ 1968.334708]   [<c1293389>] udp_rcv+0x12/0x14
kernel: [ 1968.334713]   [<c127605f>] ip_local_deliver_finish+0xd2/0x137
kernel: [ 1968.334719]   [<c12760fc>] NF_HOOK.clone.1+0x38/0x3f
kernel: [ 1968.334724]   [<c1276220>] ip_local_deliver+0x3c/0x42
kernel: [ 1968.334728]   [<c1275f2e>] ip_rcv_finish+0x25c/0x27e
kernel: [ 1968.334733]   [<c12760fc>] NF_HOOK.clone.1+0x38/0x3f
kernel: [ 1968.334737]   [<c12763c9>] ip_rcv+0x1a3/0x1c6
kernel: [ 1968.334741]   [<c12593d7>] netif_receive_skb+0x38b/0x3ab
kernel: [ 1968.334747]   [<fd20f911>] rtl8169_rx_interrupt+0x2de/0x3eb [r8169]
kernel: [ 1968.334756]   [<fd211cde>] rtl8169_poll+0x28/0x15d [r8169]
kernel: [ 1968.334763]   [<c12596b3>] net_rx_action+0x93/0x181
kernel: [ 1968.334767]   [<c1032a72>] __do_softirq+0x88/0x10c
kernel: [ 1968.334773]   [<c1032b25>] do_softirq+0x2f/0x47
kernel: [ 1968.334778]   [<c1032de2>] irq_exit+0x38/0x75
kernel: [ 1968.334782]   [<c1004489>] do_IRQ+0x79/0x8d
kernel: [ 1968.334787]   [<c1002db5>] common_interrupt+0x35/0x3c
kernel: [ 1968.334791]   [<c1246f43>] cpuidle_idle_call+0x6a/0xa0
kernel: [ 1968.334799]   [<c100171b>] cpu_idle+0x89/0xbe
kernel: [ 1968.334802]   [<c12b3d49>] rest_init+0xd1/0xd6
kernel: [ 1968.334807]   [<c147e7bd>] start_kernel+0x339/0x33e
kernel: [ 1968.334813]   [<c147e0c9>] i386_start_kernel+0xc9/0xd0
kernel: [ 1968.334818] irq event stamp: 63
kernel: [ 1968.334820] hardirqs last  enabled at (63): [<c109d7ff>] kmem_cache_free+0x83/0x8f
kernel: [ 1968.334828] hardirqs last disabled at (62): [<c109d7a6>] kmem_cache_free+0x2a/0x8f
kernel: [ 1968.334833] softirqs last  enabled at (60): [<c126400a>] rcu_read_unlock_bh+0x1c/0x1e
kernel: [ 1968.334839] softirqs last disabled at (58): [<c1263faf>] rcu_read_lock_bh+0x8/0x26
kernel: [ 1968.334845] 
kernel: [ 1968.334846] other info that might help us debug this:
kernel: [ 1968.334849] 5 locks held by events/1/3854:
kernel: [ 1968.334851]  #0:  (events){+.+.+.}, at: [<c103c8e9>] worker_thread+0x128/0x23c
kernel: [ 1968.334859]  #1:  ((&(&tp->task)->work)){+.+...}, at: [<c103c8e9>] worker_thread+0x128/0x23c
kernel: [ 1968.334865]  #2:  (rtnl_mutex){+.+.+.}, at: [<c1262b8f>] rtnl_lock+0xf/0x11
kernel: [ 1968.334871]  #3:  (rcu_read_lock){.+.+..}, at: [<c125784b>] rcu_read_lock+0x0/0x2b
kernel: [ 1968.334877]  #4:  (rcu_read_lock){.+.+..}, at: [<c1275c56>] rcu_read_lock+0x0/0x2b
kernel: [ 1968.334884] 
kernel: [ 1968.334885] stack backtrace:
kernel: [ 1968.334888] Pid: 3854, comm: events/1 Not tainted 2.6.34-rc6-dbg #105
kernel: [ 1968.334891] Call Trace:
kernel: [ 1968.334895]  [<c12c1906>] ? printk+0xf/0x11
kernel: [ 1968.334901]  [<c104e7d9>] valid_state+0x133/0x141
kernel: [ 1968.334906]  [<c104e8b6>] mark_lock+0xcf/0x1bc
kernel: [ 1968.334911]  [<c104e11f>] ? check_usage_backwards+0x0/0x72
kernel: [ 1968.334915]  [<c104fcff>] __lock_acquire+0x32c/0xc01
kernel: [ 1968.334922]  [<c129ee2d>] ? fib_table_lookup+0x81/0x8e
kernel: [ 1968.334927]  [<c100772e>] ? __cycles_2_ns+0xf/0x3e
kernel: [ 1968.334932]  [<c12671b6>] ? rcu_read_unlock+0x0/0x38
kernel: [ 1968.334937]  [<c1007a30>] ? native_sched_clock+0x49/0x4f
kernel: [ 1968.334943]  [<c10443a9>] ? sched_clock_local+0x11/0x11f
kernel: [ 1968.334948]  [<c10509df>] lock_acquire+0x5e/0x75
kernel: [ 1968.334953]  [<c1292ec4>] ? __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334958]  [<c12c366a>] _raw_spin_lock+0x28/0x58
kernel: [ 1968.334963]  [<c1292ec4>] ? __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334967]  [<c1292ec4>] __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334973]  [<c104463c>] ? sched_clock_cpu+0x121/0x131
kernel: [ 1968.334978]  [<c12735b5>] ? rcu_read_unlock+0x0/0x38
kernel: [ 1968.334983]  [<c104463c>] ? sched_clock_cpu+0x121/0x131
kernel: [ 1968.334988]  [<c10505c5>] ? __lock_acquire+0xbf2/0xc01
kernel: [ 1968.334994]  [<c12735e2>] ? rcu_read_unlock+0x2d/0x38
kernel: [ 1968.334998]  [<c1274034>] ? ip_route_input+0x101/0xaf4
kernel: [ 1968.335003]  [<c12931a7>] __udp4_lib_rcv+0x1dc/0x3ac
kernel: [ 1968.335008]  [<c1293389>] udp_rcv+0x12/0x14
kernel: [ 1968.335013]  [<c127605f>] ip_local_deliver_finish+0xd2/0x137
kernel: [ 1968.335017]  [<c1275f8d>] ? ip_local_deliver_finish+0x0/0x137
kernel: [ 1968.335022]  [<c12760fc>] NF_HOOK.clone.1+0x38/0x3f
kernel: [ 1968.335026]  [<c1276220>] ip_local_deliver+0x3c/0x42
kernel: [ 1968.335031]  [<c1275f8d>] ? ip_local_deliver_finish+0x0/0x137
kernel: [ 1968.335035]  [<c1275f2e>] ip_rcv_finish+0x25c/0x27e
kernel: [ 1968.335040]  [<c1275cd2>] ? ip_rcv_finish+0x0/0x27e
kernel: [ 1968.335044]  [<c12760fc>] NF_HOOK.clone.1+0x38/0x3f
kernel: [ 1968.335048]  [<c12763c9>] ip_rcv+0x1a3/0x1c6
kernel: [ 1968.335052]  [<c1275cd2>] ? ip_rcv_finish+0x0/0x27e
kernel: [ 1968.335057]  [<c12593d7>] netif_receive_skb+0x38b/0x3ab
kernel: [ 1968.335066]  [<fd20f911>] rtl8169_rx_interrupt+0x2de/0x3eb [r8169]
kernel: [ 1968.335073]  [<fd20fc9b>] rtl8169_reset_task+0x33/0xe8 [r8169]
kernel: [ 1968.335077]  [<c103c92b>] worker_thread+0x16a/0x23c
kernel: [ 1968.335082]  [<c103c8e9>] ? worker_thread+0x128/0x23c
kernel: [ 1968.335088]  [<fd20fc68>] ? rtl8169_reset_task+0x0/0xe8 [r8169]
kernel: [ 1968.335095]  [<c103fa46>] ? autoremove_wake_function+0x0/0x2f
kernel: [ 1968.335099]  [<c103c7c1>] ? worker_thread+0x0/0x23c
kernel: [ 1968.335103]  [<c103f76a>] kthread+0x6a/0x6f
kernel: [ 1968.335108]  [<c103f700>] ? kthread+0x0/0x6f
kernel: [ 1968.335112]  [<c1002dc2>] kernel_thread_helper+0x6/0x10
kernel: [ 1968.335282] r8169 0000:02:00.0: eth0: link down


	Sergey

[-- Attachment #2: Type: application/pgp-signature, Size: 316 bytes --]

^ permalink raw reply

* avoid compiler warning when !CONFIG_SYSCTL in netfilter dccp
From: Mathieu Lacage @ 2010-04-29  9:25 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 333 bytes --]

The attached trivial patch (generated against davem/net-next-2.6 as of
this morning) avoids this compiler warning:
../src/process-manager/linux/net/netfilter/nf_conntrack_proto_dccp.c: In
function ‘dccp_net_exit’:
../src/process-manager/linux/net/netfilter/nf_conntrack_proto_dccp.c:845: error: unused variable ‘dn’

Mathieu

[-- Attachment #2: nf-dccp-sysctl.patch --]
[-- Type: text/x-patch, Size: 541 bytes --]

diff --git a/net/netfilter/nf_conntrack_proto_dccp.c b/net/netfilter/nf_conntrack_proto_dccp.c
index 5292560..cd078df 100644
--- a/net/netfilter/nf_conntrack_proto_dccp.c
+++ b/net/netfilter/nf_conntrack_proto_dccp.c
@@ -842,8 +842,8 @@ static __net_init int dccp_net_init(struct net *net)
 
 static __net_exit void dccp_net_exit(struct net *net)
 {
-	struct dccp_net *dn = dccp_pernet(net);
 #ifdef CONFIG_SYSCTL
+	struct dccp_net *dn = dccp_pernet(net);
 	unregister_net_sysctl_table(dn->sysctl_header);
 	kfree(dn->sysctl_table);
 #endif

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox