Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: replace min with min_t in pktgen_if_write
From: David Miller @ 2010-11-09 16:40 UTC (permalink / raw)
  To: dave; +Cc: robert.olsson, linux-kernel, netdev
In-Reply-To: <1289320633.2260.12.camel@cowboy>

From: Davidlohr Bueso <dave@gnu.org>
Date: Tue, 09 Nov 2010 13:37:13 -0300

> From: Davidlohr Bueso <dave@gnu.org>
> 
> This addresses the following compiler (gcc 4.4.5) warning:
> 
>   CC [M]  net/core/pktgen.o
> net/core/pktgen.c: In function ‘pktgen_if_write’:
> net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast
> 
> Signed-off-by: Davidlohr Bueso <dave@gnu.org>

Already fixed in net-2.6:

commit 86c2c0a8a4965ae5cbc0ff97ed39a4472e8e9b23
Author: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Date:   Sat Nov 6 20:11:38 2010 +0000

    NET: pktgen - fix compile warning
    
    This should fix the following warning:
    
    net/core/pktgen.c: In function ‘pktgen_if_write’:
    net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast
    
    Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
    Reviewed-by: Nelson Elhage <nelhage@ksplice.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index fbce4b0..1992cd0 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -887,7 +887,7 @@ static ssize_t pktgen_if_write(struct file *file,
 	i += len;
 
 	if (debug) {
-		size_t copy = min(count, 1023);
+		size_t copy = min_t(size_t, count, 1023);
 		char tb[copy + 1];
 		if (copy_from_user(tb, user_buffer, copy))
 			return -EFAULT;

^ permalink raw reply related

* Re: [PATCH] Fix CAN info leak/minor heap overflow
From: David Miller @ 2010-11-09 17:05 UTC (permalink / raw)
  To: socketcan; +Cc: urs, netdev, drosenberg, security, torvalds
In-Reply-To: <4CD8FDB5.6060905@hartkopp.net>

From: Oliver Hartkopp <socketcan@hartkopp.net>
Date: Tue, 09 Nov 2010 08:52:21 +0100

> Once this patch is applied (and the procfs layout is changed anyway), i'd also
> like to send a patch from my backlog that would extend the procfs output for
> can-bcm with an additional drop counter.

I find this kind of discussion extremely disappointing.

All of this stuff you CAN guys do with procfs files and version
strings is completely wrong and bogus.

Once you create a procfs file layout, you're basically stuck and you
can at best only reasonably add new fields at the end, you can't
really change existing fields.

And sysfs would have been a lot more appropriate, you could use
attributes for each value you want to export and then just add new
sysfs attributes when you want to export new values which has very
clear semantics and backwards compatability implications.

^ permalink raw reply

* Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.
From: David Miller @ 2010-11-09 17:15 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, herbert, jdike
In-Reply-To: <1289293402-4791-1-git-send-email-xiaohui.xin@intel.com>

From: xiaohui.xin@intel.com
Date: Tue,  9 Nov 2010 17:03:05 +0800

> We provide an zero-copy method which driver side may get external
> buffers to DMA. Here external means driver don't use kernel space
> to allocate skb buffers. Currently the external buffer can be from
> guest virtio-net driver.

1) Please use kernel-doc style comments to document structure
   members.

2) The idea to key off of skb->dev in skb_release_data() is
   fundamentally flawed since many actions can change skb->dev on you,
   which will end up causing a leak of your external data areas.

^ permalink raw reply

* Re: [PATCH] drivers/net: normalize TX_TIMEOUT
From: David Miller @ 2010-11-09 17:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1288860575.2659.34.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Nov 2010 09:49:35 +0100

> Some network drivers use old TX_TIMEOUT definitions, assuming HZ=100 of
> old kernels.
> 
> Convert these definitions to include HZ, since HZ can be 1000 these
> days.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-11-09 17:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20101109153325.GB26153@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 09:03:25 PM:

> > > Something strange here, right?
> > > 1. You are consistently getting >10G/s here, and even with a single
> > stream?
> >
> > Sorry, I should have mentioned this though I had stated in my
> > earlier mails. Each test result has two iterations, each of 60
> > seconds, except when #netperfs is 1 for which I do 10 iteration
> > (sum across 10 iterations).
>
> So need to divide the number by 10?

Yes, that is what I get with 512/1K macvtap I/O size :)

> >  I started doing many more iterations
> > for 1 netperf after finding the issue earlier with single stream.
> > So the BW is only 4.5-7 Gbps.
> >
> > > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> > >    doubling that we get a marginal improvement with 2 queues and
> > >    about 30% worse with 1 queue.
> >
> > (doubling happens consistently for guest -> host, but never for
> > remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> > testing scenario. In first case, there is a slight improvement in
> > BW and good reduction in SD. In the second case, only SD improves
> > (though BW drops for 2 stream for some reason).  In both cases,
> > BW and SD improves as the number of sessions increase.
>
> I guess this is another indication that something's wrong.

The patch - both virtio-net and vhost-net, doesn't have any
locking/mutex's/ or any synchronization method. Guest -> host
performance improvement of upto 100% shows the patch is not
doing anything wrong.

> We are quite far from line rate, the fact BW does not scale
> means there's some contention in the code.

Attaining line speed with macvtap seems to be a generic issue
and unrelated to my patch specifically. IMHO if there is nothing
wrong in the code (review) and is accepted, it will benefit as
others can also help to find what needs to be implemented in
vhost/macvtap/qemu to get line speed for guest->remote-host.

PS: bare-metal performance for host->remote-host is also
    2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.

Thanks,

- KK

^ permalink raw reply

* Re: [PATCH 36/39] net/ipv4/tcp.c: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe; +Cc: trivial, kuznet, pekkas, jmorris, yoshfuji, kaber, netdev,
	linux-kernel
In-Reply-To: <2afee732758b93d7464e6bb07a52b3bc67f164a4.1288471898.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 30 Oct 2010 14:08:53 -0700

> Coalesce long formats.
> Align arguments.
> Remove KERN_<level>.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 35/39] net/core/dev.c: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe; +Cc: trivial, netdev, linux-kernel
In-Reply-To: <2f05cefa4dc86594a2d7335c7978b1ae781bef0e.1288471898.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 30 Oct 2010 14:08:52 -0700

> Coalesce long formats.
> Add missing newlines.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 17/39] drivers/net/usb: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe; +Cc: trivial, gregkh, linux-usb, netdev, linux-kernel
In-Reply-To: <4ef13c27899febe0cef4f1633321ddedbe97cfcf.1288471898.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 30 Oct 2010 14:08:34 -0700

> Add missing newlines.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 16/39] drivers/net/can: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe-6d6DIl74uiNBDgjK7y7TUQ
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, trivial-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <2a00e2a377a44e544cec15356cf3794abe47b306.1288471898.git.joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>

From: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
Date: Sat, 30 Oct 2010 14:08:33 -0700

> Add missing newlines.
> 
> Signed-off-by: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>

Applied.

^ permalink raw reply

* Re: [PATCH 3/3] net: tipc: fix information leak to userland
From: David Miller @ 2010-11-09 17:26 UTC (permalink / raw)
  To: segooon
  Cc: jon.maloy, netdev, kernel-janitors, linux-kernel, tipc-discussion,
	allan.stephens
In-Reply-To: <1288545032-16481-1-git-send-email-segooon@gmail.com>

From: Vasiliy Kulikov <segooon@gmail.com>
Date: Sun, 31 Oct 2010 20:10:32 +0300

> Structure sockaddr_tipc is copied to userland with padding bytes after
> "id" field in union field "name" unitialized.  It leads to leaking of
> contents of kernel stack memory.  We have to initialize them to zero.
> 
> Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>

Applied.

Patches #1 and #2 were given feedback which I need you to integrate
and submit new patches based upon, thanks.

------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v2)
From: nhorman @ 2010-11-09 17:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, zenczykowski, Neil Horman
In-Reply-To: <1288033566-2091-1-git-send-email-nhorman@tuxdriver.com>

From: Neil Horman <nhorman@tuxdriver.com>

Version 2 of this patch.  Sorry its been awhile, but I've had other issues come
up.  I've re-written this patch taking into account the change notes from the
last round

Change notes:
1) Rewrote the patch to not do all my previous stupid vmaps on single page
arrays.  Instead we just use vmalloc now.

2) Checked into the behavior of tcpdump on high order allocations in the failing
case.  Tcpdump (more specifically I think libpcap) does attempt to minimize the
allocation order on the ring buffer, unfortunately the lowest it will push the
ordrer too given a sufficiently large buffer is order 5, so its still a very
large contiguous allocation, and that says nothing of other malicious
applications that might try to do more.

Summary:
It was shown to me recently that systems under high load were driven very deep
into swap when tcpdump was run.  The reason this happened was because the
AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
application to specify how many entries an AF_PACKET socket will have and how
large each entry will be.  It seems the default setting for tcpdump is to set
the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
allocation.  Thats difficult under good circumstances, and horrid under memory
pressure.

I thought it would be good to make that a bit more usable.  I was going to do a
simple conversion of the ring buffer from contigous pages to iovecs, but
unfortunately, the metadata which AF_PACKET places in these buffers can easily
span a page boundary, and given that these buffers get mapped into user space,
and the data layout doesn't easily allow for a change to padding between frames
to avoid that, a simple iovec change is just going to break user space ABI
consistency.

So I've done this, I've added a three tiered mechanism to the af_packet set_ring
socket option.  It attempts to allocate memory in the following order:

1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
digging into swap

2) Using vmalloc

3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
needed to get the memory

The effect is that we don't disturb the system as much when we're under load,
while still being able to conduct tcpdumps effectively.

Tested successfully by me.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 net/packet/af_packet.c |   84 ++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 68 insertions(+), 16 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3616f27..d1148ac 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -163,8 +163,14 @@ struct packet_mreq_max {
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring);
 
+#define PGV_FROM_VMALLOC 1
+struct pgv {
+	char *buffer;
+	unsigned char flags;
+};
+
 struct packet_ring_buffer {
-	char			**pg_vec;
+	struct pgv		*pg_vec;
 	unsigned int		head;
 	unsigned int		frames_per_block;
 	unsigned int		frame_size;
@@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
 	pg_vec_pos = position / rb->frames_per_block;
 	frame_offset = position % rb->frames_per_block;
 
-	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
+	h.raw = rb->pg_vec[pg_vec_pos].buffer +
+		(frame_offset * rb->frame_size);
 
 	if (status != __packet_get_status(po, h.raw))
 		return NULL;
@@ -2322,37 +2329,74 @@ static const struct vm_operations_struct packet_mmap_ops = {
 	.close	=	packet_mm_close,
 };
 
-static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
+static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
+			unsigned int len)
 {
 	int i;
 
 	for (i = 0; i < len; i++) {
-		if (likely(pg_vec[i]))
-			free_pages((unsigned long) pg_vec[i], order);
+		if (likely(pg_vec[i].buffer)) {
+			if (pg_vec[i].flags & PGV_FROM_VMALLOC)
+				vfree(pg_vec[i].buffer);
+			else
+				free_pages((unsigned long)pg_vec[i].buffer,
+					   order);
+			pg_vec[i].buffer = NULL;
+		}
 	}
 	kfree(pg_vec);
 }
 
-static inline char *alloc_one_pg_vec_page(unsigned long order)
+static inline char *alloc_one_pg_vec_page(unsigned long order,
+					  unsigned char *flags)
 {
-	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
+	char *buffer = NULL;
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
+			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+	buffer = (char *) __get_free_pages(gfp_flags, order);
+
+	if (buffer)
+		return buffer;
+
+	/*
+	 * __get_free_pages failed, fall back to vmalloc
+	 */
+	*flags |= PGV_FROM_VMALLOC;
+	buffer = vmalloc((1 << order) * PAGE_SIZE);
 
-	return (char *) __get_free_pages(gfp_flags, order);
+	if (buffer)
+		return buffer;
+
+	/*
+	 * vmalloc failed, lets dig into swap here
+	 */
+	*flags = 0;
+	gfp_flags &= ~__GFP_NORETRY;
+	buffer = (char *)__get_free_pages(gfp_flags, order);
+	if (buffer)
+		return buffer;
+
+	/*
+	 * complete and utter failure
+	 */
+	return NULL;
 }
 
-static char **alloc_pg_vec(struct tpacket_req *req, int order)
+static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
 {
 	unsigned int block_nr = req->tp_block_nr;
-	char **pg_vec;
+	struct pgv *pg_vec;
 	int i;
 
-	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
+	pg_vec = kzalloc(block_nr * sizeof(struct pgv), GFP_KERNEL);
 	if (unlikely(!pg_vec))
 		goto out;
 
 	for (i = 0; i < block_nr; i++) {
-		pg_vec[i] = alloc_one_pg_vec_page(order);
-		if (unlikely(!pg_vec[i]))
+		pg_vec[i].buffer = alloc_one_pg_vec_page(order,
+							 &pg_vec[i].flags);
+		if (unlikely(!pg_vec[i].buffer))
 			goto out_free_pgvec;
 	}
 
@@ -2361,6 +2405,7 @@ out:
 
 out_free_pgvec:
 	free_pg_vec(pg_vec, order, block_nr);
+	kfree(pg_vec);
 	pg_vec = NULL;
 	goto out;
 }
@@ -2368,7 +2413,7 @@ out_free_pgvec:
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring)
 {
-	char **pg_vec = NULL;
+	struct pgv *pg_vec = NULL;
 	struct packet_sock *po = pkt_sk(sk);
 	int was_running, order = 0;
 	struct packet_ring_buffer *rb;
@@ -2530,15 +2575,22 @@ static int packet_mmap(struct file *file, struct socket *sock,
 			continue;
 
 		for (i = 0; i < rb->pg_vec_len; i++) {
-			struct page *page = virt_to_page(rb->pg_vec[i]);
+			struct page *page;
+			void *kaddr = rb->pg_vec[i].buffer;
 			int pg_num;
 
 			for (pg_num = 0; pg_num < rb->pg_vec_pages;
-					pg_num++, page++) {
+					pg_num++) {
+				if (rb->pg_vec[i].flags & PGV_FROM_VMALLOC)
+					page = vmalloc_to_page(kaddr);
+				else
+					page = virt_to_page(kaddr);
+
 				err = vm_insert_page(vma, start, page);
 				if (unlikely(err))
 					goto out;
 				start += PAGE_SIZE;
+				kaddr += PAGE_SIZE;
 			}
 		}
 	}
-- 
1.7.2.3


^ permalink raw reply related

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v2)
From: Eric Dumazet @ 2010-11-09 18:02 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, davem, zenczykowski
In-Reply-To: <1289324799-2256-1-git-send-email-nhorman@tuxdriver.com>

Le mardi 09 novembre 2010 à 12:46 -0500, nhorman@tuxdriver.com a écrit :
> From: Neil Horman <nhorman@tuxdriver.com>
> 
> Version 2 of this patch.  Sorry its been awhile, but I've had other issues come
> up.  I've re-written this patch taking into account the change notes from the
> last round
> 
> Change notes:
> 1) Rewrote the patch to not do all my previous stupid vmaps on single page
> arrays.  Instead we just use vmalloc now.
> 
> 2) Checked into the behavior of tcpdump on high order allocations in the failing
> case.  Tcpdump (more specifically I think libpcap) does attempt to minimize the
> allocation order on the ring buffer, unfortunately the lowest it will push the
> ordrer too given a sufficiently large buffer is order 5, so its still a very
> large contiguous allocation, and that says nothing of other malicious
> applications that might try to do more.
> 
> Summary:
> It was shown to me recently that systems under high load were driven very deep
> into swap when tcpdump was run.  The reason this happened was because the
> AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
> application to specify how many entries an AF_PACKET socket will have and how
> large each entry will be.  It seems the default setting for tcpdump is to set
> the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
> allocation.  Thats difficult under good circumstances, and horrid under memory
> pressure.
> 
> I thought it would be good to make that a bit more usable.  I was going to do a
> simple conversion of the ring buffer from contigous pages to iovecs, but
> unfortunately, the metadata which AF_PACKET places in these buffers can easily
> span a page boundary, and given that these buffers get mapped into user space,
> and the data layout doesn't easily allow for a change to padding between frames
> to avoid that, a simple iovec change is just going to break user space ABI
> consistency.
> 
> So I've done this, I've added a three tiered mechanism to the af_packet set_ring
> socket option.  It attempts to allocate memory in the following order:
> 
> 1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
> digging into swap
> 
> 2) Using vmalloc
> 
> 3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
> needed to get the memory
> 
> The effect is that we don't disturb the system as much when we're under load,
> while still being able to conduct tcpdumps effectively.
> 
> Tested successfully by me.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> ---
>  net/packet/af_packet.c |   84 ++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 68 insertions(+), 16 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 3616f27..d1148ac 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -163,8 +163,14 @@ struct packet_mreq_max {
>  static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
>  		int closing, int tx_ring);
>  
> +#define PGV_FROM_VMALLOC 1
> +struct pgv {
> +	char *buffer;
> +	unsigned char flags;
> +};
> +
>  struct packet_ring_buffer {
> -	char			**pg_vec;
> +	struct pgv		*pg_vec;
>  	unsigned int		head;
>  	unsigned int		frames_per_block;
>  	unsigned int		frame_size;
> @@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
>  	pg_vec_pos = position / rb->frames_per_block;
>  	frame_offset = position % rb->frames_per_block;
>  
> -	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
> +	h.raw = rb->pg_vec[pg_vec_pos].buffer +
> +		(frame_offset * rb->frame_size);
>  
>  	if (status != __packet_get_status(po, h.raw))
>  		return NULL;
> @@ -2322,37 +2329,74 @@ static const struct vm_operations_struct packet_mmap_ops = {
>  	.close	=	packet_mm_close,
>  };
>  
> -static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
> +static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
> +			unsigned int len)
>  {
>  	int i;
>  
>  	for (i = 0; i < len; i++) {
> -		if (likely(pg_vec[i]))
> -			free_pages((unsigned long) pg_vec[i], order);
> +		if (likely(pg_vec[i].buffer)) {
> +			if (pg_vec[i].flags & PGV_FROM_VMALLOC)
> +				vfree(pg_vec[i].buffer);
> +			else
> +				free_pages((unsigned long)pg_vec[i].buffer,
> +					   order);
> +			pg_vec[i].buffer = NULL;
> +		}
>  	}
>  	kfree(pg_vec);
>  }
>  
> -static inline char *alloc_one_pg_vec_page(unsigned long order)
> +static inline char *alloc_one_pg_vec_page(unsigned long order,
> +					  unsigned char *flags)
>  {
> -	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
> +	char *buffer = NULL;
> +	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
> +			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
> +
> +	buffer = (char *) __get_free_pages(gfp_flags, order);
> +
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * __get_free_pages failed, fall back to vmalloc
> +	 */
> +	*flags |= PGV_FROM_VMALLOC;
> +	buffer = vmalloc((1 << order) * PAGE_SIZE);
>  
> -	return (char *) __get_free_pages(gfp_flags, order);
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * vmalloc failed, lets dig into swap here
> +	 */
> +	*flags = 0;
> +	gfp_flags &= ~__GFP_NORETRY;
> +	buffer = (char *)__get_free_pages(gfp_flags, order);
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * complete and utter failure
> +	 */
> +	return NULL;
>  }
>  
> -static char **alloc_pg_vec(struct tpacket_req *req, int order)
> +static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
>  {
>  	unsigned int block_nr = req->tp_block_nr;
> -	char **pg_vec;
> +	struct pgv *pg_vec;
>  	int i;
>  
> -	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
> +	pg_vec = kzalloc(block_nr * sizeof(struct pgv), GFP_KERNEL);

While we are at it, we could check block_nr being a sane value here ;)

Nice stuff Neil !



^ permalink raw reply

* Re: radvd and auto-ipv6 address regression from 2.6.31 to 2.6.34+
From: Ben Greear @ 2010-11-09 18:07 UTC (permalink / raw)
  To: Brian Haley; +Cc: Mikael Abrahamsson, NetDev
In-Reply-To: <4CD81D24.1040609@hp.com>

On 11/08/2010 07:54 AM, Brian Haley wrote:
> On 11/06/2010 02:17 AM, Mikael Abrahamsson wrote:
>> I would not be surprised if the IPv6 stack on Linux would gain IPv6
>> addresses using SLAAC from any interface if it sees RAs on it,
>> regardless if these are locally generated or not.
>>
>> I guess any Linux device running radvd should turn off autoconf on those
>> interface that radvd is acting on?
>>
>> net.ipv6.conf.veth0.accept_ra=0 should do the trick?
>
> I believe radvd will turn-on IPv6 forwarding on all the interfaces, at
> least it does on Debian in /etc/init.d/radvd, which essentially disables
> the reception of RA's for address configuration purposes.  I'm curious
> what these values are set to, and if something just got missed.
>
> It might be something specific to veth too, not sure how packets are
> copied/looped-back on these devices from the stack.

Thanks for the hint.  I did not have ipv6 forwarding enabled for the
radvd interfaces.  Once I fixed that, it worked as expected with no
auto-created global IPv6 address.  I have been starting radvd manually,
so I wouldn't benefit from any OS specific start scripts.

Thanks,
Ben

>
> -Brian


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Networking hangs when too many parallel requests are made at once
From: Luke Hutchison @ 2010-11-09 18:30 UTC (permalink / raw)
  To: netdev

Since around Linux kernel 2.6.33 or so (but maybe as early as
2.6.31, not sure exactly what version), when restoring a crashed or
closed browser session of either Firefox or Chrome where lots of tabs
(say 10-40) open simultaneously, the networking stack is brought to
its knees -- most or all the tabs eventually time out without data, or
a few tabs might get some data and then display a partial web page.
This behavior occurs with either wifi or ethernet, and occurs when
booting from Fedora 14 on liveusb, so it does not appear to be a
configuration problem. I have a Toshiba Satellite Pro S300M-S2142
laptop with a Core 2 Duo P8600 CPU, Intel GM45 gfx, Intel 82567V
Gigabit Ethernet and Intel 5100 Wifi, running kernel
kernel-2.6.36-1.1.fc15.x86_64 on top of Fedora 14.

Sorry for the length of the following bug report, but it's quite hard
to describe the behavior succinctly.

Even after all tabs have timed out, it's impossible to get data by
opening a new tab -- nothing seems able to access the network
connection.  Networking is broken for other processes too -- for
example, commandline tools like ping don't work either.  The
connection still shows as up in NetworkManager, and sometimes after
5-10 minutes goes back to normal, but not always. "service network
restart" and/or "service NetworkManager restart" and/or "ifdown eth0 ;
ifup eth0" sometimes fixes the problem, but sometimes normal network
activity isn't restored for several minutes and may not act completely
normal again until a reboot.

DNS resolution is the most obviously affected by this.  If I reopen a
browser session and wait a few seconds for networking to hang, I can't
usually ping by domain name but I can (usually) ping by IP address.
However new browser tabs will hang at either name resolution *or*
waiting for data, so I'm not convinced this is just a problem with DNS
resolution.

Also sometimes (but not always) whatever weird state the network stack
on my laptop gets into, things are funky enough to screw up my home
router (two different Motorola Surfboard cable modems/routers), and
the cable modem sometimes has to be reset to get the connection back
to full speed again.  However it is not a router problem in general,
because:

(1) all these symptoms (except this last one where the router somehow
gets screwed up by the laptop's odd behavior) are present whether I
use a wired or wireless connection, and regardless of which network I
am connected to (home or anywhere else, or even when tethered to my
Nexus One), and in multiple countries I have been to in the last 6
months (Portugal, Germany, China).

also
(2) I used to be able to reopen a closed browser session with 40 tabs
and they would all load up just fine.  Then at some point after a
Rawhide update, this broke.

I can't put my finger on exactly when this broke, because I was
dealing with worse breakage for a while since Fedora kernel 2.6.31.5,
as I reported at the following link:

https://bugzilla.redhat.com/show_bug.cgi?id=555213#c1

Synopsis of the above "worse" bug report:
Basically in the very same situation (opening lots of browser tabs),
the machine would lock up hard and the fan would immediately blow at
100% speed.  It took a couple of months of Rawhide updates for this
bug to go away, but by the time this lockup bug was fixed around the
release of Fedora 13 at kernel version 2.6.33, the other network
issues I have described above became evident, and were triggered in
the same way -- thus I believe the two bugs may be related somehow.

My computer has been close to unusable for moderate browsing activity
for about 8 months of the year so far, across nearly two releases of
Fedora (F13 and F14 beta).  I filed the above bug report but it was
never commented on by RedHat engineers.  I figured the bug was
probably visible enough that somebody else should notice it and I just
kept hoping the next update would contain a fix, but not yet.  I
emailed one of the Red Hat kernel engineers and he suggested I ask
upstream.

Please advise me as to how to debug this problem further.  (I haven't
seen anything that looks suspicious in
dmesg output or /var/log/messages, to start with.)

Thank you,
Luke Hutchison

^ permalink raw reply

* alloc_netdev_mq() and multiqueues
From: Kevin Wilson @ 2010-11-09 18:33 UTC (permalink / raw)
  To: netdev

Hello,
I have a short question about multiqueues  and I will appreciate if
somebody can answer shortly in 2-3 sentences.
When talking about multiqueues  I refer for example, to
http://nfws.edenwall.com/nfws_userday/David-Miller_Linux-Multiqueue-Networking.pdf,
alloc_netdev_mq() and friends.

1) Does an ordinary network driver code can be adjusted to use
multiqueues ? or do we need some
special hardware feature ?
2) How can I know if a certain device support multiqueus?\

3) Can anybody name some network cards which support  multiqueues?

Regards,
Kevin

^ permalink raw reply

* Re: alloc_netdev_mq() and multiqueues
From: Ben Hutchings @ 2010-11-09 18:40 UTC (permalink / raw)
  To: Kevin Wilson; +Cc: netdev
In-Reply-To: <AANLkTi=8PrpAB+ARt-H3niNjnmQgHF1Ass345e95_pdq@mail.gmail.com>

On Tue, 2010-11-09 at 20:33 +0200, Kevin Wilson wrote:
> Hello,
> I have a short question about multiqueues  and I will appreciate if
> somebody can answer shortly in 2-3 sentences.
> When talking about multiqueues  I refer for example, to
> http://nfws.edenwall.com/nfws_userday/David-Miller_Linux-Multiqueue-Networking.pdf,
> alloc_netdev_mq() and friends.
> 
> 1) Does an ordinary network driver code can be adjusted to use
> multiqueues ? or do we need some
> special hardware feature ?

This feature is only useful if the hardware has multiple transmit
queues.

> 2) How can I know if a certain device support multiqueus?\

Read the hardware specs.

> 3) Can anybody name some network cards which support  multiqueues?

'git grep -l dev_mq drivers/net' will show you which drivers do.  I have
no knowledge beyond this of which hardware has multiple queues.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v2)
From: Neil Horman @ 2010-11-09 18:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, zenczykowski
In-Reply-To: <1289325753.2774.20.camel@edumazet-laptop>

On Tue, Nov 09, 2010 at 07:02:32PM +0100, Eric Dumazet wrote:
> Le mardi 09 novembre 2010 à 12:46 -0500, nhorman@tuxdriver.com a écrit :
> > From: Neil Horman <nhorman@tuxdriver.com>
> > 
> > Version 2 of this patch.  Sorry its been awhile, but I've had other issues come
> > up.  I've re-written this patch taking into account the change notes from the
> > last round
> > 
> > Change notes:
> > 1) Rewrote the patch to not do all my previous stupid vmaps on single page
> > arrays.  Instead we just use vmalloc now.
> > 
> > 2) Checked into the behavior of tcpdump on high order allocations in the failing
> > case.  Tcpdump (more specifically I think libpcap) does attempt to minimize the
> > allocation order on the ring buffer, unfortunately the lowest it will push the
> > ordrer too given a sufficiently large buffer is order 5, so its still a very
> > large contiguous allocation, and that says nothing of other malicious
> > applications that might try to do more.
> > 
> > Summary:
> > It was shown to me recently that systems under high load were driven very deep
> > into swap when tcpdump was run.  The reason this happened was because the
> > AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
> > application to specify how many entries an AF_PACKET socket will have and how
> > large each entry will be.  It seems the default setting for tcpdump is to set
> > the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
> > allocation.  Thats difficult under good circumstances, and horrid under memory
> > pressure.
> > 
> > I thought it would be good to make that a bit more usable.  I was going to do a
> > simple conversion of the ring buffer from contigous pages to iovecs, but
> > unfortunately, the metadata which AF_PACKET places in these buffers can easily
> > span a page boundary, and given that these buffers get mapped into user space,
> > and the data layout doesn't easily allow for a change to padding between frames
> > to avoid that, a simple iovec change is just going to break user space ABI
> > consistency.
> > 
> > So I've done this, I've added a three tiered mechanism to the af_packet set_ring
> > socket option.  It attempts to allocate memory in the following order:
> > 
> > 1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
> > digging into swap
> > 
> > 2) Using vmalloc
> > 
> > 3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
> > needed to get the memory
> > 
> > The effect is that we don't disturb the system as much when we're under load,
> > while still being able to conduct tcpdumps effectively.
> > 
> > Tested successfully by me.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > ---
> >  net/packet/af_packet.c |   84 ++++++++++++++++++++++++++++++++++++++---------
> >  1 files changed, 68 insertions(+), 16 deletions(-)
> > 
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index 3616f27..d1148ac 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -163,8 +163,14 @@ struct packet_mreq_max {
> >  static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
> >  		int closing, int tx_ring);
> >  
> > +#define PGV_FROM_VMALLOC 1
> > +struct pgv {
> > +	char *buffer;
> > +	unsigned char flags;
> > +};
> > +
> >  struct packet_ring_buffer {
> > -	char			**pg_vec;
> > +	struct pgv		*pg_vec;
> >  	unsigned int		head;
> >  	unsigned int		frames_per_block;
> >  	unsigned int		frame_size;
> > @@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
> >  	pg_vec_pos = position / rb->frames_per_block;
> >  	frame_offset = position % rb->frames_per_block;
> >  
> > -	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
> > +	h.raw = rb->pg_vec[pg_vec_pos].buffer +
> > +		(frame_offset * rb->frame_size);
> >  
> >  	if (status != __packet_get_status(po, h.raw))
> >  		return NULL;
> > @@ -2322,37 +2329,74 @@ static const struct vm_operations_struct packet_mmap_ops = {
> >  	.close	=	packet_mm_close,
> >  };
> >  
> > -static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
> > +static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
> > +			unsigned int len)
> >  {
> >  	int i;
> >  
> >  	for (i = 0; i < len; i++) {
> > -		if (likely(pg_vec[i]))
> > -			free_pages((unsigned long) pg_vec[i], order);
> > +		if (likely(pg_vec[i].buffer)) {
> > +			if (pg_vec[i].flags & PGV_FROM_VMALLOC)
> > +				vfree(pg_vec[i].buffer);
> > +			else
> > +				free_pages((unsigned long)pg_vec[i].buffer,
> > +					   order);
> > +			pg_vec[i].buffer = NULL;
> > +		}
> >  	}
> >  	kfree(pg_vec);
> >  }
> >  
> > -static inline char *alloc_one_pg_vec_page(unsigned long order)
> > +static inline char *alloc_one_pg_vec_page(unsigned long order,
> > +					  unsigned char *flags)
> >  {
> > -	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
> > +	char *buffer = NULL;
> > +	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
> > +			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
> > +
> > +	buffer = (char *) __get_free_pages(gfp_flags, order);
> > +
> > +	if (buffer)
> > +		return buffer;
> > +
> > +	/*
> > +	 * __get_free_pages failed, fall back to vmalloc
> > +	 */
> > +	*flags |= PGV_FROM_VMALLOC;
> > +	buffer = vmalloc((1 << order) * PAGE_SIZE);
> >  
> > -	return (char *) __get_free_pages(gfp_flags, order);
> > +	if (buffer)
> > +		return buffer;
> > +
> > +	/*
> > +	 * vmalloc failed, lets dig into swap here
> > +	 */
> > +	*flags = 0;
> > +	gfp_flags &= ~__GFP_NORETRY;
> > +	buffer = (char *)__get_free_pages(gfp_flags, order);
> > +	if (buffer)
> > +		return buffer;
> > +
> > +	/*
> > +	 * complete and utter failure
> > +	 */
> > +	return NULL;
> >  }
> >  
> > -static char **alloc_pg_vec(struct tpacket_req *req, int order)
> > +static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
> >  {
> >  	unsigned int block_nr = req->tp_block_nr;
> > -	char **pg_vec;
> > +	struct pgv *pg_vec;
> >  	int i;
> >  
> > -	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
> > +	pg_vec = kzalloc(block_nr * sizeof(struct pgv), GFP_KERNEL);
> 
> While we are at it, we could check block_nr being a sane value here ;)
> 
This is true.  What do you think a reasonable sane value is?  libpcap seems to
limit itself to 32 order 5 entries in the ring, but that seems a bit arbitrary.
Perhaps we could check and limit allocations to being no more than order 8
(1Mb), and a total allocation of no more than perhaps max(32Mb, 1% of all ram)?
Just throwing it out there, open to any suggestions here

> Nice stuff Neil !
> 
Thanks!
Neil

> 
> 

^ permalink raw reply

* Re: alloc_netdev_mq() and multiqueues
From: David Miller @ 2010-11-09 18:42 UTC (permalink / raw)
  To: wkevils; +Cc: netdev
In-Reply-To: <AANLkTi=8PrpAB+ARt-H3niNjnmQgHF1Ass345e95_pdq@mail.gmail.com>

From: Kevin Wilson <wkevils@gmail.com>
Date: Tue, 9 Nov 2010 20:33:53 +0200

> I have a short question about multiqueues  and I will appreciate if
> somebody can answer shortly in 2-3 sentences.

When you want help from volunteers, and then you dictacte exactly how
people should give you help, you usually receive no help at all.

Just FYI..

^ permalink raw reply

* Re: Networking hangs when too many parallel requests are made at once
From: Luke Hutchison @ 2010-11-09 19:04 UTC (permalink / raw)
  To: netdev
In-Reply-To: <AANLkTikuCj=saH34dyO6xnU-y8rWv63hJaqowZ0ACLzk@mail.gmail.com>

On Tue, Nov 9, 2010 at 1:30 PM, Luke Hutchison <luke.hutch@gmail.com> wrote:
> Since around Linux kernel 2.6.33 or so (but maybe as early as
> 2.6.31, not sure exactly what version), when restoring a crashed or
> closed browser session of either Firefox or Chrome where lots of tabs
> (say 10-40) open simultaneously, the networking stack is brought to
> its knees -- most or all the tabs eventually time out without data, or
> a few tabs might get some data and then display a partial web page.

I forgot to mention, I have glibc-2.12.90-18.x86_64.

Also the following screenshot may be useful

http://web.mit.edu/~luke_h/www/dns-hang-problem.png

Basically in the usage depicted by the screenshot, I had Chrome open
with probably 30-50 tabs across several windows, I then started an scp
transfer of a large file and waited for it to stabilize, then closed
the browser and re-opened it, restoring the tabs.  Within a second or
two (after the first few lucky browser tabs got some content), DNS
hung, and pinging a domain name from the commandline no longer worked
(ruling out a bug in the browser itself).  However the scp transfer
continued at the same rate, and pinging an IP address directly
continued to work fine (in this case; at other times network
connections to already-resolved IP addresses can seem flaky I think,
but I haven't been able to reproduce these problems as easily as with
DNS, which has 100% reproducibility).  You can see that CPU usage
dropped from 100% to something like 50% when the browser tabs all
started blocking (but actually I'm surprised that CPU usage didn't
drop to zero).  In this instance, as soon as I shut down the browser,
pinging a domain name worked immediately again (although, as I
mentioned previously, sometimes it can take a minute or more after
killing the browser for name resolution to jump back into working
mode).  "ifdown eth0 ; ifup eth0" *usually* fixes the problem by
canceling all pending requests.

>From one of the RH engineers:

> It's not a driver issue, since it occurs
> with two different devices... it's not a configuration issue since it
> occurs on a LiveCD... For the same reason it's unlikely to be a
> userspace issue... It's unlikely to be a local network issue since you
> say it happens in multiple locations...
>
> Absolutely bizarre. :/

Any help greatly appreciated.

Thanks,
Luke

^ permalink raw reply

* Re: Networking hangs when too many parallel requests are made at once
From: Ben Greear @ 2010-11-09 19:16 UTC (permalink / raw)
  To: Luke Hutchison; +Cc: netdev
In-Reply-To: <AANLkTinRo_ozWFCYYqanOh0YrTJ6+KadutG2j7T726dY@mail.gmail.com>

On 11/09/2010 11:04 AM, Luke Hutchison wrote:
> On Tue, Nov 9, 2010 at 1:30 PM, Luke Hutchison<luke.hutch@gmail.com>  wrote:
>> Since around Linux kernel 2.6.33 or so (but maybe as early as
>> 2.6.31, not sure exactly what version), when restoring a crashed or
>> closed browser session of either Firefox or Chrome where lots of tabs
>> (say 10-40) open simultaneously, the networking stack is brought to
>> its knees -- most or all the tabs eventually time out without data, or
>> a few tabs might get some data and then display a partial web page.

Have you been able to reproduce this on any other machine?  I suspect
it might be an issue with your specific NIC or other hardware.

At the least, it's not a general problem with opening lots
of TCP connections, as we routinely test with thousands...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v2)
From: Eric Dumazet @ 2010-11-09 19:20 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, zenczykowski
In-Reply-To: <20101109183820.GA8069@hmsreliant.think-freely.org>

Le mardi 09 novembre 2010 à 13:38 -0500, Neil Horman a écrit :
> On Tue, Nov 09, 2010 at 07:02:32PM +0100, Eric Dumazet wrote:
> > Le mardi 09 novembre 2010 à 12:46 -0500, nhorman@tuxdriver.com a écrit :
> ic char **alloc_pg_vec(struct tpacket_req *req, int order)
> > > +static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
> > >  {
> > >  	unsigned int block_nr = req->tp_block_nr;
> > > -	char **pg_vec;
> > > +	struct pgv *pg_vec;
> > >  	int i;
> > >  
> > > -	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
> > > +	pg_vec = kzalloc(block_nr * sizeof(struct pgv), GFP_KERNEL);
> > 
> > While we are at it, we could check block_nr being a sane value here ;)
> > 
> This is true.  What do you think a reasonable sane value is?  libpcap seems to
> limit itself to 32 order 5 entries in the ring, but that seems a bit arbitrary.
> Perhaps we could check and limit allocations to being no more than order 8
> (1Mb), and a total allocation of no more than perhaps max(32Mb, 1% of all ram)?
> Just throwing it out there, open to any suggestions here

I was refering to a malicious/buggy program giving a big tp_block_nr so
that (block_nr * sizeof(struct pgv)) overflows the u32

One way to deal with that is to use

	kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL);

I am not sure consistency checks done in packet_set_ring() are enough to
properly detect such errors.





^ permalink raw reply

* Re: ping -I eth1 ....
From: Joakim Tjernlund @ 2010-11-09 19:33 UTC (permalink / raw)
  Cc: Thomas Graf, Eric Dumazet, netdev
In-Reply-To: <OF921B3329.67FE598A-ONC12577D3.0033387A-C12577D3.00355AC4@LocalDomain>

Joakim Tjernlund/Transmode wrote on 2010/11/06 10:42:46:
> Thomas Graf <tgr@infradead.org> wrote on 2010/11/05 21:31:50:
> >
> > On Fri, Nov 05, 2010 at 04:54:18PM +0100, Joakim Tjernlund wrote:
> > > Eric Dumazet <eric.dumazet@gmail.com> wrote on 2010/11/05 16:06:54:
> > > >
> > > > > Hopefully most of that is legacy or just plain wrong? Unless
> > > > > someone can say why only test IFF_UP one should consider changing them.
> > > > >
> > > >
> > > > Most of the places are hot path.
> > > >
> > > > You dont want to replace one test by four tests.
> > > >
> > > > _This_ would be wrong :)
> > >
> > > Wrong is wrong, even if it is in the hot path :)
> > > Perhaps it is time define and internal IFF_OPERATIONAL flag
> > > which is the sum of IFF_UP, IFF_RUNNING etc.? Tht
> > > way you still get one test in the hot path and can abstract
> > > what defines an operational link.
> >
> > You definitely don't want to have your send() call fail simply because
> > the carrier was off for a few msec or the routing daemon has put a link
> > down temporarly. Also, the outgoing interface looked up at routing
> > decision is not necessarly the interface used for sending in the end.
> > The packet may get mangled and rerouted by netfilter or tc on the way.
>
> But do you handle the case when the link is non operational for a long time?
>
> >
> > Personally I'm even ok with the current behaviour of sendto() while the
> > socket is bound to an interface but if we choose to return an error
> > if the interface is down we might as well do so based on the operational
> > status.

> Perhaps there is a better way. This all started when pppd hung because
> of ping -I <ppp interface>, then someone pulled the cable for the on the link.
>
> This is a strace where we have two ping -I,
> ping -I p1-2-1-2-2 .. and ping -I p1-2-3-2-4 ..
> Notice how pppd hangs for a long time in PPPIOCDETACH
> As far as I can tell this is due to ping -I has claimed the ppp interfaces
> and doesn't noticed that the link is down. Ideally ping should receive
> a ENODEV as soon as pppd calls PPPIOCDETACH.
>
>    0.000908 write(0, "Connection terminated.\n", 23) = 23
>      0.000481 gettimeofday({1288952770, 566048}, NULL) = 0
>      0.001553 ioctl(7, PPPIOCDETACH
> Message from syslogd@Brazil at Fri Nov  5 11:26:20 2010 ...
> Brazil kernel: unregister_netdevice: waiting for p1-2-1-2-2 to become free. Usage count = 3
> Message from syslogd@Brazil at Fri Nov  5 11:26:20 2010 ...
> Brazil kernel: unregister_netdevice: waiting for p1-2-3-2-4 to become free. Usage count = 3
> Message from syslogd@Brazil at Fri Nov  5 11:26:51 2010 ...
> Brazil last message repeated 3 times
> , 0xbfbc3398) = 0
>     66.559216 connect(9, {sa_family=AF_PPPOX, sa_data="\0\0\0\0\0\0\0\252\273\314\335\356hd"}, 30) = 0
>      0.000693 close(10)                 = 0
>      0.000449 close(7)                  = 0
>      0.009801 close(9)                  = 0

Any comment on this last strace? It is expected that ping -I should
hold pppd hostage?

 Jocke


^ permalink raw reply

* [PATCH] net/dst: dst_dev_event() called after other notifiers
From: Eric Dumazet @ 2010-11-09 19:37 UTC (permalink / raw)
  To: Ben Greear, David Miller; +Cc: NetDev
In-Reply-To: <4CD893C6.2030803@candelatech.com>

Le lundi 08 novembre 2010 à 16:20 -0800, Ben Greear a écrit :
> This is on an otherwise lightly loaded 2.6.36 + hacks system, 12 physical interfaces,
> and two VETH interfaces.
> 
> It's much faster to delete an interface when it has no IPv6 address:
> 
> [root@ct503-60 lanforge]# time ip link add link eth5 up name eth5#0 address 00:00:00:00:00:01 type macvlan
> 
> real	0m0.005s
> user	0m0.001s
> sys	0m0.004s
> [root@ct503-60 lanforge]# time ip link delete eth5#0
> 
> real	0m0.033s
> user	0m0.001s
> sys	0m0.005s
> [root@ct503-60 lanforge]# ip link add link eth5 up name eth5#0 address 00:00:00:00:00:01 type macvlan
> 
> [root@ct503-60 lanforge]# ip -6 addr add 2002::1/64 dev eth5#0
> [root@ct503-60 lanforge]# time ip link delete eth5#0
> 
> real	0m1.030s
> user	0m0.000s
> sys	0m0.013s
> 
> 
> Funny enough, if you explicitly remove the IPv6 addr first it seems
> to run at normal speed (adding both operation's times together)
> 
> [root@ct503-60 lanforge]# ip link add link eth5 up name eth5#0 address 00:00:00:00:00:01 type macvlan
> [root@ct503-60 lanforge]# ip -6 addr add 2002::1/64 dev eth5#0
> [root@ct503-60 lanforge]# time ip -6 addr delete 2002::1/64 dev eth5#0
> 
> real	0m0.001s
> user	0m0.000s
> sys	0m0.001s
> [root@ct503-60 lanforge]# time ip link delete eth5#0
> 
> real	0m0.028s
> user	0m0.001s
> sys	0m0.005s
> 

OK, I confirm I already knew how to correct the problem.

http://www.spinics.net/lists/netdev/msg140729.html

quote :

I also believe the order of netdevice notifiers is wrong (we dont set
priority), and that we should call fib_netdev_event() _before_
dst_dev_event(). This needs another patch.


Thanks

[PATCH] net/dst: dst_dev_event() called after other notifiers

Followup of commit ef885afbf8a37689 (net: use rcu_barrier() in
rollback_registered_many)

dst_dev_event() scans a garbage dst list that might be feeded by various
network notifiers at device dismantle time.

Its important to call dst_dev_event() after other notifiers, or we might
enter the infamous msleep(250) in netdev_wait_allrefs(), and wait one
second before calling again call_netdevice_notifiers(NETDEV_UNREGISTER,
dev) to properly remove last device references.

Use priority -10 to let dst_dev_notifier be called after other network
notifiers (they have the default 0 priority)

Reported-by: Ben Greear <greearb@candelatech.com>
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: Octavian Purdila <opurdila@ixiacom.com>
Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/dst.c |    1 +
 1 files changed, 1 insertion(+)

diff --git a/net/core/dst.c b/net/core/dst.c
index 8abe628..e234bf1 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -370,6 +370,7 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event,
 
 static struct notifier_block dst_dev_notifier = {
 	.notifier_call  = dst_dev_event,
+	.priority = -10, /* must be called after other network notifiers */
 };
 
 void __init dst_init(void)



^ permalink raw reply related

* Re: [PATCH] net/dst: dst_dev_event() called after other notifiers
From: David Miller @ 2010-11-09 19:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: greearb, netdev
In-Reply-To: <1289331475.2774.41.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 09 Nov 2010 20:37:55 +0100

> [PATCH] net/dst: dst_dev_event() called after other notifiers

Nice, applied.

However, I had to apply this by hand:

>  static struct notifier_block dst_dev_notifier = {
>  	.notifier_call  = dst_dev_event,
> +	.priority = -10, /* must be called after other network notifiers */
>  };

The character after ".notifier_call" in my tree is a TAB character but
in your patch it is a sequence of spaces.  This isn't looking like the
usual email corruption, because the leading TAB characters on these
lines are properly there.

Please figure out why this happened so that it doesn't repeat in
future patches :-)

Thanks!

^ permalink raw reply

* Re: [PATCH] net/dst: dst_dev_event() called after other notifiers
From: Ben Greear @ 2010-11-09 20:11 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20101109.114853.193732360.davem@davemloft.net>

On 11/09/2010 11:48 AM, David Miller wrote:
> From: Eric Dumazet<eric.dumazet@gmail.com>
> Date: Tue, 09 Nov 2010 20:37:55 +0100
>
>> [PATCH] net/dst: dst_dev_event() called after other notifiers
>
> Nice, applied.
>
> However, I had to apply this by hand:
>
>>   static struct notifier_block dst_dev_notifier = {
>>   	.notifier_call  = dst_dev_event,
>> +	.priority = -10, /* must be called after other network notifiers */
>>   };
>
> The character after ".notifier_call" in my tree is a TAB character but
> in your patch it is a sequence of spaces.  This isn't looking like the
> usual email corruption, because the leading TAB characters on these
> lines are properly there.
>
> Please figure out why this happened so that it doesn't repeat in
> future patches :-)

I manually applied this as well and can confirm that interface deletion
with a global IPv6 address on it is now comparable to any other device
delete (about 30ms).

Tested-by:  Ben Greear <greearb@candelatech.com>

I'd love to test patches that made all interface deletes faster,
btw :)

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox