Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v3)
From: Eric Dumazet @ 2010-11-10 18:27 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, davem, zenczykowski
In-Reply-To: <1289413202-1773-1-git-send-email-nhorman@tuxdriver.com>

Le mercredi 10 novembre 2010 à 13:20 -0500, nhorman@tuxdriver.com a
écrit :
> From: Neil Horman <nhorman@tuxdriver.com>
> 
> Version 3 of this patch.
> 
> Change notes:
> 1) Updated pg_vec alloc mechanism to prevent user space from overflowing an int
> when configuring the ring buffer.
> 
> Summary:
> It was shown to me recently that systems under high load were driven very deep
> into swap when tcpdump was run.  The reason this happened was because the
> AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
> application to specify how many entries an AF_PACKET socket will have and how
> large each entry will be.  It seems the default setting for tcpdump is to set
> the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
> allocation.  Thats difficult under good circumstances, and horrid under memory
> pressure.
> 
> I thought it would be good to make that a bit more usable.  I was going to do a
> simple conversion of the ring buffer from contigous pages to iovecs, but
> unfortunately, the metadata which AF_PACKET places in these buffers can easily
> span a page boundary, and given that these buffers get mapped into user space,
> and the data layout doesn't easily allow for a change to padding between frames
> to avoid that, a simple iovec change is just going to break user space ABI
> consistency.
> 
> So I've done this, I've added a three tiered mechanism to the af_packet set_ring
> socket option.  It attempts to allocate memory in the following order:
> 
> 1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
> digging into swap
> 
> 2) Using vmalloc
> 
> 3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
> needed to get the memory
> 
> The effect is that we don't disturb the system as much when we're under load,
> while still being able to conduct tcpdumps effectively.
> 
> Tested successfully by me.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> ---
>  net/packet/af_packet.c |   86 +++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 70 insertions(+), 16 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 3616f27..5218e67 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -163,8 +163,14 @@ struct packet_mreq_max {
>  static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
>  		int closing, int tx_ring);
>  
> +#define PGV_FROM_VMALLOC 1
> +struct pgv {
> +	char *buffer;
> +	unsigned char flags;
> +};
> +
>  struct packet_ring_buffer {
> -	char			**pg_vec;
> +	struct pgv		*pg_vec;
>  	unsigned int		head;
>  	unsigned int		frames_per_block;
>  	unsigned int		frame_size;
> @@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
>  	pg_vec_pos = position / rb->frames_per_block;
>  	frame_offset = position % rb->frames_per_block;
>  
> -	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
> +	h.raw = rb->pg_vec[pg_vec_pos].buffer +
> +		(frame_offset * rb->frame_size);
>  
>  	if (status != __packet_get_status(po, h.raw))
>  		return NULL;
> @@ -2322,37 +2329,76 @@ static const struct vm_operations_struct packet_mmap_ops = {
>  	.close	=	packet_mm_close,
>  };
>  
> -static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
> +static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
> +			unsigned int len)
>  {
>  	int i;
>  
>  	for (i = 0; i < len; i++) {
> -		if (likely(pg_vec[i]))
> -			free_pages((unsigned long) pg_vec[i], order);
> +		if (likely(pg_vec[i].buffer)) {
> +			if (pg_vec[i].flags & PGV_FROM_VMALLOC)
> +				vfree(pg_vec[i].buffer);
> +			else
> +				free_pages((unsigned long)pg_vec[i].buffer,
> +					   order);
> +			pg_vec[i].buffer = NULL;
> +		}
>  	}
>  	kfree(pg_vec);
>  }
>  
> -static inline char *alloc_one_pg_vec_page(unsigned long order)
> +static inline char *alloc_one_pg_vec_page(unsigned long order,
> +					  unsigned char *flags)
>  {
> -	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
> +	char *buffer = NULL;
> +	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
> +			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
> +
> +	buffer = (char *) __get_free_pages(gfp_flags, order);
> +
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * __get_free_pages failed, fall back to vmalloc
> +	 */
> +	*flags |= PGV_FROM_VMALLOC;
> +	buffer = vmalloc((1 << order) * PAGE_SIZE);
>  
> -	return (char *) __get_free_pages(gfp_flags, order);
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * vmalloc failed, lets dig into swap here
> +	 */
> +	*flags = 0;
> +	gfp_flags &= ~__GFP_NORETRY;
> +	buffer = (char *)__get_free_pages(gfp_flags, order);
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * complete and utter failure
> +	 */
> +	return NULL;
>  }
>  
> -static char **alloc_pg_vec(struct tpacket_req *req, int order)
> +static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
>  {
>  	unsigned int block_nr = req->tp_block_nr;
> -	char **pg_vec;
> +	struct pgv *pg_vec;
>  	int i;
>  
> -	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
> +	pg_vec = kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL);
>  	if (unlikely(!pg_vec))
>  		goto out;
>  
> +	memset(pg_vec, 0 , sizeof(struct pgv) * block_nr);
> +

Well, memset() is not needed here : kzalloc() or kcalloc() already
cleared your memory.

You added this bit, this was not in your previous patch ;)




^ permalink raw reply

* Re: [PATCH] Prevent reading uninitialized memory with socketfilters
From: Dan Rosenberg @ 2010-11-10 18:25 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, stable, security

Apologies for the terseness, I'm not looking for a fight.

I agree the memset is too expensive, and that Eric's fix is far better.

-Dan

------Original Message------
From: David Miller
To: drosenberg@vsecurity.com
Cc: netdev@vger.kernel.org
Cc: stable@kernel.org
Cc: security@kernel.org
Subject: Re: [PATCH] Prevent reading uninitialized memory with socketfilters
Sent: Nov 10, 2010 1:21 PM

From: "Dan Rosenberg" <drosenberg@vsecurity.com>
Date: Wed, 10 Nov 2010 18:18:08 +0000

> The code sample I linked to clearly demonstrates exactly how to
> accomplish this, if you had bothered to read it.

I told you why I didn't read it, if you had bothered to read my
reply properly :-)

Anyways, I realize we have to do something, but memset() is going
to completely kill performance.  I consider Eric's suggestion the
closest to acceptable cost at this point but even that is hard
to digest for me.

^ permalink raw reply

* Re: [PATCH] Prevent reading uninitialized memory with socketfilters
From: David Miller @ 2010-11-10 18:21 UTC (permalink / raw)
  To: drosenberg; +Cc: netdev, stable, security
In-Reply-To: <1695276347-1289413089-cardhu_decombobulator_blackberry.rim.net-434693855-@bda083.bisx.prod.on.blackberry>

From: "Dan Rosenberg" <drosenberg@vsecurity.com>
Date: Wed, 10 Nov 2010 18:18:08 +0000

> The code sample I linked to clearly demonstrates exactly how to
> accomplish this, if you had bothered to read it.

I told you why I didn't read it, if you had bothered to read my
reply properly :-)

Anyways, I realize we have to do something, but memset() is going
to completely kill performance.  I consider Eric's suggestion the
closest to acceptable cost at this point but even that is hard
to digest for me.

^ permalink raw reply

* [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v3)
From: nhorman @ 2010-11-10 18:20 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, zenczykowski, Neil Horman
In-Reply-To: <1288033566-2091-1-git-send-email-nhorman@tuxdriver.com>

From: Neil Horman <nhorman@tuxdriver.com>

Version 3 of this patch.

Change notes:
1) Updated pg_vec alloc mechanism to prevent user space from overflowing an int
when configuring the ring buffer.

Summary:
It was shown to me recently that systems under high load were driven very deep
into swap when tcpdump was run.  The reason this happened was because the
AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
application to specify how many entries an AF_PACKET socket will have and how
large each entry will be.  It seems the default setting for tcpdump is to set
the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
allocation.  Thats difficult under good circumstances, and horrid under memory
pressure.

I thought it would be good to make that a bit more usable.  I was going to do a
simple conversion of the ring buffer from contigous pages to iovecs, but
unfortunately, the metadata which AF_PACKET places in these buffers can easily
span a page boundary, and given that these buffers get mapped into user space,
and the data layout doesn't easily allow for a change to padding between frames
to avoid that, a simple iovec change is just going to break user space ABI
consistency.

So I've done this, I've added a three tiered mechanism to the af_packet set_ring
socket option.  It attempts to allocate memory in the following order:

1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
digging into swap

2) Using vmalloc

3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
needed to get the memory

The effect is that we don't disturb the system as much when we're under load,
while still being able to conduct tcpdumps effectively.

Tested successfully by me.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 net/packet/af_packet.c |   86 +++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 70 insertions(+), 16 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3616f27..5218e67 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -163,8 +163,14 @@ struct packet_mreq_max {
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring);
 
+#define PGV_FROM_VMALLOC 1
+struct pgv {
+	char *buffer;
+	unsigned char flags;
+};
+
 struct packet_ring_buffer {
-	char			**pg_vec;
+	struct pgv		*pg_vec;
 	unsigned int		head;
 	unsigned int		frames_per_block;
 	unsigned int		frame_size;
@@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
 	pg_vec_pos = position / rb->frames_per_block;
 	frame_offset = position % rb->frames_per_block;
 
-	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
+	h.raw = rb->pg_vec[pg_vec_pos].buffer +
+		(frame_offset * rb->frame_size);
 
 	if (status != __packet_get_status(po, h.raw))
 		return NULL;
@@ -2322,37 +2329,76 @@ static const struct vm_operations_struct packet_mmap_ops = {
 	.close	=	packet_mm_close,
 };
 
-static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
+static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
+			unsigned int len)
 {
 	int i;
 
 	for (i = 0; i < len; i++) {
-		if (likely(pg_vec[i]))
-			free_pages((unsigned long) pg_vec[i], order);
+		if (likely(pg_vec[i].buffer)) {
+			if (pg_vec[i].flags & PGV_FROM_VMALLOC)
+				vfree(pg_vec[i].buffer);
+			else
+				free_pages((unsigned long)pg_vec[i].buffer,
+					   order);
+			pg_vec[i].buffer = NULL;
+		}
 	}
 	kfree(pg_vec);
 }
 
-static inline char *alloc_one_pg_vec_page(unsigned long order)
+static inline char *alloc_one_pg_vec_page(unsigned long order,
+					  unsigned char *flags)
 {
-	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
+	char *buffer = NULL;
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
+			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+	buffer = (char *) __get_free_pages(gfp_flags, order);
+
+	if (buffer)
+		return buffer;
+
+	/*
+	 * __get_free_pages failed, fall back to vmalloc
+	 */
+	*flags |= PGV_FROM_VMALLOC;
+	buffer = vmalloc((1 << order) * PAGE_SIZE);
 
-	return (char *) __get_free_pages(gfp_flags, order);
+	if (buffer)
+		return buffer;
+
+	/*
+	 * vmalloc failed, lets dig into swap here
+	 */
+	*flags = 0;
+	gfp_flags &= ~__GFP_NORETRY;
+	buffer = (char *)__get_free_pages(gfp_flags, order);
+	if (buffer)
+		return buffer;
+
+	/*
+	 * complete and utter failure
+	 */
+	return NULL;
 }
 
-static char **alloc_pg_vec(struct tpacket_req *req, int order)
+static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
 {
 	unsigned int block_nr = req->tp_block_nr;
-	char **pg_vec;
+	struct pgv *pg_vec;
 	int i;
 
-	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
+	pg_vec = kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL);
 	if (unlikely(!pg_vec))
 		goto out;
 
+	memset(pg_vec, 0 , sizeof(struct pgv) * block_nr);
+
 	for (i = 0; i < block_nr; i++) {
-		pg_vec[i] = alloc_one_pg_vec_page(order);
-		if (unlikely(!pg_vec[i]))
+		pg_vec[i].buffer = alloc_one_pg_vec_page(order,
+							 &pg_vec[i].flags);
+		if (unlikely(!pg_vec[i].buffer))
 			goto out_free_pgvec;
 	}
 
@@ -2361,6 +2407,7 @@ out:
 
 out_free_pgvec:
 	free_pg_vec(pg_vec, order, block_nr);
+	kfree(pg_vec);
 	pg_vec = NULL;
 	goto out;
 }
@@ -2368,7 +2415,7 @@ out_free_pgvec:
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring)
 {
-	char **pg_vec = NULL;
+	struct pgv *pg_vec = NULL;
 	struct packet_sock *po = pkt_sk(sk);
 	int was_running, order = 0;
 	struct packet_ring_buffer *rb;
@@ -2530,15 +2577,22 @@ static int packet_mmap(struct file *file, struct socket *sock,
 			continue;
 
 		for (i = 0; i < rb->pg_vec_len; i++) {
-			struct page *page = virt_to_page(rb->pg_vec[i]);
+			struct page *page;
+			void *kaddr = rb->pg_vec[i].buffer;
 			int pg_num;
 
 			for (pg_num = 0; pg_num < rb->pg_vec_pages;
-					pg_num++, page++) {
+					pg_num++) {
+				if (rb->pg_vec[i].flags & PGV_FROM_VMALLOC)
+					page = vmalloc_to_page(kaddr);
+				else
+					page = virt_to_page(kaddr);
+
 				err = vm_insert_page(vma, start, page);
 				if (unlikely(err))
 					goto out;
 				start += PAGE_SIZE;
+				kaddr += PAGE_SIZE;
 			}
 		}
 	}
-- 
1.7.2.3


^ permalink raw reply related

* Re: [PATCH 2/3] net: packet: fix information leak to userland
From: David Miller @ 2010-11-10 18:20 UTC (permalink / raw)
  To: segooon; +Cc: wharms, kernel-janitors, jpirko, eric.dumazet, netdev,
	linux-kernel
In-Reply-To: <20101110181641.GA11499@albatros>

From: Vasiliy Kulikov <segooon@gmail.com>
Date: Wed, 10 Nov 2010 21:16:42 +0300

> Are there any other suggestions about this patch?

None from me, so even if there are no changes please just repost
this.

Thanks.

^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Eric Dumazet @ 2010-11-10 18:18 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <4CDADC17.6070506@candelatech.com>

Le mercredi 10 novembre 2010 à 09:53 -0800, Ben Greear a écrit :

> I agree, but if these can be read from user-space, it can be tricky to make
> solid code to deal with wraps when the thing wrapping can be 32 or 64 bits,
> depending on whether the kernel is compiled 32-bit or 64-bit.
> 
> So, my preference is to use u32 or u64 so there is no guesswork involved.
> 
> To be sure, this problem exists in lots of places already (/proc/net/dev comes to mind),
> but the fewer places the better in my opinion.
> 

On a 32bit kernel, very few devices provide 64bit counters, so an
application reading /proc/net/dev should be prepared to handle 32 or
64bit counter.

On a 64bit kernel, many devices still provide 32, 36, 40 bit counters
(hardware based). Same conclusion for userspace.

So really, an SNMP application must be able to cope with any counter
width.

As percpu data is going to hurt in the 4096 cpu cases, we should try to
not make percpu structures too big.

^ permalink raw reply

* Re: [PATCH 2/3] net: packet: fix information leak to userland
From: Vasiliy Kulikov @ 2010-11-10 18:16 UTC (permalink / raw)
  To: walter harms
  Cc: kernel-janitors, David S. Miller, Jiri Pirko, Eric Dumazet,
	netdev, linux-kernel
In-Reply-To: <20101106143911.GA17428@albatros>

David,

Are there any other suggestions about this patch?


Thanks,

-- 
Vasiliy

^ permalink raw reply

* Re: [PATCH v2] net: ax25: fix information leak to userland
From: David Miller @ 2010-11-10 18:14 UTC (permalink / raw)
  To: segooon; +Cc: kernel-janitors, jreuter, ralf, linux-hams, netdev, linux-kernel
In-Reply-To: <1289412757-11411-1-git-send-email-segooon@gmail.com>

From: Vasiliy Kulikov <segooon@gmail.com>
Date: Wed, 10 Nov 2010 21:12:36 +0300

> Sometimes ax25_getname() doesn't initialize all members of fsa_digipeater
> field of fsa struct, also the struct has padding bytes between
> sax25_call and sax25_ndigis fields.  This structure is then copied to
> userland.  It leads to leaking of contents of kernel stack memory.
> 
> Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>

Looks good to me, applied.

Thanks!

^ permalink raw reply

* [PATCH v2] net: ax25: fix information leak to userland
From: Vasiliy Kulikov @ 2010-11-10 18:12 UTC (permalink / raw)
  To: kernel-janitors
  Cc: Joerg Reuter, Ralf Baechle, David S. Miller, linux-hams, netdev,
	linux-kernel

Sometimes ax25_getname() doesn't initialize all members of fsa_digipeater
field of fsa struct, also the struct has padding bytes between
sax25_call and sax25_ndigis fields.  This structure is then copied to
userland.  It leads to leaking of contents of kernel stack memory.

Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
---
 Compile tested.

 net/ax25/af_ax25.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index 26eaebf..bb86d29 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1392,6 +1392,7 @@ static int ax25_getname(struct socket *sock, struct sockaddr *uaddr,
 	ax25_cb *ax25;
 	int err = 0;
 
+	memset(fsa, 0, sizeof(fsa));
 	lock_sock(sk);
 	ax25 = ax25_sk(sk);
 
@@ -1403,7 +1404,6 @@ static int ax25_getname(struct socket *sock, struct sockaddr *uaddr,
 
 		fsa->fsa_ax25.sax25_family = AF_AX25;
 		fsa->fsa_ax25.sax25_call   = ax25->dest_addr;
-		fsa->fsa_ax25.sax25_ndigis = 0;
 
 		if (ax25->digipeat != NULL) {
 			ndigi = ax25->digipeat->ndigi;
-- 
1.7.0.4


^ permalink raw reply related

* Re: [PATCH] Prevent reading uninitialized memory with socket filters
From: David Miller @ 2010-11-10 18:07 UTC (permalink / raw)
  To: drosenberg; +Cc: netdev, stable, security
In-Reply-To: <1289387567.7380.63.camel@dan>

From: Dan Rosenberg <drosenberg@vsecurity.com>
Date: Wed, 10 Nov 2010 06:12:47 -0500

> 
>> 
>> Prove it.
> 
> I hope this was a joke.

It absolutely is not.

You are very much not the first person ever to try and add an
expensive memset() here.

So the onus is really on you to prove this assertion and show the
exact code path by which the user can actually see any uninitialized
kernel stack memory (he can't, he can peek at certain values in a
certain extremely contrived range, making the leak useless), rather
than point us at some web external site archive of a list posting
which we cannot easily quote and reply to here.

I think you cannot do it, really.  Except in the AF_PACKET case, the
sockets can only see "0" or a negative error code, not the actual
sk_run_filter() return value.

In the one exception, AF_PACKET, the range of values the user can
see are in the range of MTU of the device being accessed, which
realistically is 1500 bytes.  This means the user cannot see any
kernel stack value outside of the range 0 to 1500, which isn't
worth using this expensive memset to guard against at all.

I don't even think it's worth adding all of the extra cpu cycles
incurred by Eric Dumazet's scheme of using a bitmap test on every
single memory buffer access.

^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Ben Greear @ 2010-11-10 17:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <1289411027.2860.248.camel@edumazet-laptop>

On 11/10/2010 09:43 AM, Eric Dumazet wrote:
> Le mercredi 10 novembre 2010 à 09:39 -0800, Ben Greear a écrit :
>
>>>    /**
>>> - *	struct macvlan_rx_stats - MACVLAN percpu rx stats
>>> + *	struct macvlan_pcpu_stats - MACVLAN percpu stats
>>>     *	@rx_packets: number of received packets
>>>     *	@rx_bytes: number of received bytes
>>>     *	@rx_multicast: number of received multicast packets
>>> + *	@tx_
>>
>> Minor nit..seems you missed a few there?
>>
>
> Arg... you're right !
>
>>>     *	@syncp: synchronization point for 64bit counters
>>>     *	@rx_errors: number of errors
>>>     */
>>> -struct macvlan_rx_stats {
>>> +struct macvlan_pcpu_stats {
>>>    	u64			rx_packets;
>>>    	u64			rx_bytes;
>>>    	u64			rx_multicast;
>>> +	u64			tx_packets;
>>> +	u64			tx_bytes;
>>>    	struct u64_stats_sync	syncp;
>>>    	unsigned long		rx_errors;
>>> +	unsigned long		tx_dropped;
>>
>> Any reason to not also make those u64?
>>
>
> Well, they are supposed to be not incremented often, and they are packet
> counts only, so a wrap around in less than 5 seconds is very unlikely.

I agree, but if these can be read from user-space, it can be tricky to make
solid code to deal with wraps when the thing wrapping can be 32 or 64 bits,
depending on whether the kernel is compiled 32-bit or 64-bit.

So, my preference is to use u32 or u64 so there is no guesswork involved.

To be sure, this problem exists in lots of places already (/proc/net/dev comes to mind),
but the fewer places the better in my opinion.

That said, I don't feel too strongly about it, so if you want to keep these
stats as they are, I shall argue no more :)

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCH] Fix CAN info leak/minor heap overflow
From: David Miller @ 2010-11-10 17:51 UTC (permalink / raw)
  To: socketcan; +Cc: urs, netdev, drosenberg, security, torvalds
In-Reply-To: <4CDA412B.90900@hartkopp.net>

From: Oliver Hartkopp <socketcan@hartkopp.net>
Date: Wed, 10 Nov 2010 07:52:27 +0100

> IMHO the patch improves the historic situation and fixes the useless leakage
> of kernel addresses. Please consider to apply that procfs changes.

I'm only fine with fixing the kernel pointer fields in some way.

But moving forward any other change to the procfs file is simply
a waste of time.

You should create sysfs files and add logic to your tools to look
for them and use them if they exist.

Your forward path _SHOULD NOT_ be continuing this procfs versioning
madness.  Use something sane and do the work to make userland start
to be ready for this transition.

^ permalink raw reply

* Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.
From: David Miller @ 2010-11-10 17:46 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, herbert, jdike
In-Reply-To: <1289381008-5484-1-git-send-email-xiaohui.xin@intel.com>

From: xiaohui.xin@intel.com
Date: Wed, 10 Nov 2010 17:23:28 +0800

> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
>>2) The idea to key off of skb->dev in skb_release_data() is
>>   fundamentally flawed since many actions can change skb->dev on you,
>>   which will end up causing a leak of your external data areas.
> 
> How about this one? If the destructor_arg is not a good candidate,
> then I have to add an apparent field in shinfo.

If destructor_arg is actually a net_device pointer or similar,
you will need to take a reference count on it or similar.

Which means --> good bye performance especially on SMP.

You're going to be adding new serialization points and at
least two new atomics per packet.

^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Eric Dumazet @ 2010-11-10 17:43 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <4CDAD8C8.20807@candelatech.com>

Le mercredi 10 novembre 2010 à 09:39 -0800, Ben Greear a écrit :

> >   /**
> > - *	struct macvlan_rx_stats - MACVLAN percpu rx stats
> > + *	struct macvlan_pcpu_stats - MACVLAN percpu stats
> >    *	@rx_packets: number of received packets
> >    *	@rx_bytes: number of received bytes
> >    *	@rx_multicast: number of received multicast packets
> > + *	@tx_
> 
> Minor nit..seems you missed a few there?
> 

Arg... you're right !

> >    *	@syncp: synchronization point for 64bit counters
> >    *	@rx_errors: number of errors
> >    */
> > -struct macvlan_rx_stats {
> > +struct macvlan_pcpu_stats {
> >   	u64			rx_packets;
> >   	u64			rx_bytes;
> >   	u64			rx_multicast;
> > +	u64			tx_packets;
> > +	u64			tx_bytes;
> >   	struct u64_stats_sync	syncp;
> >   	unsigned long		rx_errors;
> > +	unsigned long		tx_dropped;
> 
> Any reason to not also make those u64?
> 

Well, they are supposed to be not incremented often, and they are packet
counts only, so a wrap around in less than 5 seconds is very unlikely.




^ permalink raw reply

* Your email id has won 750,000 GBP in gnld xmas promo.send info
From: richard-mullen @ 2010-11-10 17:43 UTC (permalink / raw)



Name::::::






Country::::::::

^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Ben Greear @ 2010-11-10 17:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <1289403709.2860.216.camel@edumazet-laptop>

On 11/10/2010 07:41 AM, Eric Dumazet wrote:
> macvlan is a stacked device, like tunnels. We should use the lockless
> mechanism we are using in tunnels and loopback.
>
> This patch completely removes locking in TX path.
>
> tx stat counters are added into existing percpu stat structure, renamed
> from rx_stats to pcpu_stats.
>
> Note : this reverts commit 2c11455321f37 (macvlan: add multiqueue
> capability)
>
> Signed-off-by: Eric Dumazet<eric.dumazet@gmail.com>
> ---
>   drivers/net/macvlan.c      |   72 +++++++++++++----------------------
>   include/linux/if_macvlan.h |   26 +++++++-----
>   2 files changed, 43 insertions(+), 55 deletions(-)
>

> diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
> index 8a2fd66..3779b5f 100644
> --- a/include/linux/if_macvlan.h
> +++ b/include/linux/if_macvlan.h
> @@ -25,19 +25,23 @@ struct macvlan_port;
>   struct macvtap_queue;
>
>   /**
> - *	struct macvlan_rx_stats - MACVLAN percpu rx stats
> + *	struct macvlan_pcpu_stats - MACVLAN percpu stats
>    *	@rx_packets: number of received packets
>    *	@rx_bytes: number of received bytes
>    *	@rx_multicast: number of received multicast packets
> + *	@tx_

Minor nit..seems you missed a few there?

>    *	@syncp: synchronization point for 64bit counters
>    *	@rx_errors: number of errors
>    */
> -struct macvlan_rx_stats {
> +struct macvlan_pcpu_stats {
>   	u64			rx_packets;
>   	u64			rx_bytes;
>   	u64			rx_multicast;
> +	u64			tx_packets;
> +	u64			tx_bytes;
>   	struct u64_stats_sync	syncp;
>   	unsigned long		rx_errors;
> +	unsigned long		tx_dropped;

Any reason to not also make those u64?


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: sk->sk_socket seems to disappear before connection termination
From: Eric Dumazet @ 2010-11-10 17:37 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: David Howells, Netfilter Developer Mailing List, netdev,
	Rafał Maj
In-Reply-To: <alpine.LNX.2.01.1011101816330.2886@obet.zrqbmnf.qr>

Le mercredi 10 novembre 2010 à 18:17 +0100, Jan Engelhardt a écrit :
> On Wednesday 2010-11-10 17:44, Eric Dumazet wrote:
> >
> >[ 9920.234680] ipt_LOG: sk=ffff880118bd32c0 sk->sk_socket=ffff88011d0d8c00 file=ffff88011cd4e100
> >[ 9920.234731] IN= OUT=eth1 SRC=192.168.20.108 DST=192.168.20.110 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=63704 DF PROTO=TCP SPT=60088 DPT=22 WINDOW=35 RES=0x00 ACK FIN URGP=0 UID=0 GID=0 
> >[ 9920.235221] ipt_LOG: sk=ffff880078998000 sk->sk_socket=ffff880078c58300 file=          (null)
> >[ 9920.235271] IN= OUT=eth1 SRC=192.168.20.108 DST=192.168.20.110 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=60088 DPT=22 WINDOW=35 RES=0x00 ACK URGP=0 
> >
> >You can see in my log, that the last packet seems to be from a different
> >socket ! (sk pointer changed to ffff880078998000 !)
> 
> Yes, that's it.
> 
> >Well well well, thats an ACK, in answer to FIN packet received from remote
> >side.
> 
> But why is it not handled by sk ffff880118bd32c0 anymore?
> It does have, after all, the same (addr,port) tuple.
> And it is sort of a hiccup for xt_owner users.

Its because of TIMEWAIT state : no more socket

We use a special tcp socket (net->ipv4.tcp_sock) in tcp_v4_send_ack()

Its yet another contention point on SMP machines :(




^ permalink raw reply

* Re: sk->sk_socket seems to disappear before connection termination
From: Jan Engelhardt @ 2010-11-10 17:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Howells, Netfilter Developer Mailing List, netdev,
	Rafał Maj
In-Reply-To: <1289407451.2860.239.camel@edumazet-laptop>


On Wednesday 2010-11-10 17:44, Eric Dumazet wrote:
>
>[ 9920.234680] ipt_LOG: sk=ffff880118bd32c0 sk->sk_socket=ffff88011d0d8c00 file=ffff88011cd4e100
>[ 9920.234731] IN= OUT=eth1 SRC=192.168.20.108 DST=192.168.20.110 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=63704 DF PROTO=TCP SPT=60088 DPT=22 WINDOW=35 RES=0x00 ACK FIN URGP=0 UID=0 GID=0 
>[ 9920.235221] ipt_LOG: sk=ffff880078998000 sk->sk_socket=ffff880078c58300 file=          (null)
>[ 9920.235271] IN= OUT=eth1 SRC=192.168.20.108 DST=192.168.20.110 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=60088 DPT=22 WINDOW=35 RES=0x00 ACK URGP=0 
>
>You can see in my log, that the last packet seems to be from a different
>socket ! (sk pointer changed to ffff880078998000 !)

Yes, that's it.

>Well well well, thats an ACK, in answer to FIN packet received from remote
>side.

But why is it not handled by sk ffff880118bd32c0 anymore?
It does have, after all, the same (addr,port) tuple.
And it is sort of a hiccup for xt_owner users.

^ permalink raw reply

* Re: [PATCH] ucc_geth: Fix hung tasks.
From: Joakim Tjernlund @ 2010-11-10 16:57 UTC (permalink / raw)
  Cc: Anton Vorontsov, netdev
In-Reply-To: <OF7C87528F.1727C94B-ONC12577D7.004DB53C-C12577D7.004DF235@LocalDomain>

Joakim Tjernlund/Transmode wrote on 2010/11/10 15:11:22:
>
> Actually, there is something wrong anyway with TX timeout
> so don't use this patch. I must investigate more but
> it seems like cancel_work_sync hangs whenever an TX timeout
> occurs.

OK, found the problem. Currently ucc_geth bring the IF down and up
each time a TX timeout occurs which means you cannot do cancel_work_sync()
in ucc_geth_close as it will dead lock.

Looking at gianfar, it just reinits the controller and PHY and
I guess ucc_geth really should do the same.
This patch tries to do that but I am not sure it recovers
after a TX timeout.

Anton, what do think? If OK with you I will write up
a proper patch.

 diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index 6647ed7..133aaba 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -2065,9 +2065,6 @@ static void ucc_geth_stop(struct ucc_geth_private *ugeth)
 	/* Disable Rx and Tx */
 	clrbits32(&ug_regs->maccfg1, MACCFG1_ENABLE_RX | MACCFG1_ENABLE_TX);

-	phy_disconnect(ugeth->phydev);
-	ugeth->phydev = NULL;
-
 	ucc_geth_memclean(ugeth);
 }

@@ -3558,6 +3555,8 @@ static int ucc_geth_close(struct net_device *dev)

 	cancel_work_sync(&ugeth->timeout_work);
 	ucc_geth_stop(ugeth);
+	phy_disconnect(ugeth->phydev);
+	ugeth->phydev = NULL;

 	free_irq(ugeth->ug_info->uf_info.irq, ugeth->ndev);

@@ -3586,8 +3585,12 @@ static void ucc_geth_timeout_work(struct work_struct *work)
 		 * Must reset MAC *and* PHY. This is done by reopening
 		 * the device.
 		 */
-		ucc_geth_close(dev);
-		ucc_geth_open(dev);
+		netif_tx_stop_all_queues(dev);
+		ucc_geth_stop(ugeth);
+		ucc_geth_init_mac(ugeth);
+		/* Must start PHY here */
+		phy_start(ugeth->phydev);
+		netif_tx_start_all_queues(dev);
 	}

 	netif_tx_schedule_all(dev);

>
> Joakim Tjernlund/Transmode wrote on 2010/11/10 13:05:28:
> >
> > Ping?
> >
> > Even though this patch didn't solve my hang it is still a bug.
> >
> >      Jocke
> >
> > Joakim Tjernlund <Joakim.Tjernlund@transmode.se> wrote on 2010/11/08 11:23:39:
> >
> > > From: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
> > > To: linuxppc-dev@lists.ozlabs.org, netdev@vger.kernel.org, Anton Vorontsov <avorontsov@ru.mvista.com>
> > > Cc: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
> > > Date: 2010/11/08 11:23
> > > Subject: [PATCH] ucc_geth: Fix hung tasks.
> > >
> > > We noticed a few hangs like this:
> > >
> > > INFO: task ifconfig:572 blocked for more than 120 seconds.
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > ifconfig      D 0ff65760     0   572    369 0x00000000
> > > Call Trace:
> > > [c6157be0] [c6008460] 0xc6008460 (unreliable)
> > > [c6157ca0] [c0008608] __switch_to+0x4c/0x6c
> > > [c6157cb0] [c028fecc] schedule+0x184/0x310
> > > [c6157ce0] [c0290e54] __mutex_lock_slowpath+0xa4/0x150
> > > [c6157d20] [c0290c48] mutex_lock+0x44/0x48
> > > [c6157d30] [c01aba74] phy_stop+0x20/0x70
> > > [c6157d40] [c01aef40] ucc_geth_stop+0x30/0x98
> > > [c6157d60] [c01b18fc] ucc_geth_close+0x9c/0xdc
> > > [c6157d80] [c01db0cc] __dev_close+0xa0/0xd0
> > > [c6157d90] [c01deddc] __dev_change_flags+0x8c/0x148
> > > [c6157db0] [c01def54] dev_change_flags+0x1c/0x64
> > > [c6157dd0] [c0237ac8] devinet_ioctl+0x678/0x784
> > > [c6157e50] [c0239a58] inet_ioctl+0xb0/0xbc
> > > [c6157e60] [c01cafa8] sock_ioctl+0x174/0x2a0
> > > [c6157e80] [c009a16c] vfs_ioctl+0xcc/0xe0
> > > [c6157ea0] [c009a998] do_vfs_ioctl+0xc4/0x79c
> > > [c6157f10] [c009b0b0] sys_ioctl+0x40/0x74
> > > [c6157f40] [c00117c4] ret_from_syscall+0x0/0x38
> > >
> > > I THINK this is due to a missing cancel_work_sync in the driver
> > > although we cannot be sure. I found this by comparing
> > > ucc_geth with gianfar.
> > >
> > > Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
> > > ---
> > >  drivers/net/ucc_geth.c |    1 +
> > >  1 files changed, 1 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
> > > index 97f9f7d..6647ed7 100644
> > > --- a/drivers/net/ucc_geth.c
> > > +++ b/drivers/net/ucc_geth.c
> > > @@ -3556,6 +3556,7 @@ static int ucc_geth_close(struct net_device *dev)
> > >
> > >     napi_disable(&ugeth->napi);
> > >
> > > +   cancel_work_sync(&ugeth->timeout_work);
> > >     ucc_geth_stop(ugeth);
> > >
> > >     free_irq(ugeth->ug_info->uf_info.irq, ugeth->ndev);
> > > --
> > > 1.7.2.2
> > >


^ permalink raw reply related

* Re: sk->sk_socket seems to disappear before connection termination
From: Eric Dumazet @ 2010-11-10 16:44 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: David Howells, Netfilter Developer Mailing List, netdev,
	Rafał Maj
In-Reply-To: <alpine.LNX.2.01.1011101148060.18018@obet.zrqbmnf.qr>

Le mercredi 10 novembre 2010 à 11:53 +0100, Jan Engelhardt a écrit :
> On Wednesday 2010-11-10 06:47, Eric Dumazet wrote:
> >Le mercredi 10 novembre 2010 à 02:09 +0100, Jan Engelhardt a écrit :
> >> Hi,
> >> 
> >> Rafał reported this to us on IRC, paraphrasing what has been observed:
> >> 
> >> Using a simple rule like `iptables -A OUTPUT -p tcp --dport 80 -j LOG 
> >> --log-uid`, one can observe on creating a connection and terminating
> >> it that the trailing packets have skb->sk->sk_socket == NULL.
> >> Is this intended? Is the socket not retained until after TCP has
> >> sent out the closing exchange?
> >> 
> >> As I can reproduce:
> >> 
> >> $ telnet 134.76.13.21 80
> >> Trying 134.76.13.21...
> >> Connected to 134.76.13.21.
> >> Escape character is '^]'.
> >> ^]
> >> telnet> ^D
> >> Connection closed.
> >> 
> >> [491419.500978] IN= OUT=tun0 SRC=134.76.2.163 DST=134.76.13.21 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=35420 DF PROTO=TCP SPT=58613 DPT=80 WINDOW=5488 RES=0x00 SYN URGP=0 UID=25121 GID=100 
> >> [491419.511533] IN= OUT=tun0 SRC=134.76.2.163 DST=134.76.13.21 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=35421 DF PROTO=TCP SPT=58613 DPT=80 WINDOW=86 RES=0x00 ACK URGP=0 UID=25121 GID=100 
> >> [491420.052182] IN= OUT=tun0 SRC=134.76.2.163 DST=134.76.13.21 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=35422 DF PROTO=TCP SPT=58613 DPT=80 WINDOW=86 RES=0x00 ACK FIN URGP=0 UID=25121 GID=100 
> >> [491420.063619] IN= OUT=tun0 SRC=134.76.2.163 DST=134.76.13.21 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=35423 DF PROTO=TCP SPT=58613 DPT=80 WINDOW=86 RES=0x00 ACK URGP=0 
> >
> >Hmmm... skb->sk->sk_socket is really NULL ?
> >Are you sure its not skb->sk->sk_socket->file which is NULL ?
> 
> I am certain of it, having augmented ipt_LOG/xt_LOGMARK temporarily by 
> appropriate printks.
> 
> >In this case, you might need to use sock_i_uid() / sock_i_ino() as a
> >fallback ? (expensive because they take a rwlock)
> 
> No, sock_i_uid also uses sk->sk_socket. What is interesting though is 
> that sock_i_uid uses SOCK_INODE(sk->sk_socket)->i_uid, but xt_owner uses 
> sk->sk_socket->file->f_cred->fsuid. Would you have an idea as to why 
> that is?
> Dave Howells (cced) did the last change on it.
> 

Hmm, this is not what I get here. Could you please recheck ?

diff --git a/net/ipv4/netfilter/ipt_LOG.c b/net/ipv4/netfilter/ipt_LOG.c
index 72ffc8f..b0933b7 100644
--- a/net/ipv4/netfilter/ipt_LOG.c
+++ b/net/ipv4/netfilter/ipt_LOG.c
@@ -337,6 +337,8 @@ static void dump_packet(struct sbuff *m,
        /* Max length: 15 "UID=4294967295 " */
        if ((logflags & IPT_LOG_UID) && !iphoff && skb->sk) {
                read_lock_bh(&skb->sk->sk_callback_lock);
+               pr_err("sk=%p sk->sk_socket=%p file=%p\n",
+                       skb->sk, skb->sk->sk_socket, skb->sk->sk_socket ? skb->sk->sk_socket->file : NULL);
                if (skb->sk->sk_socket && skb->sk->sk_socket->file)
                        sb_add(m, "UID=%u GID=%u ",
                                skb->sk->sk_socket->file->f_cred->fsuid,


[ 9917.808796] ipt_LOG: sk=ffff880118bd32c0 sk->sk_socket=ffff88011d0d8c00 file=ffff88011cd4e100
[ 9917.808851] IN= OUT=eth1 SRC=192.168.20.108 DST=192.168.20.110 LEN=60 TOS=0x10 PREC=0x00 TTL=64 ID=63701 DF PROTO=TCP SPT=60088 DPT=22 WINDOW=4380 RES=0x00 SYN URGP=0 UID=0 GID=0 
[ 9917.809091] ipt_LOG: sk=ffff880118bd32c0 sk->sk_socket=ffff88011d0d8c00 file=ffff88011cd4e100
[ 9917.809142] IN= OUT=eth1 SRC=192.168.20.108 DST=192.168.20.110 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=63702 DF PROTO=TCP SPT=60088 DPT=22 WINDOW=35 RES=0x00 ACK URGP=0 UID=0 GID=0 
[ 9917.814199] ipt_LOG: sk=ffff880118bd32c0 sk->sk_socket=ffff88011d0d8c00 file=ffff88011cd4e100
[ 9917.814251] IN= OUT=eth1 SRC=192.168.20.108 DST=192.168.20.110 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=63703 DF PROTO=TCP SPT=60088 DPT=22 WINDOW=35 RES=0x00 ACK URGP=0 UID=0 GID=0 
[ 9920.234680] ipt_LOG: sk=ffff880118bd32c0 sk->sk_socket=ffff88011d0d8c00 file=ffff88011cd4e100
[ 9920.234731] IN= OUT=eth1 SRC=192.168.20.108 DST=192.168.20.110 LEN=52 TOS=0x10 PREC=0x00 TTL=64 ID=63704 DF PROTO=TCP SPT=60088 DPT=22 WINDOW=35 RES=0x00 ACK FIN URGP=0 UID=0 GID=0 
[ 9920.235221] ipt_LOG: sk=ffff880078998000 sk->sk_socket=ffff880078c58300 file=          (null)
[ 9920.235271] IN= OUT=eth1 SRC=192.168.20.108 DST=192.168.20.110 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=60088 DPT=22 WINDOW=35 RES=0x00 ACK URGP=0 


You can see in my log, that the last packet seems to be from a different socket !
(sk pointer changed to ffff880078998000 !)

Well well well, thats an ACK, in answer to FIN packet received from remote side.






^ permalink raw reply related

* Re: [PATCH 1/1] UDEV - Add 'udevlom' command line param to start_udev
From: Harald Hoyer @ 2010-11-10 16:37 UTC (permalink / raw)
  To: Narendra_K
  Cc: linux-hotplug, netdev, Matt_Domsch, Jordan_Hargrave, Charles_Rose
In-Reply-To: <4CDAC930.4010801@redhat.com>

On 11/10/2010 05:32 PM, Harald Hoyer wrote:
> On 11/03/2010 05:55 PM, Narendra_K@Dell.com wrote:
>> Hello,
>>
>> This patch allows users to specify if they want the onboard network
>> interfaces to be renamed to lomN by implementing a command line param
>> 'udevlom'.
>>
>> From: Narendra K<narendra_k@dell.com>
>> Subject: [PATCH] UDEV - Add 'udevlom' command line param to start_udev
>>
>> This patch implements 'udevlom' command line parameter, which
>> when passed, results in onboard network interfaces getting
>> renamed to lomN.
>>
>> Signed-off-by: Narendra K<narendra_k@dell.com>
>> ---
>> start_udev | 5 +++++
>> 1 files changed, 5 insertions(+), 0 deletions(-)
>>
>> diff --git a/start_udev b/start_udev
>> index 49fc286..57d60c9 100755
>> --- a/start_udev
>> +++ b/start_udev
>> @@ -32,6 +32,7 @@ export TZ=/etc/localtime
>> . /etc/init.d/functions
>>
>> prog=udev
>> +cmdline=`cat /proc/cmdline`
>>
>> touch_recursive() {
>> ( cd $1;
>> @@ -60,6 +61,10 @@ fi
>>
>> ret=$[$ret + $?]
>>
>> +if strstr "$cmdline" udevlom; then
>> + /sbin/udevadm control --env=UDEVLOM="y"
>> +fi
>> +
>> /sbin/udevadm trigger --type=subsystems --action=add
>> /sbin/udevadm trigger --type=devices --action=add
>> /sbin/udevadm settle
>
> start_udev is obsolete with the use of systemd service files anyway in Fedora>=15

not saying that we really should use "udevlom" on the kernel command line, but 
you could use:

IMPORT{cmdline}="udevlom"
KERNEL="eth*", ENV{udevlom}==1, ....

^ permalink raw reply

* Re: [PATCH 1/1] UDEV - Add 'udevlom' command line param to start_udev
From: Harald Hoyer @ 2010-11-10 16:32 UTC (permalink / raw)
  To: Narendra_K
  Cc: linux-hotplug, netdev, Matt_Domsch, Jordan_Hargrave, Charles_Rose
In-Reply-To: <20101103165505.GA3281@fedora-14-r710.oslab.blr.amer.dell.com>

On 11/03/2010 05:55 PM, Narendra_K@Dell.com wrote:
> Hello,
>
> This patch allows users to specify if they want the onboard network
> interfaces to be renamed to lomN by implementing a command line param
> 'udevlom'.
>
> From: Narendra K<narendra_k@dell.com>
> Subject: [PATCH] UDEV - Add 'udevlom' command line param to start_udev
>
> This patch implements 'udevlom' command line parameter, which
> when passed, results in onboard network interfaces getting
> renamed to lomN.
>
> Signed-off-by: Narendra K<narendra_k@dell.com>
> ---
>   start_udev |    5 +++++
>   1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/start_udev b/start_udev
> index 49fc286..57d60c9 100755
> --- a/start_udev
> +++ b/start_udev
> @@ -32,6 +32,7 @@ export TZ=/etc/localtime
>   . /etc/init.d/functions
>
>   prog=udev
> +cmdline=`cat /proc/cmdline`
>
>   touch_recursive() {
>   	( cd $1;
> @@ -60,6 +61,10 @@ fi
>
>   ret=$[$ret + $?]
>
> +if strstr "$cmdline" udevlom; then
> +	/sbin/udevadm control --env=UDEVLOM="y"
> +fi
> +
>   /sbin/udevadm trigger --type=subsystems --action=add
>   /sbin/udevadm trigger --type=devices --action=add
>   /sbin/udevadm settle

start_udev is obsolete with the use of systemd service files anyway in Fedora>=15

^ permalink raw reply

* Re: [PATCH 0/2] net: Changes in queue allocation and freeing
From: Tom Herbert @ 2010-11-10 16:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev
In-Reply-To: <1289385696.2860.147.camel@edumazet-laptop>

On Wed, Nov 10, 2010 at 2:41 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 09 novembre 2010 à 12:47 -0800, Tom Herbert a écrit :
>> Changes to both RX and TX queue allocation.  In both cases allocate
>> in alloc_netdev_mq and free in free_netdev.  For RX the reference
>> couting also changed, the device reference count can now be used.
>
> Oh well :)
>
> Are they preliminary patches so that XPS also dont need the "reference
> counts specific to TX queues" ? ;)
>
Yes, this should allow the xps maps to be in the net_device also.
Sorry I neglected to mention that.

Also I noticed that the comment about RX queues refcnts is no longer
valid.  I can respin patch if necessary.

diff --git a/net/core/dev.c b/net/core/dev.c
index 87d89ba..34a42a8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5029,10 +5029,6 @@ static int netif_alloc_rx_queues(struct net_device *dev)
        }
        dev->_rx = rx;

-       /*
-        * Set a pointer to first element in the array which holds the
-        * reference count.
-        */
        for (i = 0; i < count; i++)
                rx[i].dev = dev;
 #endif

>
>
>

^ permalink raw reply related

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-11-10 16:16 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <OFFE63B7F4.D491BE12-ON652577D6.005B4090-652577D6.005F565A@in.ibm.com>

On Tue, Nov 09, 2010 at 10:54:57PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 09:03:25 PM:
> 
> > > > Something strange here, right?
> > > > 1. You are consistently getting >10G/s here, and even with a single
> > > stream?
> > >
> > > Sorry, I should have mentioned this though I had stated in my
> > > earlier mails. Each test result has two iterations, each of 60
> > > seconds, except when #netperfs is 1 for which I do 10 iteration
> > > (sum across 10 iterations).
> >
> > So need to divide the number by 10?
> 
> Yes, that is what I get with 512/1K macvtap I/O size :)
> 
> > >  I started doing many more iterations
> > > for 1 netperf after finding the issue earlier with single stream.
> > > So the BW is only 4.5-7 Gbps.
> > >
> > > > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> > > >    doubling that we get a marginal improvement with 2 queues and
> > > >    about 30% worse with 1 queue.
> > >
> > > (doubling happens consistently for guest -> host, but never for
> > > remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> > > testing scenario. In first case, there is a slight improvement in
> > > BW and good reduction in SD. In the second case, only SD improves
> > > (though BW drops for 2 stream for some reason).  In both cases,
> > > BW and SD improves as the number of sessions increase.
> >
> > I guess this is another indication that something's wrong.
> 
> The patch - both virtio-net and vhost-net, doesn't have any
> locking/mutex's/ or any synchronization method. Guest -> host
> performance improvement of upto 100% shows the patch is not
> doing anything wrong.

My concern is this: we don't seem to do anything in tap or macvtap to
help packets from separate virtio queues get to separate queues in the
hardware device and to avoid reordering when we do this.

- skb_tx_hash calculation will get different results
- hash math that e.g. tcp does will run on guest and seems to be discarded

etc

Maybe it's as simple as some tap/macvtap ioctls to set up the queue number
in skbs. Or maybe we need to pass the skb hash from guest to host.
It's this last option that should make us especially cautios as it'll
affect guest/host interface.

Also see d5a9e24afb4ab38110ebb777588ea0bd0eacbd0a: if we have
hardware which records an RX queue, it appears important to
pass that info to guest and to use that in selecting the TX queue.
Of course we won't see this in netperf runs but this needs to
be given thought too - supporting this seems to suggest either
sticking the hash in the virtio net header for both tx and rx,
or using multiplease RX queues.

> > We are quite far from line rate, the fact BW does not scale
> > means there's some contention in the code.
> 
> Attaining line speed with macvtap seems to be a generic issue
> and unrelated to my patch specifically. IMHO if there is nothing
> wrong in the code (review) and is accepted, it will benefit as
> others can also help to find what needs to be implemented in
> vhost/macvtap/qemu to get line speed for guest->remote-host.

No problem, I will queue these patches in some branch
to help enable cooperation, as well as help you
iterate with incremental patches instead of resending it all each time.

> PS: bare-metal performance for host->remote-host is also
>     2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.
> 
> Thanks,

You mean native linux BW does not scale for your host with
# of connections either? I guess this just means need another
setup for testing?

> - KK

^ permalink raw reply

* Re: [PATCH 3/3] net: tipc: fix information leak to userland
From: Vasiliy Kulikov @ 2010-11-10 15:54 UTC (permalink / raw)
  To: walter harms
  Cc: David Miller, kernel-janitors, jon.maloy, allan.stephens,
	tipc-discussion, netdev, linux-kernel
In-Reply-To: <4CDA88FE.8040801@bfs.de>

On Wed, Nov 10, 2010 at 12:58 +0100, walter harms wrote:
> NTL the core problem was that sizeof sa_data is 14 while dev->name is IFNAMESZ=15.

With this code it is NOT a bug because the output buffer is much bigger
than 14 (128 bytes).  I think it was just designed to overflow 14 bytes,
assign sa_data[14] = 0 and ignore it (lack of snprintf() those days?).

Anywhere else sa_data[14] = ... is a bug.

-- 
Vasiliy

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox