Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Eric Dumazet @ 2011-11-17 16:45 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <20111117144038.61b0525b@asterix.rh>

Le jeudi 17 novembre 2011 à 14:40 -0200, Flavio Leitner a écrit :
> On Thu, 17 Nov 2011 17:31:50 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Le jeudi 17 novembre 2011 à 13:37 -0200, Flavio Leitner a écrit :
> > > On Thu, 17 Nov 2011 15:40:20 +0100
> > > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > 
> > > > [PATCH] ping: dont increment ICMP_MIB_INERRORS
> > > > 
> > > > ping module incorrectly increments ICMP_MIB_INERRORS if feeded
> > > > with a frame not belonging to its own sockets.
> > > > 
> > > > RFC 2011 states that ICMP_MIB_INERRORS should count "the number of
> > > > ICMP messages which the entiry received but determined as having
> > > > ICMP-specific errors (bad ICMP checksums, bad length, etc.)."
> > > > 
> > > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > > > CC: Vasiliy Kulikov <segoon@openwall.com>
> > > 
> > > Yeah, they aren't ICMP specific errors and the callers already
> > > checked for checksum, lengths, and etc.. increasing that counter
> > > when necessary.
> > > 
> > > Acked-by: Flavio Leitner <fbl@redhat.com>
> > 
> > Thanks
> > 
> > By the way, redirects dont work at all in net-next
> 
> Could you be more specific? It seems to be working here.
> 
> > Probably coming from your commit 7cc9150ebe8ec0
> > (route: fix ICMP redirect validation)
> > 
> > Since calling __ip_route_output_key() will create the route with s =
> > 0, l = 0  (forcing saddr and dev->ifindex) selectors...
> > 
> > We have to add a 'create' parameter to __ip_route_output_key() so that
> > ip_rt_redirect() doesnt create a route, only find the existing one in
> > cache ?
> 
> It should receive redirect after sending a packet, so the route
> should be ready at this point. 

I receive the redirect, but the rt->peer is updated on a different route
than the one used by my ping command.

Its updated on the specific route (source address forced, output device
forced), not on the wildcarded route my ping is using.

So next packets are still sent on old gateway...

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Flavio Leitner @ 2011-11-17 16:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321547510.2751.66.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Thu, 17 Nov 2011 17:31:50 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 17 novembre 2011 à 13:37 -0200, Flavio Leitner a écrit :
> > On Thu, 17 Nov 2011 15:40:20 +0100
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > [PATCH] ping: dont increment ICMP_MIB_INERRORS
> > > 
> > > ping module incorrectly increments ICMP_MIB_INERRORS if feeded
> > > with a frame not belonging to its own sockets.
> > > 
> > > RFC 2011 states that ICMP_MIB_INERRORS should count "the number of
> > > ICMP messages which the entiry received but determined as having
> > > ICMP-specific errors (bad ICMP checksums, bad length, etc.)."
> > > 
> > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > > CC: Vasiliy Kulikov <segoon@openwall.com>
> > 
> > Yeah, they aren't ICMP specific errors and the callers already
> > checked for checksum, lengths, and etc.. increasing that counter
> > when necessary.
> > 
> > Acked-by: Flavio Leitner <fbl@redhat.com>
> 
> Thanks
> 
> By the way, redirects dont work at all in net-next

Could you be more specific? It seems to be working here.

> Probably coming from your commit 7cc9150ebe8ec0
> (route: fix ICMP redirect validation)
> 
> Since calling __ip_route_output_key() will create the route with s =
> 0, l = 0  (forcing saddr and dev->ifindex) selectors...
> 
> We have to add a 'create' parameter to __ip_route_output_key() so that
> ip_rt_redirect() doesnt create a route, only find the existing one in
> cache ?

It should receive redirect after sending a packet, so the route
should be ready at this point. 

fbl

^ permalink raw reply

* Re: root_lock vs. device's TX lock
From: Eric Dumazet @ 2011-11-17 16:34 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List
In-Reply-To: <CA+mtBx8sHDprEct3zgGXXKEESK6i0FYnFS-_z35FOyVQjS9R4Q@mail.gmail.com>

Le jeudi 17 novembre 2011 à 08:10 -0800, Tom Herbert a écrit :
> From sch_direct_xmit:
> 
>         /* And release qdisc */
>         spin_unlock(root_lock);
> 
>         HARD_TX_LOCK(dev, txq, smp_processor_id());
>         if (!netif_tx_queue_frozen_or_stopped(txq))
>                 ret = dev_hard_start_xmit(skb, dev, txq);
> 
>         HARD_TX_UNLOCK(dev, txq);
> 
>         spin_lock(root_lock);
> 
> This is a lot of lock manipulation to basically switch from one lock
> to another and possibly thrashing just to send a packet.  I am
> thinking that if the there is a 1-1 correspondence between qdisc and
> device queue then we could actually use the device's lock as the root
> lock for the qdisc.  So in that case, we would need to touch any locks
> from sch_direct_xmit (just hold root lock which is already device lock
> for the duration).
> 
> Is there any reason why this couldn't work?

But we have to dirty part of Qdisc anyway ?
(state, bstats, q, ...)

^ permalink raw reply

* [patch net-next] team: replace kmalloc+memcpy by kmemdup
From: Jiri Pirko @ 2011-11-17 16:32 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, bhutchings, shemminger, andy, fbl, jzupka,
	ivecera

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/team/team.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index c48ef19..064155d 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -104,19 +104,15 @@ int team_options_register(struct team *team,
 	if (!dst_opts)
 		return -ENOMEM;
 	for (i = 0; i < option_count; i++, option++) {
-		struct team_option *dst_opt;
-
 		if (__team_find_option(team, option->name)) {
 			err = -EEXIST;
 			goto rollback;
 		}
-		dst_opt = kmalloc(sizeof(*option), GFP_KERNEL);
-		if (!dst_opt) {
+		dst_opts[i] = kmemdup(option, sizeof(*option), GFP_KERNEL);
+		if (!dst_opts[i]) {
 			err = -ENOMEM;
 			goto rollback;
 		}
-		memcpy(dst_opt, option, sizeof(*option));
-		dst_opts[i] = dst_opt;
 	}
 
 	for (i = 0; i < option_count; i++)
-- 
1.7.6

^ permalink raw reply related

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Eric Dumazet @ 2011-11-17 16:31 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <20111117133724.59684d28@asterix.rh>

Le jeudi 17 novembre 2011 à 13:37 -0200, Flavio Leitner a écrit :
> On Thu, 17 Nov 2011 15:40:20 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > [PATCH] ping: dont increment ICMP_MIB_INERRORS
> > 
> > ping module incorrectly increments ICMP_MIB_INERRORS if feeded with a
> > frame not belonging to its own sockets.
> > 
> > RFC 2011 states that ICMP_MIB_INERRORS should count "the number of
> > ICMP messages which the entiry received but determined as having
> > ICMP-specific errors (bad ICMP checksums, bad length, etc.)."
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > CC: Vasiliy Kulikov <segoon@openwall.com>
> 
> Yeah, they aren't ICMP specific errors and the callers already
> checked for checksum, lengths, and etc.. increasing that counter
> when necessary.
> 
> Acked-by: Flavio Leitner <fbl@redhat.com>

Thanks

By the way, redirects dont work at all in net-next

Probably coming from your commit 7cc9150ebe8ec0
(route: fix ICMP redirect validation)

Since calling __ip_route_output_key() will create the route with s = 0,
l = 0  (forcing saddr and dev->ifindex) selectors...

We have to add a 'create' parameter to __ip_route_output_key() so that
ip_rt_redirect() doesnt create a route, only find the existing one in
cache ?

^ permalink raw reply

* root_lock vs. device's TX lock
From: Tom Herbert @ 2011-11-17 16:10 UTC (permalink / raw)
  To: Linux Netdev List

>From sch_direct_xmit:

        /* And release qdisc */
        spin_unlock(root_lock);

        HARD_TX_LOCK(dev, txq, smp_processor_id());
        if (!netif_tx_queue_frozen_or_stopped(txq))
                ret = dev_hard_start_xmit(skb, dev, txq);

        HARD_TX_UNLOCK(dev, txq);

        spin_lock(root_lock);

This is a lot of lock manipulation to basically switch from one lock
to another and possibly thrashing just to send a packet.  I am
thinking that if the there is a 1-1 correspondence between qdisc and
device queue then we could actually use the device's lock as the root
lock for the qdisc.  So in that case, we would need to touch any locks
from sch_direct_xmit (just hold root lock which is already device lock
for the duration).

Is there any reason why this couldn't work?

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Flavio Leitner @ 2011-11-17 15:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321540820.2751.47.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Thu, 17 Nov 2011 15:40:20 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> [PATCH] ping: dont increment ICMP_MIB_INERRORS
> 
> ping module incorrectly increments ICMP_MIB_INERRORS if feeded with a
> frame not belonging to its own sockets.
> 
> RFC 2011 states that ICMP_MIB_INERRORS should count "the number of
> ICMP messages which the entiry received but determined as having
> ICMP-specific errors (bad ICMP checksums, bad length, etc.)."
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Vasiliy Kulikov <segoon@openwall.com>

Yeah, they aren't ICMP specific errors and the callers already
checked for checksum, lengths, and etc.. increasing that counter
when necessary.

Acked-by: Flavio Leitner <fbl@redhat.com>

^ permalink raw reply

* Re: [PATCH] Don't allow sharing of tx skbs on xen-netfront
From: Ian Campbell @ 2011-11-17 15:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, David S. Miller, Jeremy Fitzhardinge,
	Konrad Rzeszutek Wilk, xen-devel
In-Reply-To: <1321298544-16434-1-git-send-email-nhorman@tuxdriver.com>

On Mon, 2011-11-14 at 14:22 -0500, Neil Horman wrote:
> It was pointed out to me recently that the xen-netfront driver can't safely
> support shared skbs on transmit, since, while it doesn't maintain skb state
> directly, it does pass a pointer to the skb to the hypervisor via a list, and
> the hypervisor may expect the contents of the skb to remain stable.  Clearing
> the IFF_TX_SKB_SHARING flag after the call to alloc_etherdev to make it safe.

What are the actual constraints here? The skb is used as a handle to the
skb->data and shinfo (frags) and to complete at the end. It's actually
those which are passed to the hypervisor (effectively the same as
passing those addresses to the h/w for DMA).

Which parts of the skb are expected/allowed to not remain stable?

(Appologies if the above seems naive, I seem to have missed the
introduction of shared tx skbs and IFF_TX_SKB_SHARING)

Ian.

> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> CC: xen-devel@lists.xensource.com
> ---
>  drivers/net/xen-netfront.c |    6 ++++++
>  1 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 226faab..fb1077b 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -1252,6 +1252,12 @@ static struct net_device * __devinit xennet_create_dev(struct xenbus_device *dev
>  		return ERR_PTR(-ENOMEM);
>  	}
>  
> +	/*
> +	 * Since frames remain on a queue after a return from xennet_start_xmit,
> +	 * we can't support tx shared skbs
> +	 */
> +	netdev->priv_flags &= ~IFF_TX_SKB_SHARING;
> +
>  	np                   = netdev_priv(netdev);
>  	np->xbdev            = dev;
>  

^ permalink raw reply

* Re: [patch net-next 2/2] team: avoid using variable-length array
From: Jiri Pirko @ 2011-11-17 15:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, bhutchings, shemminger, andy, fbl, jzupka, ivecera,
	mirqus
In-Reply-To: <1321540278.2751.40.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Thu, Nov 17, 2011 at 03:31:18PM CET, eric.dumazet@gmail.com wrote:
>Le jeudi 17 novembre 2011 à 15:16 +0100, Jiri Pirko a écrit :
>> Apparently using variable-length array is not correct
>> (https://lkml.org/lkml/2011/10/23/25). So remove it.
>> 
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>> ---
>>  drivers/net/team/team.c |    9 +++++++--
>>  1 files changed, 7 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>> index 5b169c1..c48ef19 100644
>> --- a/drivers/net/team/team.c
>> +++ b/drivers/net/team/team.c
>> @@ -96,10 +96,13 @@ int team_options_register(struct team *team,
>>  			  size_t option_count)
>>  {
>>  	int i;
>> -	struct team_option *dst_opts[option_count];
>> +	struct team_option **dst_opts;
>>  	int err;
>>  
>> -	memset(dst_opts, 0, sizeof(dst_opts));
>> +	dst_opts = kzalloc(sizeof(struct team_option *) * option_count,
>> +			   GFP_KERNEL);
>> +	if (!dst_opts)
>> +		return -ENOMEM;
>>  	for (i = 0; i < option_count; i++, option++) {
>>  		struct team_option *dst_opt;
>>  
>> @@ -119,12 +122,14 @@ int team_options_register(struct team *team,
>>  	for (i = 0; i < option_count; i++)
>>  		list_add_tail(&dst_opts[i]->list, &team->option_list);
>>  
>> +	kfree(dst_opts);
>>  	return 0;
>>  
>>  rollback:
>>  	for (i = 0; i < option_count; i++)
>>  		kfree(dst_opts[i]);
>>  
>> +	kfree(dst_opts);
>>  	return err;
>>  }
>>  
>
>Please use kmemdup() as well, or someone else will do it ;)
>
>dst_opt = kmalloc(sizeof(*option), GFP_KERNEL);
>...
>memcpy(dst_opt, option, sizeof(*option));
>
>-> dst_opt = kmemdup(...);
>
Sure, I'll do this in separate patch.

Thanks!

Jirka

^ permalink raw reply

* Re: [PATCH 0/4] skb paged fragment destructors
From: Eric Dumazet @ 2011-11-17 14:51 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, Jesse Brandeburg, netdev@vger.kernel.org
In-Reply-To: <1321541121.3664.270.camel@zakaz.uk.xensource.com>

Le jeudi 17 novembre 2011 à 14:45 +0000, Ian Campbell a écrit :

> Am I right that it's not so much a case of fitting into half a page but
> more like not wasting nearly half a page if you end up requiring a 4096
> byte allocation? (This is probably splitting hairs, I'm really just
> curious)

Performance implications are quite high :

Here is what I did to bnx2x driver to reduce the space from 4096 to
2048, and you can see performance results :

http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commit;h=e52fcb2462ac484e6dd6e68869536609f0216938

^ permalink raw reply

* Re: [PATCH 0/4] skb paged fragment destructors
From: Ian Campbell @ 2011-11-17 14:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Jesse Brandeburg, netdev@vger.kernel.org
In-Reply-To: <1320860984.3916.33.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Wed, 2011-11-09 at 17:49 +0000, Eric Dumazet wrote:
> Le mercredi 09 novembre 2011 à 15:01 +0000, Ian Campbell a écrit :
> > The following series makes use of the skb fragment API (which is in 3.2)
> > to add a per-paged-fragment destructor callback. This can be used by
> > creators of skbs who are interested in the lifecycle of the pages
> > included in that skb after they have handed it off to the network stack.
> > I think these have all been posted before, but have been backed up
> > behind the skb fragment API.
> > 
> > The mail at [0] contains some more background and rationale but
> > basically the completed series will allow entities which inject pages
> > into the networking stack to receive a notification when the stack has
> > really finished with those pages (i.e. including retransmissions,
> > clones, pull-ups etc) and not just when the original skb is finished
> > with, which is beneficial to many subsystems which wish to inject pages
> > into the network stack without giving up full ownership of those page's
> > lifecycle. It implements something broadly along the lines of what was
> > described in [1].
> > 
> > I have also included a patch to the RPC subsystem which uses this API to
> > fix the bug which I describe at [2].
> > 
> > I presented this work at LPC in September and there was a
> > question/concern raised (by Jesse Brandenburg IIRC) regarding the
> > overhead of adding this extra field per fragment. If I understand
> > correctly it seems that in the there have been performance regressions
> > in the past with allocations outgrowing one allocation size bucket and
> > therefore using the next. The change in datastructure size resulting
> > from this series is:
> > 					  BEFORE	AFTER
> > AMD64:	sizeof(struct skb_frag_struct)	= 16		24
> > 	sizeof(struct skb_shared_info)	= 344		488
> 
> Thats a real problem, because 488 is soo big. (its even rounded to 512
> bytes)
> 
> Now, on x86, a half page (2048 bytes) wont be big enough to contain a
> typical frame (MTU=1500)
> 
> NET_SKB_PAD (64) + 1500 + 14 + 512 > 2048

Am I right that it's not so much a case of fitting into half a page but
more like not wasting nearly half a page if you end up requiring a 4096
byte allocation? (This is probably splitting hairs, I'm really just
curious)

> Even if we dont round 488 to 512, (no cache align skb_shared_info) we
> have a problem.
> 
> NET_SKB_PAD (64) + 1500 + 14 + 488 > 2048
> 
> Why not using a low order bit to mark 'page' being a pointer to 

I've given this a go but it makes the patches to the user of this
facility quite nasty (or at least it does for the sunrpc/NFS case). I'm
going to plug at it a bit more and see if I can't make it look cleaner
but I was starting to wonder about alternative approaches.

One idea was split the allocation of the data and the shinfo. Since
shinfo is a fixed size it would be easy to have a specific cache for
them. Does this sound even vaguely plausible?

> struct skb_frag_page_desc {
> 	struct page *p;
> 	atomic_t ref;
> 	int (*destroy)(void *data);
> /*	void *data; */ /* no need, see container_of() */

It turns out that container_of is not so useful here as the users
typically has a list of pages not a single page and hence has a list of
destructors too. What you actually need is the container of the pointer
to that list, IYSWIM, which you can't get at given only a pointer to an
element of the list. So you end up doing

	struct subsys_page_desc {
		struct subsys_container *container;
		struct sbk_frag_page_desc;
	}

*container here is basically the same as the void * so you might as well
include it in the base datastructure.

Ian.

> };

> 
> struct skb_frag_struct {
>         struct {
>                 union {
> 			struct page *p; /* low order bit not set */
> 			struct skb_frag_page_desc *skbpage; /* low order bit set */
> 		};
>         } page;
> ...
> 
> 

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Eric Dumazet @ 2011-11-17 14:40 UTC (permalink / raw)
  To: Flavio Leitner, David Miller; +Cc: Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321535747.2751.36.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Le jeudi 17 novembre 2011 à 14:15 +0100, Eric Dumazet a écrit :

> I'll take a look :)
> 


While looking at it, I found that ICMP_MIB_INERRORS was incremented in
my (successful) ping tests.

netstat -s 
...
Icmp:
...
    13 input ICMP message failed.
...

[PATCH] ping: dont increment ICMP_MIB_INERRORS

ping module incorrectly increments ICMP_MIB_INERRORS if feeded with a
frame not belonging to its own sockets.

RFC 2011 states that ICMP_MIB_INERRORS should count "the number of ICMP
messages which the entiry received but determined as having
ICMP-specific errors (bad ICMP checksums, bad length, etc.)."

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
---
 net/ipv4/ping.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index a06f73f..43d4c3b 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -339,7 +339,6 @@ void ping_err(struct sk_buff *skb, u32 info)
 	sk = ping_v4_lookup(net, iph->daddr, iph->saddr,
 			    ntohs(icmph->un.echo.id), skb->dev->ifindex);
 	if (sk == NULL) {
-		ICMP_INC_STATS_BH(net, ICMP_MIB_INERRORS);
 		pr_debug("no socket, dropping\n");
 		return;	/* No socket for error */
 	}
@@ -679,7 +678,6 @@ static int ping_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	pr_debug("ping_queue_rcv_skb(sk=%p,sk->num=%d,skb=%p)\n",
 		inet_sk(sk), inet_sk(sk)->inet_num, skb);
 	if (sock_queue_rcv_skb(sk, skb) < 0) {
-		ICMP_INC_STATS_BH(sock_net(sk), ICMP_MIB_INERRORS);
 		kfree_skb(skb);
 		pr_debug("ping_queue_rcv_skb -> failed\n");
 		return -1;

^ permalink raw reply related

* Re: [patch net-next 2/2] team: avoid using variable-length array
From: Eric Dumazet @ 2011-11-17 14:31 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, bhutchings, shemminger, andy, fbl, jzupka, ivecera,
	mirqus
In-Reply-To: <1321539365-1125-2-git-send-email-jpirko@redhat.com>

Le jeudi 17 novembre 2011 à 15:16 +0100, Jiri Pirko a écrit :
> Apparently using variable-length array is not correct
> (https://lkml.org/lkml/2011/10/23/25). So remove it.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> ---
>  drivers/net/team/team.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> index 5b169c1..c48ef19 100644
> --- a/drivers/net/team/team.c
> +++ b/drivers/net/team/team.c
> @@ -96,10 +96,13 @@ int team_options_register(struct team *team,
>  			  size_t option_count)
>  {
>  	int i;
> -	struct team_option *dst_opts[option_count];
> +	struct team_option **dst_opts;
>  	int err;
>  
> -	memset(dst_opts, 0, sizeof(dst_opts));
> +	dst_opts = kzalloc(sizeof(struct team_option *) * option_count,
> +			   GFP_KERNEL);
> +	if (!dst_opts)
> +		return -ENOMEM;
>  	for (i = 0; i < option_count; i++, option++) {
>  		struct team_option *dst_opt;
>  
> @@ -119,12 +122,14 @@ int team_options_register(struct team *team,
>  	for (i = 0; i < option_count; i++)
>  		list_add_tail(&dst_opts[i]->list, &team->option_list);
>  
> +	kfree(dst_opts);
>  	return 0;
>  
>  rollback:
>  	for (i = 0; i < option_count; i++)
>  		kfree(dst_opts[i]);
>  
> +	kfree(dst_opts);
>  	return err;
>  }
>  

Please use kmemdup() as well, or someone else will do it ;)

dst_opt = kmalloc(sizeof(*option), GFP_KERNEL);
...
memcpy(dst_opt, option, sizeof(*option));

-> dst_opt = kmemdup(...);

^ permalink raw reply

* RE: [PATCH net-next] r8169: Add 64bit statistics
From: Eric Dumazet @ 2011-11-17 14:27 UTC (permalink / raw)
  To: David Laight; +Cc: Junchang Wang, Stephen Hemminger, netdev, romieu, nic swsd
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AECB@saturn3.aculab.com>

Le jeudi 17 novembre 2011 à 14:01 +0000, David Laight a écrit :

> Ah yes ...
> 
> Both u64_stats_update_begin & _end increment a numeric field
> with an appropriate memory barrier. So the 'update' path
> has two extra read-modify-write sequences (possibly the
> 2nd read can be optimised out), and two smp_wmb() that may
> introduce bus delays.
> 
> Would be fine if it were guaranteed to work!
> Consider the following sequence of events:
>        u64_stats_update_begin()
>        calculate 'count+1'
>                                 read_seqcount_begin()
>                                 read count_hi
>        write count_lo
>                                 read count_lo
>        write count_hi
>                                 read_seqcount_retry()
>        ... update other counters ...
>        u64_stats_update_end()
> The reader gets an invalid value since it reads the same
> 'sequence' both times.
> 
> Could be fixed by using '|= 1' in update_begin and
> looping on odd values in read_seqcount_begin().
> 

I am a bit tired of this discussion.

You obviously dont understand how the thing is working.

Spend some time on it before claiming there are bugs.

^ permalink raw reply

* [patch net-next 2/2] team: avoid using variable-length array
From: Jiri Pirko @ 2011-11-17 14:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, bhutchings, shemminger, andy, fbl, jzupka,
	ivecera, mirqus
In-Reply-To: <1321539365-1125-1-git-send-email-jpirko@redhat.com>

Apparently using variable-length array is not correct
(https://lkml.org/lkml/2011/10/23/25). So remove it.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/team/team.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 5b169c1..c48ef19 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -96,10 +96,13 @@ int team_options_register(struct team *team,
 			  size_t option_count)
 {
 	int i;
-	struct team_option *dst_opts[option_count];
+	struct team_option **dst_opts;
 	int err;
 
-	memset(dst_opts, 0, sizeof(dst_opts));
+	dst_opts = kzalloc(sizeof(struct team_option *) * option_count,
+			   GFP_KERNEL);
+	if (!dst_opts)
+		return -ENOMEM;
 	for (i = 0; i < option_count; i++, option++) {
 		struct team_option *dst_opt;
 
@@ -119,12 +122,14 @@ int team_options_register(struct team *team,
 	for (i = 0; i < option_count; i++)
 		list_add_tail(&dst_opts[i]->list, &team->option_list);
 
+	kfree(dst_opts);
 	return 0;
 
 rollback:
 	for (i = 0; i < option_count; i++)
 		kfree(dst_opts[i]);
 
+	kfree(dst_opts);
 	return err;
 }
 
-- 
1.7.6

^ permalink raw reply related

* [patch net-next 1/2] team: add fix_features
From: Jiri Pirko @ 2011-11-17 14:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, bhutchings, shemminger, andy, fbl, jzupka,
	ivecera, mirqus

do fix features in similar way as bonding code does

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/team/team.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f309274..5b169c1 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -953,6 +953,27 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
 	return err;
 }
 
+static netdev_features_t team_fix_features(struct net_device *dev,
+					   netdev_features_t features)
+{
+	struct team_port *port;
+	struct team *team = netdev_priv(dev);
+	netdev_features_t mask;
+
+	mask = features;
+	features &= ~NETIF_F_ONE_FOR_ALL;
+	features |= NETIF_F_ALL_FOR_ALL;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(port, &team->port_list, list) {
+		features = netdev_increment_features(features,
+						     port->dev->features,
+						     mask);
+	}
+	rcu_read_unlock();
+	return features;
+}
+
 static const struct net_device_ops team_netdev_ops = {
 	.ndo_init		= team_init,
 	.ndo_uninit		= team_uninit,
@@ -968,6 +989,7 @@ static const struct net_device_ops team_netdev_ops = {
 	.ndo_vlan_rx_kill_vid	= team_vlan_rx_kill_vid,
 	.ndo_add_slave		= team_add_slave,
 	.ndo_del_slave		= team_del_slave,
+	.ndo_fix_features	= team_fix_features,
 };
 
 
-- 
1.7.6

^ permalink raw reply related

* RE: [PATCH net-next] r8169: Add 64bit statistics
From: David Laight @ 2011-11-17 14:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Junchang Wang, Stephen Hemminger, netdev, romieu, nic swsd
In-Reply-To: <1321534700.2751.27.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

> From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
> Le jeudi 17 novembre 2011 à 11:13 +0000, David Laight a écrit :
> > The 64bit stats update sequence used to get valid
> > counts on 32bit systems (that can't do locked 64bit
> > memory access) seems to be:
> > 
> > >     u64_stats_update_begin(&sky2->tx_stats.syncp);
> > >     ++sky2->tx_stats.packets;
> > >     sky2->tx_stats.bytes += skb->len;
> > >     u64_stats_update_end(&sky2->tx_stats.syncp);
> > 
> > I'm not sure what the begin/end markers do, but
> > they need to hold off the readers during updates
> > and the writers during reads - this is probably
> > expensive on the update path.
> > 
> > A thought that might work is for the writer to
> > write the middle bits of the 64 bit walue to
> > another location, eg:
> >        count = sky2->tx_stats.bytes + skb->len;
> >        sky2->tx_stats.bytes = count;
> >        sky2->tx_stats.bytes_check = count >> 16;
> > The reader then loops until the two value are
> > consistent.
> > 
> > I think this doesn't even require a memory barrier
> > in the ISR since the order of the reads an writes
> > doesn't matter at all.
> > 
> > 	David
> > 
> > 
> 
> Oh well...
> 
> Before claiming all this, you really should read 
> include/linux/u64_stats_sync.h
> 
> This should answer all your questions.

Ah yes ...

Both u64_stats_update_begin & _end increment a numeric field
with an appropriate memory barrier. So the 'update' path
has two extra read-modify-write sequences (possibly the
2nd read can be optimised out), and two smp_wmb() that may
introduce bus delays.

Would be fine if it were guaranteed to work!
Consider the following sequence of events:
       u64_stats_update_begin()
       calculate 'count+1'
                                read_seqcount_begin()
                                read count_hi
       write count_lo
                                read count_lo
       write count_hi
                                read_seqcount_retry()
       ... update other counters ...
       u64_stats_update_end()
The reader gets an invalid value since it reads the same
'sequence' both times.

Could be fixed by using '|= 1' in update_begin and
looping on odd values in read_seqcount_begin().

	David

                      

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Eric Dumazet @ 2011-11-17 13:15 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: Ivan Zahariev, netdev
In-Reply-To: <20111117111145.252924f5@asterix.rh>

Le jeudi 17 novembre 2011 à 11:11 -0200, Flavio Leitner a écrit :

> I am going to be on vacations in the next couple weeks, so I won't be
> able to help fixing this any time soon. However, I am pretty sure
> someone else will help though :)
> 

I'll take a look :)

^ permalink raw reply

* [PATCH v2 net-next] net: use jump_label to shortcut RPS if not setup
From: Eric Dumazet @ 2011-11-17 13:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tom Herbert

Most machines dont use RPS/RFS, and pay a fair amount of instructions in
netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
RPS/RFS is not setup.

Add a jump_label named rps_needed.

If no device rps_map or global rps_sock_flow_table is setup,
netif_receive_skb() / netif_rx() do a single instruction instead of many
ones, including conditional jumps.

jmp +0    (if CONFIG_JUMP_LABEL=y)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
---
V2: add netif_rx() optim as well

 include/linux/netdevice.h  |    5 +++++
 net/core/dev.c             |   21 +++++++++------------
 net/core/net-sysfs.c       |    7 +++++--
 net/core/sysctl_net_core.c |    9 +++++++--
 4 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4d5698a..0bbe030 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -214,6 +214,11 @@ enum {
 #include <linux/cache.h>
 #include <linux/skbuff.h>
 
+#ifdef CONFIG_RPS
+#include <linux/jump_label.h>
+extern struct jump_label_key rps_needed;
+#endif
+
 struct neighbour;
 struct neigh_parms;
 struct sk_buff;
diff --git a/net/core/dev.c b/net/core/dev.c
index 26c49d5..f789599 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2711,6 +2711,8 @@ EXPORT_SYMBOL(__skb_get_rxhash);
 struct rps_sock_flow_table __rcu *rps_sock_flow_table __read_mostly;
 EXPORT_SYMBOL(rps_sock_flow_table);
 
+struct jump_label_key rps_needed __read_mostly;
+
 static struct rps_dev_flow *
 set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	    struct rps_dev_flow *rflow, u16 next_cpu)
@@ -2994,7 +2996,7 @@ int netif_rx(struct sk_buff *skb)
 
 	trace_netif_rx(skb);
 #ifdef CONFIG_RPS
-	{
+	if (static_branch(&rps_needed))	{
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
 		int cpu;
 
@@ -3009,14 +3011,13 @@ int netif_rx(struct sk_buff *skb)
 
 		rcu_read_unlock();
 		preempt_enable();
-	}
-#else
+	} else
+#endif
 	{
 		unsigned int qtail;
 		ret = enqueue_to_backlog(skb, get_cpu(), &qtail);
 		put_cpu();
 	}
-#endif
 	return ret;
 }
 EXPORT_SYMBOL(netif_rx);
@@ -3359,7 +3360,7 @@ int netif_receive_skb(struct sk_buff *skb)
 		return NET_RX_SUCCESS;
 
 #ifdef CONFIG_RPS
-	{
+	if (static_branch(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
 		int cpu, ret;
 
@@ -3370,16 +3371,12 @@ int netif_receive_skb(struct sk_buff *skb)
 		if (cpu >= 0) {
 			ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 			rcu_read_unlock();
-		} else {
-			rcu_read_unlock();
-			ret = __netif_receive_skb(skb);
+			return ret;
 		}
-
-		return ret;
+		rcu_read_unlock();
 	}
-#else
-	return __netif_receive_skb(skb);
 #endif
+	return __netif_receive_skb(skb);
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 602b141..db6c2f8 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -606,9 +606,12 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 	rcu_assign_pointer(queue->rps_map, map);
 	spin_unlock(&rps_map_lock);
 
-	if (old_map)
+	if (map)
+		jump_label_inc(&rps_needed);
+	if (old_map) {
 		kfree_rcu(old_map, rcu);
-
+		jump_label_dec(&rps_needed);
+	}
 	free_cpumask_var(mask);
 	return len;
 }
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 77a65f0..d05559d 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -68,8 +68,13 @@ static int rps_sock_flow_sysctl(ctl_table *table, int write,
 
 		if (sock_table != orig_sock_table) {
 			rcu_assign_pointer(rps_sock_flow_table, sock_table);
-			synchronize_rcu();
-			vfree(orig_sock_table);
+			if (sock_table)
+				jump_label_inc(&rps_needed);
+			if (orig_sock_table) {
+				jump_label_dec(&rps_needed);
+				synchronize_rcu();
+				vfree(orig_sock_table);
+			}
 		}
 	}
 

^ permalink raw reply related

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Flavio Leitner @ 2011-11-17 13:11 UTC (permalink / raw)
  To: Ivan Zahariev; +Cc: netdev
In-Reply-To: <4EC4C160.6010804@icdsoft.com>

On Thu, 17 Nov 2011 10:10:08 +0200
Ivan Zahariev <famzah@icdsoft.com> wrote:

> On 17.11.2011 г. 02:33 ч., Flavio Leitner wrote:
> > On Thu, 17 Nov 2011 00:32:18 +0200
> > Ivan Zahariev<famzah@icdsoft.com>  wrote:
> >
> >> On 11/15/2011 11:09 PM, Eric Dumazet wrote:
> >>> Le mardi 15 novembre 2011 à 22:23 +0200, Ivan Zahariev a écrit :
> >>>> Hello,
> >>>>
> >>>> We have changed nothing in our network infrastructure but only
> >>>> upgraded from Linux kernel 2.6.36.2 to 3.0.3. Here is the problem
> >>>> we are experiencing:
> >>>>
> >>>> ICMP redirected routes are cached forever, and they can be
> >>>> cleared only by a reboot.
> >>>>
> >> ### (bug #1) even though we flushed the route cache,
> >> the<redirected> route resurrects from somewhere; even without
> >> making any TCP requests ### this time what "ip" returns is
> >> consistent with the real (incorrect) routing behavior of machine5
> >> root@machine5:~# ip route flush cache
> >> root@machine5:~# ip route list cache match 8.8.4.4
> >> root@machine5:~# ip route get 8.8.4.4
> >> 8.8.4.4 via 192.168.0.120 dev eth0  src 192.168.0.244
> >>       cache<redirected>   ipid 0x303a
> >>
> >> ### only a reboot clears the cached<redirected>  routes
> > IIRC, the cache flush doesn't affect the inetpeer where the
> > redirected gateway is now stored, so even after flushing the
> > route cache, the inetpeer will restore the old info later.
> >
> > fbl
> OK, I guess my questions now are:
> * How to flush the inetpeer (redirected cache info) without having to 
> reboot the machine?

It will expire after 10min if you don't use that specific host.

> * Why "ip route" returns an incorrect route; example:

I am sorry for not being clear before. It is a bug, indeed.
 
> ### (bug #2) what "ip route" returns is inconsistent, because we are 
> using the <redirected> route 192.168.0.120 in reality
> ### note that the count of the route lines increased with one
> root@machine5:~# ip route list cache match 8.8.4.4
> 8.8.4.4 from 192.168.0.244 tos lowdelay via 192.168.0.8 dev eth0
>      cache  ipid 0x303a
> 8.8.4.4 tos lowdelay via 192.168.0.8 dev eth0  src 192.168.0.244
>      cache  ipid 0x303a
> 8.8.4.4 via 192.168.0.8 dev eth0  src 192.168.0.244
>      cache
> 8.8.4.4 from 192.168.0.244 tos lowdelay via 192.168.0.8 dev eth0
>      cache  ipid 0x303a
> 
> ### After "ip route flush cache", the output of "ip route" gets 
> consistent with the real routing behavior of machine5
> root@machine5:~# ip route flush cache
> root@machine5:~# ip route list cache match 8.8.4.4
> root@machine5:~# ip route get 8.8.4.4
> 8.8.4.4 via 192.168.0.120 dev eth0  src 192.168.0.244
>      cache <redirected>  ipid 0x303a
> 

Now the redirected gateway is stored in inetpeer which represents
an specific peer.  In your case, you have one for 8.8.4.4.

When you flush the routing cache everything is flushed, except for
the inetpeer as far as I can tell.  Later, when you try to access
the host 8.8.4.4 again, the lookup will create a fresh route but
also find the previous 8.8.4.4 inetpeer, so it will re-use the
previous redirected gateway.

Therefore, the routing is fine, but it is missing a way to
invalidade or expire all related inetpeer entries when the flush
happens.

The inetpeer will expire eventually, so waiting before trying again
would help to work around:

1) flush
2) wait to expire (10min)
3) try again

If you know how to compile a kernel, try to change these thresholds
below to expire faster, then you have to wait less for it to expire
instead of rebooting.

net/ipv4/inetpeer.c:
int inet_peer_minttl __read_mostly = 120 * HZ;  /* TTL under high load:
120 sec */ int inet_peer_maxttl __read_mostly = 10 * 60 * HZ;      /*
usual time to live: 10 min */                           

That above is just a workaround, indeed.

I am going to be on vacations in the next couple weeks, so I won't be
able to help fixing this any time soon. However, I am pretty sure
someone else will help though :)

fbl

^ permalink raw reply

* [PATCH net-next] ibm/emac: fix improper cleanup when device is removed to allow re-bind
From: Wolfgang Grandegger @ 2011-11-17 13:06 UTC (permalink / raw)
  To: Netdev; +Cc: Linuxppc-dev

The re-binding (unbind..bind) of an EMAC device fails because the
static variable "busy_phy_map" is not updated when the device is
removed.

Signed-off-by: Wolfgang Grandegger <wg@denx.de>
---
 drivers/net/ethernet/ibm/emac/core.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c
index ed79b2d..2abce96 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -2924,6 +2924,9 @@ static int __devexit emac_remove(struct platform_device *ofdev)
 	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
 		zmii_detach(dev->zmii_dev, dev->zmii_port);
 
+	busy_phy_map &= ~(1 << dev->phy.address);
+	DBG(dev, "busy_phy_map now %#x" NL, busy_phy_map);
+
 	mal_unregister_commac(dev->mal, &dev->commac);
 	emac_put_deps(dev);
 
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH] f_phonet: fix page offset of first received fragment
From: Rémi Denis-Courmont @ 2011-11-17 12:58 UTC (permalink / raw)
  To: netdev; +Cc: Rémi Denis-Courmont

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

We pull one byte (the MAC header) from the first fragment before the
fragment is actually appended. So the socket buffer length is 1, not 0.

Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
---
 drivers/usb/gadget/f_phonet.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/usb/gadget/f_phonet.c b/drivers/usb/gadget/f_phonet.c
index 3490770..16a509a 100644
--- a/drivers/usb/gadget/f_phonet.c
+++ b/drivers/usb/gadget/f_phonet.c
@@ -346,7 +346,7 @@ static void pn_rx_complete(struct usb_ep *ep, struct usb_request *req)
 		}
 
 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
-				skb->len == 0, req->actual);
+				skb->len <= 1, req->actual);
 		page = NULL;
 
 		if (req->actual < req->length) { /* Last fragment */
-- 
1.7.5.4

^ permalink raw reply related

* Re: [PATCH net-next] r8169: Add 64bit statistics
From: Eric Dumazet @ 2011-11-17 13:02 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Stephen Hemminger, netdev, romieu, nic swsd
In-Reply-To: <CABoNC83U2k=-6-BkJEyWBXfSnEbb-+K8CD1ffqhjA613Tk04tQ@mail.gmail.com>

Le jeudi 17 novembre 2011 à 18:59 +0800, Junchang Wang a écrit :
> > Yes, look at sky2.c for a template
> >
> > drivers/net/ethernet/marvell/sky2.c contains code like that
> > (different syncp for rx/tx)
> >
> > TX path:
> >                        u64_stats_update_begin(&sky2->tx_stats.syncp);
> >                        ++sky2->tx_stats.packets;
> >                        sky2->tx_stats.bytes += skb->len;
> >                        u64_stats_update_end(&sky2->tx_stats.syncp);
> >
> >
> > RX path:
> >
> >        u64_stats_update_begin(&sky2->rx_stats.syncp);
> >        sky2->rx_stats.packets += packets;
> >        sky2->rx_stats.bytes += bytes;
> >        u64_stats_update_end(&sky2->rx_stats.syncp);
> >
> 
> Thanks, Eric.
> 
> I'm still confused about why we need two sync entries. Please correct
> me if I'm wrong.
> 
> Take r8169 for example, All statistic entries are updated in
> rtl8169_rx_interrupt() or rtl8169_tx_interrupt(). Those two functions
> are called in rtl8169_poll().
> As far as I know, rtl8169_poll() is protected by NAPI_STATE_SCHED bit
> to run on a single core at one moment. So there is not compulsory to
> use two sync entries.
> 
> One benefit from two sync is that readers can avoid many retries.
> 

Oh well...

Dont try to optimize all this stuff, I spent some time on it already,
just use the infrastructure and forget about races and bugs.

The short answer is :

CPU1                    CPU2                  
TX path                 RX path               


since CPU1 & CPU2 will do the update_begin(), and the memory increment
is not atomic, we might lose one increment, and block a stat reader
forever.

^ permalink raw reply

* RE: [PATCH net-next] r8169: Add 64bit statistics
From: Eric Dumazet @ 2011-11-17 12:58 UTC (permalink / raw)
  To: David Laight; +Cc: Junchang Wang, Stephen Hemminger, netdev, romieu, nic swsd
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AEC9@saturn3.aculab.com>

Le jeudi 17 novembre 2011 à 11:13 +0000, David Laight a écrit :
> The 64bit stats update sequence used to get valid
> counts on 32bit systems (that can't do locked 64bit
> memory access) seems to be:
> 
> >     u64_stats_update_begin(&sky2->tx_stats.syncp);
> >     ++sky2->tx_stats.packets;
> >     sky2->tx_stats.bytes += skb->len;
> >     u64_stats_update_end(&sky2->tx_stats.syncp);
> 
> I'm not sure what the begin/end markers do, but
> they need to hold off the readers during updates
> and the writers during reads - this is probably
> expensive on the update path.
> 
> A thought that might work is for the writer to
> write the middle bits of the 64 bit walue to
> another location, eg:
>        count = sky2->tx_stats.bytes + skb->len;
>        sky2->tx_stats.bytes = count;
>        sky2->tx_stats.bytes_check = count >> 16;
> The reader then loops until the two value are
> consistent.
> 
> I think this doesn't even require a memory barrier
> in the ISR since the order of the reads an writes
> doesn't matter at all.
> 
> 	David
> 
> 

Oh well...

Before claiming all this, you really should read 
include/linux/u64_stats_sync.h

This should answer all your questions.

^ permalink raw reply

* RE: 3.1.0 rtl8169_open oops.
From: hayeswang @ 2011-11-17 12:26 UTC (permalink / raw)
  To: 'Dave Jones', 'Francois Romieu'
  Cc: netdev, kernel-team, dwmw2
In-Reply-To: <20111117005355.GA8466@redhat.com>

 

> -----Original Message-----
> From: Dave Jones [mailto:davej@redhat.com] 
> Sent: Thursday, November 17, 2011 8:54 AM
> To: Francois Romieu
> Cc: netdev@vger.kernel.org; kernel-team@fedoraproject.org; 
> Hayeswang; dwmw2@infradead.org
> Subject: Re: 3.1.0 rtl8169_open oops.
> 
> On Thu, Nov 17, 2011 at 01:34:36AM +0100, Francois Romieu wrote:
>  > Dave Jones <davej@redhat.com> :
>  > [...]
>  > > (gdb) list *(rtl8169_open)+0x34e
>  > > 0x4c0f is in rtl8169_open (drivers/net/r8169.c:1881).
>  > > 1876		bool rc = false;
>  > > 1877	
>  > > 1878		if (fw->size < FW_OPCODE_SIZE)
>  > > 1879			goto out;
>  > > 1880	
>  > > 1881		if (!fw_info->magic) {
>  > > 1882			size_t i, size, start;
>  > > 1883			u8 checksum = 0;
>  > > 1884	
>  > > 1885			if (fw->size < sizeof(*fw_info))
>  > 

My colleague tries Fedora 16 and it works fine with several different chips. We
couldn't reproduce the issue.

>  > Give me 24h. I hope it is not as trivial as it seems :o(
>  > 
>  > Btw the firmware included in F16 is not up-to-date (see
>  > http://marc.info/?l=linux-kernel&m=131295492507686&w=2). 
> The driver should
>  > not mind though.
> 
> That reminds me. Since the kernel.org compromise, where does
> linux-firmware live now ?
> 

I submitted the new firmwares to

http://git.kernel.org/pub/scm/linux/kernel/git/dwmw2/linux-firmware.git

However, the owner doesn't update them yet. You could reference

http://www.spinics.net/lists/netdev/msg178015.html

 
Best Regards,
Hayes

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox