Netdev List
 help / color / mirror / Atom feed
* root_lock vs. device's TX lock
From: Tom Herbert @ 2011-11-17 16:10 UTC (permalink / raw)
  To: Linux Netdev List

>From sch_direct_xmit:

        /* And release qdisc */
        spin_unlock(root_lock);

        HARD_TX_LOCK(dev, txq, smp_processor_id());
        if (!netif_tx_queue_frozen_or_stopped(txq))
                ret = dev_hard_start_xmit(skb, dev, txq);

        HARD_TX_UNLOCK(dev, txq);

        spin_lock(root_lock);

This is a lot of lock manipulation to basically switch from one lock
to another and possibly thrashing just to send a packet.  I am
thinking that if the there is a 1-1 correspondence between qdisc and
device queue then we could actually use the device's lock as the root
lock for the qdisc.  So in that case, we would need to touch any locks
from sch_direct_xmit (just hold root lock which is already device lock
for the duration).

Is there any reason why this couldn't work?

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Flavio Leitner @ 2011-11-17 15:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321540820.2751.47.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Thu, 17 Nov 2011 15:40:20 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> [PATCH] ping: dont increment ICMP_MIB_INERRORS
> 
> ping module incorrectly increments ICMP_MIB_INERRORS if feeded with a
> frame not belonging to its own sockets.
> 
> RFC 2011 states that ICMP_MIB_INERRORS should count "the number of
> ICMP messages which the entiry received but determined as having
> ICMP-specific errors (bad ICMP checksums, bad length, etc.)."
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Vasiliy Kulikov <segoon@openwall.com>

Yeah, they aren't ICMP specific errors and the callers already
checked for checksum, lengths, and etc.. increasing that counter
when necessary.

Acked-by: Flavio Leitner <fbl@redhat.com>

^ permalink raw reply

* Re: [PATCH] Don't allow sharing of tx skbs on xen-netfront
From: Ian Campbell @ 2011-11-17 15:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, David S. Miller, Jeremy Fitzhardinge,
	Konrad Rzeszutek Wilk, xen-devel
In-Reply-To: <1321298544-16434-1-git-send-email-nhorman@tuxdriver.com>

On Mon, 2011-11-14 at 14:22 -0500, Neil Horman wrote:
> It was pointed out to me recently that the xen-netfront driver can't safely
> support shared skbs on transmit, since, while it doesn't maintain skb state
> directly, it does pass a pointer to the skb to the hypervisor via a list, and
> the hypervisor may expect the contents of the skb to remain stable.  Clearing
> the IFF_TX_SKB_SHARING flag after the call to alloc_etherdev to make it safe.

What are the actual constraints here? The skb is used as a handle to the
skb->data and shinfo (frags) and to complete at the end. It's actually
those which are passed to the hypervisor (effectively the same as
passing those addresses to the h/w for DMA).

Which parts of the skb are expected/allowed to not remain stable?

(Appologies if the above seems naive, I seem to have missed the
introduction of shared tx skbs and IFF_TX_SKB_SHARING)

Ian.

> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> CC: xen-devel@lists.xensource.com
> ---
>  drivers/net/xen-netfront.c |    6 ++++++
>  1 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 226faab..fb1077b 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -1252,6 +1252,12 @@ static struct net_device * __devinit xennet_create_dev(struct xenbus_device *dev
>  		return ERR_PTR(-ENOMEM);
>  	}
>  
> +	/*
> +	 * Since frames remain on a queue after a return from xennet_start_xmit,
> +	 * we can't support tx shared skbs
> +	 */
> +	netdev->priv_flags &= ~IFF_TX_SKB_SHARING;
> +
>  	np                   = netdev_priv(netdev);
>  	np->xbdev            = dev;
>  

^ permalink raw reply

* Re: [patch net-next 2/2] team: avoid using variable-length array
From: Jiri Pirko @ 2011-11-17 15:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, bhutchings, shemminger, andy, fbl, jzupka, ivecera,
	mirqus
In-Reply-To: <1321540278.2751.40.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Thu, Nov 17, 2011 at 03:31:18PM CET, eric.dumazet@gmail.com wrote:
>Le jeudi 17 novembre 2011 à 15:16 +0100, Jiri Pirko a écrit :
>> Apparently using variable-length array is not correct
>> (https://lkml.org/lkml/2011/10/23/25). So remove it.
>> 
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>> ---
>>  drivers/net/team/team.c |    9 +++++++--
>>  1 files changed, 7 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>> index 5b169c1..c48ef19 100644
>> --- a/drivers/net/team/team.c
>> +++ b/drivers/net/team/team.c
>> @@ -96,10 +96,13 @@ int team_options_register(struct team *team,
>>  			  size_t option_count)
>>  {
>>  	int i;
>> -	struct team_option *dst_opts[option_count];
>> +	struct team_option **dst_opts;
>>  	int err;
>>  
>> -	memset(dst_opts, 0, sizeof(dst_opts));
>> +	dst_opts = kzalloc(sizeof(struct team_option *) * option_count,
>> +			   GFP_KERNEL);
>> +	if (!dst_opts)
>> +		return -ENOMEM;
>>  	for (i = 0; i < option_count; i++, option++) {
>>  		struct team_option *dst_opt;
>>  
>> @@ -119,12 +122,14 @@ int team_options_register(struct team *team,
>>  	for (i = 0; i < option_count; i++)
>>  		list_add_tail(&dst_opts[i]->list, &team->option_list);
>>  
>> +	kfree(dst_opts);
>>  	return 0;
>>  
>>  rollback:
>>  	for (i = 0; i < option_count; i++)
>>  		kfree(dst_opts[i]);
>>  
>> +	kfree(dst_opts);
>>  	return err;
>>  }
>>  
>
>Please use kmemdup() as well, or someone else will do it ;)
>
>dst_opt = kmalloc(sizeof(*option), GFP_KERNEL);
>...
>memcpy(dst_opt, option, sizeof(*option));
>
>-> dst_opt = kmemdup(...);
>
Sure, I'll do this in separate patch.

Thanks!

Jirka

^ permalink raw reply

* Re: [PATCH 0/4] skb paged fragment destructors
From: Eric Dumazet @ 2011-11-17 14:51 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, Jesse Brandeburg, netdev@vger.kernel.org
In-Reply-To: <1321541121.3664.270.camel@zakaz.uk.xensource.com>

Le jeudi 17 novembre 2011 à 14:45 +0000, Ian Campbell a écrit :

> Am I right that it's not so much a case of fitting into half a page but
> more like not wasting nearly half a page if you end up requiring a 4096
> byte allocation? (This is probably splitting hairs, I'm really just
> curious)

Performance implications are quite high :

Here is what I did to bnx2x driver to reduce the space from 4096 to
2048, and you can see performance results :

http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commit;h=e52fcb2462ac484e6dd6e68869536609f0216938

^ permalink raw reply

* Re: [PATCH 0/4] skb paged fragment destructors
From: Ian Campbell @ 2011-11-17 14:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Jesse Brandeburg, netdev@vger.kernel.org
In-Reply-To: <1320860984.3916.33.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Wed, 2011-11-09 at 17:49 +0000, Eric Dumazet wrote:
> Le mercredi 09 novembre 2011 à 15:01 +0000, Ian Campbell a écrit :
> > The following series makes use of the skb fragment API (which is in 3.2)
> > to add a per-paged-fragment destructor callback. This can be used by
> > creators of skbs who are interested in the lifecycle of the pages
> > included in that skb after they have handed it off to the network stack.
> > I think these have all been posted before, but have been backed up
> > behind the skb fragment API.
> > 
> > The mail at [0] contains some more background and rationale but
> > basically the completed series will allow entities which inject pages
> > into the networking stack to receive a notification when the stack has
> > really finished with those pages (i.e. including retransmissions,
> > clones, pull-ups etc) and not just when the original skb is finished
> > with, which is beneficial to many subsystems which wish to inject pages
> > into the network stack without giving up full ownership of those page's
> > lifecycle. It implements something broadly along the lines of what was
> > described in [1].
> > 
> > I have also included a patch to the RPC subsystem which uses this API to
> > fix the bug which I describe at [2].
> > 
> > I presented this work at LPC in September and there was a
> > question/concern raised (by Jesse Brandenburg IIRC) regarding the
> > overhead of adding this extra field per fragment. If I understand
> > correctly it seems that in the there have been performance regressions
> > in the past with allocations outgrowing one allocation size bucket and
> > therefore using the next. The change in datastructure size resulting
> > from this series is:
> > 					  BEFORE	AFTER
> > AMD64:	sizeof(struct skb_frag_struct)	= 16		24
> > 	sizeof(struct skb_shared_info)	= 344		488
> 
> Thats a real problem, because 488 is soo big. (its even rounded to 512
> bytes)
> 
> Now, on x86, a half page (2048 bytes) wont be big enough to contain a
> typical frame (MTU=1500)
> 
> NET_SKB_PAD (64) + 1500 + 14 + 512 > 2048

Am I right that it's not so much a case of fitting into half a page but
more like not wasting nearly half a page if you end up requiring a 4096
byte allocation? (This is probably splitting hairs, I'm really just
curious)

> Even if we dont round 488 to 512, (no cache align skb_shared_info) we
> have a problem.
> 
> NET_SKB_PAD (64) + 1500 + 14 + 488 > 2048
> 
> Why not using a low order bit to mark 'page' being a pointer to 

I've given this a go but it makes the patches to the user of this
facility quite nasty (or at least it does for the sunrpc/NFS case). I'm
going to plug at it a bit more and see if I can't make it look cleaner
but I was starting to wonder about alternative approaches.

One idea was split the allocation of the data and the shinfo. Since
shinfo is a fixed size it would be easy to have a specific cache for
them. Does this sound even vaguely plausible?

> struct skb_frag_page_desc {
> 	struct page *p;
> 	atomic_t ref;
> 	int (*destroy)(void *data);
> /*	void *data; */ /* no need, see container_of() */

It turns out that container_of is not so useful here as the users
typically has a list of pages not a single page and hence has a list of
destructors too. What you actually need is the container of the pointer
to that list, IYSWIM, which you can't get at given only a pointer to an
element of the list. So you end up doing

	struct subsys_page_desc {
		struct subsys_container *container;
		struct sbk_frag_page_desc;
	}

*container here is basically the same as the void * so you might as well
include it in the base datastructure.

Ian.

> };

> 
> struct skb_frag_struct {
>         struct {
>                 union {
> 			struct page *p; /* low order bit not set */
> 			struct skb_frag_page_desc *skbpage; /* low order bit set */
> 		};
>         } page;
> ...
> 
> 

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Eric Dumazet @ 2011-11-17 14:40 UTC (permalink / raw)
  To: Flavio Leitner, David Miller; +Cc: Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321535747.2751.36.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Le jeudi 17 novembre 2011 à 14:15 +0100, Eric Dumazet a écrit :

> I'll take a look :)
> 


While looking at it, I found that ICMP_MIB_INERRORS was incremented in
my (successful) ping tests.

netstat -s 
...
Icmp:
...
    13 input ICMP message failed.
...

[PATCH] ping: dont increment ICMP_MIB_INERRORS

ping module incorrectly increments ICMP_MIB_INERRORS if feeded with a
frame not belonging to its own sockets.

RFC 2011 states that ICMP_MIB_INERRORS should count "the number of ICMP
messages which the entiry received but determined as having
ICMP-specific errors (bad ICMP checksums, bad length, etc.)."

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
---
 net/ipv4/ping.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index a06f73f..43d4c3b 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -339,7 +339,6 @@ void ping_err(struct sk_buff *skb, u32 info)
 	sk = ping_v4_lookup(net, iph->daddr, iph->saddr,
 			    ntohs(icmph->un.echo.id), skb->dev->ifindex);
 	if (sk == NULL) {
-		ICMP_INC_STATS_BH(net, ICMP_MIB_INERRORS);
 		pr_debug("no socket, dropping\n");
 		return;	/* No socket for error */
 	}
@@ -679,7 +678,6 @@ static int ping_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	pr_debug("ping_queue_rcv_skb(sk=%p,sk->num=%d,skb=%p)\n",
 		inet_sk(sk), inet_sk(sk)->inet_num, skb);
 	if (sock_queue_rcv_skb(sk, skb) < 0) {
-		ICMP_INC_STATS_BH(sock_net(sk), ICMP_MIB_INERRORS);
 		kfree_skb(skb);
 		pr_debug("ping_queue_rcv_skb -> failed\n");
 		return -1;

^ permalink raw reply related

* Re: [patch net-next 2/2] team: avoid using variable-length array
From: Eric Dumazet @ 2011-11-17 14:31 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, bhutchings, shemminger, andy, fbl, jzupka, ivecera,
	mirqus
In-Reply-To: <1321539365-1125-2-git-send-email-jpirko@redhat.com>

Le jeudi 17 novembre 2011 à 15:16 +0100, Jiri Pirko a écrit :
> Apparently using variable-length array is not correct
> (https://lkml.org/lkml/2011/10/23/25). So remove it.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> ---
>  drivers/net/team/team.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> index 5b169c1..c48ef19 100644
> --- a/drivers/net/team/team.c
> +++ b/drivers/net/team/team.c
> @@ -96,10 +96,13 @@ int team_options_register(struct team *team,
>  			  size_t option_count)
>  {
>  	int i;
> -	struct team_option *dst_opts[option_count];
> +	struct team_option **dst_opts;
>  	int err;
>  
> -	memset(dst_opts, 0, sizeof(dst_opts));
> +	dst_opts = kzalloc(sizeof(struct team_option *) * option_count,
> +			   GFP_KERNEL);
> +	if (!dst_opts)
> +		return -ENOMEM;
>  	for (i = 0; i < option_count; i++, option++) {
>  		struct team_option *dst_opt;
>  
> @@ -119,12 +122,14 @@ int team_options_register(struct team *team,
>  	for (i = 0; i < option_count; i++)
>  		list_add_tail(&dst_opts[i]->list, &team->option_list);
>  
> +	kfree(dst_opts);
>  	return 0;
>  
>  rollback:
>  	for (i = 0; i < option_count; i++)
>  		kfree(dst_opts[i]);
>  
> +	kfree(dst_opts);
>  	return err;
>  }
>  

Please use kmemdup() as well, or someone else will do it ;)

dst_opt = kmalloc(sizeof(*option), GFP_KERNEL);
...
memcpy(dst_opt, option, sizeof(*option));

-> dst_opt = kmemdup(...);

^ permalink raw reply

* RE: [PATCH net-next] r8169: Add 64bit statistics
From: Eric Dumazet @ 2011-11-17 14:27 UTC (permalink / raw)
  To: David Laight; +Cc: Junchang Wang, Stephen Hemminger, netdev, romieu, nic swsd
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AECB@saturn3.aculab.com>

Le jeudi 17 novembre 2011 à 14:01 +0000, David Laight a écrit :

> Ah yes ...
> 
> Both u64_stats_update_begin & _end increment a numeric field
> with an appropriate memory barrier. So the 'update' path
> has two extra read-modify-write sequences (possibly the
> 2nd read can be optimised out), and two smp_wmb() that may
> introduce bus delays.
> 
> Would be fine if it were guaranteed to work!
> Consider the following sequence of events:
>        u64_stats_update_begin()
>        calculate 'count+1'
>                                 read_seqcount_begin()
>                                 read count_hi
>        write count_lo
>                                 read count_lo
>        write count_hi
>                                 read_seqcount_retry()
>        ... update other counters ...
>        u64_stats_update_end()
> The reader gets an invalid value since it reads the same
> 'sequence' both times.
> 
> Could be fixed by using '|= 1' in update_begin and
> looping on odd values in read_seqcount_begin().
> 

I am a bit tired of this discussion.

You obviously dont understand how the thing is working.

Spend some time on it before claiming there are bugs.

^ permalink raw reply

* [patch net-next 2/2] team: avoid using variable-length array
From: Jiri Pirko @ 2011-11-17 14:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, bhutchings, shemminger, andy, fbl, jzupka,
	ivecera, mirqus
In-Reply-To: <1321539365-1125-1-git-send-email-jpirko@redhat.com>

Apparently using variable-length array is not correct
(https://lkml.org/lkml/2011/10/23/25). So remove it.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/team/team.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 5b169c1..c48ef19 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -96,10 +96,13 @@ int team_options_register(struct team *team,
 			  size_t option_count)
 {
 	int i;
-	struct team_option *dst_opts[option_count];
+	struct team_option **dst_opts;
 	int err;
 
-	memset(dst_opts, 0, sizeof(dst_opts));
+	dst_opts = kzalloc(sizeof(struct team_option *) * option_count,
+			   GFP_KERNEL);
+	if (!dst_opts)
+		return -ENOMEM;
 	for (i = 0; i < option_count; i++, option++) {
 		struct team_option *dst_opt;
 
@@ -119,12 +122,14 @@ int team_options_register(struct team *team,
 	for (i = 0; i < option_count; i++)
 		list_add_tail(&dst_opts[i]->list, &team->option_list);
 
+	kfree(dst_opts);
 	return 0;
 
 rollback:
 	for (i = 0; i < option_count; i++)
 		kfree(dst_opts[i]);
 
+	kfree(dst_opts);
 	return err;
 }
 
-- 
1.7.6

^ permalink raw reply related

* [patch net-next 1/2] team: add fix_features
From: Jiri Pirko @ 2011-11-17 14:16 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, bhutchings, shemminger, andy, fbl, jzupka,
	ivecera, mirqus

do fix features in similar way as bonding code does

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/team/team.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f309274..5b169c1 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -953,6 +953,27 @@ static int team_del_slave(struct net_device *dev, struct net_device *port_dev)
 	return err;
 }
 
+static netdev_features_t team_fix_features(struct net_device *dev,
+					   netdev_features_t features)
+{
+	struct team_port *port;
+	struct team *team = netdev_priv(dev);
+	netdev_features_t mask;
+
+	mask = features;
+	features &= ~NETIF_F_ONE_FOR_ALL;
+	features |= NETIF_F_ALL_FOR_ALL;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(port, &team->port_list, list) {
+		features = netdev_increment_features(features,
+						     port->dev->features,
+						     mask);
+	}
+	rcu_read_unlock();
+	return features;
+}
+
 static const struct net_device_ops team_netdev_ops = {
 	.ndo_init		= team_init,
 	.ndo_uninit		= team_uninit,
@@ -968,6 +989,7 @@ static const struct net_device_ops team_netdev_ops = {
 	.ndo_vlan_rx_kill_vid	= team_vlan_rx_kill_vid,
 	.ndo_add_slave		= team_add_slave,
 	.ndo_del_slave		= team_del_slave,
+	.ndo_fix_features	= team_fix_features,
 };
 
 
-- 
1.7.6

^ permalink raw reply related

* RE: [PATCH net-next] r8169: Add 64bit statistics
From: David Laight @ 2011-11-17 14:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Junchang Wang, Stephen Hemminger, netdev, romieu, nic swsd
In-Reply-To: <1321534700.2751.27.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

> From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
> Le jeudi 17 novembre 2011 à 11:13 +0000, David Laight a écrit :
> > The 64bit stats update sequence used to get valid
> > counts on 32bit systems (that can't do locked 64bit
> > memory access) seems to be:
> > 
> > >     u64_stats_update_begin(&sky2->tx_stats.syncp);
> > >     ++sky2->tx_stats.packets;
> > >     sky2->tx_stats.bytes += skb->len;
> > >     u64_stats_update_end(&sky2->tx_stats.syncp);
> > 
> > I'm not sure what the begin/end markers do, but
> > they need to hold off the readers during updates
> > and the writers during reads - this is probably
> > expensive on the update path.
> > 
> > A thought that might work is for the writer to
> > write the middle bits of the 64 bit walue to
> > another location, eg:
> >        count = sky2->tx_stats.bytes + skb->len;
> >        sky2->tx_stats.bytes = count;
> >        sky2->tx_stats.bytes_check = count >> 16;
> > The reader then loops until the two value are
> > consistent.
> > 
> > I think this doesn't even require a memory barrier
> > in the ISR since the order of the reads an writes
> > doesn't matter at all.
> > 
> > 	David
> > 
> > 
> 
> Oh well...
> 
> Before claiming all this, you really should read 
> include/linux/u64_stats_sync.h
> 
> This should answer all your questions.

Ah yes ...

Both u64_stats_update_begin & _end increment a numeric field
with an appropriate memory barrier. So the 'update' path
has two extra read-modify-write sequences (possibly the
2nd read can be optimised out), and two smp_wmb() that may
introduce bus delays.

Would be fine if it were guaranteed to work!
Consider the following sequence of events:
       u64_stats_update_begin()
       calculate 'count+1'
                                read_seqcount_begin()
                                read count_hi
       write count_lo
                                read count_lo
       write count_hi
                                read_seqcount_retry()
       ... update other counters ...
       u64_stats_update_end()
The reader gets an invalid value since it reads the same
'sequence' both times.

Could be fixed by using '|= 1' in update_begin and
looping on odd values in read_seqcount_begin().

	David

                      

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Eric Dumazet @ 2011-11-17 13:15 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: Ivan Zahariev, netdev
In-Reply-To: <20111117111145.252924f5@asterix.rh>

Le jeudi 17 novembre 2011 à 11:11 -0200, Flavio Leitner a écrit :

> I am going to be on vacations in the next couple weeks, so I won't be
> able to help fixing this any time soon. However, I am pretty sure
> someone else will help though :)
> 

I'll take a look :)

^ permalink raw reply

* [PATCH v2 net-next] net: use jump_label to shortcut RPS if not setup
From: Eric Dumazet @ 2011-11-17 13:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tom Herbert

Most machines dont use RPS/RFS, and pay a fair amount of instructions in
netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
RPS/RFS is not setup.

Add a jump_label named rps_needed.

If no device rps_map or global rps_sock_flow_table is setup,
netif_receive_skb() / netif_rx() do a single instruction instead of many
ones, including conditional jumps.

jmp +0    (if CONFIG_JUMP_LABEL=y)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
---
V2: add netif_rx() optim as well

 include/linux/netdevice.h  |    5 +++++
 net/core/dev.c             |   21 +++++++++------------
 net/core/net-sysfs.c       |    7 +++++--
 net/core/sysctl_net_core.c |    9 +++++++--
 4 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4d5698a..0bbe030 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -214,6 +214,11 @@ enum {
 #include <linux/cache.h>
 #include <linux/skbuff.h>
 
+#ifdef CONFIG_RPS
+#include <linux/jump_label.h>
+extern struct jump_label_key rps_needed;
+#endif
+
 struct neighbour;
 struct neigh_parms;
 struct sk_buff;
diff --git a/net/core/dev.c b/net/core/dev.c
index 26c49d5..f789599 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2711,6 +2711,8 @@ EXPORT_SYMBOL(__skb_get_rxhash);
 struct rps_sock_flow_table __rcu *rps_sock_flow_table __read_mostly;
 EXPORT_SYMBOL(rps_sock_flow_table);
 
+struct jump_label_key rps_needed __read_mostly;
+
 static struct rps_dev_flow *
 set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	    struct rps_dev_flow *rflow, u16 next_cpu)
@@ -2994,7 +2996,7 @@ int netif_rx(struct sk_buff *skb)
 
 	trace_netif_rx(skb);
 #ifdef CONFIG_RPS
-	{
+	if (static_branch(&rps_needed))	{
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
 		int cpu;
 
@@ -3009,14 +3011,13 @@ int netif_rx(struct sk_buff *skb)
 
 		rcu_read_unlock();
 		preempt_enable();
-	}
-#else
+	} else
+#endif
 	{
 		unsigned int qtail;
 		ret = enqueue_to_backlog(skb, get_cpu(), &qtail);
 		put_cpu();
 	}
-#endif
 	return ret;
 }
 EXPORT_SYMBOL(netif_rx);
@@ -3359,7 +3360,7 @@ int netif_receive_skb(struct sk_buff *skb)
 		return NET_RX_SUCCESS;
 
 #ifdef CONFIG_RPS
-	{
+	if (static_branch(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
 		int cpu, ret;
 
@@ -3370,16 +3371,12 @@ int netif_receive_skb(struct sk_buff *skb)
 		if (cpu >= 0) {
 			ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 			rcu_read_unlock();
-		} else {
-			rcu_read_unlock();
-			ret = __netif_receive_skb(skb);
+			return ret;
 		}
-
-		return ret;
+		rcu_read_unlock();
 	}
-#else
-	return __netif_receive_skb(skb);
 #endif
+	return __netif_receive_skb(skb);
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 602b141..db6c2f8 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -606,9 +606,12 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 	rcu_assign_pointer(queue->rps_map, map);
 	spin_unlock(&rps_map_lock);
 
-	if (old_map)
+	if (map)
+		jump_label_inc(&rps_needed);
+	if (old_map) {
 		kfree_rcu(old_map, rcu);
-
+		jump_label_dec(&rps_needed);
+	}
 	free_cpumask_var(mask);
 	return len;
 }
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 77a65f0..d05559d 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -68,8 +68,13 @@ static int rps_sock_flow_sysctl(ctl_table *table, int write,
 
 		if (sock_table != orig_sock_table) {
 			rcu_assign_pointer(rps_sock_flow_table, sock_table);
-			synchronize_rcu();
-			vfree(orig_sock_table);
+			if (sock_table)
+				jump_label_inc(&rps_needed);
+			if (orig_sock_table) {
+				jump_label_dec(&rps_needed);
+				synchronize_rcu();
+				vfree(orig_sock_table);
+			}
 		}
 	}
 

^ permalink raw reply related

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Flavio Leitner @ 2011-11-17 13:11 UTC (permalink / raw)
  To: Ivan Zahariev; +Cc: netdev
In-Reply-To: <4EC4C160.6010804@icdsoft.com>

On Thu, 17 Nov 2011 10:10:08 +0200
Ivan Zahariev <famzah@icdsoft.com> wrote:

> On 17.11.2011 г. 02:33 ч., Flavio Leitner wrote:
> > On Thu, 17 Nov 2011 00:32:18 +0200
> > Ivan Zahariev<famzah@icdsoft.com>  wrote:
> >
> >> On 11/15/2011 11:09 PM, Eric Dumazet wrote:
> >>> Le mardi 15 novembre 2011 à 22:23 +0200, Ivan Zahariev a écrit :
> >>>> Hello,
> >>>>
> >>>> We have changed nothing in our network infrastructure but only
> >>>> upgraded from Linux kernel 2.6.36.2 to 3.0.3. Here is the problem
> >>>> we are experiencing:
> >>>>
> >>>> ICMP redirected routes are cached forever, and they can be
> >>>> cleared only by a reboot.
> >>>>
> >> ### (bug #1) even though we flushed the route cache,
> >> the<redirected> route resurrects from somewhere; even without
> >> making any TCP requests ### this time what "ip" returns is
> >> consistent with the real (incorrect) routing behavior of machine5
> >> root@machine5:~# ip route flush cache
> >> root@machine5:~# ip route list cache match 8.8.4.4
> >> root@machine5:~# ip route get 8.8.4.4
> >> 8.8.4.4 via 192.168.0.120 dev eth0  src 192.168.0.244
> >>       cache<redirected>   ipid 0x303a
> >>
> >> ### only a reboot clears the cached<redirected>  routes
> > IIRC, the cache flush doesn't affect the inetpeer where the
> > redirected gateway is now stored, so even after flushing the
> > route cache, the inetpeer will restore the old info later.
> >
> > fbl
> OK, I guess my questions now are:
> * How to flush the inetpeer (redirected cache info) without having to 
> reboot the machine?

It will expire after 10min if you don't use that specific host.

> * Why "ip route" returns an incorrect route; example:

I am sorry for not being clear before. It is a bug, indeed.
 
> ### (bug #2) what "ip route" returns is inconsistent, because we are 
> using the <redirected> route 192.168.0.120 in reality
> ### note that the count of the route lines increased with one
> root@machine5:~# ip route list cache match 8.8.4.4
> 8.8.4.4 from 192.168.0.244 tos lowdelay via 192.168.0.8 dev eth0
>      cache  ipid 0x303a
> 8.8.4.4 tos lowdelay via 192.168.0.8 dev eth0  src 192.168.0.244
>      cache  ipid 0x303a
> 8.8.4.4 via 192.168.0.8 dev eth0  src 192.168.0.244
>      cache
> 8.8.4.4 from 192.168.0.244 tos lowdelay via 192.168.0.8 dev eth0
>      cache  ipid 0x303a
> 
> ### After "ip route flush cache", the output of "ip route" gets 
> consistent with the real routing behavior of machine5
> root@machine5:~# ip route flush cache
> root@machine5:~# ip route list cache match 8.8.4.4
> root@machine5:~# ip route get 8.8.4.4
> 8.8.4.4 via 192.168.0.120 dev eth0  src 192.168.0.244
>      cache <redirected>  ipid 0x303a
> 

Now the redirected gateway is stored in inetpeer which represents
an specific peer.  In your case, you have one for 8.8.4.4.

When you flush the routing cache everything is flushed, except for
the inetpeer as far as I can tell.  Later, when you try to access
the host 8.8.4.4 again, the lookup will create a fresh route but
also find the previous 8.8.4.4 inetpeer, so it will re-use the
previous redirected gateway.

Therefore, the routing is fine, but it is missing a way to
invalidade or expire all related inetpeer entries when the flush
happens.

The inetpeer will expire eventually, so waiting before trying again
would help to work around:

1) flush
2) wait to expire (10min)
3) try again

If you know how to compile a kernel, try to change these thresholds
below to expire faster, then you have to wait less for it to expire
instead of rebooting.

net/ipv4/inetpeer.c:
int inet_peer_minttl __read_mostly = 120 * HZ;  /* TTL under high load:
120 sec */ int inet_peer_maxttl __read_mostly = 10 * 60 * HZ;      /*
usual time to live: 10 min */                           

That above is just a workaround, indeed.

I am going to be on vacations in the next couple weeks, so I won't be
able to help fixing this any time soon. However, I am pretty sure
someone else will help though :)

fbl

^ permalink raw reply

* [PATCH net-next] ibm/emac: fix improper cleanup when device is removed to allow re-bind
From: Wolfgang Grandegger @ 2011-11-17 13:06 UTC (permalink / raw)
  To: Netdev; +Cc: Linuxppc-dev

The re-binding (unbind..bind) of an EMAC device fails because the
static variable "busy_phy_map" is not updated when the device is
removed.

Signed-off-by: Wolfgang Grandegger <wg@denx.de>
---
 drivers/net/ethernet/ibm/emac/core.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c
index ed79b2d..2abce96 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -2924,6 +2924,9 @@ static int __devexit emac_remove(struct platform_device *ofdev)
 	if (emac_has_feature(dev, EMAC_FTR_HAS_ZMII))
 		zmii_detach(dev->zmii_dev, dev->zmii_port);
 
+	busy_phy_map &= ~(1 << dev->phy.address);
+	DBG(dev, "busy_phy_map now %#x" NL, busy_phy_map);
+
 	mal_unregister_commac(dev->mal, &dev->commac);
 	emac_put_deps(dev);
 
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH] f_phonet: fix page offset of first received fragment
From: Rémi Denis-Courmont @ 2011-11-17 12:58 UTC (permalink / raw)
  To: netdev; +Cc: Rémi Denis-Courmont

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

We pull one byte (the MAC header) from the first fragment before the
fragment is actually appended. So the socket buffer length is 1, not 0.

Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
---
 drivers/usb/gadget/f_phonet.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/usb/gadget/f_phonet.c b/drivers/usb/gadget/f_phonet.c
index 3490770..16a509a 100644
--- a/drivers/usb/gadget/f_phonet.c
+++ b/drivers/usb/gadget/f_phonet.c
@@ -346,7 +346,7 @@ static void pn_rx_complete(struct usb_ep *ep, struct usb_request *req)
 		}
 
 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
-				skb->len == 0, req->actual);
+				skb->len <= 1, req->actual);
 		page = NULL;
 
 		if (req->actual < req->length) { /* Last fragment */
-- 
1.7.5.4

^ permalink raw reply related

* Re: [PATCH net-next] r8169: Add 64bit statistics
From: Eric Dumazet @ 2011-11-17 13:02 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Stephen Hemminger, netdev, romieu, nic swsd
In-Reply-To: <CABoNC83U2k=-6-BkJEyWBXfSnEbb-+K8CD1ffqhjA613Tk04tQ@mail.gmail.com>

Le jeudi 17 novembre 2011 à 18:59 +0800, Junchang Wang a écrit :
> > Yes, look at sky2.c for a template
> >
> > drivers/net/ethernet/marvell/sky2.c contains code like that
> > (different syncp for rx/tx)
> >
> > TX path:
> >                        u64_stats_update_begin(&sky2->tx_stats.syncp);
> >                        ++sky2->tx_stats.packets;
> >                        sky2->tx_stats.bytes += skb->len;
> >                        u64_stats_update_end(&sky2->tx_stats.syncp);
> >
> >
> > RX path:
> >
> >        u64_stats_update_begin(&sky2->rx_stats.syncp);
> >        sky2->rx_stats.packets += packets;
> >        sky2->rx_stats.bytes += bytes;
> >        u64_stats_update_end(&sky2->rx_stats.syncp);
> >
> 
> Thanks, Eric.
> 
> I'm still confused about why we need two sync entries. Please correct
> me if I'm wrong.
> 
> Take r8169 for example, All statistic entries are updated in
> rtl8169_rx_interrupt() or rtl8169_tx_interrupt(). Those two functions
> are called in rtl8169_poll().
> As far as I know, rtl8169_poll() is protected by NAPI_STATE_SCHED bit
> to run on a single core at one moment. So there is not compulsory to
> use two sync entries.
> 
> One benefit from two sync is that readers can avoid many retries.
> 

Oh well...

Dont try to optimize all this stuff, I spent some time on it already,
just use the infrastructure and forget about races and bugs.

The short answer is :

CPU1                    CPU2                  
TX path                 RX path               


since CPU1 & CPU2 will do the update_begin(), and the memory increment
is not atomic, we might lose one increment, and block a stat reader
forever.

^ permalink raw reply

* RE: [PATCH net-next] r8169: Add 64bit statistics
From: Eric Dumazet @ 2011-11-17 12:58 UTC (permalink / raw)
  To: David Laight; +Cc: Junchang Wang, Stephen Hemminger, netdev, romieu, nic swsd
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AEC9@saturn3.aculab.com>

Le jeudi 17 novembre 2011 à 11:13 +0000, David Laight a écrit :
> The 64bit stats update sequence used to get valid
> counts on 32bit systems (that can't do locked 64bit
> memory access) seems to be:
> 
> >     u64_stats_update_begin(&sky2->tx_stats.syncp);
> >     ++sky2->tx_stats.packets;
> >     sky2->tx_stats.bytes += skb->len;
> >     u64_stats_update_end(&sky2->tx_stats.syncp);
> 
> I'm not sure what the begin/end markers do, but
> they need to hold off the readers during updates
> and the writers during reads - this is probably
> expensive on the update path.
> 
> A thought that might work is for the writer to
> write the middle bits of the 64 bit walue to
> another location, eg:
>        count = sky2->tx_stats.bytes + skb->len;
>        sky2->tx_stats.bytes = count;
>        sky2->tx_stats.bytes_check = count >> 16;
> The reader then loops until the two value are
> consistent.
> 
> I think this doesn't even require a memory barrier
> in the ISR since the order of the reads an writes
> doesn't matter at all.
> 
> 	David
> 
> 

Oh well...

Before claiming all this, you really should read 
include/linux/u64_stats_sync.h

This should answer all your questions.

^ permalink raw reply

* RE: 3.1.0 rtl8169_open oops.
From: hayeswang @ 2011-11-17 12:26 UTC (permalink / raw)
  To: 'Dave Jones', 'Francois Romieu'
  Cc: netdev, kernel-team, dwmw2
In-Reply-To: <20111117005355.GA8466@redhat.com>

 

> -----Original Message-----
> From: Dave Jones [mailto:davej@redhat.com] 
> Sent: Thursday, November 17, 2011 8:54 AM
> To: Francois Romieu
> Cc: netdev@vger.kernel.org; kernel-team@fedoraproject.org; 
> Hayeswang; dwmw2@infradead.org
> Subject: Re: 3.1.0 rtl8169_open oops.
> 
> On Thu, Nov 17, 2011 at 01:34:36AM +0100, Francois Romieu wrote:
>  > Dave Jones <davej@redhat.com> :
>  > [...]
>  > > (gdb) list *(rtl8169_open)+0x34e
>  > > 0x4c0f is in rtl8169_open (drivers/net/r8169.c:1881).
>  > > 1876		bool rc = false;
>  > > 1877	
>  > > 1878		if (fw->size < FW_OPCODE_SIZE)
>  > > 1879			goto out;
>  > > 1880	
>  > > 1881		if (!fw_info->magic) {
>  > > 1882			size_t i, size, start;
>  > > 1883			u8 checksum = 0;
>  > > 1884	
>  > > 1885			if (fw->size < sizeof(*fw_info))
>  > 

My colleague tries Fedora 16 and it works fine with several different chips. We
couldn't reproduce the issue.

>  > Give me 24h. I hope it is not as trivial as it seems :o(
>  > 
>  > Btw the firmware included in F16 is not up-to-date (see
>  > http://marc.info/?l=linux-kernel&m=131295492507686&w=2). 
> The driver should
>  > not mind though.
> 
> That reminds me. Since the kernel.org compromise, where does
> linux-firmware live now ?
> 

I submitted the new firmwares to

http://git.kernel.org/pub/scm/linux/kernel/git/dwmw2/linux-firmware.git

However, the owner doesn't update them yet. You could reference

http://www.spinics.net/lists/netdev/msg178015.html

 
Best Regards,
Hayes

^ permalink raw reply

* RE: [linux-firmware v6 2/4] rtl_nic: add new firmware for RTL8111F
From: hayeswang @ 2011-11-17 12:16 UTC (permalink / raw)
  To: dwmw2, ben; +Cc: romieu, netdev
In-Reply-To: <1319634425-1529-2-git-send-email-hayeswang@realtek.com>

 

> -----Original Message-----
> From: Hayeswang [mailto:hayeswang@realtek.com] 
> Sent: Wednesday, October 26, 2011 9:07 PM
> To: dwmw2@infradead.org; ben@decadent.org.uk
> Cc: romieu@fr.zoreil.com; netdev@vger.kernel.org; Hayeswang
> Subject: [linux-firmware v6 2/4] rtl_nic: add new firmware 
> for RTL8111F
> 
> Add new firmware:
> 1. rtl_nic/rtl8168f-1.fw
>    version: 0.0.3
> 2. rtl_nic/rtl8168f-2.fw
>    version: 0.0.3
> 
> Signed-off-by: Hayes Wang <hayeswang@realtek.com>
> ---
>  WHENCE                |    6 ++++++
>  rtl_nic/rtl8168f-1.fw |  Bin 0 -> 3136 bytes
>  rtl_nic/rtl8168f-2.fw |  Bin 0 -> 992 bytes
>  3 files changed, 6 insertions(+), 0 deletions(-)
>  create mode 100644 rtl_nic/rtl8168f-1.fw
>  create mode 100644 rtl_nic/rtl8168f-2.fw
> 

Hi sirs,

Excuse me. Someone asks me these firmwares. Are these pending? Any response?
 
Best Regards,
Hayes

^ permalink raw reply

* Re: [PATCH 1/2] net: add network priority cgroup infrastructure
From: Neil Horman @ 2011-11-17 11:56 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev@vger.kernel.org, Love, Robert W, David S. Miller
In-Reply-To: <4EC4EF46.7010009@intel.com>

On Thu, Nov 17, 2011 at 03:25:58AM -0800, John Fastabend wrote:
> On 11/16/2011 12:51 PM, Neil Horman wrote:
> > This patch adds in the infrastructure code to create the network priority
> > cgroup.  The cgroup, in addition to the standard processes file creates two
> > control files:
> > 
> > 1) prioidx - This is a read-only file that exports the index of this cgroup.
> > This is a value that is both arbitrary and unique to a cgroup in this subsystem,
> > and is used to index the per-device priority map
> > 
> > 2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
> > <name:priority> where name is the name of a network interface and priority is
> > indicates the priority assigned to frames egresessing on the named interface and
> > originating from a pid in this cgroup
> > 
> > This cgroup allows for skb priority to be set prior to a root qdisc getting
> > selected. This is benenficial for DCB enabled systems, in that it allows for any
> > application to use dcb configured priorities so without application modification
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > CC: Robert Love <robert.w.love@intel.com>
> > CC: "David S. Miller" <davem@davemloft.net>
> > ---
> >  include/linux/cgroup_subsys.h |    8 +
> >  include/linux/netdevice.h     |    4 +
> >  include/net/netprio_cgroup.h  |   66 ++++++++
> >  include/net/sock.h            |    3 +
> >  net/Kconfig                   |    7 +
> >  net/core/Makefile             |    1 +
> >  net/core/dev.c                |   13 ++
> >  net/core/netprio_cgroup.c     |  340 +++++++++++++++++++++++++++++++++++++++++
> >  net/core/sock.c               |   22 +++-
> >  net/socket.c                  |    2 +
> >  10 files changed, 465 insertions(+), 1 deletions(-)
> >  create mode 100644 include/net/netprio_cgroup.h
> >  create mode 100644 net/core/netprio_cgroup.c
> > 
> 
> Hi Neil,
> 
> couple more comments.
> 
Thanks John, I'll fix these points up and post a v2 later today.
Neil

> 

^ permalink raw reply

* Re: [PATCH 1/2] net: add network priority cgroup infrastructure
From: John Fastabend @ 2011-11-17 11:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev@vger.kernel.org, Love, Robert W, David S. Miller
In-Reply-To: <1321476666-8225-2-git-send-email-nhorman@tuxdriver.com>

On 11/16/2011 12:51 PM, Neil Horman wrote:
> This patch adds in the infrastructure code to create the network priority
> cgroup.  The cgroup, in addition to the standard processes file creates two
> control files:
> 
> 1) prioidx - This is a read-only file that exports the index of this cgroup.
> This is a value that is both arbitrary and unique to a cgroup in this subsystem,
> and is used to index the per-device priority map
> 
> 2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
> <name:priority> where name is the name of a network interface and priority is
> indicates the priority assigned to frames egresessing on the named interface and
> originating from a pid in this cgroup
> 
> This cgroup allows for skb priority to be set prior to a root qdisc getting
> selected. This is benenficial for DCB enabled systems, in that it allows for any
> application to use dcb configured priorities so without application modification
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> CC: Robert Love <robert.w.love@intel.com>
> CC: "David S. Miller" <davem@davemloft.net>
> ---
>  include/linux/cgroup_subsys.h |    8 +
>  include/linux/netdevice.h     |    4 +
>  include/net/netprio_cgroup.h  |   66 ++++++++
>  include/net/sock.h            |    3 +
>  net/Kconfig                   |    7 +
>  net/core/Makefile             |    1 +
>  net/core/dev.c                |   13 ++
>  net/core/netprio_cgroup.c     |  340 +++++++++++++++++++++++++++++++++++++++++
>  net/core/sock.c               |   22 +++-
>  net/socket.c                  |    2 +
>  10 files changed, 465 insertions(+), 1 deletions(-)
>  create mode 100644 include/net/netprio_cgroup.h
>  create mode 100644 net/core/netprio_cgroup.c
> 

Hi Neil,

couple more comments.

> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index ac663c1..0bd390c 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -59,8 +59,16 @@ SUBSYS(net_cls)
>  SUBSYS(blkio)
>  #endif
> 
> +/* */
> +
>  #ifdef CONFIG_CGROUP_PERF
>  SUBSYS(perf)
>  #endif
> 
>  /* */
> +
> +#ifdef CONFIG_NETPRIO_CGROUP
> +SUBSYS(net_prio)
> +#endif
> +
> +/* */
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0db1f5f..86e8c3f 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -50,6 +50,7 @@
>  #ifdef CONFIG_DCB
>  #include <net/dcbnl.h>
>  #endif
> +#include <net/netprio_cgroup.h>
> 
>  struct vlan_group;
>  struct netpoll_info;
> @@ -1312,6 +1313,9 @@ struct net_device {
>         /* max exchange id for FCoE LRO by ddp */
>         unsigned int            fcoe_ddp_xid;
>  #endif
> +#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
> +       struct netprio_map *priomap;
> +#endif

Any reason not to annotate this? 'struct netprio_map __rcu *priomap;' and
then use rcu_dereference_protected() throughout.

>         /* phy device may attach itself for hardware timestamping */
>         struct phy_device *phydev;
> 
> diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
> new file mode 100644
> index 0000000..6b65936
> --- /dev/null
> +++ b/include/net/netprio_cgroup.h
> @@ -0,0 +1,66 @@
> +/*
> + * netprio_cgroup.h                    Control Group Priority set
> + *
> + *
> + * Authors:    Neil Horman <nhorman@tuxdriver.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + *
> + */
> +
> +#ifndef _NETPRIO_CGROUP_H
> +#define _NETPRIO_CGROUP_H
> +#include <linux/module.h>
> +#include <linux/cgroup.h>
> +#include <linux/hardirq.h>
> +#include <linux/rcupdate.h>
> +
> +struct cgroup_netprio_state
> +{
> +       struct cgroup_subsys_state css;
> +       u32 prioidx;
> +};
> +
> +struct netprio_map {
> +       struct rcu_head rcu;
> +       u32 priomap_len;
> +       u32 priomap[];
> +};
> +
> +#ifdef CONFIG_CGROUPS
> +
> +#ifndef CONFIG_NETPRIO_CGROUP
> +extern int net_prio_subsys_id;
> +#endif
> +
> +extern void sock_update_netprioidx(struct sock *sk);
> +extern void skb_update_prio(struct sk_buff *skb);
> +
> +static inline struct cgroup_netprio_state
> +               *task_netprio_state(struct task_struct *p)
> +{
> +#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
> +       return container_of(task_subsys_state(p, net_prio_subsys_id),
> +                           struct cgroup_netprio_state, css);
> +#else
> +       return NULL;
> +#endif
> +}
> +
> +#else
> +
> +#define sock_update_netprioidx(sk)
> +#define skb_update_prio(skb)
> +
> +static inline struct cgroup_netprio_state
> +               *task_netprio_state(struct task_struct *p)
> +{
> +       return NULL;
> +}
> +
> +#endif
> +
> +#endif  /* _NET_CLS_CGROUP_H */
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 5ac682f..87b24aa 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -321,6 +321,9 @@ struct sock {
>         unsigned short          sk_ack_backlog;
>         unsigned short          sk_max_ack_backlog;
>         __u32                   sk_priority;
> +#ifdef CONFIG_CGROUPS
> +       __u32                   sk_cgrp_prioidx;
> +#endif
>         struct pid              *sk_peer_pid;
>         const struct cred       *sk_peer_cred;
>         long                    sk_rcvtimeo;
> diff --git a/net/Kconfig b/net/Kconfig
> index a073148..63d2c5d 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -232,6 +232,13 @@ config XPS
>         depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS
>         default y
> 
> +config NETPRIO_CGROUP
> +       tristate "Network priority cgroup"
> +       depends on CGROUPS
> +       ---help---
> +         Cgroup subsystem for use in assigning processes to network priorities on
> +         a per-interface basis
> +
>  config HAVE_BPF_JIT
>         bool
> 
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 0d357b1..3606d40 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_FIB_RULES) += fib_rules.o
>  obj-$(CONFIG_TRACEPOINTS) += net-traces.o
>  obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
>  obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
> +obj-$(CONFIG_NETPRIO_CGROUP) += netprio_cgroup.o
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b7ba81a..a1dca83 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2456,6 +2456,17 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>         return rc;
>  }
> 
> +#ifdef CONFIG_CGROUPS
> +void skb_update_prio(struct sk_buff *skb)
> +{
> +       struct netprio_map *map = rcu_dereference(skb->dev->priomap);
> +
> +       if ((!skb->priority) && (skb->sk) && map)
> +               skb->priority = map->priomap[skb->sk->sk_cgrp_prioidx];
> +}
> +EXPORT_SYMBOL_GPL(skb_update_prio);
> +#endif
> +
>  static DEFINE_PER_CPU(int, xmit_recursion);
>  #define RECURSION_LIMIT 10
> 
> @@ -2496,6 +2507,8 @@ int dev_queue_xmit(struct sk_buff *skb)
>          */
>         rcu_read_lock_bh();
> 
> +       skb_update_prio(skb);
> +
>         txq = dev_pick_tx(dev, skb);
>         q = rcu_dereference_bh(txq->qdisc);
> 
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> new file mode 100644
> index 0000000..14e896c
> --- /dev/null
> +++ b/net/core/netprio_cgroup.c
> @@ -0,0 +1,340 @@
> +/*
> + * net/sched/cls_cgroup.c      Control Group Classifier
          ^^^^^^^^^^^^^^^^^^
should be netprio_cgroup.c presumably.

> + *
> + *             This program is free software; you can redistribute it and/or
> + *             modify it under the terms of the GNU General Public License
> + *             as published by the Free Software Foundation; either version
> + *             2 of the License, or (at your option) any later version.
> + *
> + * Authors:    Thomas Graf <tgraf@suug.ch>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/types.h>
> +#include <linux/string.h>
> +#include <linux/errno.h>
> +#include <linux/skbuff.h>
> +#include <linux/cgroup.h>
> +#include <linux/rcupdate.h>
> +#include <linux/atomic.h>
> +#include <net/rtnetlink.h>
> +#include <net/pkt_cls.h>
> +#include <net/sock.h>
> +#include <net/netprio_cgroup.h>
> +
> +static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
> +                                              struct cgroup *cgrp);
> +static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +
> +struct cgroup_subsys net_prio_subsys = {
> +       .name           = "net_prio",
> +       .create         = cgrp_create,
> +       .destroy        = cgrp_destroy,
> +       .populate       = cgrp_populate,
> +#ifdef CONFIG_NETPRIO_CGROUP
> +       .subsys_id      = net_prio_subsys_id,
> +#endif
> +       .module         = THIS_MODULE
> +};
> +
> +#define PRIOIDX_SZ 128
> +
> +static unsigned long prioidx_map[PRIOIDX_SZ];
> +static DEFINE_SPINLOCK(prioidx_map_lock);
> +static atomic_t max_prioidx = ATOMIC_INIT(0);
> +
> +static inline struct cgroup_netprio_state *cgrp_netprio_state(struct cgroup *cgrp)
> +{
> +       return container_of(cgroup_subsys_state(cgrp, net_prio_subsys_id),
> +                           struct cgroup_netprio_state, css);
> +}
> +
> +static int get_prioidx(u32 *prio)
> +{
> +       unsigned long flags;
> +       u32 prioidx;
> +
> +       spin_lock_irqsave(&prioidx_map_lock, flags);
> +       prioidx = find_first_zero_bit(prioidx_map, sizeof(unsigned long) * PRIOIDX_SZ);
> +       set_bit(prioidx, prioidx_map);
> +       spin_unlock_irqrestore(&prioidx_map_lock, flags);
> +       if (prioidx == sizeof(unsigned long) * PRIOIDX_SZ)
> +               return -ENOSPC;
> +
> +       atomic_set(&max_prioidx, prioidx);
> +       *prio = prioidx;
> +       return 0;
> +}
> +
> +static void put_prioidx(u32 idx)
> +{
> +       unsigned long flags;
> +       spin_lock_irqsave(&prioidx_map_lock, flags);
> +       clear_bit(idx, prioidx_map);
> +       spin_unlock_irqrestore(&prioidx_map_lock, flags);
> +}
> +
> +static void extend_netdev_table(struct net_device *dev, u32 new_len)
> +{
> +       size_t new_size = sizeof(struct netprio_map) +
> +                          ((sizeof(u32) * new_len));
> +       struct netprio_map *new_priomap = kzalloc(new_size, GFP_KERNEL);
> +       struct netprio_map *old_priomap;
> +       int i;
> +
> +       old_priomap  = rcu_dereference_protected(dev->priomap, 1);

This always has rtnl_lock hence the '1' so use rtnl_dereference().

> +

also extra white line.

> +
> +       if (!new_priomap) {
> +               printk(KERN_WARNING "Unable to alloc new priomap!\n");
> +               return;
> +       }
> +
> +       for (i = 0;
> +            dev->priomap && (i < dev->priomap->priomap_len);
               ^^^^^^^^^^^^         ^^^^^^^^^^^^^ 
probably should use old_priomap here.

> +            i++)
> +               new_priomap->priomap[i] = dev->priomap->priomap[i];
> +
> +       new_priomap->priomap_len = new_len;
> +
> +       rcu_assign_pointer(dev->priomap, new_priomap);
> +       if (old_priomap)
> +               kfree_rcu(old_priomap, rcu);
> +

remove extra line.

> +}
> +
> +static void update_netdev_tables(void)
> +{
> +       struct net_device *dev;
> +       u32 max_len = atomic_read(&max_prioidx);
> +
> +       rtnl_lock();
> +
> +       for_each_netdev(&init_net, dev) {
> +               if ((!dev->priomap) ||
> +                   (dev->priomap->priomap_len < max_len))
> +                       extend_netdev_table(dev, max_len);

With __rcu change add rtnl_dereference()

> +       }
> +
> +       rtnl_unlock();
> +}
> +
> +static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
> +                                                struct cgroup *cgrp)
> +{
> +       struct cgroup_netprio_state *cs;
> +       int ret;
> +
> +       cs = kzalloc(sizeof(*cs), GFP_KERNEL);
> +       if (!cs)
> +               return ERR_PTR(-ENOMEM);
> +
> +       if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
> +               return ERR_PTR(-EINVAL);
> +
> +       ret = get_prioidx(&cs->prioidx);
> +       if (ret != 0) {
> +               printk(KERN_WARNING "No space in priority index array\n");
> +               return ERR_PTR(ret);
> +       }
> +
> +       return &cs->css;
> +}
> +
> +static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       struct cgroup_netprio_state *cs;
> +       struct net_device *dev;
> +
> +       cs = cgrp_netprio_state(cgrp);
> +       rtnl_lock();
> +       for_each_netdev(&init_net, dev) {
> +               if (dev->priomap)

ditto rtnl_dereference()

> +                       dev->priomap->priomap[cs->prioidx] = 0;
> +       }
> +       rtnl_unlock();
> +       put_prioidx(cs->prioidx);
> +out_free:
> +       kfree(cs);
> +}
> +
> +static u64 read_prioidx(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       return (u64)cgrp_netprio_state(cgrp)->prioidx;
> +}
> +
> +static int read_priomap(struct cgroup *cont, struct cftype *cft,
> +                       struct cgroup_map_cb *cb)
> +{
> +       struct net_device *dev;
> +       u32 prioidx = cgrp_netprio_state(cont)->prioidx;
> +       u32 priority;
> +
> +       /*
> +        * Stub until I add the per-interface priority map
> +        */

we have a per interface map now... so drop the comments.

> +       rcu_read_lock();
> +       for_each_netdev_rcu(&init_net, dev) {
> +               priority = dev->priomap ? dev->priomap->priomap[prioidx] : 0;


This really needs a rcu_dereference() AFAICT.

> +               cb->fill(cb, dev->name, priority);
> +       }
> +       rcu_read_unlock();
> +       return 0;
> +}
> +
> +static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
> +                        const char *buffer)
> +{
> +       char *devname = kstrdup(buffer, GFP_KERNEL);
> +       int ret = -EINVAL;
> +       u32 prioidx = cgrp_netprio_state(cgrp)->prioidx;
> +       unsigned long priority;
> +       char *priostr;
> +       struct net_device *dev;
> +
> +       devname = kstrdup(buffer, GFP_KERNEL);
> +       if (!devname)
> +               return -ENOMEM;
> +
> +       /*
> +        * Minimally sized valid priomap string
> +        */
> +       if (strlen(devname) < 3)
> +               goto out_free_devname;
> +
> +       priostr = strstr(devname, " ");
> +       if (!priostr)
> +               goto out_free_devname;
> +
> +       /*
> +        *Separate the devname from the associated priority
> +        *and advance the priostr poitner to the priority value
> +        */
> +       *priostr = '\0';
> +       priostr++;
> +
> +       /*
> +        * If the priostr points to NULL, we're at the end of the passed
> +        * in string, and its not a valid write
> +        */
> +       if (*priostr == '\0')
> +               goto out_free_devname;
> +
> +       ret = kstrtoul(priostr, 10, &priority);
> +       if (ret < 0)
> +               goto out_free_devname;
> +
> +       ret = -ENODEV;
> +
> +       dev = dev_get_by_name(&init_net, devname);
> +       if (!dev)
> +               goto out_free_devname;
> +
> +       update_netdev_tables();
> +       ret = 0;
> +       if (dev->priomap)
> +               dev->priomap->priomap[prioidx] = priority;
> +

Whats protecting this? This needs a dereference and rcu_read_lock() I think.

> +       dev_put(dev);
> +
> +out_free_devname:
> +       kfree(devname);
> +       return ret;
> +}
> +
> +static struct cftype ss_files[] = {
> +       {
> +               .name = "prioidx",
> +               .read_u64 = read_prioidx,
> +       },
> +       {
> +               .name = "ifpriomap",
> +               .read_map = read_priomap,
> +               .write_string = write_priomap,
> +       },
> +};
> +
> +static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> +       return cgroup_add_files(cgrp, ss, ss_files, ARRAY_SIZE(ss_files));
> +}
> +
> +static int netprio_device_event(struct notifier_block *unused,
> +                               unsigned long event, void *ptr)
> +{
> +       struct net_device *dev = ptr;
> +       struct netprio_map *old;
> +       u32 max_len = atomic_read(&max_prioidx);
> +
> +       old = rcu_dereference_protected(dev->priomap, 1);
> +       /*
> +        * Note this is called with rtnl_lock held so we have update side
> +        * protection on our rcu assignments
> +        */
> +
> +       switch (event) {
> +
> +       case NETDEV_REGISTER:
> +               if (max_len)
> +                       extend_netdev_table(dev, max_len);
> +               break;
> +       case NETDEV_UNREGISTER:
> +               rcu_assign_pointer(dev->priomap, NULL);
> +               if (old)
> +                       kfree_rcu(old, rcu);
> +               break;
> +       }
> +       return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block netprio_device_notifier = {
> +       .notifier_call = netprio_device_event
> +};
> +
> +static int __init init_cgroup_netprio(void)
> +{
> +       int ret;
> +
> +       ret = cgroup_load_subsys(&net_prio_subsys);
> +       if (ret)
> +               goto out;
> +#ifndef CONFIG_NETPRIO_CGROUP
> +       smp_wmb();
> +       net_prio_subsys_id = net_prio_subsys.subsys_id;
> +#endif
> +
> +       register_netdevice_notifier(&netprio_device_notifier);
> +
> +out:
> +       return ret;
> +}
> +
> +static void __exit exit_cgroup_netprio(void)
> +{
> +       struct netprio_map *old;
> +       struct net_device *dev;
> +
> +       unregister_netdevice_notifier(&netprio_device_notifier);
> +
> +       cgroup_unload_subsys(&net_prio_subsys);
> +
> +#ifndef CONFIG_NETPRIO_CGROUP
> +       net_prio_subsys_id = -1;
> +       synchronize_rcu();
> +#endif
> +
> +       rtnl_lock();
> +       for_each_netdev(&init_net, dev) {
> +               old = dev->priomap;

rtnl_dereference() here also if adding __rcu

> +               rcu_assign_pointer(dev->priomap, NULL);
> +               if (old)
> +                       kfree_rcu(old, rcu);
> +       }
> +       rtnl_unlock();
> +}
> +
> +module_init(init_cgroup_netprio);
> +module_exit(exit_cgroup_netprio);
> +MODULE_LICENSE("GPL v2");
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 5a08762..77a4888 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -125,6 +125,7 @@
>  #include <net/xfrm.h>
>  #include <linux/ipsec.h>
>  #include <net/cls_cgroup.h>
> +#include <net/netprio_cgroup.h>
> 
>  #include <linux/filter.h>
> 
> @@ -221,10 +222,16 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
>  int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
>  EXPORT_SYMBOL(sysctl_optmem_max);
> 
> -#if defined(CONFIG_CGROUPS) && !defined(CONFIG_NET_CLS_CGROUP)
> +#if defined(CONFIG_CGROUPS)
> +#if !defined(CONFIG_NET_CLS_CGROUP)
>  int net_cls_subsys_id = -1;
>  EXPORT_SYMBOL_GPL(net_cls_subsys_id);
>  #endif
> +#if !defined(CONFIG_NETPRIO_CGROUP)
> +int net_prio_subsys_id = -1;
> +EXPORT_SYMBOL_GPL(net_prio_subsys_id);
> +#endif
> +#endif
> 
>  static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
>  {
> @@ -1111,6 +1118,18 @@ void sock_update_classid(struct sock *sk)
>                 sk->sk_classid = classid;
>  }
>  EXPORT_SYMBOL(sock_update_classid);
> +
> +void sock_update_netprioidx(struct sock *sk)
> +{
> +       struct cgroup_netprio_state *state;
> +       if (in_interrupt())
> +               return;
> +       rcu_read_lock();
> +       state = task_netprio_state(current);
> +       sk->sk_cgrp_prioidx = state ? state->prioidx : 0;
> +       rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(sock_update_netprioidx);
>  #endif
> 
>  /**
> @@ -1138,6 +1157,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
>                 atomic_set(&sk->sk_wmem_alloc, 1);
> 
>                 sock_update_classid(sk);
> +               sock_update_netprioidx(sk);
>         }
> 
>         return sk;
> diff --git a/net/socket.c b/net/socket.c
> index 2877647..108716f 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -549,6 +549,8 @@ static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock,
> 
>         sock_update_classid(sock->sk);
> 
> +       sock_update_netprioidx(sock->sk);
> +
>         si->sock = sock;
>         si->scm = NULL;
>         si->msg = msg;
> --
> 1.7.6.4
> 

^ permalink raw reply

* RE: [PATCH net-next] r8169: Add 64bit statistics
From: David Laight @ 2011-11-17 11:13 UTC (permalink / raw)
  To: Junchang Wang, Eric Dumazet; +Cc: Stephen Hemminger, netdev, romieu, nic swsd
In-Reply-To: <CABoNC83U2k=-6-BkJEyWBXfSnEbb-+K8CD1ffqhjA613Tk04tQ@mail.gmail.com>

The 64bit stats update sequence used to get valid
counts on 32bit systems (that can't do locked 64bit
memory access) seems to be:

>     u64_stats_update_begin(&sky2->tx_stats.syncp);
>     ++sky2->tx_stats.packets;
>     sky2->tx_stats.bytes += skb->len;
>     u64_stats_update_end(&sky2->tx_stats.syncp);

I'm not sure what the begin/end markers do, but
they need to hold off the readers during updates
and the writers during reads - this is probably
expensive on the update path.

A thought that might work is for the writer to
write the middle bits of the 64 bit walue to
another location, eg:
       count = sky2->tx_stats.bytes + skb->len;
       sky2->tx_stats.bytes = count;
       sky2->tx_stats.bytes_check = count >> 16;
The reader then loops until the two value are
consistent.

I think this doesn't even require a memory barrier
in the ISR since the order of the reads an writes
doesn't matter at all.

	David

^ permalink raw reply

* [PATCH v2 7/7] gianfar: add support for wake-on-packet
From: Zhao Chenhui @ 2011-11-17 11:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: leoli, netdev, afleming

On certain chip like MPC8536 and P1022, system can be waked up from
sleep by user-defined packet and Magic Patcket.(The eTSEC cannot
supports both types of wake-up event simultaneously.) This patch
implements wake-up on user-defined patcket including ARP request
packet and unicast patcket to this station.

When entering suspend state, the gianfar driver sets receive queue
filer table to filter all of packets except ARP request packet and
unicast patcket to this station. The driver temporarily uses the last
receive queue to receive the user defined packet.

In suspend state, the receive part of eTSEC keeps working. When
receiving a user defined packet, it generates an interrupt to
wake up the system.

The rule of the filer table is as below.
  if (arp request to local ip address)
      accept it to the last queue
  elif (unicast packet to local mac address)
      accept it to the last queue
  else
      reject it
  endif
Note: The local ip/mac address is the ethernet primary IP/MAC address of
the station. Do not support multiple IP/MAC addresses.

Here is an example of enabling and testing wake up on user defined packet.
  ifconfig eth0 10.193.20.169
  ethtool -s eth0 wol a
  echo standby > /sys/power/state or echo mem > /sys/power/state

Ping from PC host to wake up the station:
  ping 10.193.20.169

Signed-off-by: Dave Liu <daveliu@freescale.com>
Signed-off-by: Jin Qing <b24347@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Zhao Chenhui <chenhui.zhao@freescale.com>
Acked-by: Andy Fleming <afleming@freescale.com>
---
Changes for v2:
 - rework gfar_get_ip(), gfar_config_filer_table(), gfar_suspend() and gfar_resume()
 - add support unicast wakeup
 - add supported in struct gfar_private

 .../devicetree/bindings/net/fsl-tsec-phy.txt       |    3 +
 drivers/net/ethernet/freescale/gianfar.c           |  362 +++++++++++++++++---
 drivers/net/ethernet/freescale/gianfar.h           |   33 ++-
 drivers/net/ethernet/freescale/gianfar_ethtool.c   |   52 ++-
 4 files changed, 372 insertions(+), 78 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt
index 2c6be03..543e36c 100644
--- a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt
+++ b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt
@@ -56,6 +56,9 @@ Properties:
     hardware.
   - fsl,magic-packet : If present, indicates that the hardware supports
     waking up via magic packet.
+  - fsl,wake-on-filer : If present, indicates that the hardware supports
+    waking up via arp request to local ip address or unicast packet to
+    local mac address.
   - bd-stash : If present, indicates that the hardware supports stashing
     buffer descriptors in the L2.
   - rx-stash-len : Denotes the number of bytes of a received buffer to stash
diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 83199fd..991bd47 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -85,6 +85,8 @@
 #include <linux/tcp.h>
 #include <linux/udp.h>
 #include <linux/in.h>
+#include <linux/inetdevice.h>
+#include <sysdev/fsl_soc.h>
 #include <linux/net_tstamp.h>
 
 #include <asm/io.h>
@@ -147,6 +149,17 @@ static void gfar_clear_exact_match(struct net_device *dev);
 static void gfar_set_mac_for_addr(struct net_device *dev, int num,
 				  const u8 *addr);
 static int gfar_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
+static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp);
+
+#ifdef CONFIG_PM
+static void gfar_halt_rx(struct net_device *dev);
+static void gfar_rx_start(struct net_device *dev);
+static void gfar_enable_filer(struct net_device *dev);
+static void gfar_disable_filer(struct net_device *dev);
+static void gfar_config_filer_table(struct net_device *dev);
+static void gfar_restore_filer_table(struct net_device *dev);
+static int gfar_get_ip(struct net_device *dev);
+#endif
 
 MODULE_AUTHOR("Freescale Semiconductor, Inc");
 MODULE_DESCRIPTION("Gianfar Ethernet Driver");
@@ -751,7 +764,6 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev)
 			FSL_GIANFAR_DEV_HAS_PADDING |
 			FSL_GIANFAR_DEV_HAS_CSUM |
 			FSL_GIANFAR_DEV_HAS_VLAN |
-			FSL_GIANFAR_DEV_HAS_MAGIC_PACKET |
 			FSL_GIANFAR_DEV_HAS_EXTENDED_HASH |
 			FSL_GIANFAR_DEV_HAS_TIMER;
 
@@ -763,8 +775,19 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev)
 	else
 		priv->interface = PHY_INTERFACE_MODE_MII;
 
-	if (of_get_property(np, "fsl,magic-packet", NULL))
+	/* Init Wake-on-LAN */
+	priv->wol_opts = 0;
+	priv->wol_supported = 0;
+	if (of_get_property(np, "fsl,magic-packet", NULL)) {
 		priv->device_flags |= FSL_GIANFAR_DEV_HAS_MAGIC_PACKET;
+		priv->wol_supported |= GIANFAR_WOL_MAGIC;
+	}
+
+	if (of_get_property(np, "fsl,wake-on-filer", NULL)) {
+		priv->device_flags |= FSL_GIANFAR_DEV_HAS_WAKE_ON_FILER;
+		priv->wol_supported |= GIANFAR_WOL_ARP;
+		priv->wol_supported |= GIANFAR_WOL_UCAST;
+	}
 
 	priv->phy_node = of_parse_phandle(np, "phy-handle", 0);
 
@@ -1168,8 +1191,10 @@ static int gfar_probe(struct platform_device *ofdev)
 		goto register_fail;
 	}
 
-	device_init_wakeup(&dev->dev,
-		priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET);
+	if (priv->wol_supported) {
+		device_set_wakeup_capable(&ofdev->dev, true);
+		device_set_wakeup_enable(&ofdev->dev, false);
+	}
 
 	/* fill out IRQ number and name fields */
 	len_devname = strlen(dev->name);
@@ -1260,6 +1285,143 @@ static int gfar_remove(struct platform_device *ofdev)
 }
 
 #ifdef CONFIG_PM
+static void gfar_enable_filer(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 temp;
+
+	lock_rx_qs(priv);
+
+	temp = gfar_read(&regs->rctrl);
+	temp &= ~(RCTRL_FSQEN | RCTRL_PRSDEP_MASK);
+	temp |= RCTRL_FILREN | RCTRL_PRSDEP_L2L3;
+	gfar_write(&regs->rctrl, temp);
+
+	unlock_rx_qs(priv);
+}
+
+static void gfar_disable_filer(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 temp;
+
+	lock_rx_qs(priv);
+
+	temp = gfar_read(&regs->rctrl);
+	temp &= ~RCTRL_FILREN;
+	gfar_write(&regs->rctrl, temp);
+
+	unlock_rx_qs(priv);
+}
+
+static int gfar_get_ip(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	struct in_device *in_dev;
+	int ret = -ENOENT;
+
+	in_dev = in_dev_get(dev);
+	if (!in_dev)
+		return ret;
+
+	/* Get the primary IP address */
+	for_primary_ifa(in_dev) {
+		priv->ip_addr = ifa->ifa_address;
+		ret = 0;
+		break;
+	} endfor_ifa(in_dev);
+
+	in_dev_put(in_dev);
+	return ret;
+}
+
+static void gfar_restore_filer_table(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	u32 rqfcr, rqfpr;
+	int i;
+
+	lock_rx_qs(priv);
+
+	for (i = 0; i <= MAX_FILER_IDX; i++) {
+		rqfcr = priv->ftp_rqfcr[i];
+		rqfpr = priv->ftp_rqfpr[i];
+		gfar_write_filer(priv, i, rqfcr, rqfpr);
+	}
+
+	unlock_rx_qs(priv);
+}
+
+static void gfar_config_filer_table(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	u32 dest_mac_addr;
+	u32 rqfcr, rqfpr;
+	u8  rqfcr_queue = priv->num_rx_queues - 1;
+	unsigned int index;
+
+	if (gfar_get_ip(dev)) {
+		netif_err(priv, wol, dev, "WOL: get the ip address error\n");
+		return;
+	}
+
+	lock_rx_qs(priv);
+
+	/* init filer table */
+	rqfcr = RQFCR_RJE | RQFCR_CMP_MATCH;
+	rqfpr = 0x0;
+	for (index = 0; index <= MAX_FILER_IDX; index++)
+		gfar_write_filer(priv, index, rqfcr, rqfpr);
+
+	index = 0;
+	if (priv->wol_opts & GIANFAR_WOL_ARP) {
+		/* ARP request filer, filling the packet to the last queue */
+		rqfcr = (rqfcr_queue << 10) | RQFCR_AND |
+					RQFCR_CMP_EXACT | RQFCR_PID_MASK;
+		rqfpr = RQFPR_ARQ;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+
+		rqfcr = (rqfcr_queue << 10) | RQFCR_AND |
+					RQFCR_CMP_EXACT | RQFCR_PID_PARSE;
+		rqfpr = RQFPR_ARQ;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+
+		/*
+		 * DEST_IP address in ARP packet,
+		 * filling it to the last queue.
+		 */
+		rqfcr = (rqfcr_queue << 10) | RQFCR_AND |
+					RQFCR_CMP_EXACT | RQFCR_PID_MASK;
+		rqfpr = FPR_FILER_MASK;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+
+		rqfcr = (rqfcr_queue << 10) | RQFCR_GPI |
+					RQFCR_CMP_EXACT | RQFCR_PID_DIA;
+		rqfpr = priv->ip_addr;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+	}
+
+	if (priv->wol_opts & GIANFAR_WOL_UCAST) {
+		/* Unicast packet, filling it to the last queue */
+		dest_mac_addr = (dev->dev_addr[0] << 16) |
+				(dev->dev_addr[1] << 8) | dev->dev_addr[2];
+		rqfcr = (rqfcr_queue << 10) | RQFCR_AND |
+					RQFCR_CMP_EXACT | RQFCR_PID_DAH;
+		rqfpr = dest_mac_addr;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+
+		dest_mac_addr = (dev->dev_addr[3] << 16) |
+				(dev->dev_addr[4] << 8) | dev->dev_addr[5];
+		rqfcr = (rqfcr_queue << 10) | RQFCR_GPI |
+					RQFCR_CMP_EXACT | RQFCR_PID_DAL;
+		rqfpr = dest_mac_addr;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+	}
+
+	unlock_rx_qs(priv);
+}
 
 static int gfar_suspend(struct device *dev)
 {
@@ -1269,48 +1431,43 @@ static int gfar_suspend(struct device *dev)
 	unsigned long flags;
 	u32 tempval;
 
-	int magic_packet = priv->wol_en &&
-		(priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET);
-
 	netif_device_detach(ndev);
 
-	if (netif_running(ndev)) {
-
-		local_irq_save(flags);
-		lock_tx_qs(priv);
-		lock_rx_qs(priv);
-
-		gfar_halt_nodisable(ndev);
-
-		/* Disable Tx, and Rx if wake-on-LAN is disabled. */
-		tempval = gfar_read(&regs->maccfg1);
-
-		tempval &= ~MACCFG1_TX_EN;
+	if (!netif_running(ndev))
+		return 0;
 
-		if (!magic_packet)
-			tempval &= ~MACCFG1_RX_EN;
+	local_irq_save(flags);
+	lock_tx_qs(priv);
+	lock_rx_qs(priv);
 
-		gfar_write(&regs->maccfg1, tempval);
+	gfar_halt(ndev);
 
-		unlock_rx_qs(priv);
-		unlock_tx_qs(priv);
-		local_irq_restore(flags);
+	unlock_rx_qs(priv);
+	unlock_tx_qs(priv);
+	local_irq_restore(flags);
 
-		disable_napi(priv);
+	disable_napi(priv);
 
-		if (magic_packet) {
-			/* Enable interrupt on Magic Packet */
-			gfar_write(&regs->imask, IMASK_MAG);
+	if (!priv->wol_opts && priv->phydev) {
+		phy_stop(priv->phydev);
+		return 0;
+	}
 
-			/* Enable Magic Packet mode */
-			tempval = gfar_read(&regs->maccfg2);
-			tempval |= MACCFG2_MPEN;
-			gfar_write(&regs->maccfg2, tempval);
-		} else {
-			phy_stop(priv->phydev);
-		}
+	mpc85xx_pmc_set_wake(priv->ofdev, 1);
+	if (priv->wol_opts & GIANFAR_WOL_MAGIC) {
+		/* Enable Magic Packet mode */
+		tempval = gfar_read(&regs->maccfg2);
+		tempval |= MACCFG2_MPEN;
+		gfar_write(&regs->maccfg2, tempval);
 	}
 
+	if (priv->wol_opts & (GIANFAR_WOL_ARP | GIANFAR_WOL_UCAST)) {
+		mpc85xx_pmc_set_lossless_ethernet(1);
+		gfar_disable_filer(ndev);
+		gfar_config_filer_table(ndev);
+		gfar_enable_filer(ndev);
+	}
+	gfar_rx_start(ndev);
 	return 0;
 }
 
@@ -1320,39 +1477,49 @@ static int gfar_resume(struct device *dev)
 	struct net_device *ndev = priv->ndev;
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
 	unsigned long flags;
-	u32 tempval;
-	int magic_packet = priv->wol_en &&
-		(priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET);
+	u32 tempval, i;
 
 	if (!netif_running(ndev)) {
 		netif_device_attach(ndev);
 		return 0;
 	}
 
-	if (!magic_packet && priv->phydev)
+	if (!priv->wol_opts && priv->phydev) {
 		phy_start(priv->phydev);
+		goto out;
+	}
+
+	mpc85xx_pmc_set_wake(priv->ofdev, 0);
 
-	/* Disable Magic Packet mode, in case something
-	 * else woke us up.
-	 */
 	local_irq_save(flags);
-	lock_tx_qs(priv);
 	lock_rx_qs(priv);
-
-	tempval = gfar_read(&regs->maccfg2);
-	tempval &= ~MACCFG2_MPEN;
-	gfar_write(&regs->maccfg2, tempval);
-
-	gfar_start(ndev);
-
+	gfar_halt_rx(ndev);
 	unlock_rx_qs(priv);
-	unlock_tx_qs(priv);
 	local_irq_restore(flags);
 
-	netif_device_attach(ndev);
+	if (priv->wol_opts & (GIANFAR_WOL_ARP | GIANFAR_WOL_UCAST)) {
+		mpc85xx_pmc_set_lossless_ethernet(0);
+		gfar_disable_filer(ndev);
+		gfar_restore_filer_table(ndev);
+	}
+
+	if (priv->wol_opts & GIANFAR_WOL_MAGIC) {
+		/* Disable Magic Packet mode */
+		tempval = gfar_read(&regs->maccfg2);
+		tempval &= ~MACCFG2_MPEN;
+		gfar_write(&regs->maccfg2, tempval);
+	}
 
+out:
+	gfar_start(ndev);
+	netif_device_attach(ndev);
 	enable_napi(priv);
 
+	if (priv->wol_opts & (GIANFAR_WOL_ARP | GIANFAR_WOL_UCAST)) {
+		/* send requests to process the received packets */
+		for (i = 0; i < priv->num_grps; i++)
+			gfar_schedule_cleanup(&priv->gfargrp[i]);
+	}
 	return 0;
 }
 
@@ -1602,6 +1769,48 @@ static int __gfar_is_rx_idle(struct gfar_private *priv)
 	return 0;
 }
 
+#ifdef CONFIG_PM
+/* Halt the receive queues */
+static void gfar_halt_rx(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 tempval;
+	int i = 0;
+
+	for (i = 0; i < priv->num_grps; i++) {
+		regs = priv->gfargrp[i].regs;
+		/* Mask all interrupts */
+		gfar_write(&regs->imask, IMASK_INIT_CLEAR);
+
+		/* Clear all interrupts */
+		gfar_write(&regs->ievent, IEVENT_INIT_CLEAR);
+	}
+
+	regs = priv->gfargrp[0].regs;
+	/* Stop the DMA, and wait for it to stop */
+	tempval = gfar_read(&regs->dmactrl);
+	if ((tempval & DMACTRL_GRS) != DMACTRL_GRS) {
+		int ret;
+
+		tempval |= DMACTRL_GRS;
+		gfar_write(&regs->dmactrl, tempval);
+
+		do {
+			ret = spin_event_timeout(((gfar_read(&regs->ievent) &
+				IEVENT_GRSC) == IEVENT_GRSC), 1000000, 0);
+			if (!ret && !(gfar_read(&regs->ievent) & IEVENT_GRSC))
+				ret = __gfar_is_rx_idle(priv);
+		} while (!ret);
+	}
+
+	/* Disable Rx in MACCFG1  */
+	tempval = gfar_read(&regs->maccfg1);
+	tempval &= ~MACCFG1_RX_EN;
+	gfar_write(&regs->maccfg1, tempval);
+}
+#endif
+
 /* Halt the receive and transmit queues */
 static void gfar_halt_nodisable(struct net_device *dev)
 {
@@ -1808,6 +2017,40 @@ void gfar_start(struct net_device *dev)
 	dev->trans_start = jiffies; /* prevent tx timeout */
 }
 
+#ifdef CONFIG_PM
+void gfar_rx_start(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 tempval;
+	int i = 0;
+
+	/* Enable Rx in MACCFG1 */
+	tempval = gfar_read(&regs->maccfg1);
+	tempval |= MACCFG1_RX_EN;
+	gfar_write(&regs->maccfg1, tempval);
+
+	/* Initialize DMACTRL to have WWR and WOP */
+	tempval = gfar_read(&regs->dmactrl);
+	tempval |= DMACTRL_INIT_SETTINGS;
+	gfar_write(&regs->dmactrl, tempval);
+
+	/* Make sure we aren't stopped */
+	tempval = gfar_read(&regs->dmactrl);
+	tempval &= ~DMACTRL_GRS;
+	gfar_write(&regs->dmactrl, tempval);
+
+	for (i = 0; i < priv->num_grps; i++) {
+		regs = priv->gfargrp[i].regs;
+		/* Clear RHLT, so that the DMA starts polling now */
+		gfar_write(&regs->rstat, priv->gfargrp[i].rstat);
+
+		/* Unmask the interrupts we look for */
+		gfar_write(&regs->imask, IMASK_DEFAULT);
+	}
+}
+#endif
+
 void gfar_configure_coalescing(struct gfar_private *priv,
 	unsigned long tx_mask, unsigned long rx_mask)
 {
@@ -1970,7 +2213,7 @@ static int gfar_enet_open(struct net_device *dev)
 
 	netif_tx_start_all_queues(dev);
 
-	device_set_wakeup_enable(&dev->dev, priv->wol_en);
+	device_set_wakeup_enable(&priv->ofdev->dev, priv->wol_opts);
 
 	return err;
 }
@@ -2657,6 +2900,17 @@ static inline void count_errors(unsigned short status, struct net_device *dev)
 
 irqreturn_t gfar_receive(int irq, void *grp_id)
 {
+	struct gfar_priv_grp *gfargrp = grp_id;
+	struct gfar __iomem *regs = gfargrp->regs;
+	u32 ievent;
+
+	ievent = gfar_read(&regs->ievent);
+
+	if ((ievent & IEVENT_FGPI) == IEVENT_FGPI) {
+		gfar_write(&regs->ievent, ievent & IEVENT_RX_MASK);
+		return IRQ_HANDLED;
+	}
+
 	gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id);
 	return IRQ_HANDLED;
 }
diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
index 9aa4377..1abebc4 100644
--- a/drivers/net/ethernet/freescale/gianfar.h
+++ b/drivers/net/ethernet/freescale/gianfar.h
@@ -232,6 +232,13 @@ extern const char gfar_driver_version[];
 #define RQUEUE_EN7		0x00000001
 #define RQUEUE_EN_ALL		0x000000FF
 
+/* Wake-On-Lan options */
+#define GIANFAR_WOL_UCAST	0x00000001	/* Unicast wakeup */
+#define GIANFAR_WOL_MCAST	0x00000002	/* Multicast wakeup */
+#define GIANFAR_WOL_BCAST	0x00000004	/* Broadcase wakeup */
+#define GIANFAR_WOL_ARP		0x00000008	/* ARP request wakeup */
+#define GIANFAR_WOL_MAGIC	0x00000010	/* Magic packet wakeup */
+
 /* Init to do tx snooping for buffers and descriptors */
 #define DMACTRL_INIT_SETTINGS   0x000000c3
 #define DMACTRL_GRS             0x00000010
@@ -277,11 +284,15 @@ extern const char gfar_driver_version[];
 #define RCTRL_PAL_MASK		0x001f0000
 #define RCTRL_VLEX		0x00002000
 #define RCTRL_FILREN		0x00001000
+#define RCTRL_FSQEN		0x00000800
 #define RCTRL_GHTX		0x00000400
 #define RCTRL_IPCSEN		0x00000200
 #define RCTRL_TUCSEN		0x00000100
 #define RCTRL_PRSDEP_MASK	0x000000c0
 #define RCTRL_PRSDEP_INIT	0x000000c0
+#define RCTRL_PRSDEP_L2		0x00000040
+#define RCTRL_PRSDEP_L2L3	0x00000080
+#define RCTRL_PRSDEP_L2L3L4	0x000000c0
 #define RCTRL_PRSFM		0x00000020
 #define RCTRL_PROM		0x00000008
 #define RCTRL_EMEN		0x00000002
@@ -327,18 +338,20 @@ extern const char gfar_driver_version[];
 #define IEVENT_MAG		0x00000800
 #define IEVENT_GRSC		0x00000100
 #define IEVENT_RXF0		0x00000080
+#define IEVENT_FGPI		0x00000010
 #define IEVENT_FIR		0x00000008
 #define IEVENT_FIQ		0x00000004
 #define IEVENT_DPE		0x00000002
 #define IEVENT_PERR		0x00000001
-#define IEVENT_RX_MASK          (IEVENT_RXB0 | IEVENT_RXF0 | IEVENT_BSY)
+#define IEVENT_RX_MASK          (IEVENT_RXB0 | IEVENT_RXF0 | \
+					IEVENT_FGPI | IEVENT_BSY)
 #define IEVENT_TX_MASK          (IEVENT_TXB | IEVENT_TXF)
 #define IEVENT_RTX_MASK         (IEVENT_RX_MASK | IEVENT_TX_MASK)
 #define IEVENT_ERR_MASK         \
-(IEVENT_RXC | IEVENT_BSY | IEVENT_EBERR | IEVENT_MSRO | \
- IEVENT_BABT | IEVENT_TXC | IEVENT_TXE | IEVENT_LC \
- | IEVENT_CRL | IEVENT_XFUN | IEVENT_DPE | IEVENT_PERR \
- | IEVENT_MAG | IEVENT_BABR)
+	(IEVENT_RXC | IEVENT_BSY | IEVENT_EBERR | IEVENT_MSRO | \
+	 IEVENT_BABT | IEVENT_TXC | IEVENT_TXE | IEVENT_LC | \
+	 IEVENT_CRL | IEVENT_XFUN | IEVENT_FIR | IEVENT_FIQ | \
+	 IEVENT_DPE | IEVENT_PERR | IEVENT_MAG | IEVENT_BABR)
 
 #define IMASK_INIT_CLEAR	0x00000000
 #define IMASK_BABR              0x80000000
@@ -359,14 +372,15 @@ extern const char gfar_driver_version[];
 #define IMASK_MAG		0x00000800
 #define IMASK_GRSC              0x00000100
 #define IMASK_RXFEN0		0x00000080
+#define IMASK_FGPI		0x00000010
 #define IMASK_FIR		0x00000008
 #define IMASK_FIQ		0x00000004
 #define IMASK_DPE		0x00000002
 #define IMASK_PERR		0x00000001
 #define IMASK_DEFAULT  (IMASK_TXEEN | IMASK_TXFEN | IMASK_TXBEN | \
 		IMASK_RXFEN0 | IMASK_BSY | IMASK_EBERR | IMASK_BABR | \
-		IMASK_XFUN | IMASK_RXC | IMASK_BABT | IMASK_DPE \
-		| IMASK_PERR)
+		IMASK_XFUN | IMASK_RXC | IMASK_BABT | IMASK_FGPI | \
+		IMASK_FIR | IMASK_FIQ | IMASK_DPE | IMASK_PERR | IMASK_MAG)
 #define IMASK_RTX_DISABLED ((~(IMASK_RXFEN0 | IMASK_TXFEN | IMASK_BSY)) \
 			   & IMASK_DEFAULT)
 
@@ -883,6 +897,7 @@ struct gfar {
 #define FSL_GIANFAR_DEV_HAS_BD_STASHING		0x00000200
 #define FSL_GIANFAR_DEV_HAS_BUF_STASHING	0x00000400
 #define FSL_GIANFAR_DEV_HAS_TIMER		0x00000800
+#define FSL_GIANFAR_DEV_HAS_WAKE_ON_FILER	0x00001000
 
 #if (MAXGROUPS == 2)
 #define DEFAULT_MAPPING 	0xAA
@@ -1115,6 +1130,10 @@ struct gfar_private {
 
 	struct work_struct reset_task;
 
+	u32 ip_addr;
+	u32 wol_opts;
+	u32 wol_supported;
+
 	/* Network Statistics */
 	struct gfar_extra_stats extra_stats;
 
diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c
index 212736b..296e762 100644
--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
@@ -29,6 +29,7 @@
 #include <linux/skbuff.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
+#include <linux/platform_device.h>
 
 #include <asm/io.h>
 #include <asm/irq.h>
@@ -577,32 +578,49 @@ static void gfar_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
 {
 	struct gfar_private *priv = netdev_priv(dev);
 
-	if (priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET) {
-		wol->supported = WAKE_MAGIC;
-		wol->wolopts = priv->wol_en ? WAKE_MAGIC : 0;
-	} else {
-		wol->supported = wol->wolopts = 0;
-	}
+	wol->supported = 0;
+	wol->wolopts = 0;
+
+	if (!priv->wol_supported || !device_can_wakeup(&priv->ofdev->dev))
+		return;
+
+	if (priv->wol_supported & GIANFAR_WOL_MAGIC)
+		wol->supported |= WAKE_MAGIC;
+
+	if (priv->wol_supported & GIANFAR_WOL_ARP)
+		wol->supported |= WAKE_ARP;
+
+	if (priv->wol_supported & GIANFAR_WOL_UCAST)
+		wol->supported |= WAKE_UCAST;
+
+	if (priv->wol_opts & GIANFAR_WOL_MAGIC)
+		wol->wolopts |= WAKE_MAGIC;
+
+	if (priv->wol_opts & GIANFAR_WOL_ARP)
+		wol->wolopts |= WAKE_ARP;
+
+	if (priv->wol_opts & GIANFAR_WOL_UCAST)
+		wol->wolopts |= WAKE_UCAST;
 }
 
 static int gfar_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
 {
 	struct gfar_private *priv = netdev_priv(dev);
-	unsigned long flags;
 
-	if (!(priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET) &&
-	    wol->wolopts != 0)
-		return -EINVAL;
-
-	if (wol->wolopts & ~WAKE_MAGIC)
-		return -EINVAL;
+	if (!priv->wol_supported || !device_can_wakeup(&priv->ofdev->dev) ||
+		(wol->wolopts & ~(WAKE_MAGIC | WAKE_ARP | WAKE_UCAST)))
+		return -EOPNOTSUPP;
 
-	device_set_wakeup_enable(&dev->dev, wol->wolopts & WAKE_MAGIC);
+	priv->wol_opts = 0;
 
-	spin_lock_irqsave(&priv->bflock, flags);
-	priv->wol_en =  !!device_may_wakeup(&dev->dev);
-	spin_unlock_irqrestore(&priv->bflock, flags);
+	if (wol->wolopts & WAKE_MAGIC)
+		priv->wol_opts |= GIANFAR_WOL_MAGIC;
+	if (wol->wolopts & WAKE_ARP)
+		priv->wol_opts |= GIANFAR_WOL_ARP;
+	if (wol->wolopts & WAKE_UCAST)
+		priv->wol_opts |= GIANFAR_WOL_UCAST;
 
+	device_set_wakeup_enable(&priv->ofdev->dev, (u32)priv->wol_opts);
 	return 0;
 }
 #endif
-- 
1.6.4.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox