Netdev List
 help / color / mirror / Atom feed
* [PATCH 2/3] netfilter: Add dependency on IPV6 for NF_TABLES_INET
From: Pablo Neira Ayuso @ 2014-01-16  9:04 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1389863093-4530-1-git-send-email-pablo@netfilter.org>

From: Paul Gortmaker <paul.gortmaker@windriver.com>

Commit 1d49144c0aa ("netfilter: nf_tables: add "inet" table for
IPv4/IPv6") allows creation of non-IPV6 enabled .config files that
will fail to configure/link as follows:

warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
net/built-in.o: In function `nft_reject_eval':
nft_reject.c:(.text+0x651e8): undefined reference to `nf_ip6_checksum'
nft_reject.c:(.text+0x65270): undefined reference to `ip6_route_output'
nft_reject.c:(.text+0x656c4): undefined reference to `ip6_dst_hoplimit'
make: *** [vmlinux] Error 1

Since the feature is to allow for a mixed IPV4 and IPV6 table, it
seems sensible to make it depend on IPV6.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 37d2092..6941a4f 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -429,7 +429,7 @@ config NF_TABLES
 	  To compile it as a module, choose M here.
 
 config NF_TABLES_INET
-	depends on NF_TABLES
+	depends on NF_TABLES && IPV6
 	select NF_TABLES_IPV4
 	select NF_TABLES_IPV6
 	tristate "Netfilter nf_tables mixed IPv4/IPv6 tables support"
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 0/3] Netfilter updates for net-next
From: Pablo Neira Ayuso @ 2014-01-16  9:04 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

This small batch contains several Netfilter fixes for your net-next
tree, more specifically:

* Fix compilation warning in nft_ct in NF_CONNTRACK_MARK is not set,
  from Kristian Evensen.

* Add dependency to IPV6 for NF_TABLES_INET. This one has been reported
  by the several robots that are testing .config combinations, from Paul
  Gortmaker.

* Fix default base chain policy setting in nf_tables, from myself.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Thanks!

----------------------------------------------------------------

The following changes since commit 11b57f90257c1d6a91cee720151b69e0c2020cf6:

  xen-netback: stop vif thread spinning if frontend is unresponsive (2014-01-09 23:05:46 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git 

for you to fetch changes up to 847c8e2959f7f8f1462e33c0a720c6267b984ed8:

  netfilter: nft_ct: fix compilation warning if NF_CONNTRACK_MARK is not set (2014-01-15 11:00:14 +0100)

----------------------------------------------------------------
Kristian Evensen (1):
      netfilter: nft_ct: fix compilation warning if NF_CONNTRACK_MARK is not set

Pablo Neira Ayuso (1):
      netfilter: nf_tables: fix missing byteorder conversion in policy

Paul Gortmaker (1):
      netfilter: Add dependency on IPV6 for NF_TABLES_INET

 net/netfilter/Kconfig         |    2 +-
 net/netfilter/nf_tables_api.c |    2 +-
 net/netfilter/nft_ct.c        |    2 ++
 3 files changed, 4 insertions(+), 2 deletions(-)

^ permalink raw reply

* Re: Fwd: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Stefan Hajnoczi @ 2014-01-16  8:52 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Linux Netdev List, Tom Herbert, Eric Dumazet, David S. Miller,
	Zhi Yong Wu, Michael S. Tsirkin, Rusty Russell, Jason Wang
In-Reply-To: <CAEH94Lg7ygXgHqE8ubE+nuunicDqgmnCfO2AuDSS2nXgv5GikQ@mail.gmail.com>

On Thu, Jan 16, 2014 at 04:34:10PM +0800, Zhi Yong Wu wrote:
> CC: stefanha, MST, Rusty Russel
> 
> ---------- Forwarded message ----------
> From: Jason Wang <jasowang@redhat.com>
> Date: Thu, Jan 16, 2014 at 12:23 PM
> Subject: Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
> To: Zhi Yong Wu <zwu.kernel@gmail.com>
> Cc: netdev@vger.kernel.org, therbert@google.com, edumazet@google.com,
> davem@davemloft.net, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
> 
> On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
> >
> > From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
> >
> > HI, folks
> >
> > The patchset is trying to integrate aRFS support to virtio_net. In this case,
> > aRFS will be used to select the RX queue. To make sure that it's going ahead
> > in the correct direction, although it is still one RFC and isn't tested, it's
> > post out ASAP. Any comment are appreciated, thanks.
> >
> > If anyone is interested in playing with it, you can get this patchset from my
> > dev git on github:
> >    git://github.com/wuzhy/kernel.git virtnet_rfs
> >
> > Zhi Yong Wu (3):
> >    virtio_pci: Introduce one new config api vp_get_vq_irq()
> >    virtio_net: Introduce one dummy function virtnet_filter_rfs()
> >    virtio-net: Add accelerated RFS support
> >
> >   drivers/net/virtio_net.c      |   67 ++++++++++++++++++++++++++++++++++++++++-
> >   drivers/virtio/virtio_pci.c   |   11 +++++++
> >   include/linux/virtio_config.h |   12 +++++++
> >   3 files changed, 89 insertions(+), 1 deletions(-)
> >
> 
> Please run get_maintainter.pl before sending the patch. You'd better
> at least cc virtio maintainer/list for this.
> 
> The core aRFS method is a noop in this RFC which make this series no
> much sense to discuss. You should at least mention the big picture
> here in the cover letter. I suggest you should post a RFC which can
> run and has expected result or you can just raise a thread for the
> design discussion.
> 
> And this method has been discussed before, you can search "[net-next
> RFC PATCH 5/5] virtio-net: flow director support" in netdev archive
> for a very old prototype implemented by me. It can work and looks like
> most of this RFC have already done there.
> 
> A basic question is whether or not we need this, not all the mq cards
> use aRFS (see ixgbe ATR). And whether or not it can bring extra
> overheads? For virtio, we want to reduce the vmexits as much as
> possible but this aRFS seems introduce a lot of more of this. Making a
> complex interfaces just for an virtual device may not be good, simple
> method may works for most of the cases.
> 
> We really should consider to offload this to real nic. VMDq and L2
> forwarding offload may help in this case.

Zhi Yong and I had an IRC chat.  I wanted to post my questions on the
list - it's still the same concern I had in the old email thread that
Jason mentioned.

In order for virtio-net aRFS to make sense there needs to be an overall
plan for pushing flow mapping information down to the physical NIC.
That's the only way to actually achieve the benefit of steering:
processing the packet on the CPU where the application is running.

If it's not possible or too hard to implement aRFS down the entire
stack, we won't be able to process the packet on the right CPU.
Then we might as well not bother with aRFS and just distribute uniformly
across the rx virtqueues.

Please post an outline of how rx packets will be steered up the stack so
we can discuss whether aRFS can bring any benefit.

Stefan

^ permalink raw reply

* Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Zhi Yong Wu @ 2014-01-16  8:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: Linux Netdev List, Tom Herbert, Eric Dumazet, David S. Miller,
	Zhi Yong Wu, Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell
In-Reply-To: <52D75EA5.1050000@redhat.com>

On Thu, Jan 16, 2014 at 12:23 PM, Jason Wang <jasowang@redhat.com> wrote:
> On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
>>
>> From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
>>
>> HI, folks
>>
>> The patchset is trying to integrate aRFS support to virtio_net. In this
>> case,
>> aRFS will be used to select the RX queue. To make sure that it's going
>> ahead
>> in the correct direction, although it is still one RFC and isn't tested,
>> it's
>> post out ASAP. Any comment are appreciated, thanks.
>>
>> If anyone is interested in playing with it, you can get this patchset from
>> my
>> dev git on github:
>>    git://github.com/wuzhy/kernel.git virtnet_rfs
>>
>> Zhi Yong Wu (3):
>>    virtio_pci: Introduce one new config api vp_get_vq_irq()
>>    virtio_net: Introduce one dummy function virtnet_filter_rfs()
>>    virtio-net: Add accelerated RFS support
>>
>>   drivers/net/virtio_net.c      |   67
>> ++++++++++++++++++++++++++++++++++++++++-
>>   drivers/virtio/virtio_pci.c   |   11 +++++++
>>   include/linux/virtio_config.h |   12 +++++++
>>   3 files changed, 89 insertions(+), 1 deletions(-)
>>
>
> Please run get_maintainter.pl before sending the patch. You'd better at
> least cc virtio maintainer/list for this.
Is this one must for virtio stuff?

>
> The core aRFS method is a noop in this RFC which make this series no much
> sense to discuss. You should at least mention the big picture here in the
> cover letter. I suggest you should post a RFC which can run and has expected
> result or you can just raise a thread for the design discussion.
Yes, it currently miss some important stuff as i said in another mail
of this series.

>
> And this method has been discussed before, you can search "[net-next RFC
> PATCH 5/5] virtio-net: flow director support" in netdev archive for a very
> old prototype implemented by me. It can work and looks like most of this RFC
> have already done there.
ah? Can you let me know the result of your discussion? Will checked
it, thanks for your pointer.
>
> A basic question is whether or not we need this, not all the mq cards use
> aRFS (see ixgbe ATR). And whether or not it can bring extra overheads? For
> virtio, we want to reduce the vmexits as much as possible but this aRFS
Good question, i also have concern about this, and don't know if Tom
has good explanation.
> seems introduce a lot of more of this. Making a complex interfaces just for
> an virtual device may not be good, simple method may works for most of the
> cases.
>
> We really should consider to offload this to real nic. VMDq and L2
> forwarding offload may help in this case.

By the way, Stefan, can you let us know your concerns here? as we
talked in irc channel. :)


-- 
Regards,

Zhi Yong Wu

^ permalink raw reply

* Re: [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
From: Veaceslav Falico @ 2014-01-16  8:41 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, Andy Gospodarek, David S. Miller
In-Reply-To: <15764.1389848997@death.nxdomain>

On Wed, Jan 15, 2014 at 09:09:57PM -0800, Jay Vosburgh wrote:
>Veaceslav Falico <vfalico@redhat.com> wrote:
>
>>Currently, if arp_validate is off (0), slave_last_rx() returns the
>>slave->dev->last_rx, which is always updated on *any* packet received by
>>slave, and not only arps. This means that, if the validation of arps is
>>off, we're treating *any* incoming packet as a proof of slave being up, and
>>not only arps.
>
>	The "any incoming packet" part is intentional.
>
>>This might seem logical at the first glance, however it can cause a lot of
>>troubles and false-positives, one example would be:
>>
>>The arp_ip_target is NOT accessible, however someone in the broadcast domain
>>spams with any broadcast traffic. This way bonding will be tricked that the
>>slave is still up (as in - can access arp_ip_target), while it's not.
>
>	This type of situation is why arp_validate was added.
>
>	The specific situation was when multiple hosts using bonding
>with the ARP monitor were set up behind a common gateway (in the same
>Ethernet broadcast domain).  The arp_ip_target is unreachable for
>whatever reason.  In that case, the various bonding instances on the
>different hosts will each issue broadcast ARP requests, and (in the
>absence of arp_validate) those requests would trick the other bonds into
>believing that they are up.
>
>	I don't think this patch set will resolve that problem, since
>you explicitly permit any incoming ARP to count.

I've said it was tricky at first glance :).

For this situation the arp_validate is *indeed* the cure. I'm not disabling
(or even working with) how arp_validate=1/2 works, I'm working with
arp_validate == 0.

Before the patchset (with arp_validate == 0 ):

*Any* packet (arp and non-arp) will signal us that the slave is up -
because we use slave->dev->last_rx (updated on *every* incoming packet in
bond_handle_frame).

After the patchset (with arp_validate == 0 ):

*ONLY ARP* packets signal us that the slave is up - because we use
slave->last_arp_rx that is updated every time we see an ARP packet.


The way the modes work with arp_validate > 0 don't change :), as we're
updating slave->last_arp_rx the old way in this case - after validation.

So that the scenario you've described still works flawlessly, and now
already we won't be tricked by some weird broadcast traffic even with
arp_validate == 0.

>
>>The documentation for arp_validate also states that *ARPs* will (not) be
>>validated if it's on/off, and that the arp monitoring works on arps as
>>traffic generators.
>
>	I wrote most of that text in the documentation, and the intent
>was not to imply that only ARPs should count for "up-ness" even without
>arp_validate enabled.  The intent was to distinguish it from
>"non-validate," in which any incoming traffic counted for "up-ness."
>
>	The main reason for preserving the non-validate behavior (any
>traffic counts) is for the loadbalance (xor and rr) modes.  In those
>modes, the switch decides which slave receives the incoming traffic, and
>so it's to our advantage to permit any incoming traffic to count for
>"up-ness."  The arp_validate option is not allowed in these modes
>because it won't work.
>
>	With these changes, I suspect that the loadbalance ARP monitor
>will be less reliable with these changes (granted that it's already a
>bit dodgy in its dependence on the switch to hit all slaves with
>incoming packets regularly).  Particularly if the switch ports are
>configured into an Etherchannel ("static link aggregation") group, in
>which case only one slave will receive any given frame (broadcast /
>multicast traffic will not be duplicated across all slaves).

The non-AB modes also gave me a headache, however after thinking a bit I've
decided to change them also (mainly, it's the change of
arp_loadbalance_mon function).

The usual usage, however, is to generate traffic via arps. If we don't see
arp replies - this means that arp_ip_target is down, and thus the slave is
down.

>
>	I'm not sure that this change (the "only count ARPs even without
>arp_validate" bit) won't break existing configurations.  Did you test
>the -rr and -xor modes with ARP monitor after your changes (with and
>without configuring a channel group on the switch ports)?

Sure, all works fine, afaics. Obviously, these were basic tests, and bugs
might exist.

The only possible scenario of breakage for someone, from my POV, is:

1) arp monitor is used with loadbalance mode
2) arp_ip_targets are set but _any_ arp replies are never received
3) the user relies on that every slave will receive at least one packet per
arp_interval

This use case:

1) contradicts with documentation
2) contradicts with logic (arp monitor, arp ip targets etc. are used
without, actually, meaning something)
3) is really unstable

In this case, indeed, it won't work. Two points, though:

1) It shouldn't work in the first place, is unstable etc.
2) Can be easily fixed by the following oneliner (though I *really* woudn't
like to do it, as it's useless and dangerous). Basically, for non-ab mode
we set last_arp_rx for every packet. Again, I wouldn't like to do it, but
if you know any use case scenario for this usage (no working arps but
continuous receive traffic) - I can send it as v2 or 7/6 patch (whatever
suits David better, as he already applied it but didn't push).

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 0f613ae..87358e5 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2286,6 +2286,11 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
  	__be32 sip, tip;
  	int alen;
  
+	if (!USES_PRIMARY(bond->params.mode)) {
+		slave->last_arp_rx = jiffies;
+		return RX_HANDLER_ANOTHER;
+	}
+
  	if (skb->protocol != __cpu_to_be16(ETH_P_ARP))
  		return RX_HANDLER_ANOTHER;
  

>
>>Also, the net_device->last_rx is already used in a lot of drivers (even
>>though the comment states to NOT do it :)), and it's also ugly to modify it
>>from bonding.
>
>	I didn't check, but I suspect those are mostly leftovers from
>the distant past, when the drivers were expected to update last_rx, or
>perhaps drivers using it for their own purposes.

It's really a mix. Somebody just updates them, somebody uses it for their
own purposes etc.

>
>	I don't really see an issue in decoupling bonding from the
>net_device->last_rx; it's pretty much the same thing that was done for
>trans_start some time ago.

trans_start removal is also in queue :). Though still needs some
polishing...

>
>	-J
>
>---
>	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>
>
>>So, to fix this, remove the last_rx from bonding, *always* call
>>bond_arp_rcv() in slave's rx_handler (bond_handle_frame), and if we spot an
>>arp there - update the slave->last_arp_rx - and use it instead of
>>net_device->last_rx. Finally, rename slave_last_rx() to slave_last_arp_rx()
>>to reflect the changes.
>>
>>As the changes touch really sensitive parts, I've tried to split them as
>>much as possible, for easier debugging/bisecting.
>>
>>CC: Jay Vosburgh <fubar@us.ibm.com>
>>CC: Andy Gospodarek <andy@greyhouse.net>
>>CC: "David S. Miller" <davem@davemloft.net>
>>CC: netdev@vger.kernel.org
>>Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
>>
>>---
>> drivers/net/bonding/bond_main.c    | 18 ++++++++----------
>> drivers/net/bonding/bond_options.c | 12 ++----------
>> drivers/net/bonding/bonding.h      | 16 ++++++----------
>> include/linux/netdevice.h          |  8 +-------
>> 4 files changed, 17 insertions(+), 37 deletions(-)
>>
>

^ permalink raw reply related

* Fwd: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Zhi Yong Wu @ 2014-01-16  8:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Linux Netdev List, Tom Herbert, Eric Dumazet, David S. Miller,
	Zhi Yong Wu, Michael S. Tsirkin, Rusty Russell, Jason Wang
In-Reply-To: <52D75EA5.1050000@redhat.com>

CC: stefanha, MST, Rusty Russel

---------- Forwarded message ----------
From: Jason Wang <jasowang@redhat.com>
Date: Thu, Jan 16, 2014 at 12:23 PM
Subject: Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
To: Zhi Yong Wu <zwu.kernel@gmail.com>
Cc: netdev@vger.kernel.org, therbert@google.com, edumazet@google.com,
davem@davemloft.net, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>


On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
>
> From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
>
> HI, folks
>
> The patchset is trying to integrate aRFS support to virtio_net. In this case,
> aRFS will be used to select the RX queue. To make sure that it's going ahead
> in the correct direction, although it is still one RFC and isn't tested, it's
> post out ASAP. Any comment are appreciated, thanks.
>
> If anyone is interested in playing with it, you can get this patchset from my
> dev git on github:
>    git://github.com/wuzhy/kernel.git virtnet_rfs
>
> Zhi Yong Wu (3):
>    virtio_pci: Introduce one new config api vp_get_vq_irq()
>    virtio_net: Introduce one dummy function virtnet_filter_rfs()
>    virtio-net: Add accelerated RFS support
>
>   drivers/net/virtio_net.c      |   67 ++++++++++++++++++++++++++++++++++++++++-
>   drivers/virtio/virtio_pci.c   |   11 +++++++
>   include/linux/virtio_config.h |   12 +++++++
>   3 files changed, 89 insertions(+), 1 deletions(-)
>

Please run get_maintainter.pl before sending the patch. You'd better
at least cc virtio maintainer/list for this.

The core aRFS method is a noop in this RFC which make this series no
much sense to discuss. You should at least mention the big picture
here in the cover letter. I suggest you should post a RFC which can
run and has expected result or you can just raise a thread for the
design discussion.

And this method has been discussed before, you can search "[net-next
RFC PATCH 5/5] virtio-net: flow director support" in netdev archive
for a very old prototype implemented by me. It can work and looks like
most of this RFC have already done there.

A basic question is whether or not we need this, not all the mq cards
use aRFS (see ixgbe ATR). And whether or not it can bring extra
overheads? For virtio, we want to reduce the vmexits as much as
possible but this aRFS seems introduce a lot of more of this. Making a
complex interfaces just for an virtual device may not be good, simple
method may works for most of the cases.

We really should consider to offload this to real nic. VMDq and L2
forwarding offload may help in this case.


-- 
Regards,

Zhi Yong Wu

^ permalink raw reply

* [PATCH net-next] sctp: remove the unnecessary assignment
From: Wang Weidong @ 2014-01-16  8:25 UTC (permalink / raw)
  To: Neil Horman, David Miller, Vlad Yasevich; +Cc: netdev, linux-sctp

When go the right path, the status is 0, no need to assign it again.
So just remove the assignment.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
---
 net/sctp/protocol.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 7c16108..d6934dc 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1461,7 +1461,6 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_add_protocol;
 
-	status = 0;
 out:
 	return status;
 err_v6_add_protocol:
-- 
1.7.12

^ permalink raw reply related

* Re: [PATCH net-next 1/2] bonding: add sysfs /slave dir for bond slave devices.
From: Ding Tianhong @ 2014-01-16  8:04 UTC (permalink / raw)
  To: Scott Feldman, vfalico, fubar, andy; +Cc: netdev, roopa, shm
In-Reply-To: <20140116055434.32220.89883.stgit@monster-03.cumulusnetworks.com>

On 2014/1/16 13:54, Scott Feldman wrote:
> Add sub-directory under /sys/class/net/<interface>/slave with
> read-only attributes for slave.  Directory only appears when
> <interface> is a slave.
> 
> $ tree /sys/class/net/eth2/slave/
> /sys/class/net/eth2/slave/
> ├── ad_aggregator_id
> ├── link_failure_count
> ├── mii_status
> ├── perm_hwaddr
> ├── queue_id
> └── state
> 
> $ cat /sys/class/net/eth2/slave/*
> 2
> 0
> up
> 40:02:10:ef:06:01
> 0
> active
> 
> Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
> ---
>  drivers/net/bonding/Makefile           |    2 
>  drivers/net/bonding/bond_main.c        |   27 ++++++
>  drivers/net/bonding/bond_procfs.c      |   12 ---
>  drivers/net/bonding/bond_sysfs_slave.c |  142 ++++++++++++++++++++++++++++++++
>  drivers/net/bonding/bonding.h          |    4 +
>  5 files changed, 174 insertions(+), 13 deletions(-)
>  create mode 100644 drivers/net/bonding/bond_sysfs_slave.c
> 
> diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
> index 5a5d720..6f4e808 100644
> --- a/drivers/net/bonding/Makefile
> +++ b/drivers/net/bonding/Makefile
> @@ -4,7 +4,7 @@
>  
>  obj-$(CONFIG_BONDING) += bonding.o
>  
> -bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_debugfs.o bond_netlink.o bond_options.o
> +bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o
>  
>  proc-$(CONFIG_PROC_FS) += bond_procfs.o
>  bonding-objs += $(proc-y)
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index f2fe6cb..4f1adae 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -466,6 +466,22 @@ static void bond_update_speed_duplex(struct slave *slave)
>  	return;
>  }
>  
> +const char *bond_slave_link_status(s8 link)
> +{
> +	switch (link) {
> +	case BOND_LINK_UP:
> +		return "up";
> +	case BOND_LINK_FAIL:
> +		return "going down";
> +	case BOND_LINK_DOWN:
> +		return "down";
> +	case BOND_LINK_BACK:
> +		return "going back";
> +	default:
> +		return "unknown";
> +	}
> +}
> +
>  /*
>   * if <dev> supports MII link status reporting, check its link status.
>   *
> @@ -1576,6 +1592,12 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>  		goto err_unregister;
>  	}
>  
> +	res = bond_sysfs_slave_add(new_slave);
> +	if (res) {
> +		pr_debug("Error %d calling bond_sysfs_slave_add\n", res);
> +		goto err_upper_unlink;
> +	}
> +
>  	bond->slave_cnt++;
>  	bond_compute_features(bond);
>  	bond_set_carrier(bond);
> @@ -1595,6 +1617,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
>  	return 0;
>  
>  /* Undo stages on error */
> +err_upper_unlink:
> +	bond_upper_dev_unlink(bond_dev, slave_dev);
> +
>  err_unregister:
>  	netdev_rx_handler_unregister(slave_dev);
>  
> @@ -1687,6 +1712,8 @@ static int __bond_release_one(struct net_device *bond_dev,
>  	/* release the slave from its bond */
>  	bond->slave_cnt--;
>  
> +	bond_sysfs_slave_del(slave);
> +
>  	bond_upper_dev_unlink(bond_dev, slave_dev);
>  	/* unregister rx_handler early so bond_handle_frame wouldn't be called
>  	 * for this slave anymore.
> diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c
> index fb868d6..8515b344 100644
> --- a/drivers/net/bonding/bond_procfs.c
> +++ b/drivers/net/bonding/bond_procfs.c
> @@ -159,18 +159,6 @@ static void bond_info_show_master(struct seq_file *seq)
>  	}
>  }
>  
> -static const char *bond_slave_link_status(s8 link)
> -{
> -	static const char * const status[] = {
> -		[BOND_LINK_UP] = "up",
> -		[BOND_LINK_FAIL] = "going down",
> -		[BOND_LINK_DOWN] = "down",
> -		[BOND_LINK_BACK] = "going back",
> -	};
> -
> -	return status[link];
> -}
> -
>  static void bond_info_show_slave(struct seq_file *seq,
>  				 const struct slave *slave)
>  {
> diff --git a/drivers/net/bonding/bond_sysfs_slave.c b/drivers/net/bonding/bond_sysfs_slave.c
> new file mode 100644
> index 0000000..28390af
> --- /dev/null
> +++ b/drivers/net/bonding/bond_sysfs_slave.c
> @@ -0,0 +1,142 @@
> +/*	Sysfs attributes of bond slaves
> + *
> + *      Copyright (c) 2014 Scott Feldman <sfeldma@cumulusnetworks.com>
> + *
> + *	This program is free software; you can redistribute it and/or
> + *	modify it under the terms of the GNU General Public License
> + *	as published by the Free Software Foundation; either version
> + *	2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/capability.h>
> +#include <linux/kernel.h>
> +#include <linux/netdevice.h>
> +
> +#include "bonding.h"
> +
> +struct slave_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct slave *, char *);
> +};
> +
> +#define SLAVE_ATTR(_name, _mode, _show)				\
> +const struct slave_attribute slave_attr_##_name = {		\
> +	.attr = {.name = __stringify(_name),			\
> +		 .mode = _mode },				\
> +	.show	= _show,					\
> +};
> +#define SLAVE_ATTR_RO(_name) \
> +	SLAVE_ATTR(_name, S_IRUGO, _name##_show)
> +
> +static ssize_t state_show(struct slave *slave, char *buf)
> +{
> +	switch (bond_slave_state(slave)) {
> +	case BOND_STATE_ACTIVE:
> +		return sprintf(buf, "active\n");
> +	case BOND_STATE_BACKUP:
> +		return sprintf(buf, "backup\n");
> +	default:
> +		return sprintf(buf, "UNKONWN\n");
> +	}
> +}
> +static SLAVE_ATTR_RO(state);
> +
> +static ssize_t mii_status_show(struct slave *slave, char *buf)
> +{
> +	return sprintf(buf, "%s\n", bond_slave_link_status(slave->link));
> +}
> +static SLAVE_ATTR_RO(mii_status);
> +
> +static ssize_t link_failure_count_show(struct slave *slave, char *buf)
> +{
> +	return sprintf(buf, "%d\n", slave->link_failure_count);
> +}
> +static SLAVE_ATTR_RO(link_failure_count);
> +
> +static ssize_t perm_hwaddr_show(struct slave *slave, char *buf)
> +{
> +	return sprintf(buf, "%pM\n", slave->perm_hwaddr);
> +}
> +static SLAVE_ATTR_RO(perm_hwaddr);
> +
> +static ssize_t queue_id_show(struct slave *slave, char *buf)
> +{
> +	return sprintf(buf, "%d\n", slave->queue_id);
> +}
> +static SLAVE_ATTR_RO(queue_id);
> +
> +static ssize_t ad_aggregator_id_show(struct slave *slave, char *buf)
> +{
> +	const struct aggregator *agg;
> +
> +	if (slave->bond->params.mode == BOND_MODE_8023AD) {
> +		agg = SLAVE_AD_INFO(slave).port.aggregator;
> +		if (agg)
> +			return sprintf(buf, "%d\n",
> +				       agg->aggregator_identifier);
> +	}
> +
> +	return sprintf(buf, "N/A\n");
> +}
> +static SLAVE_ATTR_RO(ad_aggregator_id);
> +
> +static const struct slave_attribute *slave_attrs[] = {
> +	&slave_attr_state,
> +	&slave_attr_mii_status,
> +	&slave_attr_link_failure_count,
> +	&slave_attr_perm_hwaddr,
> +	&slave_attr_queue_id,
> +	&slave_attr_ad_aggregator_id,
> +	NULL
> +};
> +
> +#define to_slave_attr(_at) container_of(_at, struct slave_attribute, attr)
> +#define to_slave(obj)	container_of(obj, struct slave, kobj)
> +
> +static ssize_t slave_show(struct kobject *kobj,
> +			  struct attribute *attr, char *buf)
> +{
> +	struct slave_attribute *slave_attr = to_slave_attr(attr);
> +	struct slave *slave = to_slave(kobj);
> +
> +	return slave_attr->show(slave, buf);
> +}
> +
> +const struct sysfs_ops slave_sysfs_ops = {
> +	.show = slave_show,
> +};
> +
> +static struct kobj_type slave_ktype = {
> +#ifdef CONFIG_SYSFS
> +	.sysfs_ops = &slave_sysfs_ops,
> +#endif
> +};
> +
> +int bond_sysfs_slave_add(struct slave *slave)
> +{
> +	const struct slave_attribute **a;
> +	int err;
> +
> +	err = kobject_init_and_add(&slave->kobj, &slave_ktype,
> +				   &(slave->dev->dev.kobj), "slave");
> +	if (err)
> +		return err;
> +
> +	for (a = slave_attrs; *a; ++a) {
> +		err = sysfs_create_file(&slave->kobj, &((*a)->attr));
> +		if (err)
> +			return err;
> +	}
> +
I think if add rollback path if create failed, it will be better.

Regards
Ding 


> +	return 0;
> +}
> +
> +void bond_sysfs_slave_del(struct slave *slave)
> +{
> +	const struct slave_attribute **a;
> +
> +	for (a = slave_attrs; *a; ++a)
> +		sysfs_remove_file(&slave->kobj, &((*a)->attr));
> +
> +	kobject_del(&slave->kobj);
> +}
> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
> index 955dc48..309757d 100644
> --- a/drivers/net/bonding/bonding.h
> +++ b/drivers/net/bonding/bonding.h
> @@ -203,6 +203,7 @@ struct slave {
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>  	struct netpoll *np;
>  #endif
> +	struct kobject kobj;
>  };
>  
>  /*
> @@ -421,6 +422,8 @@ int bond_create(struct net *net, const char *name);
>  int bond_create_sysfs(struct bond_net *net);
>  void bond_destroy_sysfs(struct bond_net *net);
>  void bond_prepare_sysfs_group(struct bonding *bond);
> +int bond_sysfs_slave_add(struct slave *slave);
> +void bond_sysfs_slave_del(struct slave *slave);
>  int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev);
>  int bond_release(struct net_device *bond_dev, struct net_device *slave_dev);
>  int bond_xmit_hash(struct bonding *bond, struct sk_buff *skb, int count);
> @@ -469,6 +472,7 @@ int bond_option_lacp_rate_set(struct bonding *bond, int lacp_rate);
>  int bond_option_ad_select_set(struct bonding *bond, int ad_select);
>  struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond);
>  struct net_device *bond_option_active_slave_get(struct bonding *bond);
> +const char *bond_slave_link_status(s8 link);
>  
>  struct bond_net {
>  	struct net *		net;	/* Associated network namespace */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> .
> 

^ permalink raw reply

* [Xen-devel][PATCH net-next v2] xen-netfront: clean up code in xennet_release_rx_bufs
From: Annie Li @ 2014-01-15 23:57 UTC (permalink / raw)
  To: xen-devel, netdev
  Cc: davem, konrad.wilk, ian.campbell, wei.liu2, david.vrabel,
	andrew.bennieston, annie.li, Annie Li

This patch implements two things:

* release grant reference and skb for rx path, this fixex resource leaking.
* clean up grant transfer code kept from old netfront(2.6.18) which grants
pages for access/map and transfer. But grant transfer is deprecated in current
netfront, so remove corresponding release code for transfer.

gnttab_end_foreign_access_ref may fail when the grant entry is currently used
for reading or writing. But this patch does not cover this and improvement for
this failure may be implemented in a separate patch.

Test has been run with this patch.

V2: improve patch comments.

Signed-off-by: Annie Li <Annie.li@oracle.com>
---
 drivers/net/xen-netfront.c |   60 ++-----------------------------------------
 1 files changed, 3 insertions(+), 57 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e59acb1..692589e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1134,78 +1134,24 @@ static void xennet_release_tx_bufs(struct netfront_info *np)
 
 static void xennet_release_rx_bufs(struct netfront_info *np)
 {
-	struct mmu_update      *mmu = np->rx_mmu;
-	struct multicall_entry *mcl = np->rx_mcl;
-	struct sk_buff_head free_list;
 	struct sk_buff *skb;
-	unsigned long mfn;
-	int xfer = 0, noxfer = 0, unused = 0;
 	int id, ref;
 
-	dev_warn(&np->netdev->dev, "%s: fix me for copying receiver.\n",
-			 __func__);
-	return;
-
-	skb_queue_head_init(&free_list);
-
 	spin_lock_bh(&np->rx_lock);
 
 	for (id = 0; id < NET_RX_RING_SIZE; id++) {
 		ref = np->grant_rx_ref[id];
-		if (ref == GRANT_INVALID_REF) {
-			unused++;
+		if (ref == GRANT_INVALID_REF)
 			continue;
-		}
 
 		skb = np->rx_skbs[id];
-		mfn = gnttab_end_foreign_transfer_ref(ref);
+		gnttab_end_foreign_access_ref(ref, 0);
 		gnttab_release_grant_reference(&np->gref_rx_head, ref);
 		np->grant_rx_ref[id] = GRANT_INVALID_REF;
 
-		if (0 == mfn) {
-			skb_shinfo(skb)->nr_frags = 0;
-			dev_kfree_skb(skb);
-			noxfer++;
-			continue;
-		}
-
-		if (!xen_feature(XENFEAT_auto_translated_physmap)) {
-			/* Remap the page. */
-			const struct page *page =
-				skb_frag_page(&skb_shinfo(skb)->frags[0]);
-			unsigned long pfn = page_to_pfn(page);
-			void *vaddr = page_address(page);
-
-			MULTI_update_va_mapping(mcl, (unsigned long)vaddr,
-						mfn_pte(mfn, PAGE_KERNEL),
-						0);
-			mcl++;
-			mmu->ptr = ((u64)mfn << PAGE_SHIFT)
-				| MMU_MACHPHYS_UPDATE;
-			mmu->val = pfn;
-			mmu++;
-
-			set_phys_to_machine(pfn, mfn);
-		}
-		__skb_queue_tail(&free_list, skb);
-		xfer++;
-	}
-
-	dev_info(&np->netdev->dev, "%s: %d xfer, %d noxfer, %d unused\n",
-		 __func__, xfer, noxfer, unused);
-
-	if (xfer) {
-		if (!xen_feature(XENFEAT_auto_translated_physmap)) {
-			/* Do all the remapping work and M2P updates. */
-			MULTI_mmu_update(mcl, np->rx_mmu, mmu - np->rx_mmu,
-					 NULL, DOMID_SELF);
-			mcl++;
-			HYPERVISOR_multicall(np->rx_mcl, mcl - np->rx_mcl);
-		}
+		kfree_skb(skb);
 	}
 
-	__skb_queue_purge(&free_list);
-
 	spin_unlock_bh(&np->rx_lock);
 }
 
-- 
1.7.6.5

^ permalink raw reply related

* Re: [PATCH v2 2/2] net/mlx4_core: clean up srq_res_start_move_to()
From: Jack Morgenstein @ 2014-01-16  7:47 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Or Gerlitz, Rony Efraim, Hadar Hen Zion, David S. Miller, netdev,
	linux-kernel
In-Reply-To: <1389728812.28068.9.camel@x220>

ACK.  OK.

-Jack

On Tue, 14 Jan 2014 20:46:52 +0100
Paul Bolle <pebolle@tiscali.nl> wrote:

> Building resource_tracker.o triggers a GCC warning:
>     drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In
> function 'mlx4_HW2SW_SRQ_wrapper':
> drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:3202:17:
> warning: 'srq' may be used uninitialized in this function
> [-Wmaybe-uninitialized] atomic_dec(&srq->mtt->ref_count); ^
> 
> This is a false positive. But a cleanup of srq_res_start_move_to() can
> help GCC here. The code currently uses a switch statement where a
> plain if/else would do, since only two of the switch's four cases can
> ever occur. Dropping that switch makes the warning go away.
> 
> While we're at it, add some missing braces, and convert state to the
> correct type.
> 
> Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
> ---
> v2: adjust to Jack's review.
> 
>  .../net/ethernet/mellanox/mlx4/resource_tracker.c  | 46
> ++++++++-------------- 1 file changed, 16 insertions(+), 30
> deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
> b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c index
> 15cd659..4acd84c 100644 ---
> a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c +++
> b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c @@ -1371,7
> +1371,7 @@ static int cq_res_start_move_to(struct mlx4_dev *dev, int
> slave, int cqn, } 
>  static int srq_res_start_move_to(struct mlx4_dev *dev, int slave,
> int index,
> -				 enum res_cq_states state, struct
> res_srq **srq)
> +				 enum res_srq_states state, struct
> res_srq **srq) {
>  	struct mlx4_priv *priv = mlx4_priv(dev);
>  	struct mlx4_resource_tracker *tracker =
> &priv->mfunc.master.res_tracker; @@ -1380,39 +1380,25 @@ static int
> srq_res_start_move_to(struct mlx4_dev *dev, int slave, int index, 
>  	spin_lock_irq(mlx4_tlock(dev));
>  	r = res_tracker_lookup(&tracker->res_tree[RES_SRQ], index);
> -	if (!r)
> +	if (!r) {
>  		err = -ENOENT;
> -	else if (r->com.owner != slave)
> +	} else if (r->com.owner != slave) {
>  		err = -EPERM;
> -	else {
> -		switch (state) {
> -		case RES_SRQ_BUSY:
> -			err = -EINVAL;
> -			break;
> -
> -		case RES_SRQ_ALLOCATED:
> -			if (r->com.state != RES_SRQ_HW)
> -				err = -EINVAL;
> -			else if (atomic_read(&r->ref_count))
> -				err = -EBUSY;
> -			break;
> -
> -		case RES_SRQ_HW:
> -			if (r->com.state != RES_SRQ_ALLOCATED)
> -				err = -EINVAL;
> -			break;
> -
> -		default:
> +	} else if (state == RES_SRQ_ALLOCATED) {
> +		if (r->com.state != RES_SRQ_HW)
>  			err = -EINVAL;
> -		}
> +		else if (atomic_read(&r->ref_count))
> +			err = -EBUSY;
> +	} else if (state != RES_SRQ_HW || r->com.state !=
> RES_SRQ_ALLOCATED) {
> +		err = -EINVAL;
> +	}
>  
> -		if (!err) {
> -			r->com.from_state = r->com.state;
> -			r->com.to_state = state;
> -			r->com.state = RES_SRQ_BUSY;
> -			if (srq)
> -				*srq = r;
> -		}
> +	if (!err) {
> +		r->com.from_state = r->com.state;
> +		r->com.to_state = state;
> +		r->com.state = RES_SRQ_BUSY;
> +		if (srq)
> +			*srq = r;
>  	}
>  
>  	spin_unlock_irq(mlx4_tlock(dev));

^ permalink raw reply

* Re: [PATCH v2 1/2] net/mlx4_core: clean up cq_res_start_move_to()
From: Jack Morgenstein @ 2014-01-16  7:46 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Or Gerlitz, Rony Efraim, Hadar Hen Zion, David S. Miller, netdev,
	linux-kernel
In-Reply-To: <1389728736.28068.8.camel@x220>

ACK.  OK.

-Jack

On Tue, 14 Jan 2014 20:45:36 +0100
Paul Bolle <pebolle@tiscali.nl> wrote:

> Building resource_tracker.o triggers a GCC warning:
>     drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In
> function 'mlx4_HW2SW_CQ_wrapper':
> drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:3019:16:
> warning: 'cq' may be used uninitialized in this function
> [-Wmaybe-uninitialized] atomic_dec(&cq->mtt->ref_count); ^
> 
> This is a false positive. But a cleanup of cq_res_start_move_to() can
> help GCC here. The code currently uses a switch statement where an
> if/else construct would do too, since only two of the switch's four
> cases can ever occur. Dropping that switch makes the warning go away.
> 
> While we're at it, add some missing braces.
> 
> Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
> ---
> v2: adjust to Jack's review.
> 
>  .../net/ethernet/mellanox/mlx4/resource_tracker.c  | 52
> ++++++++-------------- 1 file changed, 19 insertions(+), 33
> deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
> b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c index
> 2f3f2bc..15cd659 100644 ---
> a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c +++
> b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c @@ -1340,43
> +1340,29 @@ static int cq_res_start_move_to(struct mlx4_dev *dev, int
> slave, int cqn, spin_lock_irq(mlx4_tlock(dev));
>  	r = res_tracker_lookup(&tracker->res_tree[RES_CQ], cqn);
> -	if (!r)
> +	if (!r) {
>  		err = -ENOENT;
> -	else if (r->com.owner != slave)
> +	} else if (r->com.owner != slave) {
>  		err = -EPERM;
> -	else {
> -		switch (state) {
> -		case RES_CQ_BUSY:
> -			err = -EBUSY;
> -			break;
> -
> -		case RES_CQ_ALLOCATED:
> -			if (r->com.state != RES_CQ_HW)
> -				err = -EINVAL;
> -			else if (atomic_read(&r->ref_count))
> -				err = -EBUSY;
> -			else
> -				err = 0;
> -			break;
> -
> -		case RES_CQ_HW:
> -			if (r->com.state != RES_CQ_ALLOCATED)
> -				err = -EINVAL;
> -			else
> -				err = 0;
> -			break;
> -
> -		default:
> +	} else if (state == RES_CQ_ALLOCATED) {
> +		if (r->com.state != RES_CQ_HW)
>  			err = -EINVAL;
> -		}
> +		else if (atomic_read(&r->ref_count))
> +			err = -EBUSY;
> +		else
> +			err = 0;
> +	} else if (state != RES_CQ_HW || r->com.state !=
> RES_CQ_ALLOCATED) {
> +		err = -EINVAL;
> +	} else {
> +		err = 0;
> +	}
>  
> -		if (!err) {
> -			r->com.from_state = r->com.state;
> -			r->com.to_state = state;
> -			r->com.state = RES_CQ_BUSY;
> -			if (cq)
> -				*cq = r;
> -		}
> +	if (!err) {
> +		r->com.from_state = r->com.state;
> +		r->com.to_state = state;
> +		r->com.state = RES_CQ_BUSY;
> +		if (cq)
> +			*cq = r;
>  	}
>  
>  	spin_unlock_irq(mlx4_tlock(dev));

^ permalink raw reply

* Re: throughput problems with realtek
From: Dmitry Kasatkin @ 2014-01-16  7:45 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Rick Jones, nic_swsd, netdev, l.moiseichuk
In-Reply-To: <20140115235123.GA6773@electric-eye.fr.zoreil.com>

On Thu, Jan 16, 2014 at 1:51 AM, Francois Romieu <romieu@fr.zoreil.com> wrote:
> Dmitry Kasatkin <dmitry.kasatkin@gmail.com> :
> [...]
>> I do not see any link speed changes... it stays the same...
>> The same problem is visible on absolutely different computers.
>
> No netdev watchdog message either ?
>
> A serving 8168c does not seem to fluctuate (3.12.5, Intel 82578 client).
> My hardware is a bit old though. Please send XID message from the r8169
> driver. It should be seen in dmesg.
>
> Thanks.
>
> --
> Ueimor

Hi,

No watchdog messages. I also do not see any changes in the led on the
switch which indicates changes in the link speed.

Also as you say.. speed drop first to 140MBs so it is still higher than 100MBs

This what I have in dmesg...

[    2.005555] r8169 0000:03:00.0: irq 45 for MSI/MSI-X
[    2.005690] r8169 0000:03:00.0 eth0: RTL8168evl/8111evl at
0xffffc90000040000, 18:67:b0:2c:0f:ef, XID 0c900800 IRQ 45
[    2.005691] r8169 0000:03:00.0 eth0: jumbo features [frames: 9200
bytes, tx checksumming: ko]

Is it that XID or something else..

I have a problem on Gygabyte Celleron 1007U integrated board and also
on Samsung Series 7 laptop.
My colleague has the same on his Series 7 laptop.

One thing.. I had similar problem with Intel network card. It was
fixed using module parameter...

modprobe e1000e InterruptThrottleRate=0

Are there any network speed or irq throttling in kernel/driver?

-- 
Thanks,
Dmitry

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: drop rq->max and rq->num
From: Michael S. Tsirkin @ 2014-01-16  7:42 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, davem, linux-kernel, virtualization
In-Reply-To: <1389854724-48411-1-git-send-email-jasowang@redhat.com>

On Thu, Jan 16, 2014 at 02:45:24PM +0800, Jason Wang wrote:
> It looks like there's no need for those two fields:
> 
> - Unless there's a failure for the first refill try, rq->max should be always
>   equal to the vring size.
> - rq->num is only used to determine the condition that we need to do the refill,
>   we could check vq->num_free instead.
> - rq->num was required to be increased or decreased explicitly after each
>   get/put which results a bad API.
> 
> So this patch removes them both to make the code simpler.
> 
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Acked-by: Rusty Russell <rusty@rustcorp.com.au>


Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  drivers/net/virtio_net.c |   16 +++-------------
>  1 files changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b17240..9bd70aa 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -72,9 +72,6 @@ struct receive_queue {
>  
>  	struct napi_struct napi;
>  
> -	/* Number of input buffers, and max we've ever had. */
> -	unsigned int num, max;
> -
>  	/* Chain pages by the private ptr. */
>  	struct page *pages;
>  
> @@ -360,7 +357,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		}
>  
>  		page = virt_to_head_page(buf);
> -		--rq->num;
>  
>  		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
>  		if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
> @@ -406,7 +402,6 @@ err_skb:
>  		}
>  		page = virt_to_head_page(buf);
>  		put_page(page);
> -		--rq->num;
>  	}
>  err_buf:
>  	dev->stats.rx_dropped++;
> @@ -628,10 +623,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>  		oom = err == -ENOMEM;
>  		if (err)
>  			break;
> -		++rq->num;
>  	} while (rq->vq->num_free);
> -	if (unlikely(rq->num > rq->max))
> -		rq->max = rq->num;
>  	if (unlikely(!virtqueue_kick(rq->vq)))
>  		return false;
>  	return !oom;
> @@ -699,11 +691,10 @@ again:
>  	while (received < budget &&
>  	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
>  		receive_buf(rq, buf, len);
> -		--rq->num;
>  		received++;
>  	}
>  
> -	if (rq->num < rq->max / 2) {
> +	if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
>  		if (!try_fill_recv(rq, GFP_ATOMIC))
>  			schedule_delayed_work(&vi->refill, 0);
>  	}
> @@ -1398,9 +1389,7 @@ static void free_unused_bufs(struct virtnet_info *vi)
>  				give_pages(&vi->rq[i], buf);
>  			else
>  				dev_kfree_skb(buf);
> -			--vi->rq[i].num;
>  		}
> -		BUG_ON(vi->rq[i].num != 0);
>  	}
>  }
>  
> @@ -1671,7 +1660,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>  		try_fill_recv(&vi->rq[i], GFP_KERNEL);
>  
>  		/* If we didn't even get one input buffer, we're useless. */
> -		if (vi->rq[i].num == 0) {
> +		if (vi->rq[i].vq->num_free ==
> +		    virtqueue_get_vring_size(vi->rq[i].vq)) {
>  			free_unused_bufs(vi);
>  			err = -ENOMEM;
>  			goto free_recv_bufs;
> -- 
> 1.7.1

^ permalink raw reply

* Re: [PATCH net-next 2/2] r6040: use ETH_ZLEN instead of MISR for SKB length checking
From: Ben Hutchings @ 2014-01-16  7:23 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, davem
In-Reply-To: <1389819866-32142-3-git-send-email-florian@openwrt.org>

[-- Attachment #1: Type: text/plain, Size: 3766 bytes --]

On Wed, 2014-01-15 at 13:04 -0800, Florian Fainelli wrote:
> Ever since this driver was merged the following code was included:
> 
> if (skb->len < MISR)
> 	skb->len = MISR;

The second 'skb->len' is currently 'descptr->len'.

> MISR is defined to 0x3C which is also equivalent to ETH_ZLEN, but use
> ETH_ZLEN directly which is exactly what we want to be checking for.
> 
> Reported-by: Marc Volovic <marcv@ezchip.com>
> Signed-off-by: Florian Fainelli <florian@openwrt.org>
> ---
>  drivers/net/ethernet/rdc/r6040.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
> index ff4683a..eb15ebf 100644
> --- a/drivers/net/ethernet/rdc/r6040.c
> +++ b/drivers/net/ethernet/rdc/r6040.c
> @@ -836,8 +836,8 @@ static netdev_tx_t r6040_start_xmit(struct sk_buff *skb,
>  	/* Set TX descriptor & Transmit it */
>  	lp->tx_free_desc--;
>  	descptr = lp->tx_insert_ptr;
> -	if (skb->len < MISR)
> -		descptr->len = MISR;
> +	if (skb->len < ETH_ZLEN)
> +		descptr->len = ETH_ZLEN;

It looks like this is just increasing the TX descriptor length so the
packet is tail-padded with whatever happens to come next in the skb.
This is absolutely incorrect behaviour and may leak sensitive
information.

Presumably it is necessary to pad the frame because the MAC is too lame
to do it, and the correct way to do that is using skb_padto(skb,
ETH_ZLEN).  But this may fail as it might have to allocate memory

Additionally, the driver DMA-maps a length of skb->len but needs to map
a length of descptr->len.

Finally, it doesn't check for failure of DMA mapping.

Tidying up naming should be way down the list of priorities for
fixing this code.

I think this will fix those problems, though I need to split it up into
multiple patches:

--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -816,6 +816,7 @@ static netdev_tx_t r6040_start_xmit(struct sk_buff *skb,
 	struct r6040_descriptor *descptr;
 	void __iomem *ioaddr = lp->base;
 	unsigned long flags;
+	dma_addr_t dma_addr;
 
 	/* Critical Section */
 	spin_lock_irqsave(&lp->lock, flags);
@@ -828,24 +829,35 @@ static netdev_tx_t r6040_start_xmit(struct sk_buff *skb,
 		return NETDEV_TX_BUSY;
 	}
 
-	/* Statistic Counter */
-	dev->stats.tx_packets++;
-	dev->stats.tx_bytes += skb->len;
 	/* Set TX descriptor & Transmit it */
-	lp->tx_free_desc--;
 	descptr = lp->tx_insert_ptr;
-	if (skb->len < MISR)
-		descptr->len = MISR;
-	else
-		descptr->len = skb->len;
+
+	/* MAC doesn't pad frames so we have to */
+	if (skb_padto(skb, ETH_ZLEN)) {
+		kfree_skb(skb);
+		goto out;
+	}
+	descptr->len = max_t(int, skb->len, ETH_ZLEN);
+
+	dma_addr = pci_map_single(lp->pdev, skb->data, descptr->len,
+				  PCI_DMA_TODEVICE);
+	if (pci_dma_mapping_error(lp->pdev, dma_addr)) {
+		kfree_skb(skb);
+		goto out;
+	}
+
+	lp->tx_free_desc--;
 
 	descptr->skb_ptr = skb;
-	descptr->buf = cpu_to_le32(pci_map_single(lp->pdev,
-		skb->data, skb->len, PCI_DMA_TODEVICE));
+	descptr->buf = cpu_to_le32(dma_addr);
 	descptr->status = DSC_OWNER_MAC;
 
 	skb_tx_timestamp(skb);
 
+	/* Statistic Counter */
+	dev->stats.tx_packets++;
+	dev->stats.tx_bytes += skb->len;
+
 	/* Trigger the MAC to check the TX descriptor */
 	iowrite16(TM2TX, ioaddr + MTPR);
 	lp->tx_insert_ptr = descptr->vndescp;
@@ -854,6 +866,7 @@ static netdev_tx_t r6040_start_xmit(struct sk_buff *skb,
 	if (!lp->tx_free_desc)
 		netif_stop_queue(dev);
 
+out:
 	spin_unlock_irqrestore(&lp->lock, flags);
 
 	return NETDEV_TX_OK;
---

Ben.

-- 
Ben Hutchings
Quantity is no substitute for quality, but it's the only one we've got.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 00/13] Assorted mvneta fixes and improvements
From: Willy Tarreau @ 2014-01-16  7:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, Thomas Petazzoni, Gregory CLEMENT, Arnaud Ebalard,
	Eric Dumazet, Ben Hutchings
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

On Thu, Jan 16, 2014 at 08:20:06AM +0100, Willy Tarreau wrote:
> Hi,
> 
> this series provides some fixes for a number of issues met with the
> mvneta driver, then adds some improvements. Patches 1-5 are fixes
> and would be needed in 3.13 and likely -stable. The next ones are
> performance improvements and cleanups :

I just wanted to make it clear that the whole series is desired for net-next.

Thanks,
Willy

^ permalink raw reply

* [PATCH 11/13] net: mvneta: implement rx_copybreak
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

calling dma_map_single()/dma_unmap_single() is quite expensive compared
to copying a small packet. So let's copy short frames and keep the buffers
mapped. We set the limit to 256 bytes which seems to give good results both
on the XP-GP board and on the AX3/4.

The Rx small packet rate increased by 16.4% doing this, from 486kpps to
573kpps. It is worth noting that even the call to the function
dma_sync_single_range_for_cpu() is expensive (300 ns) although less
than dma_unmap_single(). Without it, the packet rate raises to 711kpps
(+24% more). Thus on systems where coherency from device to CPU is
guaranteed by a snoop control unit, this patch should provide even more
gains, and probably rx_copybreak could be increased.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 44 ++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 726a8d2..f5fc7a2 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -444,6 +444,8 @@ static int txq_number = 8;
 
 static int rxq_def;
 
+static int rx_copybreak __read_mostly = 256;
+
 #define MVNETA_DRIVER_NAME "mvneta"
 #define MVNETA_DRIVER_VERSION "1.0"
 
@@ -1463,22 +1465,51 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		rx_done++;
 		rx_filled++;
 		rx_status = rx_desc->status;
+		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (unsigned char *)rx_desc->buf_cookie;
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-		    (rx_status & MVNETA_RXD_ERR_SUMMARY) ||
-		    !(skb = build_skb(data, pp->frag_size > PAGE_SIZE ? 0 : pp->frag_size))) {
+		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
+		err_drop_frame:
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
 			/* leave the descriptor untouched */
 			continue;
 		}
 
-		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
+		if (rx_bytes <= rx_copybreak) {
+			/* better copy a small frame and not unmap the DMA region */
+			skb = netdev_alloc_skb_ip_align(dev, rx_bytes);
+			if (unlikely(!skb))
+				goto err_drop_frame;
+
+			dma_sync_single_range_for_cpu(dev->dev.parent,
+			                              rx_desc->buf_phys_addr,
+			                              MVNETA_MH_SIZE + NET_SKB_PAD,
+			                              rx_bytes,
+			                              DMA_FROM_DEVICE);
+			memcpy(skb_put(skb, rx_bytes),
+			       data + MVNETA_MH_SIZE + NET_SKB_PAD,
+			       rx_bytes);
+
+			skb->protocol = eth_type_trans(skb, dev);
+			mvneta_rx_csum(pp, rx_status, skb);
+			napi_gro_receive(&pp->napi, skb);
+
+			rcvd_pkts++;
+			rcvd_bytes += rx_bytes;
+
+			/* leave the descriptor and buffer untouched */
+			continue;
+		}
+
+		skb = build_skb(data, pp->frag_size > PAGE_SIZE ? 0 : pp->frag_size);
+		if (!skb)
+			goto err_drop_frame;
+
+		dma_unmap_single(dev->dev.parent, rx_desc->buf_phys_addr,
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 
-		rx_bytes = rx_desc->data_size -
-			(ETH_FCS_LEN + MVNETA_MH_SIZE);
 		rcvd_pkts++;
 		rcvd_bytes += rx_bytes;
 
@@ -1495,7 +1526,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		/* Refill processing */
 		err = mvneta_rx_refill(pp, rx_desc);
 		if (err) {
-			netdev_err(pp->dev, "Linux processing - Can't refill\n");
+			netdev_err(dev, "Linux processing - Can't refill\n");
 			rxq->missed++;
 			rx_filled--;
 		}
@@ -2945,3 +2976,4 @@ module_param(rxq_number, int, S_IRUGO);
 module_param(txq_number, int, S_IRUGO);
 
 module_param(rxq_def, int, S_IRUGO);
+module_param(rx_copybreak, int, S_IRUGO | S_IWUSR);
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related

* [PATCH 00/13] Assorted mvneta fixes and improvements
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT,
	Arnaud Ebalard, Eric Dumazet, Ben Hutchings

Hi,

this series provides some fixes for a number of issues met with the
mvneta driver, then adds some improvements. Patches 1-5 are fixes
and would be needed in 3.13 and likely -stable. The next ones are
performance improvements and cleanups :

  - driver lockup when reading stats while sending traffic from multiple
    CPUs : this obviously only happens on SMP and is the result of missing
    locking on the driver. The problem was present since the introduction
    of the driver in 3.8. The first patch performs some changes that are
    needed for the second one which actually fixes the issue by using
    per-cpu counters. It could make sense to backport this to the relevant
    stable versions.

  - mvneta_tx_timeout calls various functions to reset the NIC, and these
    functions sleep, which is not allowed here, resulting in a panic.
    Better completely disable this Tx timeout handler for now since it is
    never called. The problem was encountered while developing some new
    features, it's uncertain whether it's possible to reproduce it with
    regular usage, so maybe a backport to stable is not needed.

  - replace the Tx timer with a real Tx IRQ. As first reported by Arnaud
    Ebalard and explained by Eric Dumazet, there is no way this driver
    can work correctly if it uses a driver to recycle the Tx descriptors.
    If too many packets are sent at once, the driver quickly ends up with
    no descriptors (which happens twice as easily in GSO) and has to wait
    10ms for recycling its descriptors and being able to send again. Eric
    has worked around this in the core GSO code. But still when routing
    traffic or sending UDP packets, the limitation is very visible. Using
    Tx IRQs allows Tx descriptors to be recycled when sent. The coalesce
    value is still configurable using ethtool. This fix turns the UDP
    send bitrate from 134 Mbps to 987 Mbps (ie: line rate). It's made of
    two patches, one to add the relevant bits from the original Marvell's
    driver, and another one to implement the change. I don't know if it
    should be backported to stable, as the bug only causes poor performance.

  - Patches 6..8 are essentially cleanups, code deduplication and minor
    optimizations for not re-fetching a value we already have (status).

  - patch 9 changes the prefetch of Rx descriptor from current one to
    next one. In benchmarks, it results in about 1% general performance
    increase on HTTP traffic, probably because prefetching the current
    descriptor does not leave enough time between the start of prefetch
    and its usage.

  - patch 10 implements support for build_skb() on Rx path. The driver
    now preallocates frags instead of skbs and builds an skb just before
    delivering it. This results in a 2% performance increase on HTTP
    traffic, and up to 5% on small packet Rx rate.

  - patch 11 implements rx_copybreak for small packets (256 bytes). It
    avoids a dma_map_single()/dma_unmap_single() and increases the Rx
    rate by 16.4%, from 486kpps to 573kpps. Further improvements up to
    711kpps are possible depending how the DMA is used.

  - patches 12 and 13 are extra cleanups made possible by some of the
    simplifications above.

Thanks,
Willy

---
Arnaud Ebalard (2):
  net: mvneta: mvneta_tx_done_gbe() cleanups
  net: mvneta: make mvneta_txq_done() return void

Willy Tarreau (11):
  net: mvneta: increase the 64-bit rx/tx stats out of the hot path
  net: mvneta: use per_cpu stats to fix an SMP lock up
  net: mvneta: do not schedule in mvneta_tx_timeout
  net: mvneta: add missing bit descriptions for interrupt masks and
    causes
  net: mvneta: replace Tx timer with a real interrupt
  net: mvneta: remove tests for impossible cases in the tx_done path
  net: mvneta: factor rx refilling code
  net: mvneta: simplify access to the rx descriptor status
  net: mvneta: prefetch next rx descriptor instead of current one
  net: mvneta: convert to build_skb()
  net: mvneta: implement rx_copybreak

 drivers/net/ethernet/marvell/mvneta.c | 387 +++++++++++++++++++---------------
 1 file changed, 215 insertions(+), 172 deletions(-)

--
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply

* [PATCH 10/13] net: mvneta: convert to build_skb()
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

Make use of build_skb() to allocate frags on the RX path. When frag size
is lower than a page size, we can use netdev_alloc_frag(), and we fall back
to kmalloc() for larger sizes. The frag size is stored into the mvneta_port
struct. The alloc/free functions check the frag size to decide what alloc/
free method to use. MTU changes are safe because the MTU change function
stops the device and clears the queues before applying the change.

With this patch, I observed a reproducible 2% performance improvement on
HTTP-based benchmarks, and 5% on small packet RX rate.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 49 +++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index c7b37e0..726a8d2 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -268,6 +268,7 @@ struct mvneta_pcpu_stats {
 
 struct mvneta_port {
 	int pkt_size;
+	unsigned int frag_size;
 	void __iomem *base;
 	struct mvneta_rx_queue *rxqs;
 	struct mvneta_tx_queue *txqs;
@@ -1332,28 +1333,43 @@ static int mvneta_txq_done(struct mvneta_port *pp,
 	return tx_done;
 }
 
+static void *mvneta_frag_alloc(const struct mvneta_port *pp)
+{
+	if (likely(pp->frag_size <= PAGE_SIZE))
+		return netdev_alloc_frag(pp->frag_size);
+	else
+		return kmalloc(pp->frag_size, GFP_ATOMIC);
+}
+
+static void mvneta_frag_free(const struct mvneta_port *pp, void *data)
+{
+	if (likely(pp->frag_size <= PAGE_SIZE))
+		put_page(virt_to_head_page(data));
+	else
+		kfree(data);
+}
+
 /* Refill processing */
 static int mvneta_rx_refill(struct mvneta_port *pp,
 			    struct mvneta_rx_desc *rx_desc)
 
 {
 	dma_addr_t phys_addr;
-	struct sk_buff *skb;
+	void *data;
 
-	skb = netdev_alloc_skb(pp->dev, pp->pkt_size);
-	if (!skb)
+	data = mvneta_frag_alloc(pp);
+	if (!data)
 		return -ENOMEM;
 
-	phys_addr = dma_map_single(pp->dev->dev.parent, skb->head,
+	phys_addr = dma_map_single(pp->dev->dev.parent, data,
 				   MVNETA_RX_BUF_SIZE(pp->pkt_size),
 				   DMA_FROM_DEVICE);
 	if (unlikely(dma_mapping_error(pp->dev->dev.parent, phys_addr))) {
-		dev_kfree_skb(skb);
+		mvneta_frag_free(pp, data);
 		return -ENOMEM;
 	}
 
-	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)skb);
-
+	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
 	return 0;
 }
 
@@ -1407,9 +1423,9 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 	rx_done = mvneta_rxq_busy_desc_num_get(pp, rxq);
 	for (i = 0; i < rxq->size; i++) {
 		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
-		struct sk_buff *skb = (struct sk_buff *)rx_desc->buf_cookie;
+		void *data = (void *)rx_desc->buf_cookie;
 
-		dev_kfree_skb_any(skb);
+		mvneta_frag_free(pp, data);
 		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 	}
@@ -1440,20 +1456,21 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 	while (rx_done < rx_todo) {
 		struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq);
 		struct sk_buff *skb;
+		unsigned char *data;
 		u32 rx_status;
 		int rx_bytes, err;
 
 		rx_done++;
 		rx_filled++;
 		rx_status = rx_desc->status;
-		skb = (struct sk_buff *)rx_desc->buf_cookie;
+		data = (unsigned char *)rx_desc->buf_cookie;
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
+		    (rx_status & MVNETA_RXD_ERR_SUMMARY) ||
+		    !(skb = build_skb(data, pp->frag_size > PAGE_SIZE ? 0 : pp->frag_size))) {
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
-			mvneta_rx_desc_fill(rx_desc, rx_desc->buf_phys_addr,
-					    (u32)skb);
+			/* leave the descriptor untouched */
 			continue;
 		}
 
@@ -1466,7 +1483,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		rcvd_bytes += rx_bytes;
 
 		/* Linux processing */
-		skb_reserve(skb, MVNETA_MH_SIZE);
+		skb_reserve(skb, MVNETA_MH_SIZE + NET_SKB_PAD);
 		skb_put(skb, rx_bytes);
 
 		skb->protocol = eth_type_trans(skb, dev);
@@ -2276,6 +2293,8 @@ static int mvneta_change_mtu(struct net_device *dev, int mtu)
 	mvneta_cleanup_rxqs(pp);
 
 	pp->pkt_size = MVNETA_RX_PKT_SIZE(pp->dev->mtu);
+	pp->frag_size = SKB_DATA_ALIGN(MVNETA_RX_BUF_SIZE(pp->pkt_size)) +
+	                SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	ret = mvneta_setup_rxqs(pp);
 	if (ret) {
@@ -2423,6 +2442,8 @@ static int mvneta_open(struct net_device *dev)
 	mvneta_mac_addr_set(pp, dev->dev_addr, rxq_def);
 
 	pp->pkt_size = MVNETA_RX_PKT_SIZE(pp->dev->mtu);
+	pp->frag_size = SKB_DATA_ALIGN(MVNETA_RX_BUF_SIZE(pp->pkt_size)) +
+	                SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	ret = mvneta_setup_rxqs(pp);
 	if (ret)
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related

* [PATCH 08/13] net: mvneta: simplify access to the rx descriptor status
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

At several places, we already know the value of the rx status but
we call functions which dereference the pointer again to get it
and don't need the descriptor for anything else. Simplify this
task by replacing the rx desc pointer by the status word itself.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index eccafd1..aa3a4f7 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -528,14 +528,14 @@ struct rtnl_link_stats64 *mvneta_get_stats64(struct net_device *dev,
 
 /* Rx descriptors helper methods */
 
-/* Checks whether the given RX descriptor is both the first and the
- * last descriptor for the RX packet. Each RX packet is currently
+/* Checks whether the RX descriptor having this status is both the first
+ * and the last descriptor for the RX packet. Each RX packet is currently
  * received through a single RX descriptor, so not having each RX
  * descriptor with its first and last bits set is an error
  */
-static int mvneta_rxq_desc_is_first_last(struct mvneta_rx_desc *desc)
+static int mvneta_rxq_desc_is_first_last(u32 status)
 {
-	return (desc->status & MVNETA_RXD_FIRST_LAST_DESC) ==
+	return (status & MVNETA_RXD_FIRST_LAST_DESC) ==
 		MVNETA_RXD_FIRST_LAST_DESC;
 }
 
@@ -1234,10 +1234,10 @@ static void mvneta_rx_error(struct mvneta_port *pp,
 {
 	u32 status = rx_desc->status;
 
-	if (!mvneta_rxq_desc_is_first_last(rx_desc)) {
+	if (!mvneta_rxq_desc_is_first_last(status)) {
 		netdev_err(pp->dev,
 			   "bad rx status %08x (buffer oversize), size=%d\n",
-			   rx_desc->status, rx_desc->data_size);
+			   status, rx_desc->data_size);
 		return;
 	}
 
@@ -1261,13 +1261,12 @@ static void mvneta_rx_error(struct mvneta_port *pp,
 	}
 }
 
-/* Handle RX checksum offload */
-static void mvneta_rx_csum(struct mvneta_port *pp,
-			   struct mvneta_rx_desc *rx_desc,
+/* Handle RX checksum offload based on the descriptor's status */
+static void mvneta_rx_csum(struct mvneta_port *pp, u32 status,
 			   struct sk_buff *skb)
 {
-	if ((rx_desc->status & MVNETA_RXD_L3_IP4) &&
-	    (rx_desc->status & MVNETA_RXD_L4_CSUM_OK)) {
+	if ((status & MVNETA_RXD_L3_IP4) &&
+	    (status & MVNETA_RXD_L4_CSUM_OK)) {
 		skb->csum = 0;
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 		return;
@@ -1449,7 +1448,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		rx_status = rx_desc->status;
 		skb = (struct sk_buff *)rx_desc->buf_cookie;
 
-		if (!mvneta_rxq_desc_is_first_last(rx_desc) ||
+		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
@@ -1472,7 +1471,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 
 		skb->protocol = eth_type_trans(skb, dev);
 
-		mvneta_rx_csum(pp, rx_desc, skb);
+		mvneta_rx_csum(pp, rx_status, skb);
 
 		napi_gro_receive(&pp->napi, skb);
 
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related

* [PATCH 06/13] net: mvneta: remove tests for impossible cases in the tx_done path
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

Currently, mvneta_txq_bufs_free() calls mvneta_tx_done_policy() with
a non-null cause to retrieve the pointer to the next queue to process.
There are useless tests on the return queue number and on the pointer,
all of which are well defined within a known limited set. This code
path is fast, although not critical. Removing 3 tests here that the
compiler could not optimize (verified) is always desirable.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index df75a23..2fb9559 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1276,13 +1276,16 @@ static void mvneta_rx_csum(struct mvneta_port *pp,
 	skb->ip_summed = CHECKSUM_NONE;
 }
 
-/* Return tx queue pointer (find last set bit) according to causeTxDone reg */
+/* Return tx queue pointer (find last set bit) according to <cause> returned
+ * form tx_done reg. <cause> must not be null. The return value is always a
+ * valid queue for matching the first one found in <cause>.
+ */
 static struct mvneta_tx_queue *mvneta_tx_done_policy(struct mvneta_port *pp,
 						     u32 cause)
 {
 	int queue = fls(cause) - 1;
 
-	return (queue < 0 || queue >= txq_number) ? NULL : &pp->txqs[queue];
+	return &pp->txqs[queue];
 }
 
 /* Free tx queue skbuffs */
@@ -1651,7 +1654,9 @@ static void mvneta_txq_done_force(struct mvneta_port *pp,
 	txq->txq_get_index = 0;
 }
 
-/* handle tx done - called from tx done timer callback */
+/* Handle tx done - called in softirq context. The <cause_tx_done> argument
+ * must be a valid cause according to MVNETA_TXQ_INTR_MASK_ALL.
+ */
 static u32 mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done,
 			      int *tx_todo)
 {
@@ -1660,10 +1665,8 @@ static u32 mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done,
 	struct netdev_queue *nq;
 
 	*tx_todo = 0;
-	while (cause_tx_done != 0) {
+	while (cause_tx_done) {
 		txq = mvneta_tx_done_policy(pp, cause_tx_done);
-		if (!txq)
-			break;
 
 		nq = netdev_get_tx_queue(pp->dev, txq->id);
 		__netif_tx_lock(nq, smp_processor_id());
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related

* [PATCH 04/13] net: mvneta: add missing bit descriptions for interrupt masks and causes
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

Marvell has not published the chip's datasheet yet, so it's very hard
to find the relevant bits to manipulate to change the IRQ behaviour.
Fortunately, these bits are described in the proprietary LSP patch set
which is publicly available here :

    http://www.plugcomputer.org/downloads/mirabox/

So let's put them back in the driver in order to reduce the burden of
current and future maintenance.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 44 +++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 84220c1..defda6f 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -101,16 +101,56 @@
 #define      MVNETA_CPU_RXQ_ACCESS_ALL_MASK      0x000000ff
 #define      MVNETA_CPU_TXQ_ACCESS_ALL_MASK      0x0000ff00
 #define MVNETA_RXQ_TIME_COAL_REG(q)              (0x2580 + ((q) << 2))
+
+/* Exception Interrupt Port/Queue Cause register */
+
 #define MVNETA_INTR_NEW_CAUSE                    0x25a0
-#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
 #define MVNETA_INTR_NEW_MASK                     0x25a4
+
+/* bits  0..7  = TXQ SENT, one bit per queue.
+ * bits  8..15 = RXQ OCCUP, one bit per queue.
+ * bits 16..23 = RXQ FREE, one bit per queue.
+ * bit  29 = OLD_REG_SUM, see old reg ?
+ * bit  30 = TX_ERR_SUM, one bit for 4 ports
+ * bit  31 = MISC_SUM,   one bit for 4 ports
+ */
+#define      MVNETA_TX_INTR_MASK(nr_txqs)        (((1 << nr_txqs) - 1) << 0)
+#define      MVNETA_TX_INTR_MASK_ALL             (0xff << 0)
+#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
+#define      MVNETA_RX_INTR_MASK_ALL             (0xff << 8)
+
 #define MVNETA_INTR_OLD_CAUSE                    0x25a8
 #define MVNETA_INTR_OLD_MASK                     0x25ac
+
+/* Data Path Port/Queue Cause Register */
 #define MVNETA_INTR_MISC_CAUSE                   0x25b0
 #define MVNETA_INTR_MISC_MASK                    0x25b4
+
+#define      MVNETA_CAUSE_PHY_STATUS_CHANGE      BIT(0)
+#define      MVNETA_CAUSE_LINK_CHANGE            BIT(1)
+#define      MVNETA_CAUSE_PTP                    BIT(4)
+
+#define      MVNETA_CAUSE_INTERNAL_ADDR_ERR      BIT(7)
+#define      MVNETA_CAUSE_RX_OVERRUN             BIT(8)
+#define      MVNETA_CAUSE_RX_CRC_ERROR           BIT(9)
+#define      MVNETA_CAUSE_RX_LARGE_PKT           BIT(10)
+#define      MVNETA_CAUSE_TX_UNDERUN             BIT(11)
+#define      MVNETA_CAUSE_PRBS_ERR               BIT(12)
+#define      MVNETA_CAUSE_PSC_SYNC_CHANGE        BIT(13)
+#define      MVNETA_CAUSE_SERDES_SYNC_ERR        BIT(14)
+
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT    16
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_ALL_MASK   (0xF << MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT)
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_MASK(pool) (1 << (MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT + (pool)))
+
+#define      MVNETA_CAUSE_TXQ_ERROR_SHIFT        24
+#define      MVNETA_CAUSE_TXQ_ERROR_ALL_MASK     (0xFF << MVNETA_CAUSE_TXQ_ERROR_SHIFT)
+#define      MVNETA_CAUSE_TXQ_ERROR_MASK(q)      (1 << (MVNETA_CAUSE_TXQ_ERROR_SHIFT + (q)))
+
 #define MVNETA_INTR_ENABLE                       0x25b8
 #define      MVNETA_TXQ_INTR_ENABLE_ALL_MASK     0x0000ff00
-#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000
+#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000  // note: neta says it's 0x000000FF
+
 #define MVNETA_RXQ_CMD                           0x2680
 #define      MVNETA_RXQ_DISABLE_SHIFT            8
 #define      MVNETA_RXQ_ENABLE_MASK              0x000000ff
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related

* [PATCH 02/13] net: mvneta: use per_cpu stats to fix an SMP lock up
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

Stats writers are mvneta_rx() and mvneta_tx(). They don't lock anything
when they update the stats, and as a result, it randomly happens that
the stats freeze on SMP if two updates happen during stats retrieval.
This is very easily reproducible by starting two HTTP servers and binding
each of them to a different CPU, then consulting /proc/net/dev in loops
during transfers, the interface should immediately lock up. This issue
also randomly happens upon link state changes during transfers, because
the stats are collected in this situation, but it takes more attempts to
reproduce it.

The comments in netdevice.h suggest using per_cpu stats instead to get
rid of this issue.

This patch implements this. It merges both rx_stats and tx_stats into
a single "stats" member with a single syncp. Both mvneta_rx() and
mvneta_rx() now only update the a single CPU's counters.

In turn, mvneta_get_stats64() does the summing by iterating over all CPUs
to get their respective stats.

With this change, stats are still correct and no more lockup is encountered.

Note that this bug was present since the first import of the mvneta
driver.  It might make sense to backport it to some stable trees. If
so, it depends on "d33dc73 net: mvneta: increase the 64-bit rx/tx stats
out of the hot path".

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 84 +++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index baa85af..40d3e8b 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -221,10 +221,12 @@
 
 #define MVNETA_RX_BUF_SIZE(pkt_size)   ((pkt_size) + NET_SKB_PAD)
 
-struct mvneta_stats {
+struct mvneta_pcpu_stats {
 	struct	u64_stats_sync syncp;
-	u64	packets;
-	u64	bytes;
+	u64	rx_packets;
+	u64	rx_bytes;
+	u64	tx_packets;
+	u64	tx_bytes;
 };
 
 struct mvneta_port {
@@ -250,8 +252,7 @@ struct mvneta_port {
 	u8 mcast_count[256];
 	u16 tx_ring_size;
 	u16 rx_ring_size;
-	struct mvneta_stats tx_stats;
-	struct mvneta_stats rx_stats;
+	struct mvneta_pcpu_stats *stats;
 
 	struct mii_bus *mii_bus;
 	struct phy_device *phy_dev;
@@ -461,21 +462,29 @@ struct rtnl_link_stats64 *mvneta_get_stats64(struct net_device *dev,
 {
 	struct mvneta_port *pp = netdev_priv(dev);
 	unsigned int start;
+	int cpu;
 
-	memset(stats, 0, sizeof(struct rtnl_link_stats64));
-
-	do {
-		start = u64_stats_fetch_begin_bh(&pp->rx_stats.syncp);
-		stats->rx_packets = pp->rx_stats.packets;
-		stats->rx_bytes	= pp->rx_stats.bytes;
-	} while (u64_stats_fetch_retry_bh(&pp->rx_stats.syncp, start));
+	for_each_possible_cpu(cpu) {
+		struct mvneta_pcpu_stats *cpu_stats;
+		u64 rx_packets;
+		u64 rx_bytes;
+		u64 tx_packets;
+		u64 tx_bytes;
 
+		cpu_stats = per_cpu_ptr(pp->stats, cpu);
+		do {
+			start = u64_stats_fetch_begin_bh(&cpu_stats->syncp);
+			rx_packets = cpu_stats->rx_packets;
+			rx_bytes   = cpu_stats->rx_bytes;
+			tx_packets = cpu_stats->tx_packets;
+			tx_bytes   = cpu_stats->tx_bytes;
+		} while (u64_stats_fetch_retry_bh(&cpu_stats->syncp, start));
 
-	do {
-		start = u64_stats_fetch_begin_bh(&pp->tx_stats.syncp);
-		stats->tx_packets = pp->tx_stats.packets;
-		stats->tx_bytes	= pp->tx_stats.bytes;
-	} while (u64_stats_fetch_retry_bh(&pp->tx_stats.syncp, start));
+		stats->rx_packets += rx_packets;
+		stats->rx_bytes   += rx_bytes;
+		stats->tx_packets += tx_packets;
+		stats->tx_bytes   += tx_bytes;
+	}
 
 	stats->rx_errors	= dev->stats.rx_errors;
 	stats->rx_dropped	= dev->stats.rx_dropped;
@@ -1453,10 +1462,12 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 	}
 
 	if (rcvd_pkts) {
-		u64_stats_update_begin(&pp->rx_stats.syncp);
-		pp->rx_stats.packets += rcvd_pkts;
-		pp->rx_stats.bytes   += rcvd_bytes;
-		u64_stats_update_end(&pp->rx_stats.syncp);
+		struct mvneta_pcpu_stats *stats = this_cpu_ptr(pp->stats);
+
+		u64_stats_update_begin(&stats->syncp);
+		stats->rx_packets += rcvd_pkts;
+		stats->rx_bytes   += rcvd_bytes;
+		u64_stats_update_end(&stats->syncp);
 	}
 
 	/* Update rxq management counters */
@@ -1589,11 +1600,12 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 
 out:
 	if (frags > 0) {
-		u64_stats_update_begin(&pp->tx_stats.syncp);
-		pp->tx_stats.packets++;
-		pp->tx_stats.bytes += skb->len;
-		u64_stats_update_end(&pp->tx_stats.syncp);
+		struct mvneta_pcpu_stats *stats = this_cpu_ptr(pp->stats);
 
+		u64_stats_update_begin(&stats->syncp);
+		stats->tx_packets++;
+		stats->tx_bytes  += skb->len;
+		u64_stats_update_end(&stats->syncp);
 	} else {
 		dev->stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
@@ -2758,6 +2770,7 @@ static int mvneta_probe(struct platform_device *pdev)
 	const char *mac_from;
 	int phy_mode;
 	int err;
+	int cpu;
 
 	/* Our multiqueue support is not complete, so for now, only
 	 * allow the usage of the first RX queue
@@ -2799,9 +2812,6 @@ static int mvneta_probe(struct platform_device *pdev)
 
 	pp = netdev_priv(dev);
 
-	u64_stats_init(&pp->tx_stats.syncp);
-	u64_stats_init(&pp->rx_stats.syncp);
-
 	pp->weight = MVNETA_RX_POLL_WEIGHT;
 	pp->phy_node = phy_node;
 	pp->phy_interface = phy_mode;
@@ -2820,6 +2830,19 @@ static int mvneta_probe(struct platform_device *pdev)
 		goto err_clk;
 	}
 
+	/* Alloc per-cpu stats */
+	pp->stats = alloc_percpu(struct mvneta_pcpu_stats);
+	if (!pp->stats) {
+		err = -ENOMEM;
+		goto err_unmap;
+	}
+
+	for_each_possible_cpu(cpu) {
+		struct mvneta_pcpu_stats *stats;
+		stats = per_cpu_ptr(pp->stats, cpu);
+		u64_stats_init(&stats->syncp);
+	}
+
 	dt_mac_addr = of_get_mac_address(dn);
 	if (dt_mac_addr) {
 		mac_from = "device tree";
@@ -2849,7 +2872,7 @@ static int mvneta_probe(struct platform_device *pdev)
 	err = mvneta_init(pp, phy_addr);
 	if (err < 0) {
 		dev_err(&pdev->dev, "can't init eth hal\n");
-		goto err_unmap;
+		goto err_free_stats;
 	}
 	mvneta_port_power_up(pp, phy_mode);
 
@@ -2879,6 +2902,8 @@ static int mvneta_probe(struct platform_device *pdev)
 
 err_deinit:
 	mvneta_deinit(pp);
+err_free_stats:
+	free_percpu(pp->stats);
 err_unmap:
 	iounmap(pp->base);
 err_clk:
@@ -2899,6 +2924,7 @@ static int mvneta_remove(struct platform_device *pdev)
 	unregister_netdev(dev);
 	mvneta_deinit(pp);
 	clk_disable_unprepare(pp->clk);
+	free_percpu(pp->stats);
 	iounmap(pp->base);
 	irq_dispose_mapping(dev->irq);
 	free_netdev(dev);
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related

* [PATCH 05/13] net: mvneta: replace Tx timer with a real interrupt
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, Willy Tarreau, Thomas Petazzoni, Gregory CLEMENT,
	Arnaud Ebalard, Eric Dumazet
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

Right now the mvneta driver doesn't handle Tx IRQ, and relies on two
mechanisms to flush Tx descriptors : a flush at the end of mvneta_tx()
and a timer. If a burst of packets is emitted faster than the device
can send them, then the queue is stopped until next wake-up of the
timer 10ms later. This causes jerky output traffic with bursts and
pauses, making it difficult to reach line rate with very few streams.

A test on UDP traffic shows that it's not possible to go beyond 134
Mbps / 12 kpps of outgoing traffic with 1500-bytes IP packets. Routed
traffic tends to observe pauses as well if the traffic is bursty,
making it even burstier after the wake-up.

It seems that this feature was inherited from the original driver but
nothing there mentions any reason for not using the interrupt instead,
which the chip supports.

Thus, this patch enables Tx interrupts and removes the timer. It does
the two at once because it's not really possible to make the two
mechanisms coexist, so a split patch doesn't make sense.

First tests performed on a Mirabox (Armada 370) show that less CPU
seems to be used when sending traffic. One reason might be that we now
call the mvneta_tx_done_gbe() with a mask indicating which queues have
been done instead of looping over all of them.

The same UDP test above now happily reaches 987 Mbps / 87.7 kpps.
Single-stream TCP traffic can now more easily reach line rate. HTTP
transfers of 1 MB objects over a single connection went from 730 to
840 Mbps. It is even possible to go significantly higher (>900 Mbps)
by tweaking tcp_tso_win_divisor.

Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: Arnaud Ebalard <arno@natisbad.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 71 ++++++-----------------------------
 1 file changed, 12 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index defda6f..df75a23 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -216,9 +216,6 @@
 #define MVNETA_RX_COAL_PKTS		32
 #define MVNETA_RX_COAL_USEC		100
 
-/* Timer */
-#define MVNETA_TX_DONE_TIMER_PERIOD	10
-
 /* Napi polling weight */
 #define MVNETA_RX_POLL_WEIGHT		64
 
@@ -274,16 +271,11 @@ struct mvneta_port {
 	void __iomem *base;
 	struct mvneta_rx_queue *rxqs;
 	struct mvneta_tx_queue *txqs;
-	struct timer_list tx_done_timer;
 	struct net_device *dev;
 
 	u32 cause_rx_tx;
 	struct napi_struct napi;
 
-	/* Flags */
-	unsigned long flags;
-#define MVNETA_F_TX_DONE_TIMER_BIT  0
-
 	/* Napi weight */
 	int weight;
 
@@ -1149,17 +1141,6 @@ static void mvneta_tx_done_pkts_coal_set(struct mvneta_port *pp,
 	txq->done_pkts_coal = value;
 }
 
-/* Trigger tx done timer in MVNETA_TX_DONE_TIMER_PERIOD msecs */
-static void mvneta_add_tx_done_timer(struct mvneta_port *pp)
-{
-	if (test_and_set_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags) == 0) {
-		pp->tx_done_timer.expires = jiffies +
-			msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD);
-		add_timer(&pp->tx_done_timer);
-	}
-}
-
-
 /* Handle rx descriptor fill by setting buf_cookie and buf_phys_addr */
 static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
 				u32 phys_addr, u32 cookie)
@@ -1651,15 +1632,6 @@ out:
 		dev_kfree_skb_any(skb);
 	}
 
-	if (txq->count >= MVNETA_TXDONE_COAL_PKTS)
-		mvneta_txq_done(pp, txq);
-
-	/* If after calling mvneta_txq_done, count equals
-	 * frags, we need to set the timer
-	 */
-	if (txq->count == frags && frags > 0)
-		mvneta_add_tx_done_timer(pp);
-
 	return NETDEV_TX_OK;
 }
 
@@ -1935,14 +1907,22 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 
 	/* Read cause register */
 	cause_rx_tx = mvreg_read(pp, MVNETA_INTR_NEW_CAUSE) &
-		MVNETA_RX_INTR_MASK(rxq_number);
+		(MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
+
+	/* Release Tx descriptors */
+	if (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL) {
+		int tx_todo = 0;
+
+		mvneta_tx_done_gbe(pp, (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL), &tx_todo);
+		cause_rx_tx &= ~MVNETA_TX_INTR_MASK_ALL;
+	}
 
 	/* For the case where the last mvneta_poll did not process all
 	 * RX packets
 	 */
 	cause_rx_tx |= pp->cause_rx_tx;
 	if (rxq_number > 1) {
-		while ((cause_rx_tx != 0) && (budget > 0)) {
+		while ((cause_rx_tx & MVNETA_RX_INTR_MASK_ALL) && (budget > 0)) {
 			int count;
 			struct mvneta_rx_queue *rxq;
 			/* get rx queue number from cause_rx_tx */
@@ -1974,7 +1954,7 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 		napi_complete(napi);
 		local_irq_save(flags);
 		mvreg_write(pp, MVNETA_INTR_NEW_MASK,
-			    MVNETA_RX_INTR_MASK(rxq_number));
+			    MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
 		local_irq_restore(flags);
 	}
 
@@ -1982,26 +1962,6 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 	return rx_done;
 }
 
-/* tx done timer callback */
-static void mvneta_tx_done_timer_callback(unsigned long data)
-{
-	struct net_device *dev = (struct net_device *)data;
-	struct mvneta_port *pp = netdev_priv(dev);
-	int tx_done = 0, tx_todo = 0;
-
-	if (!netif_running(dev))
-		return ;
-
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
-
-	tx_done = mvneta_tx_done_gbe(pp,
-				     (((1 << txq_number) - 1) &
-				      MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
-				     &tx_todo);
-	if (tx_todo > 0)
-		mvneta_add_tx_done_timer(pp);
-}
-
 /* Handle rxq fill: allocates rxq skbs; called when initializing a port */
 static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 			   int num)
@@ -2251,7 +2211,7 @@ static void mvneta_start_dev(struct mvneta_port *pp)
 
 	/* Unmask interrupts */
 	mvreg_write(pp, MVNETA_INTR_NEW_MASK,
-		    MVNETA_RX_INTR_MASK(rxq_number));
+		    MVNETA_RX_INTR_MASK(rxq_number) | MVNETA_TX_INTR_MASK(txq_number));
 
 	phy_start(pp->phy_dev);
 	netif_tx_start_all_queues(pp->dev);
@@ -2527,8 +2487,6 @@ static int mvneta_stop(struct net_device *dev)
 	free_irq(dev->irq, pp);
 	mvneta_cleanup_rxqs(pp);
 	mvneta_cleanup_txqs(pp);
-	del_timer(&pp->tx_done_timer);
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
 
 	return 0;
 }
@@ -2887,11 +2845,6 @@ static int mvneta_probe(struct platform_device *pdev)
 		}
 	}
 
-	pp->tx_done_timer.data = (unsigned long)dev;
-	pp->tx_done_timer.function = mvneta_tx_done_timer_callback;
-	init_timer(&pp->tx_done_timer);
-	clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
-
 	pp->tx_ring_size = MVNETA_MAX_TXD;
 	pp->rx_ring_size = MVNETA_MAX_RXD;
 
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related

* [PATCH 13/13] net: mvneta: make mvneta_txq_done() return void
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Arnaud Ebalard
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

From: Arnaud Ebalard <arno@natisbad.org>

The function return parameter is not used in mvneta_tx_done_gbe(),
where the function is called. This patch makes the function return
void.

Reviewed-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 8c51501..f418f4f 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1314,15 +1314,16 @@ static void mvneta_txq_bufs_free(struct mvneta_port *pp,
 }
 
 /* Handle end of transmission */
-static int mvneta_txq_done(struct mvneta_port *pp,
+static void mvneta_txq_done(struct mvneta_port *pp,
 			   struct mvneta_tx_queue *txq)
 {
 	struct netdev_queue *nq = netdev_get_tx_queue(pp->dev, txq->id);
 	int tx_done;
 
 	tx_done = mvneta_txq_sent_desc_proc(pp, txq);
-	if (tx_done == 0)
-		return tx_done;
+	if (!tx_done)
+		return;
+
 	mvneta_txq_bufs_free(pp, txq, tx_done);
 
 	txq->count -= tx_done;
@@ -1331,8 +1332,6 @@ static int mvneta_txq_done(struct mvneta_port *pp,
 		if (txq->size - txq->count >= MAX_SKB_FRAGS + 1)
 			netif_tx_wake_queue(nq);
 	}
-
-	return tx_done;
 }
 
 static void *mvneta_frag_alloc(const struct mvneta_port *pp)
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related

* [PATCH 12/13] net: mvneta: mvneta_tx_done_gbe() cleanups
From: Willy Tarreau @ 2014-01-16  7:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, Arnaud Ebalard
In-Reply-To: <1389856819-6503-1-git-send-email-w@1wt.eu>

From: Arnaud Ebalard <arno@natisbad.org>

mvneta_tx_done_gbe() return value and third parameter are no more
used. This patch changes the function prototype and removes a useless
variable where the function is called.

Reviewed-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index f5fc7a2..8c51501 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1704,30 +1704,23 @@ static void mvneta_txq_done_force(struct mvneta_port *pp,
 /* Handle tx done - called in softirq context. The <cause_tx_done> argument
  * must be a valid cause according to MVNETA_TXQ_INTR_MASK_ALL.
  */
-static u32 mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done,
-			      int *tx_todo)
+static void mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done)
 {
 	struct mvneta_tx_queue *txq;
-	u32 tx_done = 0;
 	struct netdev_queue *nq;
 
-	*tx_todo = 0;
 	while (cause_tx_done) {
 		txq = mvneta_tx_done_policy(pp, cause_tx_done);
 
 		nq = netdev_get_tx_queue(pp->dev, txq->id);
 		__netif_tx_lock(nq, smp_processor_id());
 
-		if (txq->count) {
-			tx_done += mvneta_txq_done(pp, txq);
-			*tx_todo += txq->count;
-		}
+		if (txq->count)
+			mvneta_txq_done(pp, txq);
 
 		__netif_tx_unlock(nq);
 		cause_tx_done &= ~((1 << txq->id));
 	}
-
-	return tx_done;
 }
 
 /* Compute crc8 of the specified address, using a unique algorithm ,
@@ -1961,9 +1954,7 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
 
 	/* Release Tx descriptors */
 	if (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL) {
-		int tx_todo = 0;
-
-		mvneta_tx_done_gbe(pp, (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL), &tx_todo);
+		mvneta_tx_done_gbe(pp, (cause_rx_tx & MVNETA_TX_INTR_MASK_ALL));
 		cause_rx_tx &= ~MVNETA_TX_INTR_MASK_ALL;
 	}
 
-- 
1.7.12.2.21.g234cd45.dirty

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox