Netdev List
 help / color / mirror / Atom feed
* Re: [Devel] Re: [PATCH v5 00/10] per-cgroup tcp memory pressure
From: James Bottomley @ 2011-11-15 18:27 UTC (permalink / raw)
  To: davem@davemloft.net, eric.dumazet@gmail.com
  Cc: Glauber Costa, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, paul@paulmenage.org, lizf@cn.fujitsu.com,
	linux-mm@kvack.org, devel@openvz.org, kirill@shutemov.name,
	gthelen@google.com, kamezawa.hiroyu@jp.fujitsu.com
In-Reply-To: <4EBAC04F.1010901@parallels.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2029 bytes --]

On Wed, 2011-11-09 at 16:02 -0200, Glauber Costa wrote:
> On 11/07/2011 01:26 PM, Glauber Costa wrote:
> > Hi all,
> >
> > This is my new attempt at implementing per-cgroup tcp memory pressure.
> > I am particularly interested in what the network folks have to comment on
> > it: my main goal is to achieve the least impact possible in the network code.
> >
> > Here's a brief description of my approach:
> >
> > When only the root cgroup is present, the code should behave the same way as
> > before - with the exception of the inclusion of an extra field in struct sock,
> > and one in struct proto. All tests are patched out with static branch, and we
> > still access addresses directly - the same as we did before.
> >
> > When a cgroup other than root is created, we patch in the branches, and account
> > resources for that cgroup. The variables in the root cgroup are still updated.
> > If we were to try to be 100 % coherent with the memcg code, that should depend
> > on use_hierarchy. However, I feel that this is a good compromise in terms of
> > leaving the network code untouched, and still having a global vision of its
> > resources. I also do not compute max_usage for the root cgroup, for a similar
> > reason.
> >
> > Please let me know what you think of it.
> 
> Dave, Eric,
> 
> Can you let me know what you think of the general approach I've followed 
> in this series? The impact on the common case should be minimal, or at 
> least as expensive as a static branch (0 in most arches, I believe).
> 
> I am mostly interested in knowing if this a valid pursue path. I'll be 
> happy to address any specific concerns you have once you're ok with the 
> general approach.

Ping on this, please.  We're blocked on this patch set until we can get
an ack that the approach is acceptable to network people.

Thanks,

James

N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒñb‚^[nö¢®×¥yÊ&Š{^®w­r\x16«ë"œ&§iÖ¬Š	á¶Ú\x7fþËh¦Ø^™ë^­Æ¿\x0e‰ízf¢•¨ky

^ permalink raw reply

* Re: [patch net-next V8] net: introduce ethernet teaming device
From: Eric Dumazet @ 2011-11-15 18:35 UTC (permalink / raw)
  To: Rick Jones
  Cc: Andy Gospodarek, Jiri Pirko, netdev, davem, bhutchings,
	shemminger, fubar, tgraf, ebiederm, mirqus, kaber, greearb, jesse,
	fbl, benjamin.poirier, jzupka, ivecera
In-Reply-To: <4EC29FE7.70904@hp.com>

Le mardi 15 novembre 2011 à 09:22 -0800, Rick Jones a écrit :
> > On most modern systems I suspect there will be little to no difference
> > between bonding RX peformance and team performance.
> >
> > If there is any now, I suspect team and bond performance to be similar
> > by the time team has to account for the corner-cases bonding has already
> > resolved.  :-)
> >
> > Benchmarks may prove otherwise, but I've yet to see Jiri produce
> > anything.  My initial testing doesn't demonstrate any measureable
> > differences with 1Gbps interfaces on a multi-core, multi-socket system.
> 
> I wouldn't expect much difference in terms of bandwidth, I was thinking 
> the demonstration would be made in the area of service demand (CPU 
> consumed per unit work) and perhaps aggregate packets per second.

Well,

bonding is a NETIF_F_LLTX driver, but uses following rwlock in xmit
path :

read_lock(&bond->curr_slave_lock);
...
read_unlock(&bond->curr_slave_lock);

Two atomic operations on a contended cache line.

On a 16 cpu machine, here is some "perf stat" data of such workload :
(each thread doing 10.000.000 atomic_inc(&somesharedvar) )

# perf stat ./atomic 16

 Performance counter stats for './atomic 16':

      48016,104204 task-clock                #   15,566 CPUs utilized          
               555 context-switches          #    0,000 M/sec                  
                15 CPU-migrations            #    0,000 M/sec                  
               175 page-faults               #    0,000 M/sec                  
   121 669 943 013 cycles                    #    2,534 GHz                    
   121 321 455 748 stalled-cycles-frontend   #   99,71% frontend cycles idle   
   103 375 494 290 stalled-cycles-backend    #   84,96% backend  cycles idle   
       611 624 619 instructions              #    0,01  insns per cycle        
                                             #   198,36  stalled cycles per insn
       184 530 032 branches                  #    3,843 M/sec                  
           581 513 branch-misses             #    0,32% of all branches        

       3,084672937 seconds time elapsed

Cost per 'read_lock/read_unlock pair' : at least 616 ns

While on one cpu only :

# perf stat ./atomic 1

 Performance counter stats for './atomic 1':

         83,475050 task-clock                #    0,998 CPUs utilized          
                 3 context-switches          #    0,000 M/sec                  
                 1 CPU-migrations            #    0,000 M/sec                  
               144 page-faults               #    0,002 M/sec                  
       211 508 600 cycles                    #    2,534 GHz                    
       193 502 947 stalled-cycles-frontend   #   91,49% frontend cycles idle   
       124 428 400 stalled-cycles-backend    #   58,83% backend  cycles idle   
        30 870 434 instructions              #    0,15  insns per cycle        
                                             #    6,27  stalled cycles per insn
        10 163 364 branches                  #  121,753 M/sec                  
             9 633 branch-misses             #    0,09% of all branches        

       0,083679928 seconds time elapsed

Cost per 'read_lock/read_unlock pair' : 16 ns


Of course, bonding could be changed to use RCU as well,
if someone feels the need.

But teaming was designed to be RCU ready from the beginning.

exi

^ permalink raw reply

* Re: [PATCH V2] vlan:return error when real dev is enslaved
From: Nicolas de Pesloüan @ 2011-11-15 19:19 UTC (permalink / raw)
  To: Weiping Pan
  Cc: Patrick McHardy (maintainer:VLAN (802.1Q)),
	"David S. Miller" (maintainer:NETWORKING [GENERAL]),
	open list:VLAN (802.1Q), open list
In-Reply-To: <d7ea491a500c99c0b4839ddcedab027a3c865c59.1321360959.git.wpan@redhat.com>

Le 15/11/2011 13:44, Weiping Pan a écrit :
> Qinhuibin reported a kernel panic when he do some operation about vlan.
> https://lkml.org/lkml/2011/11/6/218
>
> The operation is as below:
> ifconfig eth2 up
> modprobe bonding
> modprobe 8021q
> ifconfig bond0 up
> ifenslave bond0 eth2
> vconfig add eth2 3300
> vconfig add bond0 33
> vconfig rem eth2.3300
>
> the panic stack is as below:
> [<ffffffffa002f1c9>] panic_event+0x49/0x70 [ipmi_msghandler]
> [<ffffffff80378917>] notifier_call_chain+0x37/0x70
> [<ffffffff80372122>] panic+0xa2/0x195
> [<ffffffff80376ed8>] oops_end+0xd8/0x140
> [<ffffffff8001bea7>] no_context+0xf7/0x280
> [<ffffffff8001c1a5>] __bad_area_nosemaphore+0x175/0x250
> [<ffffffff80376318>] page_fault+0x28/0x30
> [<ffffffffa039dabd>] igb_vlan_rx_kill_vid+0x4d/0x100 [igb]
> [<ffffffffa044045f>] bond_vlan_rx_kill_vid+0x9f/0x290 [bonding]
> [<ffffffffa047e636>] unregister_vlan_dev+0x136/0x180 [8021q]
> [<ffffffffa047ed20>] vlan_ioctl_handler+0x170/0x3f0 [8021q]
> [<ffffffff802c1d3f>] sock_ioctl+0x21f/0x280
> [<ffffffff800e6d7f>] vfs_ioctl+0x2f/0xb0
> [<ffffffff800e726b>] do_vfs_ioctl+0x3cb/0x5a0
> [<ffffffff800e74e1>] sys_ioctl+0xa1/0xb0
> [<ffffffff80007388>] system_call_fastpath+0x16/0x1b
> [<00007f108a2b8bd7>] 0x7f108a2b8bd7
> And the nic is as below:
> [root@localhost ~]# ethtool -i eth2
> driver: igb
> version: 3.0.6-k2
> firmware-version: 1.2-1
> bus-info: 0000:04:00.0
> kernel version:
> 2.6.32.12-0.7 also happen in 2.6.32-131
>
> For kernel 2.6.32, the reason of this bug is that when we do "vconfig add bond0 33",
> adapter->vlgrp is overwritten in igb_vlan_rx_register. So when we do "vconfig rem
> eth2.3300", it can't find the correct vlgrp.
>
> And this bug is avoided by vlan cleanup patchset from Jiri Pirko
> <jpirko@redhat.com>, especially commit b2cb09b1a772(igb: do vlan cleanup).
>
> But it is not a correct operation to creat a vlan interface on eth2
> when it have been enslaved by bond0, so this patch is to return error
> when the real dev is already enslaved.

Why isn't this setup correct?

Compare to bridge, where ebtables allow for some sort of sharing of the physical interface between 
bridge and vlan.

I think bonding should behave the same way instead of denying this setup.

	Nicolas.

^ permalink raw reply

* System Administrator.
From: Webmail Technical Upgrade Team @ 2011-11-15 13:06 UTC (permalink / raw)


mailbox has exceeded the storage limit which is 20GB as set by your
administrator,you are currently running on 20.9GB,you may not be able to
send or receive new mail until you re-validate your mailbox.Tore-validate
your mailbox please click this:
https://docs.google.com/spreadsheet/viewform?formkey=dFA0TTdCOWJQUnRRUFRCWmFoMVFoMFE6MQ

Warning!!! All Webmail. Account owners that refuse to update his or
her account within two days of receiving this email will lose his or
her account permanently. AGB © upc cablecom GmbH 2011. We apologize
for any inconvenience this may have cause you. Thank you for using
Webmail


System Administrator.
Customer Care Unit.

^ permalink raw reply

* Re: [PATCH] bonding: Don't allow mode change via sysfs with slaves present
From: Nicolas de Pesloüan @ 2011-11-15 19:24 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: Veaceslav Falico, netdev, Jay Vosburgh
In-Reply-To: <20111115170018.GB25132@gospo.rdu.redhat.com>

Le 15/11/2011 18:00, Andy Gospodarek a écrit :
> On Tue, Nov 15, 2011 at 05:44:42PM +0100, Veaceslav Falico wrote:
>> When changing mode via bonding's sysfs, the slaves are not initialized
>> correctly. Forbid to change modes with slaves present to ensure that every
>> slave is initialized correctly via bond_enslave().
>>
>> Signed-off-by: Veaceslav Falico<vfalico@redhat.com>
>
> Looks good.  This behavior forces someone who wants to change to mode to
> go through steps that are almost as destructive as when module options
> are used to configure the mode.  I do not see a problem with this.

Except the fact that is enforce one more constraint on the exact order one should write into sysfs 
to setup a bonding interface. We already have many such constraints and probably don't need more.

Currently, it is possible to enslave slaves before selecting the mode. The ifenslave-2.6 package 
from Debian currently enslave slaves before setting the mode and would break with this change.

NAK.

	Nicolas.

^ permalink raw reply

* [PATCH net-next v4 0/8] forcedeth: stats & debug enhancements
From: David Decotigny @ 2011-11-15 19:25 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla, David Decotigny

These changes implement the ndo_get_stats64 API and add a few more
stats and debugging features for forcedeth. They also ensure that
stats updates are correct in SMP systems, 32 or 64-bits.

Changes since v3:
  - updated get_stats64 + rx_dropped patches to use u64_stats_sync.h
  - dropped indentation "whitespace/indentation fixes" (included in
    get_stats64 api patch)

Changes since v2:
  - patch 1/9 is the cherry-pick of 898bdf2cb43e ("forcedeth: fix
    stats on hardware without extended stats support")
  - removed patch 5/10 "stats for rx_packets based on hardware
    registers" because packets&bytes stats are updated in software
    only (898bdf2cb43e)

Changes since v1:
  - patch 1/10 is the same as
    http://patchwork.ozlabs.org/patch/125017/ (targetting net)
  - other patches updated to take patch 1/10 into account
  - various commit message updates


Tested:
  ~150Mbps incoming TCP, ethtool -S in a loop, x86_64 16-way:
     tx_bytes: 5441989419
     rx_packets: 5439224
     tx_timeout: 0
     tx_packets: 5456705
     rx_bytes: 5566763850

Tested:
  pktgen + loopback report same RX/TX packets and bytes stats

Tested:
  tests above with Kconfig DEBUG_PAGEALLOC DEBUG_MUTEXES
  DEBUG_SPINLOCK LOCKUP_DETECTOR DEBUG_RT_MUTEXES DEBUG_LOCK_ALLOC
  PROVE_LOCKING DEBUG_ATOMIC_SLEEP DEBUG_STACK_USAGE DEBUG_KOBJECT
  DEBUG_VM DEBUG_LIST DEBUG_SG DEBUG_NOTIFIERS TEST_KSTRTOX
  STRICT_DEVMEM DEBUG_STACKOVERFLOW


############################################
# Patch Set Summary:

David Decotigny (5):
  forcedeth: expose module parameters in /sys/module
  forcedeth: implement ndo_get_stats64() API
  forcedeth: account for dropped RX frames
  forcedeth: new ethtool stat counter for TX timeouts
  forcedeth: stats updated with a deferrable timer

Mike Ditto (1):
  forcedeth: Add messages to indicate using MSI or MSI-X

Sameer Nanda (1):
  forcedeth: allow to silence "TX timeout" debug messages

david decotigny (1):
  forcedeth: fix stats on hardware without extended stats support

 drivers/net/ethernet/nvidia/forcedeth.c |  344 ++++++++++++++++++++++---------
 1 files changed, 246 insertions(+), 98 deletions(-)

-- 
1.7.3.1

^ permalink raw reply

* [PATCH net-next v4 1/8] forcedeth: fix stats on hardware without extended stats support
From: David Decotigny @ 2011-11-15 19:25 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla, david decotigny
In-Reply-To: <cover.1321384662.git.david.decotigny@google.com>

From: david decotigny <david.decotigny@google.com>

This change makes sure that tx_packets/rx_bytes ifconfig counters are
updated even on NICs that don't provide hardware support for these
stats: they are now updated in software. For the sake of consistency,
we also now have tx_bytes updated in software (hardware counters
include ethernet CRC, and software doesn't account for it).

This reverts parts of:
 - "forcedeth: statistics optimization" (21828163b2)
 - "forcedeth: Improve stats counters" (0bdfea8ba8)
 - "forcedeth: remove unneeded stats updates" (4687f3f364)

Tested:
  pktgen + loopback (http://patchwork.ozlabs.org/patch/124698/)
  reports identical tx_packets/rx_packets and tx_bytes/rx_bytes.

Signed-off-by: David Decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 898bdf2cb43eb0a962c397eb4dd1aec2c7211be2)


---
 drivers/net/ethernet/nvidia/forcedeth.c |   36 +++++++++++++++++++++++-------
 1 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index e8a5ae3..374625b 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -609,7 +609,7 @@ struct nv_ethtool_str {
 };
 
 static const struct nv_ethtool_str nv_estats_str[] = {
-	{ "tx_bytes" },
+	{ "tx_bytes" }, /* includes Ethernet FCS CRC */
 	{ "tx_zero_rexmt" },
 	{ "tx_one_rexmt" },
 	{ "tx_many_rexmt" },
@@ -637,7 +637,7 @@ static const struct nv_ethtool_str nv_estats_str[] = {
 	/* version 2 stats */
 	{ "tx_deferral" },
 	{ "tx_packets" },
-	{ "rx_bytes" },
+	{ "rx_bytes" }, /* includes Ethernet FCS CRC */
 	{ "tx_pause" },
 	{ "rx_pause" },
 	{ "rx_drop_frame" },
@@ -649,7 +649,7 @@ static const struct nv_ethtool_str nv_estats_str[] = {
 };
 
 struct nv_ethtool_stats {
-	u64 tx_bytes;
+	u64 tx_bytes; /* should be ifconfig->tx_bytes + 4*tx_packets */
 	u64 tx_zero_rexmt;
 	u64 tx_one_rexmt;
 	u64 tx_many_rexmt;
@@ -670,14 +670,14 @@ struct nv_ethtool_stats {
 	u64 rx_unicast;
 	u64 rx_multicast;
 	u64 rx_broadcast;
-	u64 rx_packets;
+	u64 rx_packets; /* should be ifconfig->rx_packets */
 	u64 rx_errors_total;
 	u64 tx_errors_total;
 
 	/* version 2 stats */
 	u64 tx_deferral;
-	u64 tx_packets;
-	u64 rx_bytes;
+	u64 tx_packets; /* should be ifconfig->tx_packets */
+	u64 rx_bytes;   /* should be ifconfig->rx_bytes + 4*rx_packets */
 	u64 tx_pause;
 	u64 rx_pause;
 	u64 rx_drop_frame;
@@ -1706,10 +1706,17 @@ static struct net_device_stats *nv_get_stats(struct net_device *dev)
 	if (np->driver_data & (DEV_HAS_STATISTICS_V1|DEV_HAS_STATISTICS_V2|DEV_HAS_STATISTICS_V3)) {
 		nv_get_hw_stats(dev);
 
+		/*
+		 * Note: because HW stats are not always available and
+		 * for consistency reasons, the following ifconfig
+		 * stats are managed by software: rx_bytes, tx_bytes,
+		 * rx_packets and tx_packets. The related hardware
+		 * stats reported by ethtool should be equivalent to
+		 * these ifconfig stats, with 4 additional bytes per
+		 * packet (Ethernet FCS CRC).
+		 */
+
 		/* copy to net_device stats */
-		dev->stats.tx_packets = np->estats.tx_packets;
-		dev->stats.rx_bytes = np->estats.rx_bytes;
-		dev->stats.tx_bytes = np->estats.tx_bytes;
 		dev->stats.tx_fifo_errors = np->estats.tx_fifo_errors;
 		dev->stats.tx_carrier_errors = np->estats.tx_carrier_errors;
 		dev->stats.rx_crc_errors = np->estats.rx_crc_errors;
@@ -2380,6 +2387,9 @@ static int nv_tx_done(struct net_device *dev, int limit)
 				if (flags & NV_TX_ERROR) {
 					if ((flags & NV_TX_RETRYERROR) && !(flags & NV_TX_RETRYCOUNT_MASK))
 						nv_legacybackoff_reseed(dev);
+				} else {
+					dev->stats.tx_packets++;
+					dev->stats.tx_bytes += np->get_tx_ctx->skb->len;
 				}
 				dev_kfree_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
@@ -2390,6 +2400,9 @@ static int nv_tx_done(struct net_device *dev, int limit)
 				if (flags & NV_TX2_ERROR) {
 					if ((flags & NV_TX2_RETRYERROR) && !(flags & NV_TX2_RETRYCOUNT_MASK))
 						nv_legacybackoff_reseed(dev);
+				} else {
+					dev->stats.tx_packets++;
+					dev->stats.tx_bytes += np->get_tx_ctx->skb->len;
 				}
 				dev_kfree_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
@@ -2429,6 +2442,9 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit)
 					else
 						nv_legacybackoff_reseed(dev);
 				}
+			} else {
+				dev->stats.tx_packets++;
+				dev->stats.tx_bytes += np->get_tx_ctx->skb->len;
 			}
 
 			dev_kfree_skb_any(np->get_tx_ctx->skb);
@@ -2678,6 +2694,7 @@ static int nv_rx_process(struct net_device *dev, int limit)
 		skb->protocol = eth_type_trans(skb, dev);
 		napi_gro_receive(&np->napi, skb);
 		dev->stats.rx_packets++;
+		dev->stats.rx_bytes += len;
 next_pkt:
 		if (unlikely(np->get_rx.orig++ == np->last_rx.orig))
 			np->get_rx.orig = np->first_rx.orig;
@@ -2761,6 +2778,7 @@ static int nv_rx_process_optimized(struct net_device *dev, int limit)
 			}
 			napi_gro_receive(&np->napi, skb);
 			dev->stats.rx_packets++;
+			dev->stats.rx_bytes += len;
 		} else {
 			dev_kfree_skb(skb);
 		}
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v4 5/8] forcedeth: implement ndo_get_stats64() API
From: David Decotigny @ 2011-11-15 19:25 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla, David Decotigny
In-Reply-To: <cover.1321384662.git.david.decotigny@google.com>

This commit implements the ndo_get_stats64() API for forcedeth. Since
hardware stats are being updated from different contexts (process and
timer), this commit adds protection (locking + atomic variables). For
software stats, it relies on the u64_stats_sync.h API.

Tested:
  - 16-way SMP x86_64 ->
    RX bytes:7244556582 (7.2 GB)  TX bytes:181904254 (181.9 MB)
  - pktgen + loopback: identical rx_bytes/tx_bytes and rx_packets/tx_packets



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |  195 +++++++++++++++++++++++--------
 1 files changed, 144 insertions(+), 51 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index ee8cce5..ff01d5e 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -65,7 +65,8 @@
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/prefetch.h>
-#include  <linux/io.h>
+#include <linux/u64_stats_sync.h>
+#include <linux/io.h>
 
 #include <asm/irq.h>
 #include <asm/system.h>
@@ -736,6 +737,18 @@ struct nv_skb_map {
  * - tx setup is lockless: it relies on netif_tx_lock. Actual submission
  *	needs netdev_priv(dev)->lock :-(
  * - set_multicast_list: preparation lockless, relies on netif_tx_lock.
+ *
+ * Hardware stats updates are protected by hwstats_lock:
+ * - updated by nv_do_stats_poll (timer). This is meant to avoid
+ *   integer wraparound in the NIC stats registers, at low frequency
+ *   (0.1 Hz)
+ * - updated by nv_get_ethtool_stats + nv_get_stats64
+ *
+ * Software stats are accessed only through a 64b synchronization
+ * point and are not subject to other synchronization techniques (one
+ * unique updating thread for each stat [single queue RX/TX fast
+ * paths], or callers already synchronized [for tx_dropped, except from
+ * nv_open/nv_close]).
  */
 
 /* in dev: base, irq */
@@ -745,9 +758,13 @@ struct fe_priv {
 	struct net_device *dev;
 	struct napi_struct napi;
 
-	/* General data:
-	 * Locking: spin_lock(&np->lock); */
+	/* hardware stats are updated in syscall and timer */
+	spinlock_t hwstats_lock;
 	struct nv_ethtool_stats estats;
+
+	/* software stats are accessed through a 64b synchronization point */
+	struct u64_stats_sync swstats_syncp;
+
 	int in_shutdown;
 	u32 linkspeed;
 	int duplex;
@@ -798,6 +815,11 @@ struct fe_priv {
 	u32 nic_poll_irq;
 	int rx_ring_size;
 
+	/* RX software stats */
+	u64 stat_rx_packets;
+	u64 stat_rx_bytes; /* not always available in HW */
+	u64 stat_rx_missed_errors;
+
 	/* media detection workaround.
 	 * Locking: Within irq hander or disable_irq+spin_lock(&np->lock);
 	 */
@@ -820,6 +842,11 @@ struct fe_priv {
 	struct nv_skb_map *tx_end_flip;
 	int tx_stop;
 
+	/* TX software stats */
+	u64 stat_tx_packets; /* not always available in HW */
+	u64 stat_tx_bytes;
+	u64 stat_tx_dropped;
+
 	/* msi/msi-x fields */
 	u32 msi_flags;
 	struct msix_entry msi_x_entry[NV_MSI_X_MAX_VECTORS];
@@ -1635,11 +1662,19 @@ static void nv_mac_reset(struct net_device *dev)
 	pci_push(base);
 }
 
-static void nv_get_hw_stats(struct net_device *dev)
+/* Caller must appropriately lock netdev_priv(dev)->hwstats_lock */
+static void nv_update_stats(struct net_device *dev)
 {
 	struct fe_priv *np = netdev_priv(dev);
 	u8 __iomem *base = get_hwbase(dev);
 
+	/* If it happens that this is run in top-half context, then
+	 * replace the spin_lock of hwstats_lock with
+	 * spin_lock_irqsave() in calling functions. */
+	WARN_ONCE(in_irq(), "forcedeth: estats spin_lock(_bh) from top-half");
+	assert_spin_locked(&np->hwstats_lock);
+
+	/* query hardware */
 	np->estats.tx_bytes += readl(base + NvRegTxCnt);
 	np->estats.tx_zero_rexmt += readl(base + NvRegTxZeroReXmt);
 	np->estats.tx_one_rexmt += readl(base + NvRegTxOneReXmt);
@@ -1698,40 +1733,67 @@ static void nv_get_hw_stats(struct net_device *dev)
 }
 
 /*
- * nv_get_stats: dev->get_stats function
+ * nv_get_stats64: dev->ndo_get_stats64 function
  * Get latest stats value from the nic.
  * Called with read_lock(&dev_base_lock) held for read -
  * only synchronized against unregister_netdevice.
  */
-static struct net_device_stats *nv_get_stats(struct net_device *dev)
+static struct rtnl_link_stats64*
+nv_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *storage)
+	__acquires(&netdev_priv(dev)->hwstats_lock)
+	__releases(&netdev_priv(dev)->hwstats_lock)
 {
 	struct fe_priv *np = netdev_priv(dev);
+	unsigned int syncp_start;
+
+	/*
+	 * Note: because HW stats are not always available and for
+	 * consistency reasons, the following ifconfig stats are
+	 * managed by software: rx_bytes, tx_bytes, rx_packets and
+	 * tx_packets. The related hardware stats reported by ethtool
+	 * should be equivalent to these ifconfig stats, with 4
+	 * additional bytes per packet (Ethernet FCS CRC).
+	 */
+
+	/* software stats */
+	do {
+		syncp_start = u64_stats_fetch_begin(&np->swstats_syncp);
+		storage->rx_packets       = np->stat_rx_packets;
+		storage->tx_packets       = np->stat_tx_packets;
+		storage->rx_bytes         = np->stat_rx_bytes;
+		storage->tx_bytes         = np->stat_tx_bytes;
+		storage->tx_dropped       = np->stat_tx_dropped;
+		storage->rx_missed_errors = np->stat_rx_missed_errors;
+	} while (u64_stats_fetch_retry(&np->swstats_syncp, syncp_start));
 
 	/* If the nic supports hw counters then retrieve latest values */
-	if (np->driver_data & (DEV_HAS_STATISTICS_V1|DEV_HAS_STATISTICS_V2|DEV_HAS_STATISTICS_V3)) {
-		nv_get_hw_stats(dev);
+	if (np->driver_data & DEV_HAS_STATISTICS_V123) {
+		spin_lock_bh(&np->hwstats_lock);
 
-		/*
-		 * Note: because HW stats are not always available and
-		 * for consistency reasons, the following ifconfig
-		 * stats are managed by software: rx_bytes, tx_bytes,
-		 * rx_packets and tx_packets. The related hardware
-		 * stats reported by ethtool should be equivalent to
-		 * these ifconfig stats, with 4 additional bytes per
-		 * packet (Ethernet FCS CRC).
-		 */
+		nv_update_stats(dev);
+
+		/* generic stats */
+		storage->rx_errors = np->estats.rx_errors_total;
+		storage->tx_errors = np->estats.tx_errors_total;
+
+		/* meaningful only when NIC supports stats v3 */
+		storage->multicast = np->estats.rx_multicast;
+
+		/* detailed rx_errors */
+		storage->rx_length_errors = np->estats.rx_length_error;
+		storage->rx_over_errors   = np->estats.rx_over_errors;
+		storage->rx_crc_errors    = np->estats.rx_crc_errors;
+		storage->rx_frame_errors  = np->estats.rx_frame_align_error;
+		storage->rx_fifo_errors   = np->estats.rx_drop_frame;
 
-		/* copy to net_device stats */
-		dev->stats.tx_fifo_errors = np->estats.tx_fifo_errors;
-		dev->stats.tx_carrier_errors = np->estats.tx_carrier_errors;
-		dev->stats.rx_crc_errors = np->estats.rx_crc_errors;
-		dev->stats.rx_over_errors = np->estats.rx_over_errors;
-		dev->stats.rx_fifo_errors = np->estats.rx_drop_frame;
-		dev->stats.rx_errors = np->estats.rx_errors_total;
-		dev->stats.tx_errors = np->estats.tx_errors_total;
+		/* detailed tx_errors */
+		storage->tx_carrier_errors = np->estats.tx_carrier_errors;
+		storage->tx_fifo_errors    = np->estats.tx_fifo_errors;
+
+		spin_unlock_bh(&np->hwstats_lock);
 	}
 
-	return &dev->stats;
+	return storage;
 }
 
 /*
@@ -1932,8 +1994,11 @@ static void nv_drain_tx(struct net_device *dev)
 			np->tx_ring.ex[i].bufhigh = 0;
 			np->tx_ring.ex[i].buflow = 0;
 		}
-		if (nv_release_txskb(np, &np->tx_skb[i]))
-			dev->stats.tx_dropped++;
+		if (nv_release_txskb(np, &np->tx_skb[i])) {
+			u64_stats_update_begin(&np->swstats_syncp);
+			np->stat_tx_dropped++;
+			u64_stats_update_end(&np->swstats_syncp);
+		}
 		np->tx_skb[i].dma = 0;
 		np->tx_skb[i].dma_len = 0;
 		np->tx_skb[i].dma_single = 0;
@@ -2390,11 +2455,14 @@ static int nv_tx_done(struct net_device *dev, int limit)
 		if (np->desc_ver == DESC_VER_1) {
 			if (flags & NV_TX_LASTPACKET) {
 				if (flags & NV_TX_ERROR) {
-					if ((flags & NV_TX_RETRYERROR) && !(flags & NV_TX_RETRYCOUNT_MASK))
+					if ((flags & NV_TX_RETRYERROR)
+					    && !(flags & NV_TX_RETRYCOUNT_MASK))
 						nv_legacybackoff_reseed(dev);
 				} else {
-					dev->stats.tx_packets++;
-					dev->stats.tx_bytes += np->get_tx_ctx->skb->len;
+					u64_stats_update_begin(&np->swstats_syncp);
+					np->stat_tx_packets++;
+					np->stat_tx_bytes += np->get_tx_ctx->skb->len;
+					u64_stats_update_end(&np->swstats_syncp);
 				}
 				dev_kfree_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
@@ -2403,11 +2471,14 @@ static int nv_tx_done(struct net_device *dev, int limit)
 		} else {
 			if (flags & NV_TX2_LASTPACKET) {
 				if (flags & NV_TX2_ERROR) {
-					if ((flags & NV_TX2_RETRYERROR) && !(flags & NV_TX2_RETRYCOUNT_MASK))
+					if ((flags & NV_TX2_RETRYERROR)
+					    && !(flags & NV_TX2_RETRYCOUNT_MASK))
 						nv_legacybackoff_reseed(dev);
 				} else {
-					dev->stats.tx_packets++;
-					dev->stats.tx_bytes += np->get_tx_ctx->skb->len;
+					u64_stats_update_begin(&np->swstats_syncp);
+					np->stat_tx_packets++;
+					np->stat_tx_bytes += np->get_tx_ctx->skb->len;
+					u64_stats_update_end(&np->swstats_syncp);
 				}
 				dev_kfree_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
@@ -2441,15 +2512,18 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit)
 
 		if (flags & NV_TX2_LASTPACKET) {
 			if (flags & NV_TX2_ERROR) {
-				if ((flags & NV_TX2_RETRYERROR) && !(flags & NV_TX2_RETRYCOUNT_MASK)) {
+				if ((flags & NV_TX2_RETRYERROR)
+				    && !(flags & NV_TX2_RETRYCOUNT_MASK)) {
 					if (np->driver_data & DEV_HAS_GEAR_MODE)
 						nv_gear_backoff_reseed(dev);
 					else
 						nv_legacybackoff_reseed(dev);
 				}
 			} else {
-				dev->stats.tx_packets++;
-				dev->stats.tx_bytes += np->get_tx_ctx->skb->len;
+					u64_stats_update_begin(&np->swstats_syncp);
+					np->stat_tx_packets++;
+					np->stat_tx_bytes += np->get_tx_ctx->skb->len;
+					u64_stats_update_end(&np->swstats_syncp);
 			}
 
 			dev_kfree_skb_any(np->get_tx_ctx->skb);
@@ -2662,8 +2736,11 @@ static int nv_rx_process(struct net_device *dev, int limit)
 					}
 					/* the rest are hard errors */
 					else {
-						if (flags & NV_RX_MISSEDFRAME)
-							dev->stats.rx_missed_errors++;
+						if (flags & NV_RX_MISSEDFRAME) {
+							u64_stats_update_begin(&np->swstats_syncp);
+							np->stat_rx_missed_errors++;
+							u64_stats_update_end(&np->swstats_syncp);
+						}
 						dev_kfree_skb(skb);
 						goto next_pkt;
 					}
@@ -2706,8 +2783,10 @@ static int nv_rx_process(struct net_device *dev, int limit)
 		skb_put(skb, len);
 		skb->protocol = eth_type_trans(skb, dev);
 		napi_gro_receive(&np->napi, skb);
-		dev->stats.rx_packets++;
-		dev->stats.rx_bytes += len;
+		u64_stats_update_begin(&np->swstats_syncp);
+		np->stat_rx_packets++;
+		np->stat_rx_bytes += len;
+		u64_stats_update_end(&np->swstats_syncp);
 next_pkt:
 		if (unlikely(np->get_rx.orig++ == np->last_rx.orig))
 			np->get_rx.orig = np->first_rx.orig;
@@ -2790,8 +2869,10 @@ static int nv_rx_process_optimized(struct net_device *dev, int limit)
 				__vlan_hwaccel_put_tag(skb, vid);
 			}
 			napi_gro_receive(&np->napi, skb);
-			dev->stats.rx_packets++;
-			dev->stats.rx_bytes += len;
+			u64_stats_update_begin(&np->swstats_syncp);
+			np->stat_rx_packets++;
+			np->stat_rx_bytes += len;
+			u64_stats_update_end(&np->swstats_syncp);
 		} else {
 			dev_kfree_skb(skb);
 		}
@@ -4000,11 +4081,18 @@ static void nv_poll_controller(struct net_device *dev)
 #endif
 
 static void nv_do_stats_poll(unsigned long data)
+	__acquires(&netdev_priv(dev)->hwstats_lock)
+	__releases(&netdev_priv(dev)->hwstats_lock)
 {
 	struct net_device *dev = (struct net_device *) data;
 	struct fe_priv *np = netdev_priv(dev);
 
-	nv_get_hw_stats(dev);
+	/* If lock is currently taken, the stats are being refreshed
+	 * and hence fresh enough */
+	if (spin_trylock(&np->hwstats_lock)) {
+		nv_update_stats(dev);
+		spin_unlock(&np->hwstats_lock);
+	}
 
 	if (!np->in_shutdown)
 		mod_timer(&np->stats_poll,
@@ -4711,14 +4799,18 @@ static int nv_get_sset_count(struct net_device *dev, int sset)
 	}
 }
 
-static void nv_get_ethtool_stats(struct net_device *dev, struct ethtool_stats *estats, u64 *buffer)
+static void nv_get_ethtool_stats(struct net_device *dev,
+				 struct ethtool_stats *estats, u64 *buffer)
+	__acquires(&netdev_priv(dev)->hwstats_lock)
+	__releases(&netdev_priv(dev)->hwstats_lock)
 {
 	struct fe_priv *np = netdev_priv(dev);
 
-	/* update stats */
-	nv_get_hw_stats(dev);
-
-	memcpy(buffer, &np->estats, nv_get_sset_count(dev, ETH_SS_STATS)*sizeof(u64));
+	spin_lock_bh(&np->hwstats_lock);
+	nv_update_stats(dev);
+	memcpy(buffer, &np->estats,
+	       nv_get_sset_count(dev, ETH_SS_STATS)*sizeof(u64));
+	spin_unlock_bh(&np->hwstats_lock);
 }
 
 static int nv_link_test(struct net_device *dev)
@@ -5362,7 +5454,7 @@ static int nv_close(struct net_device *dev)
 static const struct net_device_ops nv_netdev_ops = {
 	.ndo_open		= nv_open,
 	.ndo_stop		= nv_close,
-	.ndo_get_stats		= nv_get_stats,
+	.ndo_get_stats64	= nv_get_stats64,
 	.ndo_start_xmit		= nv_start_xmit,
 	.ndo_tx_timeout		= nv_tx_timeout,
 	.ndo_change_mtu		= nv_change_mtu,
@@ -5379,7 +5471,7 @@ static const struct net_device_ops nv_netdev_ops = {
 static const struct net_device_ops nv_netdev_ops_optimized = {
 	.ndo_open		= nv_open,
 	.ndo_stop		= nv_close,
-	.ndo_get_stats		= nv_get_stats,
+	.ndo_get_stats64	= nv_get_stats64,
 	.ndo_start_xmit		= nv_start_xmit_optimized,
 	.ndo_tx_timeout		= nv_tx_timeout,
 	.ndo_change_mtu		= nv_change_mtu,
@@ -5418,6 +5510,7 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, const struct pci_device_i
 	np->dev = dev;
 	np->pci_dev = pci_dev;
 	spin_lock_init(&np->lock);
+	spin_lock_init(&np->hwstats_lock);
 	SET_NETDEV_DEV(dev, &pci_dev->dev);
 
 	init_timer(&np->oom_kick);
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v4 7/8] forcedeth: new ethtool stat counter for TX timeouts
From: David Decotigny @ 2011-11-15 19:25 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla, David Decotigny
In-Reply-To: <cover.1321384662.git.david.decotigny@google.com>

This change publishes a new ethtool stats: tx_timeout that counts the
number of times the tx_timeout callback was triggered.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index a50c839..7284f40 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -634,6 +634,7 @@ static const struct nv_ethtool_str nv_estats_str[] = {
 	{ "rx_packets" },
 	{ "rx_errors_total" },
 	{ "tx_errors_total" },
+	{ "tx_timeout" },
 
 	/* version 2 stats */
 	{ "tx_deferral" },
@@ -674,6 +675,7 @@ struct nv_ethtool_stats {
 	u64 rx_packets; /* should be ifconfig->rx_packets */
 	u64 rx_errors_total;
 	u64 tx_errors_total;
+	u64 tx_timeout;
 
 	/* version 2 stats */
 	u64 tx_deferral;
@@ -848,6 +850,9 @@ struct fe_priv {
 	u64 stat_tx_bytes;
 	u64 stat_tx_dropped;
 
+	/* TX software stats exported by ethtool */
+	atomic_t stat_tx_timeout;  /* TX timeouts since last nv_update_stats */
+
 	/* msi/msi-x fields */
 	u32 msi_flags;
 	struct msix_entry msi_x_entry[NV_MSI_X_MAX_VECTORS];
@@ -1731,6 +1736,8 @@ static void nv_update_stats(struct net_device *dev)
 		np->estats.tx_multicast += readl(base + NvRegTxMulticast);
 		np->estats.tx_broadcast += readl(base + NvRegTxBroadcast);
 	}
+
+	np->estats.tx_timeout += atomic_xchg(&np->stat_tx_timeout, 0);
 }
 
 /*
@@ -2627,6 +2634,8 @@ static void nv_tx_timeout(struct net_device *dev)
 		}
 	}
 
+	atomic_inc(&np->stat_tx_timeout);
+
 	spin_lock_irq(&np->lock);
 
 	/* 1) stop tx engine */
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v4 2/8] forcedeth: Add messages to indicate using MSI or MSI-X
From: David Decotigny @ 2011-11-15 19:25 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla, Mike Ditto, David Decotigny
In-Reply-To: <cover.1321384662.git.david.decotigny@google.com>

From: Mike Ditto <mditto@google.com>

This adds a few debug messages to indicate whether PCIe interrupts are
signaled with MSI or MSI-X.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 374625b..fe17e42 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -3810,6 +3810,7 @@ static int nv_request_irq(struct net_device *dev, int intr_test)
 				writel(0, base + NvRegMSIXMap0);
 				writel(0, base + NvRegMSIXMap1);
 			}
+			netdev_info(dev, "MSI-X enabled\n");
 		}
 	}
 	if (ret != 0 && np->msi_flags & NV_MSI_CAPABLE) {
@@ -3831,6 +3832,7 @@ static int nv_request_irq(struct net_device *dev, int intr_test)
 			writel(0, base + NvRegMSIMap1);
 			/* enable msi vector 0 */
 			writel(NVREG_MSI_VECTOR_0_ENABLED, base + NvRegMSIIrqMask);
+			netdev_info(dev, "MSI enabled\n");
 		}
 	}
 	if (ret != 0) {
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v4 3/8] forcedeth: allow to silence "TX timeout" debug messages
From: David Decotigny @ 2011-11-15 19:25 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla, Sameer Nanda, David Decotigny
In-Reply-To: <cover.1321384662.git.david.decotigny@google.com>

From: Sameer Nanda <snanda@google.com>

This adds a new module parameter "debug_tx_timeout" to silence most
debug messages in case of TX timeout. These messages don't provide a
signal/noise ratio high enough for production systems and, with ~30kB
logged each time, they tend to add to a cascade effect if the system
is already under stress (memory pressure, disk, etc.).

By default, the parameter is clear, meaning that only a single warning
will be reported.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |   98 ++++++++++++++++++-------------
 1 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index fe17e42..9b917ff 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -892,6 +892,11 @@ enum {
 static int dma_64bit = NV_DMA_64BIT_ENABLED;
 
 /*
+ * Debug output control for tx_timeout
+ */
+static bool debug_tx_timeout = false;
+
+/*
  * Crossover Detection
  * Realtek 8201 phy + some OEM boards do not work properly.
  */
@@ -2477,56 +2482,64 @@ static void nv_tx_timeout(struct net_device *dev)
 	u32 status;
 	union ring_type put_tx;
 	int saved_tx_limit;
-	int i;
 
 	if (np->msi_flags & NV_MSI_X_ENABLED)
 		status = readl(base + NvRegMSIXIrqStatus) & NVREG_IRQSTAT_MASK;
 	else
 		status = readl(base + NvRegIrqStatus) & NVREG_IRQSTAT_MASK;
 
-	netdev_info(dev, "Got tx_timeout. irq: %08x\n", status);
+	netdev_warn(dev, "Got tx_timeout. irq status: %08x\n", status);
 
-	netdev_info(dev, "Ring at %lx\n", (unsigned long)np->ring_addr);
-	netdev_info(dev, "Dumping tx registers\n");
-	for (i = 0; i <= np->register_size; i += 32) {
-		netdev_info(dev,
-			    "%3x: %08x %08x %08x %08x %08x %08x %08x %08x\n",
-			    i,
-			    readl(base + i + 0), readl(base + i + 4),
-			    readl(base + i + 8), readl(base + i + 12),
-			    readl(base + i + 16), readl(base + i + 20),
-			    readl(base + i + 24), readl(base + i + 28));
-	}
-	netdev_info(dev, "Dumping tx ring\n");
-	for (i = 0; i < np->tx_ring_size; i += 4) {
-		if (!nv_optimized(np)) {
-			netdev_info(dev,
-				    "%03x: %08x %08x // %08x %08x // %08x %08x // %08x %08x\n",
-				    i,
-				    le32_to_cpu(np->tx_ring.orig[i].buf),
-				    le32_to_cpu(np->tx_ring.orig[i].flaglen),
-				    le32_to_cpu(np->tx_ring.orig[i+1].buf),
-				    le32_to_cpu(np->tx_ring.orig[i+1].flaglen),
-				    le32_to_cpu(np->tx_ring.orig[i+2].buf),
-				    le32_to_cpu(np->tx_ring.orig[i+2].flaglen),
-				    le32_to_cpu(np->tx_ring.orig[i+3].buf),
-				    le32_to_cpu(np->tx_ring.orig[i+3].flaglen));
-		} else {
+	if (unlikely(debug_tx_timeout)) {
+		int i;
+
+		netdev_info(dev, "Ring at %lx\n", (unsigned long)np->ring_addr);
+		netdev_info(dev, "Dumping tx registers\n");
+		for (i = 0; i <= np->register_size; i += 32) {
 			netdev_info(dev,
-				    "%03x: %08x %08x %08x // %08x %08x %08x // %08x %08x %08x // %08x %08x %08x\n",
+				    "%3x: %08x %08x %08x %08x "
+				    "%08x %08x %08x %08x\n",
 				    i,
-				    le32_to_cpu(np->tx_ring.ex[i].bufhigh),
-				    le32_to_cpu(np->tx_ring.ex[i].buflow),
-				    le32_to_cpu(np->tx_ring.ex[i].flaglen),
-				    le32_to_cpu(np->tx_ring.ex[i+1].bufhigh),
-				    le32_to_cpu(np->tx_ring.ex[i+1].buflow),
-				    le32_to_cpu(np->tx_ring.ex[i+1].flaglen),
-				    le32_to_cpu(np->tx_ring.ex[i+2].bufhigh),
-				    le32_to_cpu(np->tx_ring.ex[i+2].buflow),
-				    le32_to_cpu(np->tx_ring.ex[i+2].flaglen),
-				    le32_to_cpu(np->tx_ring.ex[i+3].bufhigh),
-				    le32_to_cpu(np->tx_ring.ex[i+3].buflow),
-				    le32_to_cpu(np->tx_ring.ex[i+3].flaglen));
+				    readl(base + i + 0), readl(base + i + 4),
+				    readl(base + i + 8), readl(base + i + 12),
+				    readl(base + i + 16), readl(base + i + 20),
+				    readl(base + i + 24), readl(base + i + 28));
+		}
+		netdev_info(dev, "Dumping tx ring\n");
+		for (i = 0; i < np->tx_ring_size; i += 4) {
+			if (!nv_optimized(np)) {
+				netdev_info(dev,
+					    "%03x: %08x %08x // %08x %08x "
+					    "// %08x %08x // %08x %08x\n",
+					    i,
+					    le32_to_cpu(np->tx_ring.orig[i].buf),
+					    le32_to_cpu(np->tx_ring.orig[i].flaglen),
+					    le32_to_cpu(np->tx_ring.orig[i+1].buf),
+					    le32_to_cpu(np->tx_ring.orig[i+1].flaglen),
+					    le32_to_cpu(np->tx_ring.orig[i+2].buf),
+					    le32_to_cpu(np->tx_ring.orig[i+2].flaglen),
+					    le32_to_cpu(np->tx_ring.orig[i+3].buf),
+					    le32_to_cpu(np->tx_ring.orig[i+3].flaglen));
+			} else {
+				netdev_info(dev,
+					    "%03x: %08x %08x %08x "
+					    "// %08x %08x %08x "
+					    "// %08x %08x %08x "
+					    "// %08x %08x %08x\n",
+					    i,
+					    le32_to_cpu(np->tx_ring.ex[i].bufhigh),
+					    le32_to_cpu(np->tx_ring.ex[i].buflow),
+					    le32_to_cpu(np->tx_ring.ex[i].flaglen),
+					    le32_to_cpu(np->tx_ring.ex[i+1].bufhigh),
+					    le32_to_cpu(np->tx_ring.ex[i+1].buflow),
+					    le32_to_cpu(np->tx_ring.ex[i+1].flaglen),
+					    le32_to_cpu(np->tx_ring.ex[i+2].bufhigh),
+					    le32_to_cpu(np->tx_ring.ex[i+2].buflow),
+					    le32_to_cpu(np->tx_ring.ex[i+2].flaglen),
+					    le32_to_cpu(np->tx_ring.ex[i+3].bufhigh),
+					    le32_to_cpu(np->tx_ring.ex[i+3].buflow),
+					    le32_to_cpu(np->tx_ring.ex[i+3].flaglen));
+			}
 		}
 	}
 
@@ -6156,6 +6169,9 @@ module_param(phy_cross, int, 0);
 MODULE_PARM_DESC(phy_cross, "Phy crossover detection for Realtek 8201 phy is enabled by setting to 1 and disabled by setting to 0.");
 module_param(phy_power_down, int, 0);
 MODULE_PARM_DESC(phy_power_down, "Power down phy and disable link when interface is down (1), or leave phy powered up (0).");
+module_param(debug_tx_timeout, bool, 0);
+MODULE_PARM_DESC(debug_tx_timeout,
+		 "Dump tx related registers and ring when tx_timeout happens");
 
 MODULE_AUTHOR("Manfred Spraul <manfred@colorfullife.com>");
 MODULE_DESCRIPTION("Reverse Engineered nForce ethernet driver");
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v4 4/8] forcedeth: expose module parameters in /sys/module
From: David Decotigny @ 2011-11-15 19:25 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla, David Decotigny
In-Reply-To: <cover.1321384662.git.david.decotigny@google.com>

In particular, debug_tx_timeout can be updated at runtime.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 9b917ff..ee8cce5 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -6153,23 +6153,23 @@ static void __exit exit_nic(void)
 	pci_unregister_driver(&driver);
 }
 
-module_param(max_interrupt_work, int, 0);
+module_param(max_interrupt_work, int, S_IRUGO);
 MODULE_PARM_DESC(max_interrupt_work, "forcedeth maximum events handled per interrupt");
-module_param(optimization_mode, int, 0);
+module_param(optimization_mode, int, S_IRUGO);
 MODULE_PARM_DESC(optimization_mode, "In throughput mode (0), every tx & rx packet will generate an interrupt. In CPU mode (1), interrupts are controlled by a timer. In dynamic mode (2), the mode toggles between throughput and CPU mode based on network load.");
-module_param(poll_interval, int, 0);
+module_param(poll_interval, int, S_IRUGO);
 MODULE_PARM_DESC(poll_interval, "Interval determines how frequent timer interrupt is generated by [(time_in_micro_secs * 100) / (2^10)]. Min is 0 and Max is 65535.");
-module_param(msi, int, 0);
+module_param(msi, int, S_IRUGO);
 MODULE_PARM_DESC(msi, "MSI interrupts are enabled by setting to 1 and disabled by setting to 0.");
-module_param(msix, int, 0);
+module_param(msix, int, S_IRUGO);
 MODULE_PARM_DESC(msix, "MSIX interrupts are enabled by setting to 1 and disabled by setting to 0.");
-module_param(dma_64bit, int, 0);
+module_param(dma_64bit, int, S_IRUGO);
 MODULE_PARM_DESC(dma_64bit, "High DMA is enabled by setting to 1 and disabled by setting to 0.");
-module_param(phy_cross, int, 0);
+module_param(phy_cross, int, S_IRUGO);
 MODULE_PARM_DESC(phy_cross, "Phy crossover detection for Realtek 8201 phy is enabled by setting to 1 and disabled by setting to 0.");
-module_param(phy_power_down, int, 0);
+module_param(phy_power_down, int, S_IRUGO);
 MODULE_PARM_DESC(phy_power_down, "Power down phy and disable link when interface is down (1), or leave phy powered up (0).");
-module_param(debug_tx_timeout, bool, 0);
+module_param(debug_tx_timeout, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(debug_tx_timeout,
 		 "Dump tx related registers and ring when tx_timeout happens");
 
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v4 6/8] forcedeth: account for dropped RX frames
From: David Decotigny @ 2011-11-15 19:25 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla, David Decotigny
In-Reply-To: <cover.1321384662.git.david.decotigny@google.com>

This adds the stats counter for dropped RX frames.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index ff01d5e..a50c839 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -819,6 +819,7 @@ struct fe_priv {
 	u64 stat_rx_packets;
 	u64 stat_rx_bytes; /* not always available in HW */
 	u64 stat_rx_missed_errors;
+	u64 stat_rx_dropped;
 
 	/* media detection workaround.
 	 * Locking: Within irq hander or disable_irq+spin_lock(&np->lock);
@@ -1762,6 +1763,7 @@ nv_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *storage)
 		storage->tx_packets       = np->stat_tx_packets;
 		storage->rx_bytes         = np->stat_rx_bytes;
 		storage->tx_bytes         = np->stat_tx_bytes;
+		storage->rx_dropped       = np->stat_rx_dropped;
 		storage->tx_dropped       = np->stat_tx_dropped;
 		storage->rx_missed_errors = np->stat_rx_missed_errors;
 	} while (u64_stats_fetch_retry(&np->swstats_syncp, syncp_start));
@@ -1826,8 +1828,12 @@ static int nv_alloc_rx(struct net_device *dev)
 				np->put_rx.orig = np->first_rx.orig;
 			if (unlikely(np->put_rx_ctx++ == np->last_rx_ctx))
 				np->put_rx_ctx = np->first_rx_ctx;
-		} else
+		} else {
+			u64_stats_update_begin(&np->swstats_syncp);
+			np->stat_rx_dropped++;
+			u64_stats_update_end(&np->swstats_syncp);
 			return 1;
+		}
 	}
 	return 0;
 }
@@ -1858,8 +1864,12 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
 				np->put_rx.ex = np->first_rx.ex;
 			if (unlikely(np->put_rx_ctx++ == np->last_rx_ctx))
 				np->put_rx_ctx = np->first_rx_ctx;
-		} else
+		} else {
+			u64_stats_update_begin(&np->swstats_syncp);
+			np->stat_rx_dropped++;
+			u64_stats_update_end(&np->swstats_syncp);
 			return 1;
+		}
 	}
 	return 0;
 }
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH net-next v4 8/8] forcedeth: stats updated with a deferrable timer
From: David Decotigny @ 2011-11-15 19:25 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Ian Campbell, Eric Dumazet, Jeff Kirsher,
	Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla, David Decotigny
In-Reply-To: <cover.1321384662.git.david.decotigny@google.com>

Mark stats timer as deferrable: punctuality in waking the stats timer
callback doesn't matter much, as it is responsible only to avoid
integer wraparound.

We need at least 1 other timer to fire within 17s (fully loaded 1Gbps)
to avoid wrap-arounds. Desired period is still 10s.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 7284f40..b5c85de 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -5538,7 +5538,7 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, const struct pci_device_i
 	init_timer(&np->nic_poll);
 	np->nic_poll.data = (unsigned long) dev;
 	np->nic_poll.function = nv_do_nic_poll;	/* timer handler */
-	init_timer(&np->stats_poll);
+	init_timer_deferrable(&np->stats_poll);
 	np->stats_poll.data = (unsigned long) dev;
 	np->stats_poll.function = nv_do_stats_poll;	/* timer handler */
 
-- 
1.7.3.1

^ permalink raw reply related

* Re: [PATCH net-next v3 5/9] forcedeth: implement ndo_get_stats64() API
From: David Decotigny @ 2011-11-15 19:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, linux-kernel, David S. Miller, Ian Campbell, Eric Dumazet,
	Jeff Kirsher, Ben Hutchings, Jiri Pirko, Joe Perches, Szymon Janc,
	Richard Jones, Ayaz Abdulla
In-Reply-To: <20111114171039.34a04369@nehalam.linuxnetplumber.net>

Hi Stephen,

On Mon, Nov 14, 2011 at 5:10 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> Please existing u64_stats_sync rather than inventing your own
> method. The u64_stats_sync is faster and does require locking.

Thanks for your review.
This should be fixed in the v4 series I just sent:
http://patchwork.ozlabs.org/patch/125861/

Regards,

--
David Decotigny



On Mon, Nov 14, 2011 at 5:10 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Mon, 14 Nov 2011 16:11:15 -0800
> David Decotigny <david.decotigny@google.com> wrote:
>
>> This commit implements the ndo_get_stats64() API for forcedeth. Since
>> these stats are being updated from different contexts (process and
>> timer), this commit adds protection (locking + atomic variables).
>>
>> Tested:
>>   - 16-way SMP x86_64 ->
>>     RX bytes:7244556582 (7.2 GB)  TX bytes:181904254 (181.9 MB)
>>   - pktgen + loopback: identical rx_bytes/tx_bytes and rx_packets/tx_packets
>>
>>
>>
>> Signed-off-by: David Decotigny <david.decotigny@google.com>
>> ---
>>  drivers/net/ethernet/nvidia/forcedeth.c |  182 ++++++++++++++++++++++++-------
>>  1 files changed, 141 insertions(+), 41 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
>> index ee8cce5..a67c1f4 100644
>> --- a/drivers/net/ethernet/nvidia/forcedeth.c
>> +++ b/drivers/net/ethernet/nvidia/forcedeth.c
>> @@ -692,6 +692,21 @@ struct nv_ethtool_stats {
>>  #define NV_DEV_STATISTICS_V2_COUNT (NV_DEV_STATISTICS_V3_COUNT - 3)
>>  #define NV_DEV_STATISTICS_V1_COUNT (NV_DEV_STATISTICS_V2_COUNT - 6)
>>
>> +/* driver statistics */
>> +struct nv_driver_stat {
>> +     atomic_t delta;  /* increase since last nv_update_stats() */
>> +     u64 total;  /* cumulative, requires netdev_priv(dev)->stats_lock */
>> +};
>
> Please existing u64_stats_sync rather than inventing your own
> method. The u64_stats_sync is faster and does require locking.
>

^ permalink raw reply

* Re: [PATCH net-next v4 2/8] forcedeth: Add messages to indicate using MSI or MSI-X
From: Joe Perches @ 2011-11-15 19:32 UTC (permalink / raw)
  To: David Decotigny
  Cc: netdev, linux-kernel, David S. Miller, Ian Campbell, Eric Dumazet,
	Jeff Kirsher, Ben Hutchings, Jiri Pirko, Szymon Janc,
	Richard Jones, Ayaz Abdulla, Mike Ditto
In-Reply-To: <97a5a50baaca7b272dfb7a1a06e88df4888f12b9.1321384662.git.david.decotigny@google.com>

On Tue, 2011-11-15 at 11:25 -0800, David Decotigny wrote:
> From: Mike Ditto <mditto@google.com>
> 
> This adds a few debug messages to indicate whether PCIe interrupts are
> signaled with MSI or MSI-X.

Perhaps the changelog would be better without "debug".
These added messages aren't really debug messages.
They are emitted as netdev_info vs netdev_dbg.

> diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
[]
> @@ -3810,6 +3810,7 @@ static int nv_request_irq(struct net_device *dev, int intr_test)
> +			netdev_info(dev, "MSI-X enabled\n");
> @@ -3831,6 +3832,7 @@ static int nv_request_irq(struct net_device *dev, int intr_test)
> +			netdev_info(dev, "MSI enabled\n");

^ permalink raw reply

* Re: [PATCH] bonding: Don't allow mode change via sysfs with slaves present
From: Ben Hutchings @ 2011-11-15 19:33 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Andy Gospodarek, Veaceslav Falico, netdev, Jay Vosburgh
In-Reply-To: <4EC2BC6D.9000304@gmail.com>

On Tue, 2011-11-15 at 20:24 +0100, Nicolas de Pesloüan wrote:
> Le 15/11/2011 18:00, Andy Gospodarek a écrit :
> > On Tue, Nov 15, 2011 at 05:44:42PM +0100, Veaceslav Falico wrote:
> >> When changing mode via bonding's sysfs, the slaves are not initialized
> >> correctly. Forbid to change modes with slaves present to ensure that every
> >> slave is initialized correctly via bond_enslave().
> >>
> >> Signed-off-by: Veaceslav Falico<vfalico@redhat.com>
> >
> > Looks good.  This behavior forces someone who wants to change to mode to
> > go through steps that are almost as destructive as when module options
> > are used to configure the mode.  I do not see a problem with this.
> 
> Except the fact that is enforce one more constraint on the exact order one should write into sysfs 
> to setup a bonding interface. We already have many such constraints and probably don't need more.

From the administrator perspective, perhaps.  From the developer
perspective, the current flexibility of bonding makes it very difficult
to test and maintain.

> Currently, it is possible to enslave slaves before selecting the mode. The ifenslave-2.6 package 
> from Debian currently enslave slaves before setting the mode and would break with this change.
> 
> NAK.

It sounds like this feature has to be kept and fixed, then.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] bonding: Don't allow mode change via sysfs with slaves present
From: Andy Gospodarek @ 2011-11-15 19:35 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Andy Gospodarek, Veaceslav Falico, netdev, Jay Vosburgh
In-Reply-To: <4EC2BC6D.9000304@gmail.com>

On Tue, Nov 15, 2011 at 08:24:29PM +0100, Nicolas de Pesloüan wrote:
> Le 15/11/2011 18:00, Andy Gospodarek a écrit :
>> On Tue, Nov 15, 2011 at 05:44:42PM +0100, Veaceslav Falico wrote:
>>> When changing mode via bonding's sysfs, the slaves are not initialized
>>> correctly. Forbid to change modes with slaves present to ensure that every
>>> slave is initialized correctly via bond_enslave().
>>>
>>> Signed-off-by: Veaceslav Falico<vfalico@redhat.com>
>>
>> Looks good.  This behavior forces someone who wants to change to mode to
>> go through steps that are almost as destructive as when module options
>> are used to configure the mode.  I do not see a problem with this.
>
> Except the fact that is enforce one more constraint on the exact order 
> one should write into sysfs to setup a bonding interface. We already have 
> many such constraints and probably don't need more.
>
> Currently, it is possible to enslave slaves before selecting the mode. 
> The ifenslave-2.6 package from Debian currently enslave slaves before 
> setting the mode and would break with this change.
>

Our testing indicates that 802.3ad mode bonding will not work unless the
devices are enslaved after the mode is set.  Does this mean that no one
using Debian is using 802.2ad mode or are they just not reporting it?

^ permalink raw reply

* [PATCH 1/1] net/cadence: enable by default NET_ATMEL
From: Jean-Christophe PLAGNIOL-VILLARD @ 2011-11-15 19:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, Jean-Christophe PLAGNIOL-VILLARD, Nicolas Ferre

so the defconfig of the atmel continue to have the support of the network
as before

Signed-off-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
---
Hi David,

	can we have this for the 3.2 so the atmel continue to work as before

Best Regards,
J.
 drivers/net/ethernet/cadence/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/cadence/Kconfig b/drivers/net/ethernet/cadence/Kconfig
index 98849a1..b48378a 100644
--- a/drivers/net/ethernet/cadence/Kconfig
+++ b/drivers/net/ethernet/cadence/Kconfig
@@ -7,6 +7,7 @@ config HAVE_NET_MACB
 
 config NET_ATMEL
 	bool "Atmel devices"
+	default y
 	depends on HAVE_NET_MACB || (ARM && ARCH_AT91RM9200)
 	---help---
 	  If you have a network (Ethernet) card belonging to this class, say Y.
-- 
1.7.7

^ permalink raw reply related

* [PATCH net-next v4 2/8] forcedeth: Add messages to indicate using MSI or MSI-X
From: David Decotigny @ 2011-11-15 19:51 UTC (permalink / raw)
  To: Joe Perches, netdev, linux-kernel
  Cc: David S. Miller, Mike Ditto, David Decotigny
In-Reply-To: <cover.1321386214.git.david.decotigny@google.com>

From: Mike Ditto <mditto@google.com>

This adds a few kernel messages to indicate whether PCIe interrupts
are signaled with MSI or MSI-X.



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 374625b..fe17e42 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -3810,6 +3810,7 @@ static int nv_request_irq(struct net_device *dev, int intr_test)
 				writel(0, base + NvRegMSIXMap0);
 				writel(0, base + NvRegMSIXMap1);
 			}
+			netdev_info(dev, "MSI-X enabled\n");
 		}
 	}
 	if (ret != 0 && np->msi_flags & NV_MSI_CAPABLE) {
@@ -3831,6 +3832,7 @@ static int nv_request_irq(struct net_device *dev, int intr_test)
 			writel(0, base + NvRegMSIMap1);
 			/* enable msi vector 0 */
 			writel(NVREG_MSI_VECTOR_0_ENABLED, base + NvRegMSIIrqMask);
+			netdev_info(dev, "MSI enabled\n");
 		}
 	}
 	if (ret != 0) {
-- 
1.7.3.1

^ permalink raw reply related

* Re: [RFT] bridge: checksum not updated after pull
From: Martin Volf @ 2011-11-15 19:51 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Yan, Zheng, bridge@lists.linux-foundation.org,
	netdev@vger.kernel.org, davem@davemloft.net, wcang@sfc.wide.ad.jp
In-Reply-To: <20111115100914.0c9540be@nehalam.linuxnetplumber.net>

On 15 November 2011 19:09, Stephen Hemminger <shemminger@vyatta.com> wrote:
> I think this is what is necessary, please test.
...

Hello,

with the patch on top of 3.1.1 I no longer get the "hw csum failure"
message. I don't know how to test IPv6 multicast, but IPv6 ND and
SLAAC is correctly passing through the bridge.

Thanks!

Martin Volf

^ permalink raw reply

* Re: [PATCH] bonding: Don't allow mode change via sysfs with slaves present
From: Nicolas de Pesloüan @ 2011-11-15 20:02 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: Veaceslav Falico, netdev, Jay Vosburgh
In-Reply-To: <20111115193535.GC25132@gospo.rdu.redhat.com>

Le 15/11/2011 20:35, Andy Gospodarek a écrit :
> On Tue, Nov 15, 2011 at 08:24:29PM +0100, Nicolas de Pesloüan wrote:
>> Le 15/11/2011 18:00, Andy Gospodarek a écrit :
>>> On Tue, Nov 15, 2011 at 05:44:42PM +0100, Veaceslav Falico wrote:
>>>> When changing mode via bonding's sysfs, the slaves are not initialized
>>>> correctly. Forbid to change modes with slaves present to ensure that every
>>>> slave is initialized correctly via bond_enslave().
>>>>
>>>> Signed-off-by: Veaceslav Falico<vfalico@redhat.com>
>>>
>>> Looks good.  This behavior forces someone who wants to change to mode to
>>> go through steps that are almost as destructive as when module options
>>> are used to configure the mode.  I do not see a problem with this.
>>
>> Except the fact that is enforce one more constraint on the exact order
>> one should write into sysfs to setup a bonding interface. We already have
>> many such constraints and probably don't need more.
>>
>> Currently, it is possible to enslave slaves before selecting the mode.
>> The ifenslave-2.6 package from Debian currently enslave slaves before
>> setting the mode and would break with this change.
>>
>
> Our testing indicates that 802.3ad mode bonding will not work unless the
> devices are enslaved after the mode is set.  Does this mean that no one
> using Debian is using 802.2ad mode or are they just not reporting it?

I don't know. Possibly, they setup the bonding by hand, instead of relying on the bonding extensions 
of /etc/network/interfaces provided by the ifenslave-2.6 package.

Having a look at popularity for the package (http://qa.debian.org/popcon.php?package=ifenslave-2.6), 
it is obviously not the most popular one, but...

To try and fix the problem you noticed, if might be desirable to assert carrier off then on for all 
slaves when entering 802.3ad mode.

	Nicolas

^ permalink raw reply

* Re: Finding a hidden bound TCP socket
From: richard -rw- weinberger @ 2011-11-15 20:23 UTC (permalink / raw)
  To: G. D. Fuego; +Cc: linux-kernel, netdev
In-Reply-To: <CALmbKGV0g2nfNAghF5dGrt=TCyrms6RznOL8jSkg+gkgyfekLg@mail.gmail.com>

On Tue, Nov 15, 2011 at 8:21 PM, G. D. Fuego <gdfuego@gmail.com> wrote:
> Hello,
>
> I have a question about an odd linux networking behavior, that could
> potentially be a local networking DoS.  I'm curious if anyone is
> familiar with this behavior.
>
> Essentially I was assisting someone with tracking down a hidden tcp
> connection.  Attempts to bind to a particular port were failing,
> saying the socket was in use, even though netstat was not reporting
> any sort of connection.  They were initially suspecting a root kit,
> but after a bit of digging, I came across this page:
>
> http://dcid.me/2007/06/hidden-ports-on-linux/
>
> From the page:
>
> "Here is the idea. If you get this simple C program, it will attempt
> to bind every TCP port from 1025 to 1050, but it will not listen on
> them. After it is done, if you do a netstat (or fuser or lsof) nothing
> will be shown. However, if you try to use the port, you will get an
> error saying that it is already in use."
>
> I tested it out and confirmed that connections opened by their test
> program do in fact cause the port to be unavailable for use, and they
> are not reported in netstat, lsof, ss, or any other networking tools
> that I tried.  I'm unable to to find any references to the ports being
> in use anywhere I've looked within /proc.  Are you aware of another
> way to figure out which process may be bound to the port?  In our
> case, we figured out via trial and error which software was likely
> triggering this behavior.
>
> It seems to me that this could be a potentially interesting local
> networking DoS.  By binding to all ephemeral ports in this way, you'd
> prevent the local system from being able to establish any tcp
> connections, and it would be a pain to figure out which process was
> causing the problem.
>
> My lame attempts to exploit this failed due to a max file descriptor
> limit, but I'm told this could be doable by forking more processes for
> doing the binding.
>
> Is this behavior known/expected?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

CC'ing netdev

-- 
Thanks,
//richard

^ permalink raw reply

* Unable to flush ICMP redirect routes in kernel 3.0+
From: Ivan Zahariev @ 2011-11-15 20:23 UTC (permalink / raw)
  To: netdev

Hello,

We have changed nothing in our network infrastructure but only upgraded 
from Linux kernel 2.6.36.2 to 3.0.3. Here is the problem we are 
experiencing:

ICMP redirected routes are cached forever, and they can be cleared only 
by a reboot.

Here is an example:

root@machine5:~# ip route get 1.1.1.1
1.1.1.1 via 9.0.0.1 dev eth0  src 5.5.5.5
     cache <redirected>  ipid 0xfb5d rtt 1475ms rttvar 450ms cwnd 10

root@machine5:~# ip route list cache match 1.1.1.1
1.1.1.1 tos lowdelay via 9.0.0.1 dev eth0  src 5.5.5.5
     cache <redirected>  ipid 0xfb5d rtt 1475ms rttvar 450ms cwnd 10
1.1.1.1 via 9.0.0.1 dev eth0  src 5.5.5.5
     cache <redirected>  ipid 0xfb5d rtt 1475ms rttvar 450ms cwnd 10
...(two more entries, all go via 9.0.0.1)...

1.1.1.1 is the test destination address
5.5.5.5 is the source IP address of "machine5" via dev eth0, the only 
interface besides "lo"
9.0.0.1 is the incorrect gateway which we were redirected to; we want to 
change the route to 9.0.0.8

I found no way to clear this route. What I tried:

root@machine5:~# ip route flush cache ### CACHE FLUSH ###
root@machine5:~# ip route list cache match 1.1.1.1 # empty

root@machine5:~# ip route flush cache ### CACHE FLUSH ###
root@machine5:~# echo 1 > /proc/sys/net/ipv4/route/flush
root@machine5:~# ip route list cache match 1.1.1.1 # empty

root@machine5:~# ip route get 1.1.1.1 # magically re-inserts the 
<redirected> route, tcpdump sees NO ICMP traffic
1.1.1.1 via 9.0.0.1 dev eth0  src 5.5.5.5
     cache <redirected>  ipid 0xfb5d rtt 1475ms rttvar 450ms cwnd 10

I also tried to force a scheduled route flush:

root@machine5:~# echo 1 > /proc/sys/net/ipv4/route/gc_timeout
root@machine5:~# echo 1 > /proc/sys/net/ipv4/route/gc_interval

A reboot fixed it all.

This may be related to the "Several major changes to our routing 
infrastructure" (https://lkml.org/lkml/2011/3/16/384).
Other users are reporting the same problem:
* https://plus.google.com/u/0/117161704068825702652/posts/1UK1Rp4KA4J
* http://lists.debian.org/debian-kernel/2011/10/msg00633.html
Other similar issues:
* http://www.spinics.net/lists/netdev/msg176966.html
* http://forums.gentoo.org/viewtopic-t-901024-start-0.html

This has been occurring on a few KVM guest machines and also on a 
regular Linux machine, so it's not KVM related.

Is this a bug, or it's me who's missing something?

Thanks.
--Ivan

^ permalink raw reply

* Re: [PATCH] bonding: Don't allow mode change via sysfs with slaves present
From: Andy Gospodarek @ 2011-11-15 20:47 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Andy Gospodarek, Veaceslav Falico, netdev, Jay Vosburgh
In-Reply-To: <4EC2C550.6050805@gmail.com>

On Tue, Nov 15, 2011 at 09:02:24PM +0100, Nicolas de Pesloüan wrote:
> Le 15/11/2011 20:35, Andy Gospodarek a écrit :
>> On Tue, Nov 15, 2011 at 08:24:29PM +0100, Nicolas de Pesloüan wrote:
>>> Le 15/11/2011 18:00, Andy Gospodarek a écrit :
>>>> On Tue, Nov 15, 2011 at 05:44:42PM +0100, Veaceslav Falico wrote:
>>>>> When changing mode via bonding's sysfs, the slaves are not initialized
>>>>> correctly. Forbid to change modes with slaves present to ensure that every
>>>>> slave is initialized correctly via bond_enslave().
>>>>>
>>>>> Signed-off-by: Veaceslav Falico<vfalico@redhat.com>
>>>>
>>>> Looks good.  This behavior forces someone who wants to change to mode to
>>>> go through steps that are almost as destructive as when module options
>>>> are used to configure the mode.  I do not see a problem with this.
>>>
>>> Except the fact that is enforce one more constraint on the exact order
>>> one should write into sysfs to setup a bonding interface. We already have
>>> many such constraints and probably don't need more.
>>>
>>> Currently, it is possible to enslave slaves before selecting the mode.
>>> The ifenslave-2.6 package from Debian currently enslave slaves before
>>> setting the mode and would break with this change.
>>>
>>
>> Our testing indicates that 802.3ad mode bonding will not work unless the
>> devices are enslaved after the mode is set.  Does this mean that no one
>> using Debian is using 802.2ad mode or are they just not reporting it?
>
> I don't know. Possibly, they setup the bonding by hand, instead of 
> relying on the bonding extensions of /etc/network/interfaces provided by 
> the ifenslave-2.6 package.
>
> Having a look at popularity for the package 
> (http://qa.debian.org/popcon.php?package=ifenslave-2.6), it is obviously 
> not the most popular one, but...
>

Nicolas,

I took a look at the ifenslave package for debian more closely and it
actually looks like devices are enslaved last, after mode is set.  Can
you please take a look at this package and confirm what I'm seeing in
the 'pre-up' script.

It appears to me that setup_master sets the mode and enslave_slaves is
called after and enslaves the devices:

# Option slaves deprecated, replaced by bond-slaves, but still supported
# for backward compatibility.
IF_BOND_SLAVES=${IF_BOND_SLAVES:-$IF_SLAVES}

if [ "$IF_BOND_MASTER" ] ; then
        BOND_MASTER="$IF_BOND_MASTER"
        BOND_SLAVES="$IFACE"
else
        if [ "$IF_BOND_SLAVES" ] ; then
                BOND_MASTER="$IFACE"
                BOND_SLAVES="$IF_BOND_SLAVES"
        fi
fi

# Exit if nothing to do...
[ -z "$BOND_MASTER$BOND_SLAVES" ] && exit

add_master
early_setup_master
setup_master
enslave_slaves
exit 0

-andy

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox