Netdev List
 help / color / mirror / Atom feed
* Re: Raise initial congestion window size / speedup slow start?
From: H.K. Jerry Chu @ 2010-07-19 22:51 UTC (permalink / raw)
  To: Rick Jones
  Cc: Patrick McManus, David Miller, davidsen, lists, linux-kernel,
	netdev
In-Reply-To: <4C448688.1070507@hp.com>

 Mon, Jul 19, 2010 at 10:08 AM, Rick Jones <rick.jones2@hp.com> wrote:
> H.K. Jerry Chu wrote:
>>
>> On Fri, Jul 16, 2010 at 10:01 AM, Patrick McManus <mcmanus@ducksong.com>
>> wrote:
>>>
>>> can you tell us more about the impl concerns of initcwnd stored on the
>>> route?
>>
>>
>> We have found two issues when altering initcwnd through the ip route cmd:
>> 1. initcwnd is actually capped by sndbuf (i.e., tcp_wmem[1], which is
>> defaulted to a small value of 16KB). This problem has been made obscured
>> by the TSO code, which fudges the flow control limit (and could be a bug
>> by
>> itself).
>
> I'll ask my Emily Litella question of the day and inquire as to why that
> would be unique to altering initcwnd via the route?
>
> The slightly less Emily Litella-esque question is why an appliction with a
> desire to know it could send more than 16K at one time wouldn't have either
> asked via its install docs to have the minimum tweaked (certainly if one is
> already tweaking routes...), or "gone all the way" and made an explicit
> setsockopt(SO_SNDBUF) call?  We are in a realm of applications for which
> there was a proposal to allow them to pick their own initcwnd right?  Having

Per app setting of initcwnd is just one case. Another is per route setting of
initcwnd basis through the ip route cmd. For the latter the initcwnd change is
more or less supposed to be transparent to apps.

This wasn't a big issue and can probably be easily fixed by
initializing sk_sndbuf
to max(tcp_wmem[1], initcwnd) as you alluded to below. It is just our
experiements got hindered by this little bug but we weren't aware of it sooner
due to TSO fudging sndbuf.

Jerry

> them pick an SO_SNDBUF size would seem to be no more to ask.
>
> rick jones
>
> sendbuf_init = max(tcp_mem,initcwnd)?
>

^ permalink raw reply

* Re: Very low latency TCP for clusters
From: Tom Herbert @ 2010-07-19 23:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1279576980.2458.56.camel@edumazet-laptop>

On Mon, Jul 19, 2010 at 3:03 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 19 juillet 2010 à 11:44 -0700, Tom Herbert a écrit :
>
>> I see about 7 usecs as best number on loopback, so I believe this is
>> in the ballpark.  As I mentioned above, this about "best case" latency
>> of a single thread, so we assume any amount of pinning or other
>> customized configuration to that purpose.
>
> Well, given I get 29 us on a ping between two machines (Gb link, no
> process involved on receiver, only softirq), I really doubt we can reach
> 5 us on a tcp test involving a user process on both side ;)
>
That's pretty pokey ;-) I see numbers around 25 usecs between to
machines, this is with TCP_NBRR.  With TCP_RR it's more like 35 usecs,
so eliminating the scheduler is already a big reduction.  That leaves
18 usecs in device time, interrupt processing, network, and cache
misses; 7 usecs in TCP processing, user space.  While 5 usecs is an
aggressive goal, I am not ready to concede that there's an
architectural limit in either NICs, TCP, or sockets that can't be
overcome.

^ permalink raw reply

* Re: Raise initial congestion window size / speedup slow start?
From: Hagen Paul Pfeifer @ 2010-07-19 23:42 UTC (permalink / raw)
  To: H.K. Jerry Chu
  Cc: Rick Jones, Patrick McManus, David Miller, davidsen, lists,
	linux-kernel, netdev, Stephen Hemminger, Alan Cox
In-Reply-To: <AANLkTilI3rXF9ikiQOIqCOXpq4s3cfqOULBn_P8jQVrp@mail.gmail.com>

Maybe someone is interested: on the Transport Modeling Research Group (TMRG)
mailing list a new thread named "Proposal to increase TCP initial CWND"
starts one day ago.

Cheers, Hagen


^ permalink raw reply

* [net-next-2.6 PATCH] e1000: allow option to limit number of descriptors down to 48 per ring
From: Jeff Kirsher @ 2010-07-19 23:43 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, Alexander Duyck, Jeff Kirsher

From: Alexander Duyck <alexander.h.duyck@intel.com>

This change makes it possible to limit the number of descriptors down to 48
per ring.  The reason for this change is to address a variation on hardware
errata 10 for 82546GB in which descriptors will be lost if more than 32
descriptors are fetched and the PCI-X MRBC is 512.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/e1000/e1000.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
index 40b62b4..65298a6 100644
--- a/drivers/net/e1000/e1000.h
+++ b/drivers/net/e1000/e1000.h
@@ -86,12 +86,12 @@ struct e1000_adapter;
 /* TX/RX descriptor defines */
 #define E1000_DEFAULT_TXD                  256
 #define E1000_MAX_TXD                      256
-#define E1000_MIN_TXD                       80
+#define E1000_MIN_TXD                       48
 #define E1000_MAX_82544_TXD               4096
 
 #define E1000_DEFAULT_RXD                  256
 #define E1000_MAX_RXD                      256
-#define E1000_MIN_RXD                       80
+#define E1000_MIN_RXD                       48
 #define E1000_MAX_82544_RXD               4096
 
 #define E1000_MIN_ITR_USECS		10 /* 100000 irq/sec */


^ permalink raw reply related

* [net-next-2.6 PATCH 1/5] ixgbe: dcb, set DPF bit when PFC is enabled
From: Jeff Kirsher @ 2010-07-19 23:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, John Fastabend, Don Skidmore,
	Jeff Kirsher

From: John Fastabend <john.r.fastabend@intel.com>

Set the DPF bit when PFC is enabled.  This will discard
PFC frames so they do not get passed up the stack.

The DPF bit is set for flow control, but not priority
flow control this brings pfc inline with fc.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_dcb_82599.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_dcb_82599.c b/drivers/net/ixgbe/ixgbe_dcb_82599.c
index 4f7a26a..25b02fb 100644
--- a/drivers/net/ixgbe/ixgbe_dcb_82599.c
+++ b/drivers/net/ixgbe/ixgbe_dcb_82599.c
@@ -346,7 +346,7 @@ s32 ixgbe_dcb_config_pfc_82599(struct ixgbe_hw *hw,
 	 */
 	reg = IXGBE_READ_REG(hw, IXGBE_MFLCN);
 	reg &= ~IXGBE_MFLCN_RFCE;
-	reg |= IXGBE_MFLCN_RPFCE;
+	reg |= IXGBE_MFLCN_RPFCE | IXGBE_MFLCN_DPF;
 	IXGBE_WRITE_REG(hw, IXGBE_MFLCN, reg);
 out:
 	return 0;


^ permalink raw reply related

* [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: Jeff Kirsher @ 2010-07-19 23:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, Alexander Duyck, Don Skidmore,
	Jeff Kirsher
In-Reply-To: <20100719235831.14112.14175.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This change removes UDP from the supported protocols for RSS hashing.  The
reason for removing this protocol is because IP fragmentation was causing a
network flow to be broken into two streams, one for fragmented, and one for
non-fragmented and this in turn was causing out-of-order issues.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_main.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index b235aa1..813d2cb 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -2800,10 +2800,8 @@ static void ixgbe_configure_rx(struct ixgbe_adapter *adapter)
 		    /* Perform hash on these packet types */
 		mrqc |= IXGBE_MRQC_RSS_FIELD_IPV4
 		      | IXGBE_MRQC_RSS_FIELD_IPV4_TCP
-		      | IXGBE_MRQC_RSS_FIELD_IPV4_UDP
 		      | IXGBE_MRQC_RSS_FIELD_IPV6
-		      | IXGBE_MRQC_RSS_FIELD_IPV6_TCP
-		      | IXGBE_MRQC_RSS_FIELD_IPV6_UDP;
+		      | IXGBE_MRQC_RSS_FIELD_IPV6_TCP;
 	}
 	IXGBE_WRITE_REG(hw, IXGBE_MRQC, mrqc);
 


^ permalink raw reply related

* [net-next-2.6 PATCH 3/5] ixgbe: properly toggling netdev feature flags when disabling FCoE
From: Jeff Kirsher @ 2010-07-19 23:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, Yi Zou, Jeff Kirsher
In-Reply-To: <20100719235831.14112.14175.stgit@localhost.localdomain>

From: Yi Zou <yi.zou@intel.com>

When FCoE is disabled, there is a race condition that FCoE offload is
turned off but the FCoE protocol driver is still queuing I/O thinking
offload support still exists. This patch toggles off corresponding FCoE
netdev feature flags and notify the FCoE stack first, allowing FCoE
protocol stack driver to update its flags upon NETDEV_FEAT_CHANGE so no
I/O will be using offload.

Also, indicate FCoE offload flags in vlan_features in ixgbe_probe once
and do not toggle them in ixgbe_fcoe_enable/disable so when FCoE is
created on the VLAN interface, vlan_transfer_features() would properly
update the VLAN netdev features flag and notify the FCoE protocol driver
for NETDEV_FEAT_CHANGE.

Signed-off-by: Yi Zou <yi.zou@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_fcoe.c |   19 ++++++-------------
 drivers/net/ixgbe/ixgbe_main.c |    5 +++++
 2 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_fcoe.c b/drivers/net/ixgbe/ixgbe_fcoe.c
index f6ef4cd..1737d2b 100644
--- a/drivers/net/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ixgbe/ixgbe_fcoe.c
@@ -622,9 +622,6 @@ int ixgbe_fcoe_enable(struct net_device *netdev)
 	netdev->features |= NETIF_F_FCOE_CRC;
 	netdev->features |= NETIF_F_FSO;
 	netdev->features |= NETIF_F_FCOE_MTU;
-	netdev->vlan_features |= NETIF_F_FCOE_CRC;
-	netdev->vlan_features |= NETIF_F_FSO;
-	netdev->vlan_features |= NETIF_F_FCOE_MTU;
 	netdev->fcoe_ddp_xid = IXGBE_FCOE_DDP_MAX - 1;
 
 	ixgbe_init_interrupt_scheme(adapter);
@@ -658,24 +655,20 @@ int ixgbe_fcoe_disable(struct net_device *netdev)
 		goto out_disable;
 
 	e_info(drv, "Disabling FCoE offload features.\n");
+	netdev->features &= ~NETIF_F_FCOE_CRC;
+	netdev->features &= ~NETIF_F_FSO;
+	netdev->features &= ~NETIF_F_FCOE_MTU;
+	netdev->fcoe_ddp_xid = 0;
+	netdev_features_change(netdev);
+
 	if (netif_running(netdev))
 		netdev->netdev_ops->ndo_stop(netdev);
 
 	ixgbe_clear_interrupt_scheme(adapter);
-
 	adapter->flags &= ~IXGBE_FLAG_FCOE_ENABLED;
 	adapter->ring_feature[RING_F_FCOE].indices = 0;
-	netdev->features &= ~NETIF_F_FCOE_CRC;
-	netdev->features &= ~NETIF_F_FSO;
-	netdev->features &= ~NETIF_F_FCOE_MTU;
-	netdev->vlan_features &= ~NETIF_F_FCOE_CRC;
-	netdev->vlan_features &= ~NETIF_F_FSO;
-	netdev->vlan_features &= ~NETIF_F_FCOE_MTU;
-	netdev->fcoe_ddp_xid = 0;
-
 	ixgbe_cleanup_fcoe(adapter);
 	ixgbe_init_interrupt_scheme(adapter);
-	netdev_features_change(netdev);
 
 	if (netif_running(netdev))
 		netdev->netdev_ops->ndo_open(netdev);
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 813d2cb..7d619d6 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -6740,6 +6740,11 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 				adapter->flags &= ~IXGBE_FLAG_FCOE_CAPABLE;
 		}
 	}
+	if (adapter->flags & IXGBE_FLAG_FCOE_CAPABLE) {
+		netdev->vlan_features |= NETIF_F_FCOE_CRC;
+		netdev->vlan_features |= NETIF_F_FSO;
+		netdev->vlan_features |= NETIF_F_FCOE_MTU;
+	}
 #endif /* IXGBE_FCOE */
 	if (pci_using_dac)
 		netdev->features |= NETIF_F_HIGHDMA;


^ permalink raw reply related

* [net-next-2.6 PATCH 4/5] ixgbe: use GFP_ATOMIC when allocating FCoE DDP context from the dma pool
From: Jeff Kirsher @ 2010-07-20  0:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, Yi Zou, Jeff Kirsher
In-Reply-To: <20100719235831.14112.14175.stgit@localhost.localdomain>

From: Yi Zou <yi.zou@intel.com>

The FCoE protocol stack may hold a lock when this gets called.

Signed-off-by: Yi Zou <yi.zou@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_fcoe.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_fcoe.c b/drivers/net/ixgbe/ixgbe_fcoe.c
index 1737d2b..072327c 100644
--- a/drivers/net/ixgbe/ixgbe_fcoe.c
+++ b/drivers/net/ixgbe/ixgbe_fcoe.c
@@ -190,7 +190,7 @@ int ixgbe_fcoe_ddp_get(struct net_device *netdev, u16 xid,
 	}
 
 	/* alloc the udl from our ddp pool */
-	ddp->udl = pci_pool_alloc(fcoe->pool, GFP_KERNEL, &ddp->udp);
+	ddp->udl = pci_pool_alloc(fcoe->pool, GFP_ATOMIC, &ddp->udp);
 	if (!ddp->udl) {
 		e_err(drv, "failed allocated ddp context\n");
 		goto out_noddp_unmap;


^ permalink raw reply related

* [net-next-2.6 PATCH 5/5] ixgbe: fix version string for ixgbe
From: Jeff Kirsher @ 2010-07-20  0:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, Don Skidmore, Jeff Kirsher
In-Reply-To: <20100719235831.14112.14175.stgit@localhost.localdomain>

From: Don Skidmore <donald.c.skidmore@intel.com>

Bump the version string to better reflect what is in the driver.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 7d619d6..9203759 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -52,7 +52,7 @@ char ixgbe_driver_name[] = "ixgbe";
 static const char ixgbe_driver_string[] =
                               "Intel(R) 10 Gigabit PCI Express Network Driver";
 
-#define DRV_VERSION "2.0.62-k2"
+#define DRV_VERSION "2.0.84-k2"
 const char ixgbe_driver_version[] = DRV_VERSION;
 static char ixgbe_copyright[] = "Copyright (c) 1999-2010 Intel Corporation.";
 


^ permalink raw reply related

* [PATCH net-next 1/4] bnx2: Use proper counter for net_device_stats->multicast.
From: Michael Chan @ 2010-07-20  0:15 UTC (permalink / raw)
  To: davem; +Cc: netdev

We were using the wrong tx multicast counter instead of the rx multicast
counter.

Reported-by: Peter Snellman <peter.snellman@cinnober.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/bnx2.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index ce3217b..deb7f83 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -6631,7 +6631,7 @@ bnx2_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *net_stats)
 		GET_64BIT_NET_STATS(stat_IfHCOutOctets);
 
 	net_stats->multicast =
-		GET_64BIT_NET_STATS(stat_IfHCOutMulticastPkts);
+		GET_64BIT_NET_STATS(stat_IfHCInMulticastPkts);
 
 	net_stats->collisions =
 		GET_32BIT_NET_STATS(stat_EtherStatsCollisions);
-- 
1.6.4.GIT



^ permalink raw reply related

* [PATCH net-next 3/4] bnx2: Remove some unnecessary smp_mb() in tx fast path.
From: Michael Chan @ 2010-07-20  0:15 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1279584905-15084-2-git-send-email-mchan@broadcom.com>

smp_mb() inside bnx2_tx_avail() is used twice in the normal
bnx2_start_xmit() path (see illustration below).  The full memory
barrier is only necessary during race conditions with tx completion.
We can speed up the tx path by replacing smp_mb() in bnx2_tx_avail()
with a compiler barrier.  The compiler barrier is to force the
compiler to fetch the tx_prod and tx_cons from memory.

In the race condition between bnx2_start_xmit() and bnx2_tx_int(),
we have the following situation:

bnx2_start_xmit()                       bnx2_tx_int()
    if (!bnx2_tx_avail())
            BUG();

    ...

    if (!bnx2_tx_avail())
            netif_tx_stop_queue();          update_tx_index();
            smp_mb();                       smp_mb();
            if (bnx2_tx_avail())            if (netif_tx_queue_stopped() &&
                    netif_tx_wake_queue();      bnx2_tx_avail())

With smp_mb() removed from bnx2_tx_avail(), we need to add smp_mb() to
bnx2_start_xmit() as shown above to properly order netif_tx_stop_queue()
and bnx2_tx_avail() to check the ring index.  If it is not strictly
ordered, the tx queue can be stopped forever.

This improves performance by about 5% with 2 ports running bi-directional
64-byte packets.

Reviewed-by: Benjamin Li <benli@broadcom.com>
Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/bnx2.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index d44ecc3..2af570d 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -253,7 +253,8 @@ static inline u32 bnx2_tx_avail(struct bnx2 *bp, struct bnx2_tx_ring_info *txr)
 {
 	u32 diff;
 
-	smp_mb();
+	/* Tell compiler to fetch tx_prod and tx_cons from memory. */
+	barrier();
 
 	/* The ring uses 256 indices for 255 entries, one of them
 	 * needs to be skipped.
@@ -6534,6 +6535,13 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	if (unlikely(bnx2_tx_avail(bp, txr) <= MAX_SKB_FRAGS)) {
 		netif_tx_stop_queue(txq);
+
+		/* netif_tx_stop_queue() must be done before checking
+		 * tx index in bnx2_tx_avail() below, because in
+		 * bnx2_tx_int(), we update tx index before checking for
+		 * netif_tx_queue_stopped().
+		 */
+		smp_mb();
 		if (bnx2_tx_avail(bp, txr) > bp->tx_wake_thresh)
 			netif_tx_wake_queue(txq);
 	}
-- 
1.6.4.GIT



^ permalink raw reply related

* [PATCH net-next 2/4] bnx2: Call pci_enable_msix() with actual number of vectors.
From: Michael Chan @ 2010-07-20  0:15 UTC (permalink / raw)
  To: davem; +Cc: netdev, Breno Leitão
In-Reply-To: <1279584905-15084-1-git-send-email-mchan@broadcom.com>

Based on original patch by Breno Leitão <leitao@linux.vnet.ibm.com>.

Allocate the actual number of vectors and make use of fewer vectors
if pci_enable_msix() returns > 0.  We must allocate one additional
vector for the cnic driver.

Cc: Breno Leitão <leitao@linux.vnet.ibm.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/bnx2.c |   24 ++++++++++++++++++++----
 drivers/net/bnx2.h |    9 ++++++---
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index deb7f83..d44ecc3 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -864,7 +864,7 @@ bnx2_alloc_mem(struct bnx2 *bp)
 	bnapi->hw_rx_cons_ptr =
 		&bnapi->status_blk.msi->status_rx_quick_consumer_index0;
 	if (bp->flags & BNX2_FLAG_MSIX_CAP) {
-		for (i = 1; i < BNX2_MAX_MSIX_VEC; i++) {
+		for (i = 1; i < bp->irq_nvecs; i++) {
 			struct status_block_msix *sblk;
 
 			bnapi = &bp->bnx2_napi[i];
@@ -6152,7 +6152,7 @@ bnx2_free_irq(struct bnx2 *bp)
 static void
 bnx2_enable_msix(struct bnx2 *bp, int msix_vecs)
 {
-	int i, rc;
+	int i, total_vecs, rc;
 	struct msix_entry msix_ent[BNX2_MAX_MSIX_VEC];
 	struct net_device *dev = bp->dev;
 	const int len = sizeof(bp->irq_tbl[0].name);
@@ -6171,13 +6171,29 @@ bnx2_enable_msix(struct bnx2 *bp, int msix_vecs)
 		msix_ent[i].vector = 0;
 	}
 
-	rc = pci_enable_msix(bp->pdev, msix_ent, BNX2_MAX_MSIX_VEC);
+	total_vecs = msix_vecs;
+#ifdef BCM_CNIC
+	total_vecs++;
+#endif
+	rc = -ENOSPC;
+	while (total_vecs >= BNX2_MIN_MSIX_VEC) {
+		rc = pci_enable_msix(bp->pdev, msix_ent, total_vecs);
+		if (rc <= 0)
+			break;
+		if (rc > 0)
+			total_vecs = rc;
+	}
+
 	if (rc != 0)
 		return;
 
+	msix_vecs = total_vecs;
+#ifdef BCM_CNIC
+	msix_vecs--;
+#endif
 	bp->irq_nvecs = msix_vecs;
 	bp->flags |= BNX2_FLAG_USING_MSIX | BNX2_FLAG_ONE_SHOT_MSI;
-	for (i = 0; i < BNX2_MAX_MSIX_VEC; i++) {
+	for (i = 0; i < total_vecs; i++) {
 		bp->irq_tbl[i].vector = msix_ent[i].vector;
 		snprintf(bp->irq_tbl[i].name, len, "%s-%d", dev->name, i);
 		bp->irq_tbl[i].handler = bnx2_msi_1shot;
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index b9af6bc..2104c10 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -6637,9 +6637,12 @@ struct flash_spec {
 
 #define BNX2_MAX_MSIX_HW_VEC	9
 #define BNX2_MAX_MSIX_VEC	9
-#define BNX2_BASE_VEC		0
-#define BNX2_TX_VEC		1
-#define BNX2_TX_INT_NUM	(BNX2_TX_VEC << BNX2_PCICFG_INT_ACK_CMD_INT_NUM_SHIFT)
+#ifdef BCM_CNIC
+#define BNX2_MIN_MSIX_VEC	2
+#else
+#define BNX2_MIN_MSIX_VEC	1
+#endif
+
 
 struct bnx2_irq {
 	irq_handler_t	handler;
-- 
1.6.4.GIT



^ permalink raw reply related

* [PATCH net-next 4/4] bnx2: Update version to 2.0.17.
From: Michael Chan @ 2010-07-20  0:15 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1279584905-15084-3-git-send-email-mchan@broadcom.com>

Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/bnx2.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 2af570d..e6a803f 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -58,8 +58,8 @@
 #include "bnx2_fw.h"
 
 #define DRV_MODULE_NAME		"bnx2"
-#define DRV_MODULE_VERSION	"2.0.16"
-#define DRV_MODULE_RELDATE	"July 2, 2010"
+#define DRV_MODULE_VERSION	"2.0.17"
+#define DRV_MODULE_RELDATE	"July 18, 2010"
 #define FW_MIPS_FILE_06		"bnx2/bnx2-mips-06-5.0.0.j6.fw"
 #define FW_RV2P_FILE_06		"bnx2/bnx2-rv2p-06-5.0.0.j3.fw"
 #define FW_MIPS_FILE_09		"bnx2/bnx2-mips-09-5.0.0.j15.fw"
-- 
1.6.4.GIT



^ permalink raw reply related

* [RFC PATCH v3 0/5] netdev: show a process of packets
From: Koki Sanagi @ 2010-07-20  0:43 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, davem, kaneshige.kenji, izumi.taku, kosaki.motohiro,
	nhorman, laijs, scott.a.mcmillan, rostedt, eric.dumazet, fweisbec,
	mathieu.desnoyers

CHANGE-LOG since v2:
    1) let all tracepoints of softirq use DECLARE_EVENT_CLASS
    2) let tracepoint of netdev_queue and netdev_receive use DECLARE_EVENT_CLASS
    3) add tracepoint to skb_free_datagram_locked
    4) show function and time when received packet is freed

These patch-set adds tracepoints to show us a process of packets.
Using these tracepoints and existing points, we can get the time when
packet passes through some points in transmit or receive sequence.
For example, this is an output of perf script which is attached by patch 5/5.

79074.756672832sec cpu=1
irq_entry(+0.000000msec,irq=77:eth3)
         |------------softirq_raise(+0.001277msec)
irq_exit (+0.002278msec)     |
                             |
                      softirq_entry(+0.003562msec)
                             |
                             |---netif_receive_skb(+0.006279msec,len=100)
                             |            |
                             |   skb_copy_datagram_iovec(+0.038778msec, 2285:sshd)
                             |
                      napi_poll_exit(+0.017160msec, eth3)
                             |
                      softirq_exit(+0.018248msec)

The above is a receive side. Like this, it can show receive sequence from
interrupt(irq_entry) to application(skb_copy_datagram_iovec). There are eight
points in this side. All events except for skb_copy_datagram_iovec and
freeing skb events can be associated with each other by CPU number.
skb_copy_datagram_iovec and freeing skb events can be associated with
netif_receive_skb by skbaddr.
This script shows one NET_RX softirq and events related to it. All relative
time bases on first irq_entry which raise NET_RX softirq.

   dev    len      Qdisc               netdevice             free
   eth3   114  79044.417123332sec     0.005242msec          0.103843msec
   eth3   114  79044.580090422sec     0.002306msec          0.103632msec
   eth3   114  79044.719078251sec     0.002288msec          0.104093msec

The above is a transmit side. There are three tracepoints in this side.
Point1 is before putting a packet to Qdisc. point2 is after ndo_start_xmit in
dev_hard_start_xmit. It indicates finishing putting a packet to driver.
point3 is in consume_skb and dev_kfree_skb_irq. It indicates freeing a
transmitted packet.
Values of this script are, from left, device name, length of a packet, a time of
point1, an interval time between point1 and point2 and an interval time between
point2 and point3.

These times are useful to analyze a performance or to detect a point where
packet delays. For example,
- NET_RX softirq calling is late.
- Application is late to take a packet.
- It takes much time to put a transmitting packet to driver
  (It may be caused by packed queue)

And also, these tracepoint help us to investigate a network driver's trouble
from memory dump because ftrace records it to memory. And ftrace is so light
even if always trace on. So, in a case investigating a problem which doesn't
reproduce, it is useful.

Thanks,
Koki Sanagi.

^ permalink raw reply

* [RFC PATCH v3 1/5] irq: add tracepoint to softirq_raise
From: Koki Sanagi @ 2010-07-20  0:45 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, davem, kaneshige.kenji, izumi.taku, kosaki.motohiro,
	nhorman, laijs, scott.a.mcmillan, rostedt, eric.dumazet, fweisbec,
	mathieu.desnoyers
In-Reply-To: <4C44F12F.5090908@jp.fujitsu.com>

From: Lai Jiangshan <laijs@cn.fujitsu.com>

Add a tracepoint for tracing when softirq action is raised.

It and the existed tracepoints complete softirq's tracepoints:
softirq_raise, softirq_entry and softirq_exit.

And when this tracepoint is used in combination with
the softirq_entry tracepoint we can determine
the softirq raise latency.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>

[ factorize softirq events with DECLARE_EVENT_CLASS ]
Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/linux/interrupt.h  |    8 +++++-
 include/trace/events/irq.h |   57 ++++++++++++++++++++++++++-----------------
 kernel/softirq.c           |    4 +-
 3 files changed, 43 insertions(+), 26 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index c233113..1cb5726 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -18,6 +18,7 @@
 #include <asm/atomic.h>
 #include <asm/ptrace.h>
 #include <asm/system.h>
+#include <trace/events/irq.h>
 
 /*
  * These correspond to the IORESOURCE_IRQ_* defines in
@@ -402,7 +403,12 @@ asmlinkage void do_softirq(void);
 asmlinkage void __do_softirq(void);
 extern void open_softirq(int nr, void (*action)(struct softirq_action *));
 extern void softirq_init(void);
-#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0)
+static inline void __raise_softirq_irqoff(unsigned int nr)
+{
+	trace_softirq_raise(nr);
+	or_softirq_pending(1UL << nr);
+}
+
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
 extern void wakeup_softirqd(void);
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
index 0e4cfb6..717744c 100644
--- a/include/trace/events/irq.h
+++ b/include/trace/events/irq.h
@@ -5,7 +5,9 @@
 #define _TRACE_IRQ_H
 
 #include <linux/tracepoint.h>
-#include <linux/interrupt.h>
+
+struct irqaction;
+struct softirq_action;
 
 #define softirq_name(sirq) { sirq##_SOFTIRQ, #sirq }
 #define show_softirq_name(val)				\
@@ -84,56 +86,65 @@ TRACE_EVENT(irq_handler_exit,
 
 DECLARE_EVENT_CLASS(softirq,
 
-	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
+	TP_PROTO(unsigned int nr),
 
-	TP_ARGS(h, vec),
+	TP_ARGS(nr),
 
 	TP_STRUCT__entry(
-		__field(	int,	vec			)
+		__field(	unsigned int,	vec	)
 	),
 
 	TP_fast_assign(
-		__entry->vec = (int)(h - vec);
+		__entry->vec	= nr;
 	),
 
 	TP_printk("vec=%d [action=%s]", __entry->vec,
-		  show_softirq_name(__entry->vec))
+		show_softirq_name(__entry->vec))
+);
+
+/**
+ * softirq_raise - called immediately when a softirq is raised
+ * @nr: softirq vector number
+ *
+ * Tracepoint for tracing when softirq action is raised.
+ * Also, when used in combination with the softirq_entry tracepoint
+ * we can determine the softirq raise latency.
+ */
+DEFINE_EVENT(softirq, softirq_raise,
+
+	TP_PROTO(unsigned int nr),
+
+	TP_ARGS(nr)
 );
 
 /**
  * softirq_entry - called immediately before the softirq handler
- * @h: pointer to struct softirq_action
- * @vec: pointer to first struct softirq_action in softirq_vec array
+ * @nr: softirq vector number
  *
- * The @h parameter, contains a pointer to the struct softirq_action
- * which has a pointer to the action handler that is called. By subtracting
- * the @vec pointer from the @h pointer, we can determine the softirq
- * number. Also, when used in combination with the softirq_exit tracepoint
+ * Tracepoint for tracing when softirq action starts.
+ * Also, when used in combination with the softirq_exit tracepoint
  * we can determine the softirq latency.
  */
 DEFINE_EVENT(softirq, softirq_entry,
 
-	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
+	TP_PROTO(unsigned int nr),
 
-	TP_ARGS(h, vec)
+	TP_ARGS(nr)
 );
 
 /**
  * softirq_exit - called immediately after the softirq handler returns
- * @h: pointer to struct softirq_action
- * @vec: pointer to first struct softirq_action in softirq_vec array
+ * @nr: softirq vector number
  *
- * The @h parameter contains a pointer to the struct softirq_action
- * that has handled the softirq. By subtracting the @vec pointer from
- * the @h pointer, we can determine the softirq number. Also, when used in
- * combination with the softirq_entry tracepoint we can determine the softirq
- * latency.
+ * Tracepoint for tracing when softirq action ends.
+ * Also, when used in combination with the softirq_entry tracepoint
+ * we can determine the softirq latency.
  */
 DEFINE_EVENT(softirq, softirq_exit,
 
-	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
+	TP_PROTO(unsigned int nr),
 
-	TP_ARGS(h, vec)
+	TP_ARGS(nr)
 );
 
 #endif /*  _TRACE_IRQ_H */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 825e112..6790599 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -215,9 +215,9 @@ restart:
 			int prev_count = preempt_count();
 			kstat_incr_softirqs_this_cpu(h - softirq_vec);
 
-			trace_softirq_entry(h, softirq_vec);
+			trace_softirq_entry(h - softirq_vec);
 			h->action(h);
-			trace_softirq_exit(h, softirq_vec);
+			trace_softirq_exit(h - softirq_vec);
 			if (unlikely(prev_count != preempt_count())) {
 				printk(KERN_ERR "huh, entered softirq %td %s %p"
 				       "with preempt_count %08x,"

^ permalink raw reply related

* [RFC PATCH v3 2/5] napi: convert trace_napi_poll to TRACE_EVENT
From: Koki Sanagi @ 2010-07-20  0:46 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, davem, kaneshige.kenji, izumi.taku, kosaki.motohiro,
	nhorman, laijs, scott.a.mcmillan, rostedt, eric.dumazet, fweisbec,
	mathieu.desnoyers
In-Reply-To: <4C44F12F.5090908@jp.fujitsu.com>

From: Neil Horman <nhorman@tuxdriver.com>

This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT to improve
the usability of napi_poll tracepoint.

          <idle>-0     [001] 241302.750777: napi_poll: napi poll on napi struct f6acc480 for device eth3
          <idle>-0     [000] 241302.852389: napi_poll: napi poll on napi struct f5d0d70c for device eth1

An original patch is below.
http://marc.info/?l=linux-kernel&m=126021713809450&w=2
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

And add a fix by Steven Rostedt.
http://marc.info/?l=linux-kernel&m=126150506519173&w=2

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/trace/events/napi.h |   25 +++++++++++++++++++++++--
 1 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
index 188deca..8fe1e93 100644
--- a/include/trace/events/napi.h
+++ b/include/trace/events/napi.h
@@ -6,10 +6,31 @@
 
 #include <linux/netdevice.h>
 #include <linux/tracepoint.h>
+#include <linux/ftrace.h>
+
+#define NO_DEV "(no_device)"
+
+TRACE_EVENT(napi_poll,
 
-DECLARE_TRACE(napi_poll,
 	TP_PROTO(struct napi_struct *napi),
-	TP_ARGS(napi));
+
+	TP_ARGS(napi),
+
+	TP_STRUCT__entry(
+		__field(	struct napi_struct *,	napi)
+		__string(	dev_name, napi->dev ? napi->dev->name : NO_DEV)
+	),
+
+	TP_fast_assign(
+		__entry->napi = napi;
+		__assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
+	),
+
+	TP_printk("napi poll on napi struct %p for device %s",
+		__entry->napi, __get_str(dev_name))
+);
+
+#undef NO_DEV
 
 #endif /* _TRACE_NAPI_H_ */
 


^ permalink raw reply related

* [RFC PATCH v3 3/5] netdev: add tracepoints to netdev layer
From: Koki Sanagi @ 2010-07-20  0:47 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, davem, kaneshige.kenji, izumi.taku, kosaki.motohiro,
	nhorman, laijs, scott.a.mcmillan, rostedt, eric.dumazet, fweisbec,
	mathieu.desnoyers
In-Reply-To: <4C44F12F.5090908@jp.fujitsu.com>

This patch adds tracepoint to dev_queue_xmit, dev_hard_start_xmit and
netif_receive_skb. These tracepoints help you to monitor network driver's
input/output.

            sshd-4445  [001] 241367.066046: net_dev_queue: dev=eth3 skbaddr=dd6b2538 len=114
            sshd-4445  [001] 241367.066047: net_dev_xmit: dev=eth3 skbaddr=dd6b2538 len=114 rc=0
          <idle>-0     [001] 241367.067472: net_dev_receive: dev=eth3 skbaddr=f5e59000 len=52

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/trace/events/net.h |   75 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c             |    5 +++
 net/core/net-traces.c      |    1 +
 3 files changed, 81 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
new file mode 100644
index 0000000..8a21361
--- /dev/null
+++ b/include/trace/events/net.h
@@ -0,0 +1,75 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM net
+
+#if !defined(_TRACE_NET_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NET_H
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/ip.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(net_dev_xmit,
+
+	TP_PROTO(struct sk_buff *skb,
+		 int rc),
+
+	TP_ARGS(skb, rc),
+
+	TP_STRUCT__entry(
+		__field(	void *,		skbaddr		)
+		__field(	unsigned int,	len		)
+		__field(	int,		rc		)
+		__string(	name,		skb->dev->name	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+		__entry->len = skb->len;
+		__entry->rc = rc;
+		__assign_str(name, skb->dev->name);
+	),
+
+	TP_printk("dev=%s skbaddr=%p len=%u rc=%d",
+		__get_str(name), __entry->skbaddr, __entry->len, __entry->rc)
+);
+
+DECLARE_EVENT_CLASS(net_dev_template,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__field(	void *,		skbaddr		)
+		__field(	unsigned int,	len		)
+		__string(	name,		skb->dev->name	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+		__entry->len = skb->len;
+		__assign_str(name, skb->dev->name);
+	),
+
+	TP_printk("dev=%s skbaddr=%p len=%u",
+		__get_str(name), __entry->skbaddr, __entry->len)
+)
+
+DEFINE_EVENT(net_dev_template, net_dev_queue,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_template, net_dev_receive,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+#endif /* _TRACE_NET_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/core/dev.c b/net/core/dev.c
index 93b8929..4acfec6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -130,6 +130,7 @@
 #include <linux/jhash.h>
 #include <linux/random.h>
 #include <trace/events/napi.h>
+#include <trace/events/net.h>
 #include <linux/pci.h>
 
 #include "net-sysfs.h"
@@ -1955,6 +1956,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		}
 
 		rc = ops->ndo_start_xmit(skb, dev);
+		trace_net_dev_xmit(skb, rc);
 		if (rc == NETDEV_TX_OK)
 			txq_trans_update(txq);
 		return rc;
@@ -1975,6 +1977,7 @@ gso:
 			skb_dst_drop(nskb);
 
 		rc = ops->ndo_start_xmit(nskb, dev);
+		trace_net_dev_xmit(nskb, rc);
 		if (unlikely(rc != NETDEV_TX_OK)) {
 			if (rc & ~NETDEV_TX_MASK)
 				goto out_kfree_gso_skb;
@@ -2165,6 +2168,7 @@ int dev_queue_xmit(struct sk_buff *skb)
 #ifdef CONFIG_NET_CLS_ACT
 	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
 #endif
+	trace_net_dev_queue(skb);
 	if (q->enqueue) {
 		rc = __dev_xmit_skb(skb, q, dev, txq);
 		goto out;
@@ -2939,6 +2943,7 @@ int netif_receive_skb(struct sk_buff *skb)
 	if (netdev_tstamp_prequeue)
 		net_timestamp_check(skb);
 
+	trace_net_dev_receive(skb);
 #ifdef CONFIG_RPS
 	{
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index afa6380..7f1bb2a 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -26,6 +26,7 @@
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/skb.h>
+#include <trace/events/net.h>
 #include <trace/events/napi.h>
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);


^ permalink raw reply related

* [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
From: Koki Sanagi @ 2010-07-20  0:49 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, davem, kaneshige.kenji, izumi.taku, kosaki.motohiro,
	nhorman, laijs, scott.a.mcmillan, rostedt, eric.dumazet, fweisbec,
	mathieu.desnoyers
In-Reply-To: <4C44F12F.5090908@jp.fujitsu.com>

[RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
This patch adds tracepoint to consume_skb, dev_kfree_skb_irq and
skb_free_datagram_locked. Combinating with tracepoint on dev_hard_start_xmit,
we can check how long it takes to free transmited packets. And using it, we can
calculate how many packets driver had at that time. It is useful when a drop of
transmited packet is a problem.

          <idle>-0     [001] 241409.218333: consume_skb: skbaddr=dd6b2fb8
          <idle>-0     [001] 241409.490555: dev_kfree_skb_irq: skbaddr=f5e29840

        udp-recv-302   [001] 515031.206008: skb_free_datagram_locked: skbaddr=f5b1d900


Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/trace/events/skb.h |   42 ++++++++++++++++++++++++++++++++++++++++++
 net/core/datagram.c        |    1 +
 net/core/dev.c             |    2 ++
 net/core/skbuff.c          |    1 +
 4 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 4b2be6d..84c9041 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -35,6 +35,48 @@ TRACE_EVENT(kfree_skb,
 		__entry->skbaddr, __entry->protocol, __entry->location)
 );
 
+DECLARE_EVENT_CLASS(free_skb,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__field(	void *,	skbaddr	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+	),
+
+	TP_printk("skbaddr=%p", __entry->skbaddr)
+
+);
+
+DEFINE_EVENT(free_skb, consume_skb,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+
+);
+
+DEFINE_EVENT(free_skb, dev_kfree_skb_irq,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+
+);
+
+DEFINE_EVENT(free_skb, skb_free_datagram_locked,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+
+);
+
 TRACE_EVENT(skb_copy_datagram_iovec,
 
 	TP_PROTO(const struct sk_buff *skb, int len),
diff --git a/net/core/datagram.c b/net/core/datagram.c
index f5b6f43..1ea32a0 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -231,6 +231,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
 {
 	bool slow;
 
+	trace_skb_free_datagram_locked(skb);
 	if (likely(atomic_read(&skb->users) == 1))
 		smp_rmb();
 	else if (likely(!atomic_dec_and_test(&skb->users)))
diff --git a/net/core/dev.c b/net/core/dev.c
index 4acfec6..d979847 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -131,6 +131,7 @@
 #include <linux/random.h>
 #include <trace/events/napi.h>
 #include <trace/events/net.h>
+#include <trace/events/skb.h>
 #include <linux/pci.h>
 
 #include "net-sysfs.h"
@@ -1581,6 +1582,7 @@ void dev_kfree_skb_irq(struct sk_buff *skb)
 		struct softnet_data *sd;
 		unsigned long flags;
 
+		trace_dev_kfree_skb_irq(skb);
 		local_irq_save(flags);
 		sd = &__get_cpu_var(softnet_data);
 		skb->next = sd->completion_queue;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 34432b4..a7b4036 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -466,6 +466,7 @@ void consume_skb(struct sk_buff *skb)
 		smp_rmb();
 	else if (likely(!atomic_dec_and_test(&skb->users)))
 		return;
+	trace_consume_skb(skb);
 	__kfree_skb(skb);
 }
 EXPORT_SYMBOL(consume_skb);


^ permalink raw reply related

* [RFC PATCH v3 5/5] perf:add a script shows a process of packet
From: Koki Sanagi @ 2010-07-20  0:50 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, davem, kaneshige.kenji, izumi.taku, kosaki.motohiro,
	nhorman, laijs, scott.a.mcmillan, rostedt, eric.dumazet, fweisbec,
	mathieu.desnoyers
In-Reply-To: <4C44F12F.5090908@jp.fujitsu.com>

Add a perf script which shows a process of packets and processed time.
It helps us to investigate networking or network device.

If you want to use it, install perf and record perf.data like following.

#perf trace record netdev-times [script]

If you set script, perf gathers records until it ends.
If not, you must Ctrl-C to stop recording.

And if you want a report from record,

#perf trace report netdev-times [options]

If you use some options, you can limit an output.
Option is below.

tx: show only process of tx packets
rx: show only process of rx packets
dev=: show a process specified with this option
debug: work with debug mode. It shows buffer status.

For example, if you want to show a process of received packets associated
with eth3,

#perf trace report netdev-times rx dev=eth3
79074.756672832sec cpu=1
irq_entry(+0.000000msec,irq=77:eth3)
         |------------softirq_raise(+0.001277msec)
irq_exit (+0.002278msec)     |
                      softirq_entry(+0.003562msec
                             |
                             |---netif_receive_skb(+0.006279msec,len=100)
                             |            |
                             |   skb_copy_datagram_iovec(+0.038778msec, 2285:sshd)
                             |
                      napi_poll_exit(+0.017160msec, eth3)
                             |
                      softirq_exit(+0.018248msec)


This perf script helps us to analyze a process time of transmit/receive
sequence.

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 tools/perf/scripts/python/bin/netdev-times-record |    8 +
 tools/perf/scripts/python/bin/netdev-times-report |    5 +
 tools/perf/scripts/python/netdev-times.py         |  478 +++++++++++++++++++++
 3 files changed, 491 insertions(+), 0 deletions(-)

diff --git a/tools/perf/scripts/python/bin/netdev-times-record b/tools/perf/scripts/python/bin/netdev-times-record
new file mode 100644
index 0000000..12da07e
--- /dev/null
+++ b/tools/perf/scripts/python/bin/netdev-times-record
@@ -0,0 +1,8 @@
+#!/bin/bash
+perf record -c 1 -f -R -a -e net:net_dev_xmit -e net:net_dev_queue	\
+		-e net:net_dev_receive -e skb:consume_skb		\
+		-e skb:kfree_skb -e skb:skb_free_datagram_locked	\
+		-e skb:dev_kfree_skb_irq -e napi:napi_poll		\
+		-e irq:irq_handler_entry -e irq:irq_handler_exit	\
+		-e irq:softirq_entry -e irq:softirq_exit		\
+		-e irq:softirq_raise -e skb:skb_copy_datagram_iovec $@
diff --git a/tools/perf/scripts/python/bin/netdev-times-report b/tools/perf/scripts/python/bin/netdev-times-report
new file mode 100644
index 0000000..c3d0a63
--- /dev/null
+++ b/tools/perf/scripts/python/bin/netdev-times-report
@@ -0,0 +1,5 @@
+#!/bin/bash
+# description: display a process of packet and processing time
+# args: [tx] [rx] [dev=] [debug]
+
+perf trace -s ~/libexec/perf-core/scripts/python/netdev-times.py $@
diff --git a/tools/perf/scripts/python/less b/tools/perf/scripts/python/less
new file mode 100644
index 0000000..e69de29
diff --git a/tools/perf/scripts/python/netdev-times.py b/tools/perf/scripts/python/netdev-times.py
new file mode 100644
index 0000000..486f16e
--- /dev/null
+++ b/tools/perf/scripts/python/netdev-times.py
@@ -0,0 +1,478 @@
+# Display a process of packets and processed time.
+# It helps us to investigate networking or network device.
+#
+# options
+# tx: show only tx chart
+# rx: show only rx chart
+# dev=: show only thing related to specified device
+# debug: work with debug mode. It shows buffer status.
+
+import os
+import sys
+
+sys.path.append(os.environ['PERF_EXEC_PATH'] + \
+	'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
+
+from perf_trace_context import *
+from Core import *
+from Util import *
+
+all_event_list = []; # insert all tracepoint event related with this script
+irq_dic = {}; # key is cpu and value is a list which stacks irqs
+              # which raise NET_RX softirq
+net_rx_dic = {}; # key is cpu and value include time of NET_RX softirq-entry
+		 # and a list which stacks receive
+receive_hunk_list = []; # a list which include a sequence of receive events
+rx_skb_list = []; # received packet list for matching
+		       # skb_copy_datagram_iovec
+
+buffer_budget = 65536; # the budget of rx_skb_list, tx_queue_list and
+		       # tx_xmit_list
+of_count_rx_skb_list = 0; # overflow count
+
+tx_queue_list = []; # list of packets which pass through dev_queue_xmit
+of_count_tx_queue_list = 0; # overflow count
+
+tx_xmit_list = [];  # list of packets which pass through dev_hard_start_xmit
+of_count_tx_xmit_list = 0; # overflow count
+
+tx_free_list = [];  # list of packets which is freed
+
+# options
+show_tx = 0;
+show_rx = 0;
+dev = 0; # store a name of device specified by option "dev="
+debug = 0;
+
+# indices of event_info tuple
+EINFO_IDX_NAME=   0
+EINFO_IDX_CONTEXT=1
+EINFO_IDX_CPU=    2
+EINFO_IDX_TIME=   3
+EINFO_IDX_PID=    4
+EINFO_IDX_COMM=   5
+
+# Calculate a time interval(msec) from src(nsec) to dst(nsec)
+def diff_msec(src, dst):
+	return (dst - src) / 1000000.0
+
+# Display a process of transmitting a packet
+def print_transmit(hunk):
+	if dev != 0 and hunk['dev'].find(dev) < 0:
+		return
+	print "%7s %5d %6d.%09dsec %12.6fmsec      %12.6fmsec" % \
+		(hunk['dev'], hunk['len'],
+		nsecs_secs(hunk['queue_t']),
+		nsecs_nsecs(hunk['queue_t']),
+		diff_msec(hunk['queue_t'], hunk['xmit_t']),
+		diff_msec(hunk['xmit_t'], hunk['free_t']))
+
+PF_IRQ_ENTRY= "irq_entry(+%fmsec,irq=%d:%s)"
+PF_IRQ_EXIT=  "irq_exit (+%fmsec)     |"
+PF_SOFT_RAISE="         |------------softirq_raise(+%fmsec)"
+PF_SOFT_ENTRY="                      softirq_entry(+%fmsec)"
+PF_SOFT_EXIT= "                      softirq_exit (+%fmsec)\n"
+PF_NAPI_POLL= "                      napi_poll_exit(+%fmsec, %s)"
+PF_JOINT=     "                             |"
+PF_WJOINT=    "                             |            |"
+PF_NET_RECV=  "                             |---netif_receive_skb" \
+				"(+%fmsec,len=%d)"
+PF_CPY_DGRAM= "                             |   skb_copy_datagram_iovec" \
+				"(+%fmsec, %d:%s)"
+PF_FREE_DGRAM="                             |   skb_free_datagram_locked" \
+				"(+%fmsec)"
+PF_KFREE_SKB= "                             |   kfree_skb" \
+				"(+%fmsec)"
+PF_CONS_SKB=  "                             |   consume_skb" \
+				"(+%fmsec)"
+
+# Display a process of received packets and interrputs associated with
+# a NET_RX softirq
+def print_receive(hunk):
+	show_hunk = 0
+	irq_list = hunk['irq_list']
+	cpu = irq_list[0]['cpu']
+	base_t = irq_list[0]['irq_ent_t']
+	# check if this hunk should be showed
+	if dev != 0:
+		for i in range(len(irq_list)):
+			if irq_list[i]['name'].find(dev) >= 0:
+				show_hunk = 1
+				break
+	else:
+		show_hunk = 1
+	if show_hunk == 0:
+		return
+
+	print "%d.%09dsec cpu=%d" % \
+		(nsecs_secs(base_t), nsecs_nsecs(base_t), cpu)
+	for i in range(len(irq_list)):
+		print PF_IRQ_ENTRY % \
+			(diff_msec(base_t, irq_list[i]['irq_ent_t']),
+			irq_list[i]['irq'], irq_list[i]['name'])
+
+		if 'sirq_raise_t' in irq_list[i].keys():
+			print PF_SOFT_RAISE % \
+				diff_msec(base_t, irq_list[i]['sirq_raise_t'])
+
+		if 'irq_ext_t' in irq_list[i].keys():
+			print PF_IRQ_EXIT % \
+				diff_msec(base_t, irq_list[i]['irq_ext_t'])
+	if 'sirq_ent_t' not in hunk.keys():
+		print 'maybe softirq_entry is dropped'
+		return
+	print PF_SOFT_ENTRY % \
+		diff_msec(base_t, hunk['sirq_ent_t'])
+	print PF_JOINT
+	event_list = hunk['event_list']
+	for i in range(len(event_list)):
+		event = event_list[i]
+		if event['event_name'] == 'napi_poll':
+			print PF_NAPI_POLL % \
+			    (diff_msec(base_t, event['event_t']), event['dev'])
+		else:
+			print PF_NET_RECV % \
+			    (diff_msec(base_t, event['event_t']), event['len'])
+			if 'comm' in event.keys():
+				print PF_WJOINT
+				print PF_CPY_DGRAM % \
+					(diff_msec(base_t, event['comm_t']),
+					event['pid'], event['comm'])
+			elif 'handle' in event.keys():
+				print PF_WJOINT
+				if event['handle'] == \
+				    "skb_free_datagram_locked":
+					print PF_FREE_DGRAM % \
+						diff_msec(base_t,
+							event['comm_t'])
+				elif event['handle'] == "kfree_skb":
+					print PF_KFREE_SKB % \
+						diff_msec(base_t,
+							event['comm_t'])
+				elif event['handle'] == "consume_skb":
+					print PF_CONS_SKB % \
+						diff_msec(base_t,
+							event['comm_t'])
+		print PF_JOINT
+	print PF_SOFT_EXIT % diff_msec(base_t, hunk['sirq_ext_t'])
+
+def trace_begin():
+	global show_tx
+	global show_rx
+	global dev
+	global debug
+
+	for i in range(len(sys.argv)):
+		if i == 0:
+			continue
+		arg = sys.argv[i]
+		if arg == 'tx':
+			show_tx = 1
+		elif arg =='rx':
+			show_rx = 1
+		elif arg.find('dev=',0, 4) >= 0:
+			dev = arg[4:]
+		elif arg == 'debug':
+			debug = 1
+	if show_tx == 0  and show_rx == 0:
+		show_tx = 1
+		show_rx = 1
+
+def trace_end():
+	# order all events in time
+	all_event_list.sort(lambda a,b :cmp(a[EINFO_IDX_TIME],
+					    b[EINFO_IDX_TIME]))
+	# process all events
+	for i in range(len(all_event_list)):
+		event_info = all_event_list[i]
+		name = event_info[EINFO_IDX_NAME]
+		if name == 'irq__softirq_exit':
+			handle_irq_softirq_exit(event_info)
+		elif name == 'irq__softirq_entry':
+			handle_irq_softirq_entry(event_info)
+		elif name == 'irq__softirq_raise':
+			handle_irq_softirq_raise(event_info)
+		elif name == 'irq__irq_handler_entry':
+			handle_irq_handler_entry(event_info)
+		elif name == 'irq__irq_handler_exit':
+			handle_irq_handler_exit(event_info)
+		elif name == 'napi__napi_poll':
+			handle_napi_poll(event_info)
+		elif name == 'net__net_dev_receive':
+			handle_net_dev_receive(event_info)
+		elif name == 'skb__skb_copy_datagram_iovec':
+			handle_skb_copy_datagram_iovec(event_info)
+		elif name == 'net__net_dev_queue':
+			handle_net_dev_queue(event_info)
+		elif name == 'net__net_dev_xmit':
+			handle_net_dev_xmit(event_info)
+		elif name == 'skb__kfree_skb':
+			handle_kfree_skb(event_info)
+		elif name == 'skb__dev_kfree_skb_irq':
+			handle_dev_kfree_skb_irq(event_info)
+		elif name == 'skb__consume_skb':
+			handle_consume_skb(event_info)
+		elif name == 'skb__skb_free_datagram_locked':
+			handle_skb_free_datagram_locked(event_info)
+	# display receive hunks
+	if show_rx:
+		for i in range(len(receive_hunk_list)):
+			print_receive(receive_hunk_list[i])
+	# display transmit hunks
+	if show_tx:
+		print "   dev    len      Qdisc        " \
+			"       netdevice             free"
+		for i in range(len(tx_free_list)):
+			print_transmit(tx_free_list[i])
+	if debug:
+		print "debug buffer status"
+		print "----------------------------"
+		print "xmit Qdisc:remain:%d overflow:%d" % \
+			(len(tx_queue_list), of_count_tx_queue_list)
+		print "xmit netdevice:remain:%d overflow:%d" % \
+			(len(tx_xmit_list), of_count_tx_xmit_list)
+		print "receive:remain:%d overflow:%d" % \
+			(len(rx_skb_list), of_count_rx_skb_list)
+
+# called from perf, when it finds a correspoinding event
+def irq__softirq_entry(name, context, cpu, sec, nsec, pid, comm, vec):
+	if symbol_str("irq__softirq_entry", "vec", vec) != "NET_RX":
+		return
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm, vec)
+	all_event_list.append(event_info)
+
+def irq__softirq_exit(name, context, cpu, sec, nsec, pid, comm, vec):
+	if symbol_str("irq__softirq_entry", "vec", vec) != "NET_RX":
+		return
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm, vec)
+	all_event_list.append(event_info)
+
+def irq__softirq_raise(name, context, cpu, sec, nsec, pid, comm, vec):
+	if symbol_str("irq__softirq_entry", "vec", vec) != "NET_RX":
+		return
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm, vec)
+	all_event_list.append(event_info)
+
+def irq__irq_handler_entry(name, context, cpu, sec, nsec, pid, comm,
+			irq, irq_name):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			irq, irq_name)
+	all_event_list.append(event_info)
+
+def irq__irq_handler_exit(name, context, cpu, sec, nsec, pid, comm, irq, ret):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm, irq, ret)
+	all_event_list.append(event_info)
+
+def napi__napi_poll(name, context, cpu, sec, nsec, pid, comm, napi, dev_name):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			napi, dev_name)
+	all_event_list.append(event_info)
+
+def net__net_dev_receive(name, context, cpu, sec, nsec, pid, comm, skbaddr,
+			skblen, dev_name):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			skbaddr, skblen, dev_name)
+	all_event_list.append(event_info)
+
+def net__net_dev_queue(name, context, cpu, sec, nsec, pid, comm,
+			skbaddr, skblen, dev_name):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			skbaddr, skblen, dev_name)
+	all_event_list.append(event_info)
+
+def net__net_dev_xmit(name, context, cpu, sec, nsec, pid, comm,
+			skbaddr, skblen, rc, dev_name):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			skbaddr, skblen, rc ,dev_name)
+	all_event_list.append(event_info)
+
+def skb__kfree_skb(name, context, cpu, sec, nsec, pid, comm,
+			skbaddr, protocol, location):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			skbaddr, protocol, location)
+	all_event_list.append(event_info)
+
+def skb__skb_free_datagram_locked(name, context, cpu, sec, nsec, pid, comm,
+			skbaddr):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			skbaddr)
+	all_event_list.append(event_info)
+
+def skb__dev_kfree_skb_irq(name, context, cpu, sec, nsec, pid, comm, skbaddr):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			skbaddr)
+	all_event_list.append(event_info)
+
+def skb__consume_skb(name, context, cpu, sec, nsec, pid, comm, skbaddr):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			skbaddr)
+	all_event_list.append(event_info)
+
+def skb__skb_copy_datagram_iovec(name, context, cpu, sec, nsec, pid, comm,
+	skbaddr, skblen):
+	event_info = (name, context, cpu, nsecs(sec, nsec), pid, comm,
+			skbaddr, skblen)
+	all_event_list.append(event_info)
+
+def handle_irq_softirq_exit(event_info):
+	(name, context, cpu, time, pid, comm, vec) = event_info
+	irq_list = []
+	event_list = 0
+	if cpu in irq_dic.keys():
+		irq_list = irq_dic[cpu]
+		del irq_dic[cpu]
+	if cpu in net_rx_dic.keys():
+		sirq_ent_t = net_rx_dic[cpu]['sirq_ent_t']
+		event_list = net_rx_dic[cpu]['event_list']
+		del net_rx_dic[cpu]
+	if irq_list == [] or event_list == 0:
+		return
+	rec_data = {'sirq_ent_t':sirq_ent_t, 'sirq_ext_t':time,
+		    'irq_list':irq_list, 'event_list':event_list}
+	# merge information realted to a NET_RX softirq
+	receive_hunk_list.append(rec_data)
+
+def handle_irq_softirq_entry(event_info):
+	(name, context, cpu, time, pid, comm, vec) = event_info
+	net_rx_dic[cpu] = {'sirq_ent_t':time, 'event_list':[]}
+
+def handle_irq_softirq_raise(event_info):
+	(name, context, cpu, time, pid, comm, vec) = event_info
+	if cpu not in irq_dic.keys() \
+	or len(irq_dic[cpu]) == 0:
+		return
+	irq = irq_dic[cpu].pop()
+	# put a time to prev irq on the same cpu
+	irq.update({'sirq_raise_t':time})
+	irq_dic[cpu].append(irq)
+
+def handle_irq_handler_entry(event_info):
+	(name, context, cpu, time, pid, comm, irq, irq_name) = event_info
+	if cpu not in irq_dic.keys():
+		irq_dic[cpu] = []
+	irq_record = {'irq':irq, 'name':irq_name, 'cpu':cpu, 'irq_ent_t':time}
+	irq_dic[cpu].append(irq_record)
+
+def handle_irq_handler_exit(event_info):
+	(name, context, cpu, time, pid, comm, irq, ret) = event_info
+	if cpu not in irq_dic.keys():
+		return
+	irq_record = irq_dic[cpu].pop()
+	if irq != irq_record['irq']:
+		return
+	irq_record.update({'irq_ext_t':time})
+	# if an irq doesn't include NET_RX softirq, drop.
+	if 'sirq_raise_t' in irq_record.keys():
+		irq_dic[cpu].append(irq_record)
+
+def handle_napi_poll(event_info):
+	(name, context, cpu, time, pid, comm, napi, dev_name) = event_info
+	if cpu in net_rx_dic.keys():
+		event_list = net_rx_dic[cpu]['event_list']
+		rec_data = {'event_name':'napi_poll',
+				'dev':dev_name, 'event_t':time}
+		event_list.append(rec_data)
+
+def handle_net_dev_receive(event_info):
+	global of_count_rx_skb_list
+
+	(name, context, cpu, time, pid, comm,
+		skbaddr, skblen, dev_name) = event_info
+	if cpu in net_rx_dic.keys():
+		rec_data = {'event_name':'netif_receive_skb',
+			    'event_t':time, 'skbaddr':skbaddr, 'len':skblen}
+		event_list = net_rx_dic[cpu]['event_list']
+		event_list.append(rec_data)
+		rx_skb_list.insert(0, rec_data)
+		if len(rx_skb_list) > buffer_budget:
+			rx_skb_list.pop()
+			of_count_rx_skb_list += 1
+
+def handle_net_dev_queue(event_info):
+	global of_count_tx_queue_list
+
+	(name, context, cpu, time, pid, comm,
+		skbaddr, skblen, dev_name) = event_info
+	skb = {'dev':dev_name, 'skbaddr':skbaddr, 'len':skblen, 'queue_t':time}
+	tx_xmit_list.insert(0, skb)
+	if len(tx_xmit_list) > buffer_budget:
+		tx_xmit_list.pop()
+		of_count_tx_xmit_list += 1
+
+def handle_net_dev_xmit(event_info):
+	global of_count_tx_queue_list
+
+	(name, context, cpu, time, pid, comm,
+		skbaddr, skblen, rc, dev_name) = event_info
+	if rc == 0: # NETDEV_TX_OK
+		for i in range(len(tx_xmit_list)):
+			skb = tx_xmit_list[i]
+			if skb['skbaddr'] == skbaddr:
+				skb['xmit_t'] = time
+				tx_queue_list.insert(0, skb)
+				del tx_xmit_list[i]
+				if len(tx_queue_list) > buffer_budget:
+					tx_queue_list.pop()
+					of_count_tx_queue_list += 1
+				return
+
+def handle_kfree_skb(event_info):
+	(name, context, cpu, time, pid, comm,
+		skbaddr, protocol, location) = event_info
+	for i in range(len(tx_queue_list)):
+		skb = tx_queue_list[i]
+		if skb['skbaddr'] == skbaddr:
+			del tx_queue_list[i]
+			return
+	for i in range(len(tx_xmit_list)):
+		skb = tx_xmit_list[i]
+		if skb['skbaddr'] == skbaddr:
+			del tx_xmit_list[i]
+			return
+	for i in range(len(rx_skb_list)):
+		rec_data = rx_skb_list[i]
+		if rec_data['skbaddr'] == skbaddr:
+			rec_data.update({'handle':"kfree_skb",
+					'comm':comm, 'pid':pid, 'comm_t':time})
+			del rx_skb_list[i]
+			return
+
+def handle_dev_kfree_skb_irq(event_info):
+	(name, context, cpu, time, pid, comm, skbaddr) = event_info
+	for i in range(len(tx_queue_list)):
+		skb = tx_queue_list[i]
+		if skb['skbaddr'] == skbaddr:
+			skb['free_t'] = time
+			tx_free_list.append(skb)
+			del tx_queue_list[i]
+			return
+
+def handle_consume_skb(event_info):
+	(name, context, cpu, time, pid, comm, skbaddr) = event_info
+	for i in range(len(tx_queue_list)):
+		skb = tx_queue_list[i]
+		if skb['skbaddr'] == skbaddr:
+			skb['free_t'] = time
+			tx_free_list.append(skb)
+			del tx_queue_list[i]
+			return
+
+def handle_skb_free_datagram_locked(event_info):
+	(name, context, cpu, time, pid, comm, skbaddr) = event_info
+	for i in range(len(rx_skb_list)):
+		rec_data = rx_skb_list[i]
+		if skbaddr == rec_data['skbaddr']:
+			rec_data.update({'handle':"skb_free_datagram_locked",
+					'comm':comm, 'pid':pid, 'comm_t':time})
+			del rx_skb_list[i]
+			return
+
+def handle_skb_copy_datagram_iovec(event_info):
+	(name, context, cpu, time, pid, comm, skbaddr, skblen) = event_info
+	for i in range(len(rx_skb_list)):
+		rec_data = rx_skb_list[i]
+		if skbaddr == rec_data['skbaddr']:
+			rec_data.update({'handle':"skb_copy_datagram_iovec",
+					'comm':comm, 'pid':pid, 'comm_t':time})
+			del rx_skb_list[i]
+			return

^ permalink raw reply related

* [PATCH net-next 0/3] cxgb4vf: fixes for several small issues discovered by QA
From: Casey Leedom @ 2010-07-20  1:12 UTC (permalink / raw)
  To: netdev

  A couple of small (but important) fixes discovered by our QA people.  I've also 
included a patch to add myself as the maintainer of cxgb4vf in the MAINTAINERS 
file which I think is the protocol but please correct me if changes to that file 
are usually performed by someone else.

Casey

^ permalink raw reply

* Re: Question about way that NICs deliver packets to the kernel
From: Junchang Wang @ 2010-07-20  1:15 UTC (permalink / raw)
  To: Rick Jones; +Cc: Ben Hutchings, romieu, netdev
In-Reply-To: <4C409DD6.7060903@hp.com>

On Fri, Jul 16, 2010 at 10:58:46AM -0700, Rick Jones wrote:
>>Hi Ben,
>>I added options -c -C to netperf's command line. Result is as follows:
>>                    scheme 1    scheme 2    Imp.
>>Throughput:     683M        718M       5%
>>CPU usage:     47.8%       45.6%
>>
>>That really surprised me because "top" command showed the CPU usage
>>was fluctuating between 0.5% and 1.5% rather that between 45% and 50%.
>

Hi rick,
very sorry for my late reply. Just recovered from the final exam.:)

>Can you tell us a bit more about the system, and which version of
>netperf you are using?  

The target machine is a Pentium Dual-core E2200 desktop with a r8169 
gigabit NIC. (I couldn't find a better server with old pci slot.)

Another machine is a Nehalem based system with Intel 82576 NIC.

The target machine executes netserver and Nehalem machine executes netperf.
The version of netperf is 2.4.5

>Any chance that the CPU utilization you were
>looking at in top was just that being charged to netperf the process?

What I see on target machine is as follows:

top - 21:37:12 up 21 min,  6 users,  load average: 0.43, 0.28, 0.19
Tasks: 152 total,   2 running, 149 sleeping,   0 stopped,   1 zombie
Cpu(s):  2.3%us,  1.5%sy,  0.1%ni, 89.5%id,  2.7%wa,  0.0%hi,  3.9%si,  0.0%
Mem:   2074064k total,   690200k used,  1383864k free,    39372k buffers
Swap:  2096476k total,        0k used,  2096476k free,   435044k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
3916 root      20   0  2228  584  296 R 84.6  0.0   0:07.12 netserver    

It shows the CPU usage of taget machine is around 10%.

while Nehalem machine's report is as follows:

TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.1 (192.168.2.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

87380  16384  16384    10.05       679.79   1.63     48.27    1.571   11.634 

It shows the CPU usage of target machine is 48.27%.

>"Network processing" does not often get charged to the responsible
>process, so netperf reports system-wide CPU utilization on the
>assumption it is the only thing causing the CPUs to be utilized.

My understand of your commends is:
1)except running in ksoftirqd, network processing cannot be correctly counted
  because it runs in interrupt contexts that do not get charged to a correct
  process. So "top" misses lots of CPU usage in high interrupt rate network
  situation.
2)As you have mentioned in netperf's manual, netperf uses /proc/stat on Linux
  to retrieve time spent in idle mode. In other words, it accumulates cpu time
  spent in all other modes, including hardware interrupt, software interrupt,
  etc., making the CPU usage more accurate in high interrupt situation.
3)Since most processes in target machine are in sleeping mode, the CPU usage
  of network processing is in actually very close to 48.27%. Right?

Correct me if any of them are incorrect. Thanks.

--Junchang

^ permalink raw reply

* Re: [PATCH] LSM: Add post accept() hook.
From: Tetsuo Handa @ 2010-07-20  1:36 UTC (permalink / raw)
  To: paul.moore
  Cc: davem, eric.dumazet, jmorris, sam, serge, netdev,
	linux-security-module
In-Reply-To: <201007191815.52124.paul.moore@hp.com>

Paul Moore wrote:
> I think you need to show how you plan to use this hook in an LSM before we can 
> consider merging it with mainline.  What you are proposing here is giving an 
> LSM the ability to drop a connection _after_ allowing it to be established in 
> the first place; this seems very wrong to me and I want to make sure everyone 
> else is aware of that before accepting this code into the kernel.  I 
> understand that TOMOYO's security model does not allow it to reject incoming 
> connections at the beginning of the connection request like some of the LSMs 
> currently in use, but I'm just not very happy with the idea of finishing a 
> connection handshake only to later drop the connection on the floor.

Yes. I'm planning to use security_socket_post_accept() for two purposes.



One is for dropping connections from unwanted hosts. Administrators define
policy before enabling enforcing mode (the mode which connections are dropped
if operation was not granted by policy). Administrators specify acceptable
hosts (i.e. hosts which this host needs to communicate with) and unacceptable
hosts (i.e. hosts which this host needn't to communicate with).
Dropping connections would happen if some process was hijacked and the process
attempted to communicate with other processes using TCP connections. But
dropping connections should not happen in normal circumstance.



The other is for updating process's state variable upon accept() operation.
LKM version of TOMOYO has per a task_struct variable that is used for
implementing stateful permissions. (As of now, not implemented for LSM version
of TOMOYO.) For example,

  allow_network TCP accept 10.0.0.0-10.255.255.255 1024-65535 ; set task.state[0]=1
  allow_network TCP accept 192.168.0.1-192.168.255.255 1024-65535 ; set task.state[0]=2

will change current thread's task state variable to 1 if current thread
accepted TCP connection from 10.0.0.0-10.255.255.255 and change it to 2 if from
192.168.0.1-192.168.255.255 . This variable is used for giving different
permissions for subsequent operations. For example,

  allow_execute /bin/bash if task.state[0]=1
  allow_execute /bin/tcsh if task.state[0]=2

will allow execution of /bin/bash if current thread is dealing connections from
10.0.0.0-10.255.255.255 and allow execution of /bin/tcsh if current thread is
dealing connections from 192.168.0.1-192.168.255.255 . Another example,

  allow_network TCP accept 0.0.0.0-255.255.255.255 1024-65535 ; set task.state[0]=3
  allow_network TCP accept 0.0.0.0-255.255.255.255 1-1023 ; set task.state[0]=4

will change it to 3 if from unprivileged port and change it to 4 if from
privileged port.

  allow_execute /bin/rbash if task.state[0]=3
  allow_execute /bin/bash if task.state[0]=4

will allow execution of /bin/rbash if dealing connections from unprivileged
ports and allow execution of /bin/bash if dealing connections from privileged
ports.

LSM hooks called before sock->ops->accept() cannot change current thread's task
state variable because it is racy, and LSM hook called after sock->ops->accept()
is missing.

Strictly speaking, it could be possible to update current thread's task state
variable in LSM hooks called by subsequent operations (e.g.
security_dentry_open(), security_bprm_set_creds()) by doing similar approach
done by tomoyo_dead_sock(), but updating it can fail (e.g. -ENOMEM) since
credentials are COW. If updating it failed, I want to drop the accept()ed
connection, but that is impossible from LSM hooks called by subsequent
operations. Killing current thread when updating it failed is possible, but
that looks worse for me than dropping connections upon accept() time (because
such action resembles OOM killer and likely gives larger damage to the caller).

^ permalink raw reply

* linux-next: manual merge of the net tree with the net-current tree
From: Stephen Rothwell @ 2010-07-20  2:20 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Jeff Dike, Michael S. Tsirkin

Hi all,

Today's linux-next merge of the net tree got a conflict in
drivers/vhost/net.c between commit
1680e9063ea28099a1efa8ca11cee069cc7a9bc3 ("vhost-net: avoid flush under
lock") from the net-current tree and commit
dd1f4078f0d2de74a308f00a2dffbd550cfba59f ("vhost-net: minor cleanup")
from the net tree.

I fixed it up (see below) and can carry the fix as necessary.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc drivers/vhost/net.c
index d219070,107af9e..0000000
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@@ -527,15 -527,12 +527,14 @@@ static long vhost_net_set_backend(struc
  
  	/* start polling new socket */
  	oldsock = vq->private_data;
- 	if (sock == oldsock)
- 		goto done;
+ 	if (sock != oldsock){
+                 vhost_net_disable_vq(n, vq);
+                 rcu_assign_pointer(vq->private_data, sock);
+                 vhost_net_enable_vq(n, vq);
+ 	}
  
- 	vhost_net_disable_vq(n, vq);
- 	rcu_assign_pointer(vq->private_data, sock);
- 	vhost_net_enable_vq(n, vq);
- done:
 +	mutex_unlock(&vq->mutex);
 +
  	if (oldsock) {
  		vhost_net_flush_vq(n, index);
  		fput(oldsock->file);

^ permalink raw reply

* Re: linux-next: manual merge of the net tree with the net-current tree
From: Joe Perches @ 2010-07-20  2:34 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, netdev, linux-next, linux-kernel, Jeff Dike,
	Michael S. Tsirkin
In-Reply-To: <20100720122032.88e0fcd9.sfr@canb.auug.org.au>

On Tue, 2010-07-20 at 12:20 +1000, Stephen Rothwell wrote:
> I fixed it up (see below) and can carry the fix as necessary.
@@@ -527,15 -527,12 +527,14 @@@ static long vhost_net_set_backend(struc
  
        /* start polling new socket */
        oldsock = vq->private_data;
-       if (sock == oldsock)
-               goto done;
+       if (sock != oldsock){

Trivial: missing space before open brace in commit
dd1f4078f0d2de74a308f00a2dffbd550cfba59f

^ permalink raw reply

* Re: [net-next-2.6 PATCH] e1000: allow option to limit number of descriptors down to 48 per ring
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, alexander.h.duyck
In-Reply-To: <20100719234219.13875.90302.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 16:43:47 -0700

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change makes it possible to limit the number of descriptors down to 48
> per ring.  The reason for this change is to address a variation on hardware
> errata 10 for 82546GB in which descriptors will be lost if more than 32
> descriptors are fetched and the PCI-X MRBC is 512.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox