Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: Eric Dumazet @ 2010-07-20  6:30 UTC (permalink / raw)
  To: Bill Fink
  Cc: David Miller, jeffrey.t.kirsher, netdev, gospo, bphilips,
	alexander.h.duyck, donald.c.skidmore
In-Reply-To: <20100720020754.135b5ff7.billfink@mindspring.com>

Le mardi 20 juillet 2010 à 02:07 -0400, Bill Fink a écrit :
> On Mon, 19 Jul 2010, David Miller wrote:
> 

> Should there be a /proc or ethtool setting for whether or not to
> use RSS hashing for UDP flows?  I would think that for many common
> UDP applications, IP fragmentation would not be an issue because
> they often tend to use sub-MTU sized datagrams.  And of course
> UDP does not guarantee in-order delivery in any event.  Then a
> remaining issue is what the default setting of such an option
> should be.  I would lean to having it enabled by default, but
> I can also see the safety argument for having it off by default.
> 

Their are several issues here.

1) Ability for the NIC to spread UDP loads on several queues.

2) Ability for the NIC to provide the hash to our stack, to speedup a
bit RPS.

If the patch is about 1), ie disables NIC ability to split UDP flows on
several RX queues, then yes : its probably _not_ wanted.

Commit message is not very clear on this topic.

By nature, UDP flows are subject to out of order issues, so what is this
patch tries to avoid ?

^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: David Miller @ 2010-07-20  6:28 UTC (permalink / raw)
  To: herbert; +Cc: mst, shemminger, frzhang, netdev, amwang, mpm
In-Reply-To: <20100720052645.GA5214@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 20 Jul 2010 13:26:45 +0800

> Note that this patch needs to be reverted in net-next-2.6 as
> bridge netpoll works correctly there.

Ok, will do.

> bridge: Partially disable netpoll support
> 
> The new netpoll code in bridging contains use-after-free bugs
> that are non-trivial to fix.
> 
> This patch fixes this by removing the code that uses skbs after
> they're freed.
> 
> As a consequence, this means that we can no longer call bridge
> from the netpoll path, so this patch also removes the controller
> function in order to disable netpoll.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks Herbert.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: Bill Fink @ 2010-07-20  6:07 UTC (permalink / raw)
  To: David Miller
  Cc: jeffrey.t.kirsher, netdev, gospo, bphilips, alexander.h.duyck,
	donald.c.skidmore
In-Reply-To: <20100719.202417.218061462.davem@davemloft.net>

On Mon, 19 Jul 2010, David Miller wrote:

> From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Date: Mon, 19 Jul 2010 16:59:27 -0700
> 
> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> > 
> > This change removes UDP from the supported protocols for RSS hashing.  The
> > reason for removing this protocol is because IP fragmentation was causing a
> > network flow to be broken into two streams, one for fragmented, and one for
> > non-fragmented and this in turn was causing out-of-order issues.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> > Acked-by: Don Skidmore <donald.c.skidmore@intel.com>
> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> 
> Applied.

Should there be a /proc or ethtool setting for whether or not to
use RSS hashing for UDP flows?  I would think that for many common
UDP applications, IP fragmentation would not be an issue because
they often tend to use sub-MTU sized datagrams.  And of course
UDP does not guarantee in-order delivery in any event.  Then a
remaining issue is what the default setting of such an option
should be.  I would lean to having it enabled by default, but
I can also see the safety argument for having it off by default.

						-Bill

^ permalink raw reply

* Re: [PATCH net-next 3/4] bnx2: Remove some unnecessary smp_mb() in tx fast path.
From: Michael Chan @ 2010-07-20  5:48 UTC (permalink / raw)
  To: 'Eric Dumazet'; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <1279604283.2458.68.camel@edumazet-laptop>

Eric Dumazet wrote:

> Excellent,
>
> Is similar patch for tg3 planned ?
>

Yes, this performance issue was actually first discovered when
profiling a new tg3 device.  So Matt should be sending out a
very similar patch as part of his next patch-set.

^ permalink raw reply

* Re: [PATCH net-next 3/4] bnx2: Remove some unnecessary smp_mb() in tx fast path.
From: Eric Dumazet @ 2010-07-20  5:38 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev
In-Reply-To: <1279584905-15084-3-git-send-email-mchan@broadcom.com>

Le lundi 19 juillet 2010 à 17:15 -0700, Michael Chan a écrit :
> smp_mb() inside bnx2_tx_avail() is used twice in the normal
> bnx2_start_xmit() path (see illustration below).  The full memory
> barrier is only necessary during race conditions with tx completion.
> We can speed up the tx path by replacing smp_mb() in bnx2_tx_avail()
> with a compiler barrier.  The compiler barrier is to force the
> compiler to fetch the tx_prod and tx_cons from memory.
> 
> In the race condition between bnx2_start_xmit() and bnx2_tx_int(),
> we have the following situation:
> 
> bnx2_start_xmit()                       bnx2_tx_int()
>     if (!bnx2_tx_avail())
>             BUG();
> 
>     ...
> 
>     if (!bnx2_tx_avail())
>             netif_tx_stop_queue();          update_tx_index();
>             smp_mb();                       smp_mb();
>             if (bnx2_tx_avail())            if (netif_tx_queue_stopped() &&
>                     netif_tx_wake_queue();      bnx2_tx_avail())
> 
> With smp_mb() removed from bnx2_tx_avail(), we need to add smp_mb() to
> bnx2_start_xmit() as shown above to properly order netif_tx_stop_queue()
> and bnx2_tx_avail() to check the ring index.  If it is not strictly
> ordered, the tx queue can be stopped forever.
> 
> This improves performance by about 5% with 2 ports running bi-directional
> 64-byte packets.
> 
> Reviewed-by: Benjamin Li <benli@broadcom.com>
> Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>
> ---
>  drivers/net/bnx2.c |   10 +++++++++-
>  1 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
> index d44ecc3..2af570d 100644
> --- a/drivers/net/bnx2.c
> +++ b/drivers/net/bnx2.c
> @@ -253,7 +253,8 @@ static inline u32 bnx2_tx_avail(struct bnx2 *bp, struct bnx2_tx_ring_info *txr)
>  {
>  	u32 diff;
>  
> -	smp_mb();
> +	/* Tell compiler to fetch tx_prod and tx_cons from memory. */
> +	barrier();
>  
>  	/* The ring uses 256 indices for 255 entries, one of them
>  	 * needs to be skipped.
> @@ -6534,6 +6535,13 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	if (unlikely(bnx2_tx_avail(bp, txr) <= MAX_SKB_FRAGS)) {
>  		netif_tx_stop_queue(txq);
> +
> +		/* netif_tx_stop_queue() must be done before checking
> +		 * tx index in bnx2_tx_avail() below, because in
> +		 * bnx2_tx_int(), we update tx index before checking for
> +		 * netif_tx_queue_stopped().
> +		 */
> +		smp_mb();
>  		if (bnx2_tx_avail(bp, txr) > bp->tx_wake_thresh)
>  			netif_tx_wake_queue(txq);
>  	}

Excellent,

Is similar patch for tg3 planned ?

Thanks




^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: Herbert Xu @ 2010-07-20  5:26 UTC (permalink / raw)
  To: David Miller; +Cc: mst, shemminger, frzhang, netdev, amwang, mpm
In-Reply-To: <20100719.090503.73693858.davem@davemloft.net>

On Mon, Jul 19, 2010 at 09:05:03AM -0700, David Miller wrote:
>
> I thought we did that already.... oh I see, we did it for bonding:
> 
> commit c22d7ac844f1cb9c6a5fd20f89ebadc2feef891b
> Author: Andy Gospodarek <andy@greyhouse.net>
> Date:   Fri Jun 25 09:50:44 2010 +0000
> 
>     bonding: prevent netpoll over bonded interfaces
> 
> I'm fine with disabling it for bridging too, just send me a patch
> similar to the bonding one.

OK, here is a minimal patch to remove the offending bits of the
bridge netpoll support.

Note that this patch needs to be reverted in net-next-2.6 as
bridge netpoll works correctly there.

bridge: Partially disable netpoll support

The new netpoll code in bridging contains use-after-free bugs
that are non-trivial to fix.

This patch fixes this by removing the code that uses skbs after
they're freed.

As a consequence, this means that we can no longer call bridge
from the netpoll path, so this patch also removes the controller
function in order to disable netpoll.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index eedf2c9..753fc42 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -217,14 +217,6 @@ static bool br_devices_support_netpoll(struct net_bridge *br)
 	return count != 0 && ret;
 }
 
-static void br_poll_controller(struct net_device *br_dev)
-{
-	struct netpoll *np = br_dev->npinfo->netpoll;
-
-	if (np->real_dev != br_dev)
-		netpoll_poll_dev(np->real_dev);
-}
-
 void br_netpoll_cleanup(struct net_device *dev)
 {
 	struct net_bridge *br = netdev_priv(dev);
@@ -295,7 +287,6 @@ static const struct net_device_ops br_netdev_ops = {
 	.ndo_do_ioctl		 = br_dev_ioctl,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_netpoll_cleanup	 = br_netpoll_cleanup,
-	.ndo_poll_controller	 = br_poll_controller,
 #endif
 };
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index a4e72a8..595da45 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -50,14 +50,7 @@ int br_dev_queue_push_xmit(struct sk_buff *skb)
 			kfree_skb(skb);
 		else {
 			skb_push(skb, ETH_HLEN);
-
-#ifdef CONFIG_NET_POLL_CONTROLLER
-			if (unlikely(skb->dev->priv_flags & IFF_IN_NETPOLL)) {
-				netpoll_send_skb(skb->dev->npinfo->netpoll, skb);
-				skb->dev->priv_flags &= ~IFF_IN_NETPOLL;
-			} else
-#endif
-				dev_queue_xmit(skb);
+			dev_queue_xmit(skb);
 		}
 	}
 
@@ -73,23 +66,9 @@ int br_forward_finish(struct sk_buff *skb)
 
 static void __br_deliver(const struct net_bridge_port *to, struct sk_buff *skb)
 {
-#ifdef CONFIG_NET_POLL_CONTROLLER
-	struct net_bridge *br = to->br;
-	if (unlikely(br->dev->priv_flags & IFF_IN_NETPOLL)) {
-		struct netpoll *np;
-		to->dev->npinfo = skb->dev->npinfo;
-		np = skb->dev->npinfo->netpoll;
-		np->real_dev = np->dev = to->dev;
-		to->dev->priv_flags |= IFF_IN_NETPOLL;
-	}
-#endif
 	skb->dev = to->dev;
 	NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_OUT, skb, NULL, skb->dev,
 		br_forward_finish);
-#ifdef CONFIG_NET_POLL_CONTROLLER
-	if (skb->dev->npinfo)
-		skb->dev->npinfo->netpoll->dev = br->dev;
-#endif
 }
 
 static void __br_forward(const struct net_bridge_port *to, struct sk_buff *skb)

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: Very low latency TCP for clusters
From: Eric Dumazet @ 2010-07-20  5:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, Rick Jones
In-Reply-To: <AANLkTillCVWMHHBImGEeRYy2MJYqtGIvSPvacLKJnpP2@mail.gmail.com>

Le lundi 19 juillet 2010 à 16:37 -0700, Tom Herbert a écrit :

> That's pretty pokey ;-) I see numbers around 25 usecs between to
> machines, this is with TCP_NBRR.  With TCP_RR it's more like 35 usecs,
> so eliminating the scheduler is already a big reduction.  That leaves
> 18 usecs in device time, interrupt processing, network, and cache
> misses; 7 usecs in TCP processing, user space.  While 5 usecs is an
> aggressive goal, I am not ready to concede that there's an
> architectural limit in either NICs, TCP, or sockets that can't be
> overcome.

Last time I tried TCP_NBRR, it was not working (not even compiled in), I
guess I should submit a bug report to Rick ;)




^ permalink raw reply

* Re: [PATCH net-next] drivers/acpi/acpica/utmisc.c: Use printk extension %pV
From: Joe Perches @ 2010-07-20  5:21 UTC (permalink / raw)
  To: David Miller; +Cc: lenb, linux-acpi, linux-kernel, netdev
In-Reply-To: <20100719.221630.146073821.davem@davemloft.net>

On Mon, 2010-07-19 at 22:16 -0700, David Miller wrote:
> Joe, I really can't keep merging stuff like this into the networking
> tree.

Oh no worries David.

The patch is more like a preliminary notification that at
some point after the vsprintf stuff gets merged into Linus'
tree after 2.6.36-rc1 that this change to acpi could be useful.

> But you're going to have to find another way to merge uses outside of
> networking that you want integrated.

Hey, I'm a patient guy, there's no hurry.

cheers, Joe

^ permalink raw reply

* Re: [PATCH net-next] drivers/acpi/acpica/utmisc.c: Use printk extension %pV
From: David Miller @ 2010-07-20  5:16 UTC (permalink / raw)
  To: joe; +Cc: lenb, linux-acpi, linux-kernel, netdev
In-Reply-To: <1279602367.19374.20.camel@Joe-Laptop.home>

From: Joe Perches <joe@perches.com>
Date: Mon, 19 Jul 2010 22:06:07 -0700

> Consolidates the printk messages to a single
> call so the messages can not be interleaved.
> 
> Reduces text a bit.
> 
> $ size drivers/acpi/acpica/utmisc.o.*
>    text	   data	    bss	    dec	    hex	filename
>    7822	     56	   1832	   9710	   25ee	drivers/acpi/acpica/utmisc.o.old
>    7748	     56	   1736	   9540	   2544	drivers/acpi/acpica/utmisc.o.new
> 
> Depends on net-next commit 7db6f5fb65a82af03229eef104dc9899c5eecf33
> (vsprintf: Recursive vsnprintf: Add "%pV", struct va_format)
> 
> Compile tested only
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Joe, I really can't keep merging stuff like this into the networking
tree.  The driver generic bits, since the netdev print macros needed
this indirectly and you used %pV primarily for networking stuff, that's
fine.

But you're going to have to find another way to merge uses outside of
networking that you want integrated.

Thanks.

^ permalink raw reply

* [PATCH net-next] drivers/acpi/acpica/utmisc.c: Use printk extension %pV
From: Joe Perches @ 2010-07-20  5:06 UTC (permalink / raw)
  To: Len Brown; +Cc: linux-acpi, linux-kernel, netdev, David Miller

Consolidates the printk messages to a single
call so the messages can not be interleaved.

Reduces text a bit.

$ size drivers/acpi/acpica/utmisc.o.*
   text	   data	    bss	    dec	    hex	filename
   7822	     56	   1832	   9710	   25ee	drivers/acpi/acpica/utmisc.o.old
   7748	     56	   1736	   9540	   2544	drivers/acpi/acpica/utmisc.o.new

Depends on net-next commit 7db6f5fb65a82af03229eef104dc9899c5eecf33
(vsprintf: Recursive vsnprintf: Add "%pV", struct va_format)

Compile tested only

Signed-off-by: Joe Perches <joe@perches.com>
---
 drivers/acpi/acpica/utmisc.c |   78 ++++++++++++++++++++++++++++--------------
 1 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/drivers/acpi/acpica/utmisc.c b/drivers/acpi/acpica/utmisc.c
index e8d0724..d9a4a64 100644
--- a/drivers/acpi/acpica/utmisc.c
+++ b/drivers/acpi/acpica/utmisc.c
@@ -53,8 +53,8 @@ ACPI_MODULE_NAME("utmisc")
 /*
  * Common suffix for messages
  */
-#define ACPI_COMMON_MSG_SUFFIX \
-	acpi_os_printf(" (%8.8X/%s-%u)\n", ACPI_CA_VERSION, module_name, line_number)
+#define ACPI_COMMON_MSG_SUFFIX_FMT  " (%8.8X/%s-%u)"
+#define ACPI_COMMON_MSG_SUFFIX_ARGS ACPI_CA_VERSION, module_name, line_number
 /*******************************************************************************
  *
  * FUNCTION:    acpi_ut_validate_exception
@@ -1063,12 +1063,16 @@ void ACPI_INTERNAL_VAR_XFACE
 acpi_error(const char *module_name, u32 line_number, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI Error: ");
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Error: %pV" ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1077,12 +1081,18 @@ acpi_exception(const char *module_name,
 	       u32 line_number, acpi_status status, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI Exception: %s, ", acpi_format_exception(status));
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Exception: %s, %pV"
+		       ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       acpi_format_exception(status), &vaf,
+		       ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1090,12 +1100,16 @@ void ACPI_INTERNAL_VAR_XFACE
 acpi_warning(const char *module_name, u32 line_number, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI Warning: ");
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Warning: %pV" ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1103,12 +1117,15 @@ void ACPI_INTERNAL_VAR_XFACE
 acpi_info(const char *module_name, u32 line_number, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI: ");
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	acpi_os_printf("\n");
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI: %pV\n", &vaf);
+
 	va_end(args);
 }
 
@@ -1143,6 +1160,7 @@ acpi_ut_predefined_warning(const char *module_name,
 			   u8 node_flags, const char *format, ...)
 {
 	va_list args;
+	struct va_format vaf;
 
 	/*
 	 * Warning messages for this method/object will be disabled after the
@@ -1152,11 +1170,15 @@ acpi_ut_predefined_warning(const char *module_name,
 		return;
 	}
 
-	acpi_os_printf("ACPI Warning for %s: ", pathname);
-
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Warning for %s: %pV"
+		       ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       pathname, &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1185,6 +1207,7 @@ acpi_ut_predefined_info(const char *module_name,
 			char *pathname, u8 node_flags, const char *format, ...)
 {
 	va_list args;
+	struct va_format vaf;
 
 	/*
 	 * Warning messages for this method/object will be disabled after the
@@ -1194,10 +1217,13 @@ acpi_ut_predefined_info(const char *module_name,
 		return;
 	}
 
-	acpi_os_printf("ACPI Info for %s: ", pathname);
-
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Info for %s: %pV" ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       pathname, &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }



^ permalink raw reply related

* Re: [RFC PATCH] ipv6: Make IP6CB(skb)->nhoff 16-bit.
From: David Miller @ 2010-07-20  5:01 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20100715.223011.58409183.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Thu, 15 Jul 2010 22:30:11 -0700 (PDT)

> 
> Even with jumbograms I cannot see any way in which we would need
> to records a larger than 65535 valued next-header offset.
> 
> The maximum extension header length is (256 << 3) == 2048.
> There are only a handful of extension headers specified which
> we'd even accept (say 5 or 6), therefore the largest next-header
> offset we'd ever have to contend with is something less than
> say 16k.
> 
> Therefore make it a u16 instead of a u32.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
> Can anyone find any holes in this?  I want to do this so we
> can legitimately rip ndisc_nodetype out of struct sk_buff
> and stick it into the IP6CB where it belongs.  Currently
> there isn't enough space, but with this change there will
> be.

No holes claimed by anyone, so... applied :-)

^ permalink raw reply

* Re: [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
From: Eric Dumazet @ 2010-07-20  4:54 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, nhorman, laijs, scott.a.mcmillan, rostedt,
	fweisbec, mathieu.desnoyers
In-Reply-To: <4C44F286.1050907@jp.fujitsu.com>

Le mardi 20 juillet 2010 à 09:49 +0900, Koki Sanagi a écrit :
> [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
> This patch adds tracepoint to consume_skb, dev_kfree_skb_irq and
> skb_free_datagram_locked. Combinating with tracepoint on dev_hard_start_xmit,
> we can check how long it takes to free transmited packets. And using it, we can
> calculate how many packets driver had at that time. It is useful when a drop of
> transmited packet is a problem.
> 
>           <idle>-0     [001] 241409.218333: consume_skb: skbaddr=dd6b2fb8
>           <idle>-0     [001] 241409.490555: dev_kfree_skb_irq: skbaddr=f5e29840
> 
>         udp-recv-302   [001] 515031.206008: skb_free_datagram_locked: skbaddr=f5b1d900
> 
> 
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  include/trace/events/skb.h |   42 ++++++++++++++++++++++++++++++++++++++++++
>  net/core/datagram.c        |    1 +
>  net/core/dev.c             |    2 ++
>  net/core/skbuff.c          |    1 +
>  4 files changed, 46 insertions(+), 0 deletions(-)
> 
> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
> index 4b2be6d..84c9041 100644
> --- a/include/trace/events/skb.h
> +++ b/include/trace/events/skb.h
> @@ -35,6 +35,48 @@ TRACE_EVENT(kfree_skb,
>  		__entry->skbaddr, __entry->protocol, __entry->location)
>  );
>  
> +DECLARE_EVENT_CLASS(free_skb,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb),
> +
> +	TP_STRUCT__entry(
> +		__field(	void *,	skbaddr	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->skbaddr = skb;
> +	),
> +
> +	TP_printk("skbaddr=%p", __entry->skbaddr)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, consume_skb,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, dev_kfree_skb_irq,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, skb_free_datagram_locked,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
>  TRACE_EVENT(skb_copy_datagram_iovec,
>  
>  	TP_PROTO(const struct sk_buff *skb, int len),
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index f5b6f43..1ea32a0 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -231,6 +231,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
>  {
>  	bool slow;
>  
> +	trace_skb_free_datagram_locked(skb);

Here you unconditionally trace before the test on skb->users

>  	if (likely(atomic_read(&skb->users) == 1))
>  		smp_rmb();
>  	else if (likely(!atomic_dec_and_test(&skb->users)))
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4acfec6..d979847 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -131,6 +131,7 @@
>  #include <linux/random.h>
>  #include <trace/events/napi.h>
>  #include <trace/events/net.h>
> +#include <trace/events/skb.h>
>  #include <linux/pci.h>
>  
>  #include "net-sysfs.h"
> @@ -1581,6 +1582,7 @@ void dev_kfree_skb_irq(struct sk_buff *skb)
>  		struct softnet_data *sd;
>  		unsigned long flags;
>  
> +		trace_dev_kfree_skb_irq(skb);
>  		local_irq_save(flags);
>  		sd = &__get_cpu_var(softnet_data);
>  		skb->next = sd->completion_queue;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 34432b4..a7b4036 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -466,6 +466,7 @@ void consume_skb(struct sk_buff *skb)
>  		smp_rmb();
>  	else if (likely(!atomic_dec_and_test(&skb->users)))
>  		return;

While here you trace _after_ the test on skb->users - and a "return;" ,
so you miss some consume_skb() calls


> +	trace_consume_skb(skb);
>  	__kfree_skb(skb);
>  }
>  EXPORT_SYMBOL(consume_skb);
> 

^ permalink raw reply

* Re: [net-next-2.6 PATCH] e1000: allow option to limit number of descriptors down to 48 per ring
From: Eric Dumazet @ 2010-07-20  4:40 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, gospo, bphilips, Alexander Duyck
In-Reply-To: <20100719234219.13875.90302.stgit@localhost.localdomain>

Le lundi 19 juillet 2010 à 16:43 -0700, Jeff Kirsher a écrit :
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change makes it possible to limit the number of descriptors down to 48
> per ring.  The reason for this change is to address a variation on hardware
> errata 10 for 82546GB in which descriptors will be lost if more than 32
> descriptors are fetched and the PCI-X MRBC is 512.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> 
>  drivers/net/e1000/e1000.h |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
> index 40b62b4..65298a6 100644
> --- a/drivers/net/e1000/e1000.h
> +++ b/drivers/net/e1000/e1000.h
> @@ -86,12 +86,12 @@ struct e1000_adapter;
>  /* TX/RX descriptor defines */
>  #define E1000_DEFAULT_TXD                  256
>  #define E1000_MAX_TXD                      256
> -#define E1000_MIN_TXD                       80
> +#define E1000_MIN_TXD                       48
>  #define E1000_MAX_82544_TXD               4096
>  
>  #define E1000_DEFAULT_RXD                  256
>  #define E1000_MAX_RXD                      256
> -#define E1000_MIN_RXD                       80
> +#define E1000_MIN_RXD                       48
>  #define E1000_MAX_82544_RXD               4096
>  
>  #define E1000_MIN_ITR_USECS		10 /* 100000 irq/sec */
> 

So this limit is a pure software one ?

Why not let an admin chose a lower limit if he wants to ?

I am asking because big ring sizes can be a latency source in some
workloads.

Thanks



^ permalink raw reply

* Re: [PATCH] net: Add batman-adv meshing protocol
From: David Miller @ 2010-07-20  4:26 UTC (permalink / raw)
  To: sven.eckelmann-Mmb7MZpHnFY
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwlltHuzzzSOjJt
In-Reply-To: <1279291156-5297-2-git-send-email-sven.eckelmann-Mmb7MZpHnFY@public.gmane.org>

From: Sven Eckelmann <sven.eckelmann-Mmb7MZpHnFY@public.gmane.org>
Date: Fri, 16 Jul 2010 16:39:16 +0200

> +/* count the hamming weight, how many good packets did we receive? just count
> + * the 1's. The inner loop uses the Kernighan algorithm, see
> + * http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
> + */
> +int bit_packet_count(TYPE_OF_WORD *seq_bits)
> +{
> +	int i, hamming = 0;
> +	TYPE_OF_WORD word;
> +
> +	for (i = 0; i < NUM_WORDS; i++) {
> +		word = seq_bits[i];
> +
> +		while (word) {
> +			word &= word-1;
> +			hamming++;
> +		}
> +	}
> +	return hamming;
> +}

The kernel has a hamming weight library function which takes advantage
of population count instructions on cpus that suport it, and also has
a sw version than is faster than what you're doing here, please use
it.

The interfaces are called "hweight{8,16,32,64}()" where the number in
the name indicates the bit-size of the word the interface operates on.

I also notice that this code uses it's own internal buffering scheme
with kmalloc()'d buffers, then seperately allocates actual SKB's and
copies the data there.

Just use the SKB facilities how they were designed to be used, instead
of needlessly inventing new things.  Allocate your initial SKB and put
the initial forwarding header in it, then when you want to send a copy
off, skb_clone() it, and push the other bits you want at the head
and/or the tail of the cloned SKB, then simply send it off.

^ permalink raw reply

* Re: [PATCH net-next 0/3] cxgb4vf: fixes for several small issues discovered by QA
From: David Miller @ 2010-07-20  3:40 UTC (permalink / raw)
  To: leedom; +Cc: netdev
In-Reply-To: <201007191812.10459.leedom@chelsio.com>

From: Casey Leedom <leedom@chelsio.com>
Date: Mon, 19 Jul 2010 18:12:10 -0700

>   A couple of small (but important) fixes discovered by our QA people.  I've also 
> included a patch to add myself as the maintainer of cxgb4vf in the MAINTAINERS 
> file which I think is the protocol but please correct me if changes to that file 
> are usually performed by someone else.

Well, where are they?

^ permalink raw reply

* Re: [PATCH net-next 4/4] bnx2: Update version to 2.0.17.
From: David Miller @ 2010-07-20  3:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1279584905-15084-4-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Mon, 19 Jul 2010 17:15:05 -0700

> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 3/4] bnx2: Remove some unnecessary smp_mb() in tx fast path.
From: David Miller @ 2010-07-20  3:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1279584905-15084-3-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Mon, 19 Jul 2010 17:15:04 -0700

> smp_mb() inside bnx2_tx_avail() is used twice in the normal
> bnx2_start_xmit() path (see illustration below).  The full memory
> barrier is only necessary during race conditions with tx completion.
> We can speed up the tx path by replacing smp_mb() in bnx2_tx_avail()
> with a compiler barrier.  The compiler barrier is to force the
> compiler to fetch the tx_prod and tx_cons from memory.
> 
> In the race condition between bnx2_start_xmit() and bnx2_tx_int(),
> we have the following situation:
> 
> bnx2_start_xmit()                       bnx2_tx_int()
>     if (!bnx2_tx_avail())
>             BUG();
> 
>     ...
> 
>     if (!bnx2_tx_avail())
>             netif_tx_stop_queue();          update_tx_index();
>             smp_mb();                       smp_mb();
>             if (bnx2_tx_avail())            if (netif_tx_queue_stopped() &&
>                     netif_tx_wake_queue();      bnx2_tx_avail())
> 
> With smp_mb() removed from bnx2_tx_avail(), we need to add smp_mb() to
> bnx2_start_xmit() as shown above to properly order netif_tx_stop_queue()
> and bnx2_tx_avail() to check the ring index.  If it is not strictly
> ordered, the tx queue can be stopped forever.
> 
> This improves performance by about 5% with 2 ports running bi-directional
> 64-byte packets.
> 
> Reviewed-by: Benjamin Li <benli@broadcom.com>
> Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/4] bnx2: Call pci_enable_msix() with actual number of vectors.
From: David Miller @ 2010-07-20  3:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev, leitao
In-Reply-To: <1279584905-15084-2-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Mon, 19 Jul 2010 17:15:03 -0700

> Based on original patch by Breno Leitão <leitao@linux.vnet.ibm.com>.
> 
> Allocate the actual number of vectors and make use of fewer vectors
> if pci_enable_msix() returns > 0.  We must allocate one additional
> vector for the cnic driver.
> 
> Cc: Breno Leitão <leitao@linux.vnet.ibm.com>
> Reviewed-by: Benjamin Li <benli@broadcom.com>
> Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/4] bnx2: Use proper counter for net_device_stats->multicast.
From: David Miller @ 2010-07-20  3:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1279584905-15084-1-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Mon, 19 Jul 2010 17:15:02 -0700

> We were using the wrong tx multicast counter instead of the rx multicast
> counter.
> 
> Reported-by: Peter Snellman <peter.snellman@cinnober.com>
> Reviewed-by: Benjamin Li <benli@broadcom.com>
> Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 5/5] ixgbe: fix version string for ixgbe
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, donald.c.skidmore
In-Reply-To: <20100720000044.14112.65405.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 17:00:47 -0700

> From: Don Skidmore <donald.c.skidmore@intel.com>
> 
> Bump the version string to better reflect what is in the driver.
> 
> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 4/5] ixgbe: use GFP_ATOMIC when allocating FCoE DDP context from the dma pool
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, yi.zou
In-Reply-To: <20100720000021.14112.63604.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 17:00:24 -0700

> From: Yi Zou <yi.zou@intel.com>
> 
> The FCoE protocol stack may hold a lock when this gets called.
> 
> Signed-off-by: Yi Zou <yi.zou@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 3/5] ixgbe: properly toggling netdev feature flags when disabling FCoE
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, yi.zou
In-Reply-To: <20100719235949.14112.48380.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 16:59:52 -0700

> From: Yi Zou <yi.zou@intel.com>
> 
> When FCoE is disabled, there is a race condition that FCoE offload is
> turned off but the FCoE protocol driver is still queuing I/O thinking
> offload support still exists. This patch toggles off corresponding FCoE
> netdev feature flags and notify the FCoE stack first, allowing FCoE
> protocol stack driver to update its flags upon NETDEV_FEAT_CHANGE so no
> I/O will be using offload.
> 
> Also, indicate FCoE offload flags in vlan_features in ixgbe_probe once
> and do not toggle them in ixgbe_fcoe_enable/disable so when FCoE is
> created on the VLAN interface, vlan_transfer_features() would properly
> update the VLAN netdev features flag and notify the FCoE protocol driver
> for NETDEV_FEAT_CHANGE.
> 
> Signed-off-by: Yi Zou <yi.zou@intel.com>
> Tested-by: Ross Brattain <ross.b.brattain@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, gospo, bphilips, alexander.h.duyck, donald.c.skidmore
In-Reply-To: <20100719235925.14112.65890.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 16:59:27 -0700

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change removes UDP from the supported protocols for RSS hashing.  The
> reason for removing this protocol is because IP fragmentation was causing a
> network flow to be broken into two streams, one for fragmented, and one for
> non-fragmented and this in turn was causing out-of-order issues.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Acked-by: Don Skidmore <donald.c.skidmore@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/5] ixgbe: dcb, set DPF bit when PFC is enabled
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, gospo, bphilips, john.r.fastabend, donald.c.skidmore
In-Reply-To: <20100719235831.14112.14175.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 16:59:03 -0700

> From: John Fastabend <john.r.fastabend@intel.com>
> 
> Set the DPF bit when PFC is enabled.  This will discard
> PFC frames so they do not get passed up the stack.
> 
> The DPF bit is set for flow control, but not priority
> flow control this brings pfc inline with fc.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
> Tested-by: Ross Brattain <ross.b.brattain@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH] e1000: allow option to limit number of descriptors down to 48 per ring
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, alexander.h.duyck
In-Reply-To: <20100719234219.13875.90302.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 16:43:47 -0700

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change makes it possible to limit the number of descriptors down to 48
> per ring.  The reason for this change is to address a variation on hardware
> errata 10 for 82546GB in which descriptors will be lost if more than 32
> descriptors are fetched and the PCI-X MRBC is 512.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox