Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 3/4] bnx2: Remove some unnecessary smp_mb() in tx fast path.
From: Michael Chan @ 2010-07-20  5:48 UTC (permalink / raw)
  To: 'Eric Dumazet'; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <1279604283.2458.68.camel@edumazet-laptop>

Eric Dumazet wrote:

> Excellent,
>
> Is similar patch for tg3 planned ?
>

Yes, this performance issue was actually first discovered when
profiling a new tg3 device.  So Matt should be sending out a
very similar patch as part of his next patch-set.

^ permalink raw reply

* Re: [PATCH net-next 3/4] bnx2: Remove some unnecessary smp_mb() in tx fast path.
From: Eric Dumazet @ 2010-07-20  5:38 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev
In-Reply-To: <1279584905-15084-3-git-send-email-mchan@broadcom.com>

Le lundi 19 juillet 2010 à 17:15 -0700, Michael Chan a écrit :
> smp_mb() inside bnx2_tx_avail() is used twice in the normal
> bnx2_start_xmit() path (see illustration below).  The full memory
> barrier is only necessary during race conditions with tx completion.
> We can speed up the tx path by replacing smp_mb() in bnx2_tx_avail()
> with a compiler barrier.  The compiler barrier is to force the
> compiler to fetch the tx_prod and tx_cons from memory.
> 
> In the race condition between bnx2_start_xmit() and bnx2_tx_int(),
> we have the following situation:
> 
> bnx2_start_xmit()                       bnx2_tx_int()
>     if (!bnx2_tx_avail())
>             BUG();
> 
>     ...
> 
>     if (!bnx2_tx_avail())
>             netif_tx_stop_queue();          update_tx_index();
>             smp_mb();                       smp_mb();
>             if (bnx2_tx_avail())            if (netif_tx_queue_stopped() &&
>                     netif_tx_wake_queue();      bnx2_tx_avail())
> 
> With smp_mb() removed from bnx2_tx_avail(), we need to add smp_mb() to
> bnx2_start_xmit() as shown above to properly order netif_tx_stop_queue()
> and bnx2_tx_avail() to check the ring index.  If it is not strictly
> ordered, the tx queue can be stopped forever.
> 
> This improves performance by about 5% with 2 ports running bi-directional
> 64-byte packets.
> 
> Reviewed-by: Benjamin Li <benli@broadcom.com>
> Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>
> ---
>  drivers/net/bnx2.c |   10 +++++++++-
>  1 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
> index d44ecc3..2af570d 100644
> --- a/drivers/net/bnx2.c
> +++ b/drivers/net/bnx2.c
> @@ -253,7 +253,8 @@ static inline u32 bnx2_tx_avail(struct bnx2 *bp, struct bnx2_tx_ring_info *txr)
>  {
>  	u32 diff;
>  
> -	smp_mb();
> +	/* Tell compiler to fetch tx_prod and tx_cons from memory. */
> +	barrier();
>  
>  	/* The ring uses 256 indices for 255 entries, one of them
>  	 * needs to be skipped.
> @@ -6534,6 +6535,13 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	if (unlikely(bnx2_tx_avail(bp, txr) <= MAX_SKB_FRAGS)) {
>  		netif_tx_stop_queue(txq);
> +
> +		/* netif_tx_stop_queue() must be done before checking
> +		 * tx index in bnx2_tx_avail() below, because in
> +		 * bnx2_tx_int(), we update tx index before checking for
> +		 * netif_tx_queue_stopped().
> +		 */
> +		smp_mb();
>  		if (bnx2_tx_avail(bp, txr) > bp->tx_wake_thresh)
>  			netif_tx_wake_queue(txq);
>  	}

Excellent,

Is similar patch for tg3 planned ?

Thanks




^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: Herbert Xu @ 2010-07-20  5:26 UTC (permalink / raw)
  To: David Miller; +Cc: mst, shemminger, frzhang, netdev, amwang, mpm
In-Reply-To: <20100719.090503.73693858.davem@davemloft.net>

On Mon, Jul 19, 2010 at 09:05:03AM -0700, David Miller wrote:
>
> I thought we did that already.... oh I see, we did it for bonding:
> 
> commit c22d7ac844f1cb9c6a5fd20f89ebadc2feef891b
> Author: Andy Gospodarek <andy@greyhouse.net>
> Date:   Fri Jun 25 09:50:44 2010 +0000
> 
>     bonding: prevent netpoll over bonded interfaces
> 
> I'm fine with disabling it for bridging too, just send me a patch
> similar to the bonding one.

OK, here is a minimal patch to remove the offending bits of the
bridge netpoll support.

Note that this patch needs to be reverted in net-next-2.6 as
bridge netpoll works correctly there.

bridge: Partially disable netpoll support

The new netpoll code in bridging contains use-after-free bugs
that are non-trivial to fix.

This patch fixes this by removing the code that uses skbs after
they're freed.

As a consequence, this means that we can no longer call bridge
from the netpoll path, so this patch also removes the controller
function in order to disable netpoll.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index eedf2c9..753fc42 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -217,14 +217,6 @@ static bool br_devices_support_netpoll(struct net_bridge *br)
 	return count != 0 && ret;
 }
 
-static void br_poll_controller(struct net_device *br_dev)
-{
-	struct netpoll *np = br_dev->npinfo->netpoll;
-
-	if (np->real_dev != br_dev)
-		netpoll_poll_dev(np->real_dev);
-}
-
 void br_netpoll_cleanup(struct net_device *dev)
 {
 	struct net_bridge *br = netdev_priv(dev);
@@ -295,7 +287,6 @@ static const struct net_device_ops br_netdev_ops = {
 	.ndo_do_ioctl		 = br_dev_ioctl,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_netpoll_cleanup	 = br_netpoll_cleanup,
-	.ndo_poll_controller	 = br_poll_controller,
 #endif
 };
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index a4e72a8..595da45 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -50,14 +50,7 @@ int br_dev_queue_push_xmit(struct sk_buff *skb)
 			kfree_skb(skb);
 		else {
 			skb_push(skb, ETH_HLEN);
-
-#ifdef CONFIG_NET_POLL_CONTROLLER
-			if (unlikely(skb->dev->priv_flags & IFF_IN_NETPOLL)) {
-				netpoll_send_skb(skb->dev->npinfo->netpoll, skb);
-				skb->dev->priv_flags &= ~IFF_IN_NETPOLL;
-			} else
-#endif
-				dev_queue_xmit(skb);
+			dev_queue_xmit(skb);
 		}
 	}
 
@@ -73,23 +66,9 @@ int br_forward_finish(struct sk_buff *skb)
 
 static void __br_deliver(const struct net_bridge_port *to, struct sk_buff *skb)
 {
-#ifdef CONFIG_NET_POLL_CONTROLLER
-	struct net_bridge *br = to->br;
-	if (unlikely(br->dev->priv_flags & IFF_IN_NETPOLL)) {
-		struct netpoll *np;
-		to->dev->npinfo = skb->dev->npinfo;
-		np = skb->dev->npinfo->netpoll;
-		np->real_dev = np->dev = to->dev;
-		to->dev->priv_flags |= IFF_IN_NETPOLL;
-	}
-#endif
 	skb->dev = to->dev;
 	NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_OUT, skb, NULL, skb->dev,
 		br_forward_finish);
-#ifdef CONFIG_NET_POLL_CONTROLLER
-	if (skb->dev->npinfo)
-		skb->dev->npinfo->netpoll->dev = br->dev;
-#endif
 }
 
 static void __br_forward(const struct net_bridge_port *to, struct sk_buff *skb)

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: Very low latency TCP for clusters
From: Eric Dumazet @ 2010-07-20  5:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, Rick Jones
In-Reply-To: <AANLkTillCVWMHHBImGEeRYy2MJYqtGIvSPvacLKJnpP2@mail.gmail.com>

Le lundi 19 juillet 2010 à 16:37 -0700, Tom Herbert a écrit :

> That's pretty pokey ;-) I see numbers around 25 usecs between to
> machines, this is with TCP_NBRR.  With TCP_RR it's more like 35 usecs,
> so eliminating the scheduler is already a big reduction.  That leaves
> 18 usecs in device time, interrupt processing, network, and cache
> misses; 7 usecs in TCP processing, user space.  While 5 usecs is an
> aggressive goal, I am not ready to concede that there's an
> architectural limit in either NICs, TCP, or sockets that can't be
> overcome.

Last time I tried TCP_NBRR, it was not working (not even compiled in), I
guess I should submit a bug report to Rick ;)




^ permalink raw reply

* Re: [PATCH net-next] drivers/acpi/acpica/utmisc.c: Use printk extension %pV
From: Joe Perches @ 2010-07-20  5:21 UTC (permalink / raw)
  To: David Miller; +Cc: lenb, linux-acpi, linux-kernel, netdev
In-Reply-To: <20100719.221630.146073821.davem@davemloft.net>

On Mon, 2010-07-19 at 22:16 -0700, David Miller wrote:
> Joe, I really can't keep merging stuff like this into the networking
> tree.

Oh no worries David.

The patch is more like a preliminary notification that at
some point after the vsprintf stuff gets merged into Linus'
tree after 2.6.36-rc1 that this change to acpi could be useful.

> But you're going to have to find another way to merge uses outside of
> networking that you want integrated.

Hey, I'm a patient guy, there's no hurry.

cheers, Joe

^ permalink raw reply

* Re: [PATCH net-next] drivers/acpi/acpica/utmisc.c: Use printk extension %pV
From: David Miller @ 2010-07-20  5:16 UTC (permalink / raw)
  To: joe; +Cc: lenb, linux-acpi, linux-kernel, netdev
In-Reply-To: <1279602367.19374.20.camel@Joe-Laptop.home>

From: Joe Perches <joe@perches.com>
Date: Mon, 19 Jul 2010 22:06:07 -0700

> Consolidates the printk messages to a single
> call so the messages can not be interleaved.
> 
> Reduces text a bit.
> 
> $ size drivers/acpi/acpica/utmisc.o.*
>    text	   data	    bss	    dec	    hex	filename
>    7822	     56	   1832	   9710	   25ee	drivers/acpi/acpica/utmisc.o.old
>    7748	     56	   1736	   9540	   2544	drivers/acpi/acpica/utmisc.o.new
> 
> Depends on net-next commit 7db6f5fb65a82af03229eef104dc9899c5eecf33
> (vsprintf: Recursive vsnprintf: Add "%pV", struct va_format)
> 
> Compile tested only
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Joe, I really can't keep merging stuff like this into the networking
tree.  The driver generic bits, since the netdev print macros needed
this indirectly and you used %pV primarily for networking stuff, that's
fine.

But you're going to have to find another way to merge uses outside of
networking that you want integrated.

Thanks.

^ permalink raw reply

* [PATCH net-next] drivers/acpi/acpica/utmisc.c: Use printk extension %pV
From: Joe Perches @ 2010-07-20  5:06 UTC (permalink / raw)
  To: Len Brown; +Cc: linux-acpi, linux-kernel, netdev, David Miller

Consolidates the printk messages to a single
call so the messages can not be interleaved.

Reduces text a bit.

$ size drivers/acpi/acpica/utmisc.o.*
   text	   data	    bss	    dec	    hex	filename
   7822	     56	   1832	   9710	   25ee	drivers/acpi/acpica/utmisc.o.old
   7748	     56	   1736	   9540	   2544	drivers/acpi/acpica/utmisc.o.new

Depends on net-next commit 7db6f5fb65a82af03229eef104dc9899c5eecf33
(vsprintf: Recursive vsnprintf: Add "%pV", struct va_format)

Compile tested only

Signed-off-by: Joe Perches <joe@perches.com>
---
 drivers/acpi/acpica/utmisc.c |   78 ++++++++++++++++++++++++++++--------------
 1 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/drivers/acpi/acpica/utmisc.c b/drivers/acpi/acpica/utmisc.c
index e8d0724..d9a4a64 100644
--- a/drivers/acpi/acpica/utmisc.c
+++ b/drivers/acpi/acpica/utmisc.c
@@ -53,8 +53,8 @@ ACPI_MODULE_NAME("utmisc")
 /*
  * Common suffix for messages
  */
-#define ACPI_COMMON_MSG_SUFFIX \
-	acpi_os_printf(" (%8.8X/%s-%u)\n", ACPI_CA_VERSION, module_name, line_number)
+#define ACPI_COMMON_MSG_SUFFIX_FMT  " (%8.8X/%s-%u)"
+#define ACPI_COMMON_MSG_SUFFIX_ARGS ACPI_CA_VERSION, module_name, line_number
 /*******************************************************************************
  *
  * FUNCTION:    acpi_ut_validate_exception
@@ -1063,12 +1063,16 @@ void ACPI_INTERNAL_VAR_XFACE
 acpi_error(const char *module_name, u32 line_number, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI Error: ");
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Error: %pV" ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1077,12 +1081,18 @@ acpi_exception(const char *module_name,
 	       u32 line_number, acpi_status status, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI Exception: %s, ", acpi_format_exception(status));
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Exception: %s, %pV"
+		       ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       acpi_format_exception(status), &vaf,
+		       ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1090,12 +1100,16 @@ void ACPI_INTERNAL_VAR_XFACE
 acpi_warning(const char *module_name, u32 line_number, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI Warning: ");
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Warning: %pV" ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1103,12 +1117,15 @@ void ACPI_INTERNAL_VAR_XFACE
 acpi_info(const char *module_name, u32 line_number, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI: ");
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	acpi_os_printf("\n");
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI: %pV\n", &vaf);
+
 	va_end(args);
 }
 
@@ -1143,6 +1160,7 @@ acpi_ut_predefined_warning(const char *module_name,
 			   u8 node_flags, const char *format, ...)
 {
 	va_list args;
+	struct va_format vaf;
 
 	/*
 	 * Warning messages for this method/object will be disabled after the
@@ -1152,11 +1170,15 @@ acpi_ut_predefined_warning(const char *module_name,
 		return;
 	}
 
-	acpi_os_printf("ACPI Warning for %s: ", pathname);
-
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Warning for %s: %pV"
+		       ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       pathname, &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1185,6 +1207,7 @@ acpi_ut_predefined_info(const char *module_name,
 			char *pathname, u8 node_flags, const char *format, ...)
 {
 	va_list args;
+	struct va_format vaf;
 
 	/*
 	 * Warning messages for this method/object will be disabled after the
@@ -1194,10 +1217,13 @@ acpi_ut_predefined_info(const char *module_name,
 		return;
 	}
 
-	acpi_os_printf("ACPI Info for %s: ", pathname);
-
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Info for %s: %pV" ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       pathname, &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }



^ permalink raw reply related

* Re: [RFC PATCH] ipv6: Make IP6CB(skb)->nhoff 16-bit.
From: David Miller @ 2010-07-20  5:01 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20100715.223011.58409183.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Thu, 15 Jul 2010 22:30:11 -0700 (PDT)

> 
> Even with jumbograms I cannot see any way in which we would need
> to records a larger than 65535 valued next-header offset.
> 
> The maximum extension header length is (256 << 3) == 2048.
> There are only a handful of extension headers specified which
> we'd even accept (say 5 or 6), therefore the largest next-header
> offset we'd ever have to contend with is something less than
> say 16k.
> 
> Therefore make it a u16 instead of a u32.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
> Can anyone find any holes in this?  I want to do this so we
> can legitimately rip ndisc_nodetype out of struct sk_buff
> and stick it into the IP6CB where it belongs.  Currently
> there isn't enough space, but with this change there will
> be.

No holes claimed by anyone, so... applied :-)

^ permalink raw reply

* Re: [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
From: Eric Dumazet @ 2010-07-20  4:54 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, nhorman, laijs, scott.a.mcmillan, rostedt,
	fweisbec, mathieu.desnoyers
In-Reply-To: <4C44F286.1050907@jp.fujitsu.com>

Le mardi 20 juillet 2010 à 09:49 +0900, Koki Sanagi a écrit :
> [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
> This patch adds tracepoint to consume_skb, dev_kfree_skb_irq and
> skb_free_datagram_locked. Combinating with tracepoint on dev_hard_start_xmit,
> we can check how long it takes to free transmited packets. And using it, we can
> calculate how many packets driver had at that time. It is useful when a drop of
> transmited packet is a problem.
> 
>           <idle>-0     [001] 241409.218333: consume_skb: skbaddr=dd6b2fb8
>           <idle>-0     [001] 241409.490555: dev_kfree_skb_irq: skbaddr=f5e29840
> 
>         udp-recv-302   [001] 515031.206008: skb_free_datagram_locked: skbaddr=f5b1d900
> 
> 
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  include/trace/events/skb.h |   42 ++++++++++++++++++++++++++++++++++++++++++
>  net/core/datagram.c        |    1 +
>  net/core/dev.c             |    2 ++
>  net/core/skbuff.c          |    1 +
>  4 files changed, 46 insertions(+), 0 deletions(-)
> 
> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
> index 4b2be6d..84c9041 100644
> --- a/include/trace/events/skb.h
> +++ b/include/trace/events/skb.h
> @@ -35,6 +35,48 @@ TRACE_EVENT(kfree_skb,
>  		__entry->skbaddr, __entry->protocol, __entry->location)
>  );
>  
> +DECLARE_EVENT_CLASS(free_skb,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb),
> +
> +	TP_STRUCT__entry(
> +		__field(	void *,	skbaddr	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->skbaddr = skb;
> +	),
> +
> +	TP_printk("skbaddr=%p", __entry->skbaddr)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, consume_skb,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, dev_kfree_skb_irq,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, skb_free_datagram_locked,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
>  TRACE_EVENT(skb_copy_datagram_iovec,
>  
>  	TP_PROTO(const struct sk_buff *skb, int len),
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index f5b6f43..1ea32a0 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -231,6 +231,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
>  {
>  	bool slow;
>  
> +	trace_skb_free_datagram_locked(skb);

Here you unconditionally trace before the test on skb->users

>  	if (likely(atomic_read(&skb->users) == 1))
>  		smp_rmb();
>  	else if (likely(!atomic_dec_and_test(&skb->users)))
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4acfec6..d979847 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -131,6 +131,7 @@
>  #include <linux/random.h>
>  #include <trace/events/napi.h>
>  #include <trace/events/net.h>
> +#include <trace/events/skb.h>
>  #include <linux/pci.h>
>  
>  #include "net-sysfs.h"
> @@ -1581,6 +1582,7 @@ void dev_kfree_skb_irq(struct sk_buff *skb)
>  		struct softnet_data *sd;
>  		unsigned long flags;
>  
> +		trace_dev_kfree_skb_irq(skb);
>  		local_irq_save(flags);
>  		sd = &__get_cpu_var(softnet_data);
>  		skb->next = sd->completion_queue;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 34432b4..a7b4036 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -466,6 +466,7 @@ void consume_skb(struct sk_buff *skb)
>  		smp_rmb();
>  	else if (likely(!atomic_dec_and_test(&skb->users)))
>  		return;

While here you trace _after_ the test on skb->users - and a "return;" ,
so you miss some consume_skb() calls


> +	trace_consume_skb(skb);
>  	__kfree_skb(skb);
>  }
>  EXPORT_SYMBOL(consume_skb);
> 

^ permalink raw reply

* Re: [net-next-2.6 PATCH] e1000: allow option to limit number of descriptors down to 48 per ring
From: Eric Dumazet @ 2010-07-20  4:40 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, gospo, bphilips, Alexander Duyck
In-Reply-To: <20100719234219.13875.90302.stgit@localhost.localdomain>

Le lundi 19 juillet 2010 à 16:43 -0700, Jeff Kirsher a écrit :
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change makes it possible to limit the number of descriptors down to 48
> per ring.  The reason for this change is to address a variation on hardware
> errata 10 for 82546GB in which descriptors will be lost if more than 32
> descriptors are fetched and the PCI-X MRBC is 512.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> 
>  drivers/net/e1000/e1000.h |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
> index 40b62b4..65298a6 100644
> --- a/drivers/net/e1000/e1000.h
> +++ b/drivers/net/e1000/e1000.h
> @@ -86,12 +86,12 @@ struct e1000_adapter;
>  /* TX/RX descriptor defines */
>  #define E1000_DEFAULT_TXD                  256
>  #define E1000_MAX_TXD                      256
> -#define E1000_MIN_TXD                       80
> +#define E1000_MIN_TXD                       48
>  #define E1000_MAX_82544_TXD               4096
>  
>  #define E1000_DEFAULT_RXD                  256
>  #define E1000_MAX_RXD                      256
> -#define E1000_MIN_RXD                       80
> +#define E1000_MIN_RXD                       48
>  #define E1000_MAX_82544_RXD               4096
>  
>  #define E1000_MIN_ITR_USECS		10 /* 100000 irq/sec */
> 

So this limit is a pure software one ?

Why not let an admin chose a lower limit if he wants to ?

I am asking because big ring sizes can be a latency source in some
workloads.

Thanks



^ permalink raw reply

* Re: [PATCH] net: Add batman-adv meshing protocol
From: David Miller @ 2010-07-20  4:26 UTC (permalink / raw)
  To: sven.eckelmann-Mmb7MZpHnFY
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwlltHuzzzSOjJt
In-Reply-To: <1279291156-5297-2-git-send-email-sven.eckelmann-Mmb7MZpHnFY@public.gmane.org>

From: Sven Eckelmann <sven.eckelmann-Mmb7MZpHnFY@public.gmane.org>
Date: Fri, 16 Jul 2010 16:39:16 +0200

> +/* count the hamming weight, how many good packets did we receive? just count
> + * the 1's. The inner loop uses the Kernighan algorithm, see
> + * http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
> + */
> +int bit_packet_count(TYPE_OF_WORD *seq_bits)
> +{
> +	int i, hamming = 0;
> +	TYPE_OF_WORD word;
> +
> +	for (i = 0; i < NUM_WORDS; i++) {
> +		word = seq_bits[i];
> +
> +		while (word) {
> +			word &= word-1;
> +			hamming++;
> +		}
> +	}
> +	return hamming;
> +}

The kernel has a hamming weight library function which takes advantage
of population count instructions on cpus that suport it, and also has
a sw version than is faster than what you're doing here, please use
it.

The interfaces are called "hweight{8,16,32,64}()" where the number in
the name indicates the bit-size of the word the interface operates on.

I also notice that this code uses it's own internal buffering scheme
with kmalloc()'d buffers, then seperately allocates actual SKB's and
copies the data there.

Just use the SKB facilities how they were designed to be used, instead
of needlessly inventing new things.  Allocate your initial SKB and put
the initial forwarding header in it, then when you want to send a copy
off, skb_clone() it, and push the other bits you want at the head
and/or the tail of the cloned SKB, then simply send it off.

^ permalink raw reply

* Re: [PATCH net-next 0/3] cxgb4vf: fixes for several small issues discovered by QA
From: David Miller @ 2010-07-20  3:40 UTC (permalink / raw)
  To: leedom; +Cc: netdev
In-Reply-To: <201007191812.10459.leedom@chelsio.com>

From: Casey Leedom <leedom@chelsio.com>
Date: Mon, 19 Jul 2010 18:12:10 -0700

>   A couple of small (but important) fixes discovered by our QA people.  I've also 
> included a patch to add myself as the maintainer of cxgb4vf in the MAINTAINERS 
> file which I think is the protocol but please correct me if changes to that file 
> are usually performed by someone else.

Well, where are they?

^ permalink raw reply

* Re: [PATCH net-next 4/4] bnx2: Update version to 2.0.17.
From: David Miller @ 2010-07-20  3:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1279584905-15084-4-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Mon, 19 Jul 2010 17:15:05 -0700

> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 3/4] bnx2: Remove some unnecessary smp_mb() in tx fast path.
From: David Miller @ 2010-07-20  3:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1279584905-15084-3-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Mon, 19 Jul 2010 17:15:04 -0700

> smp_mb() inside bnx2_tx_avail() is used twice in the normal
> bnx2_start_xmit() path (see illustration below).  The full memory
> barrier is only necessary during race conditions with tx completion.
> We can speed up the tx path by replacing smp_mb() in bnx2_tx_avail()
> with a compiler barrier.  The compiler barrier is to force the
> compiler to fetch the tx_prod and tx_cons from memory.
> 
> In the race condition between bnx2_start_xmit() and bnx2_tx_int(),
> we have the following situation:
> 
> bnx2_start_xmit()                       bnx2_tx_int()
>     if (!bnx2_tx_avail())
>             BUG();
> 
>     ...
> 
>     if (!bnx2_tx_avail())
>             netif_tx_stop_queue();          update_tx_index();
>             smp_mb();                       smp_mb();
>             if (bnx2_tx_avail())            if (netif_tx_queue_stopped() &&
>                     netif_tx_wake_queue();      bnx2_tx_avail())
> 
> With smp_mb() removed from bnx2_tx_avail(), we need to add smp_mb() to
> bnx2_start_xmit() as shown above to properly order netif_tx_stop_queue()
> and bnx2_tx_avail() to check the ring index.  If it is not strictly
> ordered, the tx queue can be stopped forever.
> 
> This improves performance by about 5% with 2 ports running bi-directional
> 64-byte packets.
> 
> Reviewed-by: Benjamin Li <benli@broadcom.com>
> Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/4] bnx2: Call pci_enable_msix() with actual number of vectors.
From: David Miller @ 2010-07-20  3:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev, leitao
In-Reply-To: <1279584905-15084-2-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Mon, 19 Jul 2010 17:15:03 -0700

> Based on original patch by Breno Leitão <leitao@linux.vnet.ibm.com>.
> 
> Allocate the actual number of vectors and make use of fewer vectors
> if pci_enable_msix() returns > 0.  We must allocate one additional
> vector for the cnic driver.
> 
> Cc: Breno Leitão <leitao@linux.vnet.ibm.com>
> Reviewed-by: Benjamin Li <benli@broadcom.com>
> Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/4] bnx2: Use proper counter for net_device_stats->multicast.
From: David Miller @ 2010-07-20  3:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1279584905-15084-1-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Mon, 19 Jul 2010 17:15:02 -0700

> We were using the wrong tx multicast counter instead of the rx multicast
> counter.
> 
> Reported-by: Peter Snellman <peter.snellman@cinnober.com>
> Reviewed-by: Benjamin Li <benli@broadcom.com>
> Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 5/5] ixgbe: fix version string for ixgbe
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, donald.c.skidmore
In-Reply-To: <20100720000044.14112.65405.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 17:00:47 -0700

> From: Don Skidmore <donald.c.skidmore@intel.com>
> 
> Bump the version string to better reflect what is in the driver.
> 
> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 4/5] ixgbe: use GFP_ATOMIC when allocating FCoE DDP context from the dma pool
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, yi.zou
In-Reply-To: <20100720000021.14112.63604.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 17:00:24 -0700

> From: Yi Zou <yi.zou@intel.com>
> 
> The FCoE protocol stack may hold a lock when this gets called.
> 
> Signed-off-by: Yi Zou <yi.zou@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 3/5] ixgbe: properly toggling netdev feature flags when disabling FCoE
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, yi.zou
In-Reply-To: <20100719235949.14112.48380.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 16:59:52 -0700

> From: Yi Zou <yi.zou@intel.com>
> 
> When FCoE is disabled, there is a race condition that FCoE offload is
> turned off but the FCoE protocol driver is still queuing I/O thinking
> offload support still exists. This patch toggles off corresponding FCoE
> netdev feature flags and notify the FCoE stack first, allowing FCoE
> protocol stack driver to update its flags upon NETDEV_FEAT_CHANGE so no
> I/O will be using offload.
> 
> Also, indicate FCoE offload flags in vlan_features in ixgbe_probe once
> and do not toggle them in ixgbe_fcoe_enable/disable so when FCoE is
> created on the VLAN interface, vlan_transfer_features() would properly
> update the VLAN netdev features flag and notify the FCoE protocol driver
> for NETDEV_FEAT_CHANGE.
> 
> Signed-off-by: Yi Zou <yi.zou@intel.com>
> Tested-by: Ross Brattain <ross.b.brattain@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, gospo, bphilips, alexander.h.duyck, donald.c.skidmore
In-Reply-To: <20100719235925.14112.65890.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 16:59:27 -0700

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change removes UDP from the supported protocols for RSS hashing.  The
> reason for removing this protocol is because IP fragmentation was causing a
> network flow to be broken into two streams, one for fragmented, and one for
> non-fragmented and this in turn was causing out-of-order issues.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Acked-by: Don Skidmore <donald.c.skidmore@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/5] ixgbe: dcb, set DPF bit when PFC is enabled
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, gospo, bphilips, john.r.fastabend, donald.c.skidmore
In-Reply-To: <20100719235831.14112.14175.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 16:59:03 -0700

> From: John Fastabend <john.r.fastabend@intel.com>
> 
> Set the DPF bit when PFC is enabled.  This will discard
> PFC frames so they do not get passed up the stack.
> 
> The DPF bit is set for flow control, but not priority
> flow control this brings pfc inline with fc.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
> Tested-by: Ross Brattain <ross.b.brattain@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH] e1000: allow option to limit number of descriptors down to 48 per ring
From: David Miller @ 2010-07-20  3:24 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, alexander.h.duyck
In-Reply-To: <20100719234219.13875.90302.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 19 Jul 2010 16:43:47 -0700

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change makes it possible to limit the number of descriptors down to 48
> per ring.  The reason for this change is to address a variation on hardware
> errata 10 for 82546GB in which descriptors will be lost if more than 32
> descriptors are fetched and the PCI-X MRBC is 512.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: linux-next: manual merge of the net tree with the net-current tree
From: Joe Perches @ 2010-07-20  2:34 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, netdev, linux-next, linux-kernel, Jeff Dike,
	Michael S. Tsirkin
In-Reply-To: <20100720122032.88e0fcd9.sfr@canb.auug.org.au>

On Tue, 2010-07-20 at 12:20 +1000, Stephen Rothwell wrote:
> I fixed it up (see below) and can carry the fix as necessary.
@@@ -527,15 -527,12 +527,14 @@@ static long vhost_net_set_backend(struc
  
        /* start polling new socket */
        oldsock = vq->private_data;
-       if (sock == oldsock)
-               goto done;
+       if (sock != oldsock){

Trivial: missing space before open brace in commit
dd1f4078f0d2de74a308f00a2dffbd550cfba59f

^ permalink raw reply

* linux-next: manual merge of the net tree with the net-current tree
From: Stephen Rothwell @ 2010-07-20  2:20 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Jeff Dike, Michael S. Tsirkin

Hi all,

Today's linux-next merge of the net tree got a conflict in
drivers/vhost/net.c between commit
1680e9063ea28099a1efa8ca11cee069cc7a9bc3 ("vhost-net: avoid flush under
lock") from the net-current tree and commit
dd1f4078f0d2de74a308f00a2dffbd550cfba59f ("vhost-net: minor cleanup")
from the net tree.

I fixed it up (see below) and can carry the fix as necessary.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc drivers/vhost/net.c
index d219070,107af9e..0000000
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@@ -527,15 -527,12 +527,14 @@@ static long vhost_net_set_backend(struc

  	/* start polling new socket */
  	oldsock = vq->private_data;
- 	if (sock == oldsock)
- 		goto done;
+ 	if (sock != oldsock){
+                 vhost_net_disable_vq(n, vq);
+                 rcu_assign_pointer(vq->private_data, sock);
+                 vhost_net_enable_vq(n, vq);
+ 	}

- 	vhost_net_disable_vq(n, vq);
- 	rcu_assign_pointer(vq->private_data, sock);
- 	vhost_net_enable_vq(n, vq);
- done:
 +	mutex_unlock(&vq->mutex);
 +
  	if (oldsock) {
  		vhost_net_flush_vq(n, index);
  		fput(oldsock->file);

^ permalink raw reply

* Re: [PATCH] LSM: Add post accept() hook.
From: Tetsuo Handa @ 2010-07-20  1:36 UTC (permalink / raw)
  To: paul.moore
  Cc: davem, eric.dumazet, jmorris, sam, serge, netdev,
	linux-security-module
In-Reply-To: <201007191815.52124.paul.moore@hp.com>

Paul Moore wrote:
> I think you need to show how you plan to use this hook in an LSM before we can 
> consider merging it with mainline.  What you are proposing here is giving an 
> LSM the ability to drop a connection _after_ allowing it to be established in 
> the first place; this seems very wrong to me and I want to make sure everyone 
> else is aware of that before accepting this code into the kernel.  I 
> understand that TOMOYO's security model does not allow it to reject incoming 
> connections at the beginning of the connection request like some of the LSMs 
> currently in use, but I'm just not very happy with the idea of finishing a 
> connection handshake only to later drop the connection on the floor.

Yes. I'm planning to use security_socket_post_accept() for two purposes.

One is for dropping connections from unwanted hosts. Administrators define
policy before enabling enforcing mode (the mode which connections are dropped
if operation was not granted by policy). Administrators specify acceptable
hosts (i.e. hosts which this host needs to communicate with) and unacceptable
hosts (i.e. hosts which this host needn't to communicate with).
Dropping connections would happen if some process was hijacked and the process
attempted to communicate with other processes using TCP connections. But
dropping connections should not happen in normal circumstance.

The other is for updating process's state variable upon accept() operation.
LKM version of TOMOYO has per a task_struct variable that is used for
implementing stateful permissions. (As of now, not implemented for LSM version
of TOMOYO.) For example,

  allow_network TCP accept 10.0.0.0-10.255.255.255 1024-65535 ; set task.state[0]=1
  allow_network TCP accept 192.168.0.1-192.168.255.255 1024-65535 ; set task.state[0]=2

will change current thread's task state variable to 1 if current thread
accepted TCP connection from 10.0.0.0-10.255.255.255 and change it to 2 if from
192.168.0.1-192.168.255.255 . This variable is used for giving different
permissions for subsequent operations. For example,

  allow_execute /bin/bash if task.state[0]=1
  allow_execute /bin/tcsh if task.state[0]=2

will allow execution of /bin/bash if current thread is dealing connections from
10.0.0.0-10.255.255.255 and allow execution of /bin/tcsh if current thread is
dealing connections from 192.168.0.1-192.168.255.255 . Another example,

  allow_network TCP accept 0.0.0.0-255.255.255.255 1024-65535 ; set task.state[0]=3
  allow_network TCP accept 0.0.0.0-255.255.255.255 1-1023 ; set task.state[0]=4

will change it to 3 if from unprivileged port and change it to 4 if from
privileged port.

  allow_execute /bin/rbash if task.state[0]=3
  allow_execute /bin/bash if task.state[0]=4

will allow execution of /bin/rbash if dealing connections from unprivileged
ports and allow execution of /bin/bash if dealing connections from privileged
ports.

LSM hooks called before sock->ops->accept() cannot change current thread's task
state variable because it is racy, and LSM hook called after sock->ops->accept()
is missing.

Strictly speaking, it could be possible to update current thread's task state
variable in LSM hooks called by subsequent operations (e.g.
security_dentry_open(), security_bprm_set_creds()) by doing similar approach
done by tomoyo_dead_sock(), but updating it can fail (e.g. -ENOMEM) since
credentials are COW. If updating it failed, I want to drop the accept()ed
connection, but that is impossible from LSM hooks called by subsequent
operations. Killing current thread when updating it failed is possible, but
that looks worse for me than dropping connections upon accept() time (because
such action resembles OOM killer and likely gives larger damage to the caller).

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox