Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: oops in tcp_xmit_retransmit_queue() w/ v2.6.32.15
From: Ilpo Järvinen @ 2010-07-20  8:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lennart Schulte, Eric Dumazet, David S. Miller, lkml,
	netdev@vger.kernel.org, Fehrmann, Henning, Carsten Aulbert
In-Reply-To: <4C4467E0.9080907@kernel.org>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6095 bytes --]

On Mon, 19 Jul 2010, Tejun Heo wrote:

> Hello,
> 
> On 07/16/2010 02:02 PM, Ilpo Järvinen wrote:
> > Besides, Tejun has also found that it's hint->next ptr which is NULL in 
> > his case so this won't solve his case anyway. Tejun, can you confirm 
> > whether it was retransmit_skb_hint->next being NULL on _entry time_ to 
> > tcp_xmit_retransmit_queue() or later on in the loop after the updates done 
> > by the loop itself to the hint (or that your testing didn't conclude 
> > either)?
> 
> Sorry about the delay.  I was traveling last week.  Unfortunately, I
> don't know whether ->next was NULL on entry or not.  I hacked up the
> following ugly patch for the next test run.  It should have everything
> which has come up till now + list and hint sanity checking before
> starting processing them.  I'm planning on deploying it w/ crashdump
> enabled in several days.  If I've missed something, please let me
> know.

One thing that complicates things further is the fact that the write queue 
can change beneath us (ie., in tcp_retrans_try_collapse and tcp_fragment).
I've read them multiple times through and always found them innocent so 
this might be just for-the-record type of a note (at least I hope so).

-- 
 i.

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b4ed957..1c8b1e0 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2190,6 +2190,53 @@ static int tcp_can_forward_retransmit(struct sock *sk)
>  	return 1;
>  }
> 
> +static void print_queue(struct sock *sk, struct sk_buff *old, struct sk_buff *hole)
> +{
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	struct sk_buff *skb, *prev;
> +	bool do_panic = false;
> +
> +	skb = tcp_write_queue_head(sk);
> +	prev = (struct sk_buff *)(&sk->sk_write_queue);
> +
> +	if (skb == NULL) {
> +		printk("XXX NULL head, pkts %u\n", tp->packets_out);
> +		do_panic = true;
> +	}
> +
> +	printk("XXX head %p tail %p sendhead %p oldhint %p now %p hole %p high %u\n",
> +	       tcp_write_queue_head(sk), tcp_write_queue_tail(sk),
> +	       tcp_send_head(sk), old, tp->retransmit_skb_hint, hole,
> +	       tp->retransmit_high);
> +
> +	while (skb) {
> +		printk("XXX skb %p (%u-%u) next %p prev %p sacked %u\n",
> +		       skb, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
> +		       skb->next, skb->prev, TCP_SKB_CB(skb)->sacked);
> +		if (prev != skb->prev) {
> +			printk("XXX Inconsistent prev\n");
> +			do_panic = true;
> +		}
> +
> +		if (skb == tcp_write_queue_tail(sk)) {
> +			if (skb->next != (struct sk_buff *)(&sk->sk_write_queue)) {
> +				printk("XXX Improper next at tail\n");
> +				do_panic = true;
> +			}
> +			break;
> +		}
> +
> +		prev = skb;
> +		skb = skb->next;
> +	}
> +	if (!skb) {
> +		printk("XXX Encountered unexpected NULL\n");
> +		do_panic = true;
> +	}
> +	if (do_panic)
> +		panic("XXX panicking");
> +}
> +
>  /* This gets called after a retransmit timeout, and the initially
>   * retransmitted data is acknowledged.  It tries to continue
>   * resending the rest of the retransmit queue, until either
> @@ -2198,19 +2245,53 @@ static int tcp_can_forward_retransmit(struct sock *sk)
>   * based retransmit packet might feed us FACK information again.
>   * If so, we use it to avoid unnecessarily retransmissions.
>   */
> +static unsigned int caught_it;
> +
>  void tcp_xmit_retransmit_queue(struct sock *sk)
>  {
>  	const struct inet_connection_sock *icsk = inet_csk(sk);
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	struct sk_buff *skb;
> +	struct sk_buff *skb, *prev;
>  	struct sk_buff *hole = NULL;
> +	struct sk_buff *old = tp->retransmit_skb_hint;
>  	u32 last_lost;
>  	int mib_idx;
>  	int fwd_rexmitting = 0;
> +	bool saw_hint = false;
> +
> +	if (!tp->packets_out) {
> +		if (net_ratelimit())
> +			printk("XXX !tp->packets_out, retransmit_skb_hint=%p, write_queue_head=%p\n",
> +			       tp->retransmit_skb_hint, tcp_write_queue_head(sk));
> +		return;
> +	}
> 
>  	if (!tp->lost_out)
>  		tp->retransmit_high = tp->snd_una;
> 
> +	for (skb = tcp_write_queue_head(sk),
> +	     prev = (struct sk_buff *)&sk->sk_write_queue;
> +	     skb != (struct sk_buff *)&sk->sk_write_queue;
> +	     prev = skb, skb = skb->next) {
> +		if (prev != skb->prev) {
> +			printk("XXX sanity check: prev corrupt\n");
> +			print_queue(sk, old, hole);
> +		}
> +		if (skb == tp->retransmit_skb_hint)
> +			saw_hint = true;
> +		if (skb == tcp_write_queue_tail(sk) &&
> +		    skb->next != (struct sk_buff *)(&sk->sk_write_queue)) {
> +			printk("XXX sanity check: end corrupt\n");
> +			print_queue(sk, old, hole);
> +		}
> +	}
> +	if (tp->retransmit_skb_hint && !saw_hint) {
> +		printk("XXX sanity check: retransmit_skb_hint=%p is not on list, claring hint\n",
> +		       tp->retransmit_skb_hint);
> +		print_queue(sk, old, hole);
> +		tp->retransmit_skb_hint = NULL;
> +	}
> +
>  	if (tp->retransmit_skb_hint) {
>  		skb = tp->retransmit_skb_hint;
>  		last_lost = TCP_SKB_CB(skb)->end_seq;
> @@ -2218,7 +2299,17 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
>  			last_lost = tp->retransmit_high;
>  	} else {
>  		skb = tcp_write_queue_head(sk);
> -		last_lost = tp->snd_una;
> +		if (skb)
> +			last_lost = tp->snd_una;
> +	}
> +
> +checknull:
> +	if (skb == NULL) {
> +		print_queue(sk, old, hole);
> +		caught_it++;
> +		if (net_ratelimit())
> +			printk("XXX Errors caught so far %u\n", caught_it);
> +		return;
>  	}
> 
>  	tcp_for_write_queue_from(skb, sk) {
> @@ -2261,7 +2352,7 @@ begin_fwd:
>  		} else if (!(sacked & TCPCB_LOST)) {
>  			if (hole == NULL && !(sacked & (TCPCB_SACKED_RETRANS|TCPCB_SACKED_ACKED)))
>  				hole = skb;
> -			continue;
> +			goto cont;
> 
>  		} else {
>  			last_lost = TCP_SKB_CB(skb)->end_seq;
> @@ -2272,7 +2363,7 @@ begin_fwd:
>  		}
> 
>  		if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS))
> -			continue;
> +			goto cont;
> 
>  		if (tcp_retransmit_skb(sk, skb))
>  			return;
> @@ -2282,6 +2373,9 @@ begin_fwd:
>  			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>  						  inet_csk(sk)->icsk_rto,
>  						  TCP_RTO_MAX);
> +cont:
> +		skb = skb->next;
> +		goto checknull;
>  	}
>  }
> 
> 

^ permalink raw reply

* recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Roy Marples @ 2010-07-20  8:26 UTC (permalink / raw)
  To: netdev

Hi

I would like to support all possible kernels I can and previously used a 
fixed buffer of size 256 to read from netlink sockets. This is now 
proving too small for some 64-bit kernels so I would like to use recv(2) 
with MSG_TRUNK to wor out the size. However, the man page says that this 
only works for 2.6.22 kernels or newer.

My question is, what is the behaviour of recv on older kernels where 
MSG_TRUNC is not supported? I would rather not use some arbitary size if 
at all possible.

Reply directly please as I'm not subscribed here.

Thanks

Roy

^ permalink raw reply

* Re: [PATCHv2] tcp: fix crash in tcp_xmit_retransmit_queue
From: Ilpo Järvinen @ 2010-07-20  8:33 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, lennart.schulte, tj, LKML, Netdev, henning.fehrmann,
	carsten.aulbert
In-Reply-To: <20100719.125500.257479409.davem@davemloft.net>

On Mon, 19 Jul 2010, David Miller wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 19 Jul 2010 19:39:08 +0200
> 
> > Do you know in what exact circumstance the bug triggers ?
> > 
> > It's hard to believe thousand of machines on the Internet never hit
> > it :(
> > 
> > Maybe another problem in congestion control ?
> 
> This is something to investigate, but the conditions under which
> tcp_fastretrans_alert() (the main invoker of tcp_xmit_retransmit_queue())
> does it's thing are complicated enough that I'm going to add this fix
> for the time being and push it out to stable too.

This is so true. ...So far I've managed to twice rule out of the 
possibility of this being really triggerable (ie., it would mean
Lennart's out of tree changes broke it), and once in the middle came
into opposite conclusion. Thus by majority voting we can deduce that it 
won't happen - how reassuring :-/. It seems that tcp_try_undo_recovery 
causes return if TCP remained in CA_Loss/CA_Recovery and that 
tcp_time_to_recover won't really let past return either under normal 
circumstances (more details below), and tcp_simple_retransmit 
requires lost_out to change; seems safe in mainline to me.

Hmm... It seems that I've just solved another report too. ...Somebody a 
while back found out that setting reordering sysctl to zero (ie. to a 
value which does not make too much sense) crashed the kernel. It seems 
that at least then tcp_time_to_recover() would return true and trigger 
this bug (though I'm not sure if that's the only breakage to happen).

Also worth to keep in mind is the bugzilla entry ("New freez in 
TCP" or something like that) so I'm not really sure I could say for sure 
nobody never hit it. The bugzilla one goes away by disable SACK (at least 
for some) but it might mix two different issues. It seems that there 
really are two different issues, the other may have something to do with 
SACK though there are other variables then involved, e.g., the changes in 
retransmission logic/timing, so it's impossible to say if the SACK disable 
really "fixed" the bugzilla one or not. Also Tejun's ->next == NULL 
finding points out to a different bug than this Lennart's one.

-- 
 i.

^ permalink raw reply

* Re: [PATCH] net: Add batman-adv meshing protocol
From: Sven Eckelmann @ 2010-07-20  8:28 UTC (permalink / raw)
  To: b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r, Simon Wunderlich
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <20100719.212625.255369607.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

[-- Attachment #1: Type: Text/Plain, Size: 1679 bytes --]

Thanks a lot for your review and for your hints.

David Miller wrote:
> From: Sven Eckelmann <sven.eckelmann-Mmb7MZpHnFY@public.gmane.org>
> Date: Fri, 16 Jul 2010 16:39:16 +0200
[...]
> 
> The kernel has a hamming weight library function which takes advantage
> of population count instructions on cpus that suport it, and also has
> a sw version than is faster than what you're doing here, please use
> it.
> 
> The interfaces are called "hweight{8,16,32,64}()" where the number in
> the name indicates the bit-size of the word the interface operates on.

Correct, the inner loop is a quite straight forward implementation without any 
kind of optimization. I will change that.

> I also notice that this code uses it's own internal buffering scheme
> with kmalloc()'d buffers, then seperately allocates actual SKB's and
> copies the data there.
> 
> Just use the SKB facilities how they were designed to be used, instead
> of needlessly inventing new things.  Allocate your initial SKB and put
> the initial forwarding header in it, then when you want to send a copy
> off, skb_clone() it, and push the other bits you want at the head
> and/or the tail of the cloned SKB, then simply send it off.

Good catch. That comes from a time when batman-adv was a minimalistic 
conversation of the userspace proof of concept implementation. This happens 
for example in vis.c, icmp_socket.c and send.c (just grepping for 
send_raw_packet is a good way to find those places). But is also happening 
with batman_if->packet_buff in schedule_own_packet and similar places.

I would leave that to the original author of those functions.

thanks,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH] phy: add suspend/resume in the ic+
From: Giuseppe CAVALLARO @ 2010-07-20  7:12 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 drivers/net/phy/icplus.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/phy/icplus.c b/drivers/net/phy/icplus.c
index 439adaf..3f2583f 100644
--- a/drivers/net/phy/icplus.c
+++ b/drivers/net/phy/icplus.c
@@ -116,6 +116,8 @@ static struct phy_driver ip175c_driver = {
 	.config_init	= &ip175c_config_init,
 	.config_aneg	= &ip175c_config_aneg,
 	.read_status	= &ip175c_read_status,
+	.suspend	= genphy_suspend,
+	.resume		= genphy_resume,
 	.driver		= { .owner = THIS_MODULE,},
 };
 
-- 
1.5.5.6


^ permalink raw reply related

* Re: [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
From: Koki Sanagi @ 2010-07-20  6:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, nhorman, laijs, scott.a.mcmillan, rostedt,
	fweisbec, mathieu.desnoyers
In-Reply-To: <1279601643.2458.64.camel@edumazet-laptop>

(2010/07/20 13:54), Eric Dumazet wrote:
> Le mardi 20 juillet 2010 à 09:49 +0900, Koki Sanagi a écrit :
>> [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
>> This patch adds tracepoint to consume_skb, dev_kfree_skb_irq and
>> skb_free_datagram_locked. Combinating with tracepoint on dev_hard_start_xmit,
>> we can check how long it takes to free transmited packets. And using it, we can
>> calculate how many packets driver had at that time. It is useful when a drop of
>> transmited packet is a problem.
>>
>>           <idle>-0     [001] 241409.218333: consume_skb: skbaddr=dd6b2fb8
>>           <idle>-0     [001] 241409.490555: dev_kfree_skb_irq: skbaddr=f5e29840
>>
>>         udp-recv-302   [001] 515031.206008: skb_free_datagram_locked: skbaddr=f5b1d900
>>
>>
>> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
>> ---
>>  include/trace/events/skb.h |   42 ++++++++++++++++++++++++++++++++++++++++++
>>  net/core/datagram.c        |    1 +
>>  net/core/dev.c             |    2 ++
>>  net/core/skbuff.c          |    1 +
>>  4 files changed, 46 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
>> index 4b2be6d..84c9041 100644
>> --- a/include/trace/events/skb.h
>> +++ b/include/trace/events/skb.h
>> @@ -35,6 +35,48 @@ TRACE_EVENT(kfree_skb,
>>  		__entry->skbaddr, __entry->protocol, __entry->location)
>>  );
>>  
>> +DECLARE_EVENT_CLASS(free_skb,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(	void *,	skbaddr	)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->skbaddr = skb;
>> +	),
>> +
>> +	TP_printk("skbaddr=%p", __entry->skbaddr)
>> +
>> +);
>> +
>> +DEFINE_EVENT(free_skb, consume_skb,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +
>> +);
>> +
>> +DEFINE_EVENT(free_skb, dev_kfree_skb_irq,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +
>> +);
>> +
>> +DEFINE_EVENT(free_skb, skb_free_datagram_locked,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +
>> +);
>> +
>>  TRACE_EVENT(skb_copy_datagram_iovec,
>>  
>>  	TP_PROTO(const struct sk_buff *skb, int len),
>> diff --git a/net/core/datagram.c b/net/core/datagram.c
>> index f5b6f43..1ea32a0 100644
>> --- a/net/core/datagram.c
>> +++ b/net/core/datagram.c
>> @@ -231,6 +231,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
>>  {
>>  	bool slow;
>>  
>> +	trace_skb_free_datagram_locked(skb);
> 
> Here you unconditionally trace before the test on skb->users
> 
>>  	if (likely(atomic_read(&skb->users) == 1))
>>  		smp_rmb();
>>  	else if (likely(!atomic_dec_and_test(&skb->users)))
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 4acfec6..d979847 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -131,6 +131,7 @@
>>  #include <linux/random.h>
>>  #include <trace/events/napi.h>
>>  #include <trace/events/net.h>
>> +#include <trace/events/skb.h>
>>  #include <linux/pci.h>
>>  
>>  #include "net-sysfs.h"
>> @@ -1581,6 +1582,7 @@ void dev_kfree_skb_irq(struct sk_buff *skb)
>>  		struct softnet_data *sd;
>>  		unsigned long flags;
>>  
>> +		trace_dev_kfree_skb_irq(skb);
>>  		local_irq_save(flags);
>>  		sd = &__get_cpu_var(softnet_data);
>>  		skb->next = sd->completion_queue;
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 34432b4..a7b4036 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -466,6 +466,7 @@ void consume_skb(struct sk_buff *skb)
>>  		smp_rmb();
>>  	else if (likely(!atomic_dec_and_test(&skb->users)))
>>  		return;
> 
> While here you trace _after_ the test on skb->users - and a "return;" ,
> so you miss some consume_skb() calls
> 
Yeah, I'll move trace_consume_skb() before the test.

Thanks,
Koki Sanagi.

> 
>> +	trace_consume_skb(skb);
>>  	__kfree_skb(skb);
>>  }
>>  EXPORT_SYMBOL(consume_skb);
>>
> 
> 
> 
> 

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: David Miller @ 2010-07-20  6:44 UTC (permalink / raw)
  To: eric.dumazet
  Cc: jeffrey.t.kirsher, netdev, gospo, bphilips, alexander.h.duyck,
	donald.c.skidmore
In-Reply-To: <1279607980.2458.82.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Jul 2010 08:39:40 +0200

> This would be a serious regression.

The regression is in the hardware Eric.

> Many UDP applications try hard to not use fragments. 
> 
> They are going to pay the price because some application :
> - Use big segments, fragmented.
> - Is subject to OOO artifacts.

None of this matters.  If the hardware can't flow seperate properly
it's tough cookies.

We never may reorder packets in our stack by our own doing.  The
only safe default is to turn UDP flow seperation off on chips
like ixgbe.

If you want an ethtool knob to turn it back on for your machines,
fine.  But never can it be enabled by default.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: Eric Dumazet @ 2010-07-20  6:39 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: davem, netdev, gospo, bphilips, Alexander Duyck, Don Skidmore
In-Reply-To: <20100719235925.14112.65890.stgit@localhost.localdomain>

Le lundi 19 juillet 2010 à 16:59 -0700, Jeff Kirsher a écrit :
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change removes UDP from the supported protocols for RSS hashing.  The
> reason for removing this protocol is because IP fragmentation was causing a
> network flow to be broken into two streams, one for fragmented, and one for
> non-fragmented and this in turn was causing out-of-order issues.
> 

Jeff, does it mean all UDP packets are going to be delivered to a single
queue ?

This would be a serious regression.

Many UDP applications try hard to not use fragments. 

They are going to pay the price because some application :
- Use big segments, fragmented.
- Is subject to OOO artifacts.

We would like some clarifications please :)




^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: David Miller @ 2010-07-20  6:39 UTC (permalink / raw)
  To: eric.dumazet
  Cc: billfink, jeffrey.t.kirsher, netdev, gospo, bphilips,
	alexander.h.duyck, donald.c.skidmore
In-Reply-To: <1279607400.2458.76.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Jul 2010 08:30:00 +0200

> By nature, UDP flows are subject to out of order issues, so what is this
> patch tries to avoid ?

UDP being subject to out-of-order issues really doesn't matter one
bit.

We should never, _knowingly_ create out-of-order packets in our
networking stack.  And this is regardless of protocol.

If there is no way to make ixgbe respect in-order packet delivery for
UDP frames vis-a-vis fragmented frames, we must disable RX flow
spreading for UDP.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: Eric Dumazet @ 2010-07-20  6:30 UTC (permalink / raw)
  To: Bill Fink
  Cc: David Miller, jeffrey.t.kirsher, netdev, gospo, bphilips,
	alexander.h.duyck, donald.c.skidmore
In-Reply-To: <20100720020754.135b5ff7.billfink@mindspring.com>

Le mardi 20 juillet 2010 à 02:07 -0400, Bill Fink a écrit :
> On Mon, 19 Jul 2010, David Miller wrote:
> 

> Should there be a /proc or ethtool setting for whether or not to
> use RSS hashing for UDP flows?  I would think that for many common
> UDP applications, IP fragmentation would not be an issue because
> they often tend to use sub-MTU sized datagrams.  And of course
> UDP does not guarantee in-order delivery in any event.  Then a
> remaining issue is what the default setting of such an option
> should be.  I would lean to having it enabled by default, but
> I can also see the safety argument for having it off by default.
> 

Their are several issues here.

1) Ability for the NIC to spread UDP loads on several queues.

2) Ability for the NIC to provide the hash to our stack, to speedup a
bit RPS.

If the patch is about 1), ie disables NIC ability to split UDP flows on
several RX queues, then yes : its probably _not_ wanted.

Commit message is not very clear on this topic.

By nature, UDP flows are subject to out of order issues, so what is this
patch tries to avoid ?

^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: David Miller @ 2010-07-20  6:28 UTC (permalink / raw)
  To: herbert; +Cc: mst, shemminger, frzhang, netdev, amwang, mpm
In-Reply-To: <20100720052645.GA5214@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 20 Jul 2010 13:26:45 +0800

> Note that this patch needs to be reverted in net-next-2.6 as
> bridge netpoll works correctly there.

Ok, will do.

> bridge: Partially disable netpoll support
> 
> The new netpoll code in bridging contains use-after-free bugs
> that are non-trivial to fix.
> 
> This patch fixes this by removing the code that uses skbs after
> they're freed.
> 
> As a consequence, this means that we can no longer call bridge
> from the netpoll path, so this patch also removes the controller
> function in order to disable netpoll.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks Herbert.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: Bill Fink @ 2010-07-20  6:07 UTC (permalink / raw)
  To: David Miller
  Cc: jeffrey.t.kirsher, netdev, gospo, bphilips, alexander.h.duyck,
	donald.c.skidmore
In-Reply-To: <20100719.202417.218061462.davem@davemloft.net>

On Mon, 19 Jul 2010, David Miller wrote:

> From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Date: Mon, 19 Jul 2010 16:59:27 -0700
> 
> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> > 
> > This change removes UDP from the supported protocols for RSS hashing.  The
> > reason for removing this protocol is because IP fragmentation was causing a
> > network flow to be broken into two streams, one for fragmented, and one for
> > non-fragmented and this in turn was causing out-of-order issues.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> > Acked-by: Don Skidmore <donald.c.skidmore@intel.com>
> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> 
> Applied.

Should there be a /proc or ethtool setting for whether or not to
use RSS hashing for UDP flows?  I would think that for many common
UDP applications, IP fragmentation would not be an issue because
they often tend to use sub-MTU sized datagrams.  And of course
UDP does not guarantee in-order delivery in any event.  Then a
remaining issue is what the default setting of such an option
should be.  I would lean to having it enabled by default, but
I can also see the safety argument for having it off by default.

						-Bill

^ permalink raw reply

* Re: [PATCH net-next 3/4] bnx2: Remove some unnecessary smp_mb() in tx fast path.
From: Michael Chan @ 2010-07-20  5:48 UTC (permalink / raw)
  To: 'Eric Dumazet'; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <1279604283.2458.68.camel@edumazet-laptop>

Eric Dumazet wrote:

> Excellent,
>
> Is similar patch for tg3 planned ?
>

Yes, this performance issue was actually first discovered when
profiling a new tg3 device.  So Matt should be sending out a
very similar patch as part of his next patch-set.

^ permalink raw reply

* Re: [PATCH net-next 3/4] bnx2: Remove some unnecessary smp_mb() in tx fast path.
From: Eric Dumazet @ 2010-07-20  5:38 UTC (permalink / raw)
  To: Michael Chan; +Cc: davem, netdev
In-Reply-To: <1279584905-15084-3-git-send-email-mchan@broadcom.com>

Le lundi 19 juillet 2010 à 17:15 -0700, Michael Chan a écrit :
> smp_mb() inside bnx2_tx_avail() is used twice in the normal
> bnx2_start_xmit() path (see illustration below).  The full memory
> barrier is only necessary during race conditions with tx completion.
> We can speed up the tx path by replacing smp_mb() in bnx2_tx_avail()
> with a compiler barrier.  The compiler barrier is to force the
> compiler to fetch the tx_prod and tx_cons from memory.
> 
> In the race condition between bnx2_start_xmit() and bnx2_tx_int(),
> we have the following situation:
> 
> bnx2_start_xmit()                       bnx2_tx_int()
>     if (!bnx2_tx_avail())
>             BUG();
> 
>     ...
> 
>     if (!bnx2_tx_avail())
>             netif_tx_stop_queue();          update_tx_index();
>             smp_mb();                       smp_mb();
>             if (bnx2_tx_avail())            if (netif_tx_queue_stopped() &&
>                     netif_tx_wake_queue();      bnx2_tx_avail())
> 
> With smp_mb() removed from bnx2_tx_avail(), we need to add smp_mb() to
> bnx2_start_xmit() as shown above to properly order netif_tx_stop_queue()
> and bnx2_tx_avail() to check the ring index.  If it is not strictly
> ordered, the tx queue can be stopped forever.
> 
> This improves performance by about 5% with 2 ports running bi-directional
> 64-byte packets.
> 
> Reviewed-by: Benjamin Li <benli@broadcom.com>
> Reviewed-by: Matt Carlson <mcarlson@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>
> ---
>  drivers/net/bnx2.c |   10 +++++++++-
>  1 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
> index d44ecc3..2af570d 100644
> --- a/drivers/net/bnx2.c
> +++ b/drivers/net/bnx2.c
> @@ -253,7 +253,8 @@ static inline u32 bnx2_tx_avail(struct bnx2 *bp, struct bnx2_tx_ring_info *txr)
>  {
>  	u32 diff;
>  
> -	smp_mb();
> +	/* Tell compiler to fetch tx_prod and tx_cons from memory. */
> +	barrier();
>  
>  	/* The ring uses 256 indices for 255 entries, one of them
>  	 * needs to be skipped.
> @@ -6534,6 +6535,13 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	if (unlikely(bnx2_tx_avail(bp, txr) <= MAX_SKB_FRAGS)) {
>  		netif_tx_stop_queue(txq);
> +
> +		/* netif_tx_stop_queue() must be done before checking
> +		 * tx index in bnx2_tx_avail() below, because in
> +		 * bnx2_tx_int(), we update tx index before checking for
> +		 * netif_tx_queue_stopped().
> +		 */
> +		smp_mb();
>  		if (bnx2_tx_avail(bp, txr) > bp->tx_wake_thresh)
>  			netif_tx_wake_queue(txq);
>  	}

Excellent,

Is similar patch for tg3 planned ?

Thanks




^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: Herbert Xu @ 2010-07-20  5:26 UTC (permalink / raw)
  To: David Miller; +Cc: mst, shemminger, frzhang, netdev, amwang, mpm
In-Reply-To: <20100719.090503.73693858.davem@davemloft.net>

On Mon, Jul 19, 2010 at 09:05:03AM -0700, David Miller wrote:
>
> I thought we did that already.... oh I see, we did it for bonding:
> 
> commit c22d7ac844f1cb9c6a5fd20f89ebadc2feef891b
> Author: Andy Gospodarek <andy@greyhouse.net>
> Date:   Fri Jun 25 09:50:44 2010 +0000
> 
>     bonding: prevent netpoll over bonded interfaces
> 
> I'm fine with disabling it for bridging too, just send me a patch
> similar to the bonding one.

OK, here is a minimal patch to remove the offending bits of the
bridge netpoll support.

Note that this patch needs to be reverted in net-next-2.6 as
bridge netpoll works correctly there.

bridge: Partially disable netpoll support

The new netpoll code in bridging contains use-after-free bugs
that are non-trivial to fix.

This patch fixes this by removing the code that uses skbs after
they're freed.

As a consequence, this means that we can no longer call bridge
from the netpoll path, so this patch also removes the controller
function in order to disable netpoll.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index eedf2c9..753fc42 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -217,14 +217,6 @@ static bool br_devices_support_netpoll(struct net_bridge *br)
 	return count != 0 && ret;
 }
 
-static void br_poll_controller(struct net_device *br_dev)
-{
-	struct netpoll *np = br_dev->npinfo->netpoll;
-
-	if (np->real_dev != br_dev)
-		netpoll_poll_dev(np->real_dev);
-}
-
 void br_netpoll_cleanup(struct net_device *dev)
 {
 	struct net_bridge *br = netdev_priv(dev);
@@ -295,7 +287,6 @@ static const struct net_device_ops br_netdev_ops = {
 	.ndo_do_ioctl		 = br_dev_ioctl,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_netpoll_cleanup	 = br_netpoll_cleanup,
-	.ndo_poll_controller	 = br_poll_controller,
 #endif
 };
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index a4e72a8..595da45 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -50,14 +50,7 @@ int br_dev_queue_push_xmit(struct sk_buff *skb)
 			kfree_skb(skb);
 		else {
 			skb_push(skb, ETH_HLEN);
-
-#ifdef CONFIG_NET_POLL_CONTROLLER
-			if (unlikely(skb->dev->priv_flags & IFF_IN_NETPOLL)) {
-				netpoll_send_skb(skb->dev->npinfo->netpoll, skb);
-				skb->dev->priv_flags &= ~IFF_IN_NETPOLL;
-			} else
-#endif
-				dev_queue_xmit(skb);
+			dev_queue_xmit(skb);
 		}
 	}
 
@@ -73,23 +66,9 @@ int br_forward_finish(struct sk_buff *skb)
 
 static void __br_deliver(const struct net_bridge_port *to, struct sk_buff *skb)
 {
-#ifdef CONFIG_NET_POLL_CONTROLLER
-	struct net_bridge *br = to->br;
-	if (unlikely(br->dev->priv_flags & IFF_IN_NETPOLL)) {
-		struct netpoll *np;
-		to->dev->npinfo = skb->dev->npinfo;
-		np = skb->dev->npinfo->netpoll;
-		np->real_dev = np->dev = to->dev;
-		to->dev->priv_flags |= IFF_IN_NETPOLL;
-	}
-#endif
 	skb->dev = to->dev;
 	NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_OUT, skb, NULL, skb->dev,
 		br_forward_finish);
-#ifdef CONFIG_NET_POLL_CONTROLLER
-	if (skb->dev->npinfo)
-		skb->dev->npinfo->netpoll->dev = br->dev;
-#endif
 }
 
 static void __br_forward(const struct net_bridge_port *to, struct sk_buff *skb)

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: Very low latency TCP for clusters
From: Eric Dumazet @ 2010-07-20  5:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, Rick Jones
In-Reply-To: <AANLkTillCVWMHHBImGEeRYy2MJYqtGIvSPvacLKJnpP2@mail.gmail.com>

Le lundi 19 juillet 2010 à 16:37 -0700, Tom Herbert a écrit :

> That's pretty pokey ;-) I see numbers around 25 usecs between to
> machines, this is with TCP_NBRR.  With TCP_RR it's more like 35 usecs,
> so eliminating the scheduler is already a big reduction.  That leaves
> 18 usecs in device time, interrupt processing, network, and cache
> misses; 7 usecs in TCP processing, user space.  While 5 usecs is an
> aggressive goal, I am not ready to concede that there's an
> architectural limit in either NICs, TCP, or sockets that can't be
> overcome.

Last time I tried TCP_NBRR, it was not working (not even compiled in), I
guess I should submit a bug report to Rick ;)




^ permalink raw reply

* Re: [PATCH net-next] drivers/acpi/acpica/utmisc.c: Use printk extension %pV
From: Joe Perches @ 2010-07-20  5:21 UTC (permalink / raw)
  To: David Miller; +Cc: lenb, linux-acpi, linux-kernel, netdev
In-Reply-To: <20100719.221630.146073821.davem@davemloft.net>

On Mon, 2010-07-19 at 22:16 -0700, David Miller wrote:
> Joe, I really can't keep merging stuff like this into the networking
> tree.

Oh no worries David.

The patch is more like a preliminary notification that at
some point after the vsprintf stuff gets merged into Linus'
tree after 2.6.36-rc1 that this change to acpi could be useful.

> But you're going to have to find another way to merge uses outside of
> networking that you want integrated.

Hey, I'm a patient guy, there's no hurry.

cheers, Joe

^ permalink raw reply

* Re: [PATCH net-next] drivers/acpi/acpica/utmisc.c: Use printk extension %pV
From: David Miller @ 2010-07-20  5:16 UTC (permalink / raw)
  To: joe; +Cc: lenb, linux-acpi, linux-kernel, netdev
In-Reply-To: <1279602367.19374.20.camel@Joe-Laptop.home>

From: Joe Perches <joe@perches.com>
Date: Mon, 19 Jul 2010 22:06:07 -0700

> Consolidates the printk messages to a single
> call so the messages can not be interleaved.
> 
> Reduces text a bit.
> 
> $ size drivers/acpi/acpica/utmisc.o.*
>    text	   data	    bss	    dec	    hex	filename
>    7822	     56	   1832	   9710	   25ee	drivers/acpi/acpica/utmisc.o.old
>    7748	     56	   1736	   9540	   2544	drivers/acpi/acpica/utmisc.o.new
> 
> Depends on net-next commit 7db6f5fb65a82af03229eef104dc9899c5eecf33
> (vsprintf: Recursive vsnprintf: Add "%pV", struct va_format)
> 
> Compile tested only
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Joe, I really can't keep merging stuff like this into the networking
tree.  The driver generic bits, since the netdev print macros needed
this indirectly and you used %pV primarily for networking stuff, that's
fine.

But you're going to have to find another way to merge uses outside of
networking that you want integrated.

Thanks.

^ permalink raw reply

* [PATCH net-next] drivers/acpi/acpica/utmisc.c: Use printk extension %pV
From: Joe Perches @ 2010-07-20  5:06 UTC (permalink / raw)
  To: Len Brown; +Cc: linux-acpi, linux-kernel, netdev, David Miller

Consolidates the printk messages to a single
call so the messages can not be interleaved.

Reduces text a bit.

$ size drivers/acpi/acpica/utmisc.o.*
   text	   data	    bss	    dec	    hex	filename
   7822	     56	   1832	   9710	   25ee	drivers/acpi/acpica/utmisc.o.old
   7748	     56	   1736	   9540	   2544	drivers/acpi/acpica/utmisc.o.new

Depends on net-next commit 7db6f5fb65a82af03229eef104dc9899c5eecf33
(vsprintf: Recursive vsnprintf: Add "%pV", struct va_format)

Compile tested only

Signed-off-by: Joe Perches <joe@perches.com>
---
 drivers/acpi/acpica/utmisc.c |   78 ++++++++++++++++++++++++++++--------------
 1 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/drivers/acpi/acpica/utmisc.c b/drivers/acpi/acpica/utmisc.c
index e8d0724..d9a4a64 100644
--- a/drivers/acpi/acpica/utmisc.c
+++ b/drivers/acpi/acpica/utmisc.c
@@ -53,8 +53,8 @@ ACPI_MODULE_NAME("utmisc")
 /*
  * Common suffix for messages
  */
-#define ACPI_COMMON_MSG_SUFFIX \
-	acpi_os_printf(" (%8.8X/%s-%u)\n", ACPI_CA_VERSION, module_name, line_number)
+#define ACPI_COMMON_MSG_SUFFIX_FMT  " (%8.8X/%s-%u)"
+#define ACPI_COMMON_MSG_SUFFIX_ARGS ACPI_CA_VERSION, module_name, line_number
 /*******************************************************************************
  *
  * FUNCTION:    acpi_ut_validate_exception
@@ -1063,12 +1063,16 @@ void ACPI_INTERNAL_VAR_XFACE
 acpi_error(const char *module_name, u32 line_number, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI Error: ");
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Error: %pV" ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1077,12 +1081,18 @@ acpi_exception(const char *module_name,
 	       u32 line_number, acpi_status status, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI Exception: %s, ", acpi_format_exception(status));
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Exception: %s, %pV"
+		       ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       acpi_format_exception(status), &vaf,
+		       ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1090,12 +1100,16 @@ void ACPI_INTERNAL_VAR_XFACE
 acpi_warning(const char *module_name, u32 line_number, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI Warning: ");
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Warning: %pV" ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1103,12 +1117,15 @@ void ACPI_INTERNAL_VAR_XFACE
 acpi_info(const char *module_name, u32 line_number, const char *format, ...)
 {
 	va_list args;
-
-	acpi_os_printf("ACPI: ");
+	struct va_format vaf;
 
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	acpi_os_printf("\n");
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI: %pV\n", &vaf);
+
 	va_end(args);
 }
 
@@ -1143,6 +1160,7 @@ acpi_ut_predefined_warning(const char *module_name,
 			   u8 node_flags, const char *format, ...)
 {
 	va_list args;
+	struct va_format vaf;
 
 	/*
 	 * Warning messages for this method/object will be disabled after the
@@ -1152,11 +1170,15 @@ acpi_ut_predefined_warning(const char *module_name,
 		return;
 	}
 
-	acpi_os_printf("ACPI Warning for %s: ", pathname);
-
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Warning for %s: %pV"
+		       ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       pathname, &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }
 
@@ -1185,6 +1207,7 @@ acpi_ut_predefined_info(const char *module_name,
 			char *pathname, u8 node_flags, const char *format, ...)
 {
 	va_list args;
+	struct va_format vaf;
 
 	/*
 	 * Warning messages for this method/object will be disabled after the
@@ -1194,10 +1217,13 @@ acpi_ut_predefined_info(const char *module_name,
 		return;
 	}
 
-	acpi_os_printf("ACPI Info for %s: ", pathname);
-
 	va_start(args, format);
-	acpi_os_vprintf(format, args);
-	ACPI_COMMON_MSG_SUFFIX;
+
+	vaf.fmt = format;
+	vaf.va = &args;
+
+	acpi_os_printf("ACPI Info for %s: %pV" ACPI_COMMON_MSG_SUFFIX_FMT "\n",
+		       pathname, &vaf, ACPI_COMMON_MSG_SUFFIX_ARGS);
+
 	va_end(args);
 }



^ permalink raw reply related

* Re: [RFC PATCH] ipv6: Make IP6CB(skb)->nhoff 16-bit.
From: David Miller @ 2010-07-20  5:01 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20100715.223011.58409183.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Thu, 15 Jul 2010 22:30:11 -0700 (PDT)

> 
> Even with jumbograms I cannot see any way in which we would need
> to records a larger than 65535 valued next-header offset.
> 
> The maximum extension header length is (256 << 3) == 2048.
> There are only a handful of extension headers specified which
> we'd even accept (say 5 or 6), therefore the largest next-header
> offset we'd ever have to contend with is something less than
> say 16k.
> 
> Therefore make it a u16 instead of a u32.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
> Can anyone find any holes in this?  I want to do this so we
> can legitimately rip ndisc_nodetype out of struct sk_buff
> and stick it into the IP6CB where it belongs.  Currently
> there isn't enough space, but with this change there will
> be.

No holes claimed by anyone, so... applied :-)

^ permalink raw reply

* Re: [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
From: Eric Dumazet @ 2010-07-20  4:54 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, nhorman, laijs, scott.a.mcmillan, rostedt,
	fweisbec, mathieu.desnoyers
In-Reply-To: <4C44F286.1050907@jp.fujitsu.com>

Le mardi 20 juillet 2010 à 09:49 +0900, Koki Sanagi a écrit :
> [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
> This patch adds tracepoint to consume_skb, dev_kfree_skb_irq and
> skb_free_datagram_locked. Combinating with tracepoint on dev_hard_start_xmit,
> we can check how long it takes to free transmited packets. And using it, we can
> calculate how many packets driver had at that time. It is useful when a drop of
> transmited packet is a problem.
> 
>           <idle>-0     [001] 241409.218333: consume_skb: skbaddr=dd6b2fb8
>           <idle>-0     [001] 241409.490555: dev_kfree_skb_irq: skbaddr=f5e29840
> 
>         udp-recv-302   [001] 515031.206008: skb_free_datagram_locked: skbaddr=f5b1d900
> 
> 
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  include/trace/events/skb.h |   42 ++++++++++++++++++++++++++++++++++++++++++
>  net/core/datagram.c        |    1 +
>  net/core/dev.c             |    2 ++
>  net/core/skbuff.c          |    1 +
>  4 files changed, 46 insertions(+), 0 deletions(-)
> 
> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
> index 4b2be6d..84c9041 100644
> --- a/include/trace/events/skb.h
> +++ b/include/trace/events/skb.h
> @@ -35,6 +35,48 @@ TRACE_EVENT(kfree_skb,
>  		__entry->skbaddr, __entry->protocol, __entry->location)
>  );
>  
> +DECLARE_EVENT_CLASS(free_skb,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb),
> +
> +	TP_STRUCT__entry(
> +		__field(	void *,	skbaddr	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->skbaddr = skb;
> +	),
> +
> +	TP_printk("skbaddr=%p", __entry->skbaddr)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, consume_skb,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, dev_kfree_skb_irq,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, skb_free_datagram_locked,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
>  TRACE_EVENT(skb_copy_datagram_iovec,
>  
>  	TP_PROTO(const struct sk_buff *skb, int len),
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index f5b6f43..1ea32a0 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -231,6 +231,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
>  {
>  	bool slow;
>  
> +	trace_skb_free_datagram_locked(skb);

Here you unconditionally trace before the test on skb->users

>  	if (likely(atomic_read(&skb->users) == 1))
>  		smp_rmb();
>  	else if (likely(!atomic_dec_and_test(&skb->users)))
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4acfec6..d979847 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -131,6 +131,7 @@
>  #include <linux/random.h>
>  #include <trace/events/napi.h>
>  #include <trace/events/net.h>
> +#include <trace/events/skb.h>
>  #include <linux/pci.h>
>  
>  #include "net-sysfs.h"
> @@ -1581,6 +1582,7 @@ void dev_kfree_skb_irq(struct sk_buff *skb)
>  		struct softnet_data *sd;
>  		unsigned long flags;
>  
> +		trace_dev_kfree_skb_irq(skb);
>  		local_irq_save(flags);
>  		sd = &__get_cpu_var(softnet_data);
>  		skb->next = sd->completion_queue;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 34432b4..a7b4036 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -466,6 +466,7 @@ void consume_skb(struct sk_buff *skb)
>  		smp_rmb();
>  	else if (likely(!atomic_dec_and_test(&skb->users)))
>  		return;

While here you trace _after_ the test on skb->users - and a "return;" ,
so you miss some consume_skb() calls


> +	trace_consume_skb(skb);
>  	__kfree_skb(skb);
>  }
>  EXPORT_SYMBOL(consume_skb);
> 

^ permalink raw reply

* Re: [net-next-2.6 PATCH] e1000: allow option to limit number of descriptors down to 48 per ring
From: Eric Dumazet @ 2010-07-20  4:40 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, gospo, bphilips, Alexander Duyck
In-Reply-To: <20100719234219.13875.90302.stgit@localhost.localdomain>

Le lundi 19 juillet 2010 à 16:43 -0700, Jeff Kirsher a écrit :
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change makes it possible to limit the number of descriptors down to 48
> per ring.  The reason for this change is to address a variation on hardware
> errata 10 for 82546GB in which descriptors will be lost if more than 32
> descriptors are fetched and the PCI-X MRBC is 512.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> 
>  drivers/net/e1000/e1000.h |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
> index 40b62b4..65298a6 100644
> --- a/drivers/net/e1000/e1000.h
> +++ b/drivers/net/e1000/e1000.h
> @@ -86,12 +86,12 @@ struct e1000_adapter;
>  /* TX/RX descriptor defines */
>  #define E1000_DEFAULT_TXD                  256
>  #define E1000_MAX_TXD                      256
> -#define E1000_MIN_TXD                       80
> +#define E1000_MIN_TXD                       48
>  #define E1000_MAX_82544_TXD               4096
>  
>  #define E1000_DEFAULT_RXD                  256
>  #define E1000_MAX_RXD                      256
> -#define E1000_MIN_RXD                       80
> +#define E1000_MIN_RXD                       48
>  #define E1000_MAX_82544_RXD               4096
>  
>  #define E1000_MIN_ITR_USECS		10 /* 100000 irq/sec */
> 

So this limit is a pure software one ?

Why not let an admin chose a lower limit if he wants to ?

I am asking because big ring sizes can be a latency source in some
workloads.

Thanks



^ permalink raw reply

* Re: [PATCH] net: Add batman-adv meshing protocol
From: David Miller @ 2010-07-20  4:26 UTC (permalink / raw)
  To: sven.eckelmann-Mmb7MZpHnFY
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwlltHuzzzSOjJt
In-Reply-To: <1279291156-5297-2-git-send-email-sven.eckelmann-Mmb7MZpHnFY@public.gmane.org>

From: Sven Eckelmann <sven.eckelmann-Mmb7MZpHnFY@public.gmane.org>
Date: Fri, 16 Jul 2010 16:39:16 +0200

> +/* count the hamming weight, how many good packets did we receive? just count
> + * the 1's. The inner loop uses the Kernighan algorithm, see
> + * http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
> + */
> +int bit_packet_count(TYPE_OF_WORD *seq_bits)
> +{
> +	int i, hamming = 0;
> +	TYPE_OF_WORD word;
> +
> +	for (i = 0; i < NUM_WORDS; i++) {
> +		word = seq_bits[i];
> +
> +		while (word) {
> +			word &= word-1;
> +			hamming++;
> +		}
> +	}
> +	return hamming;
> +}

The kernel has a hamming weight library function which takes advantage
of population count instructions on cpus that suport it, and also has
a sw version than is faster than what you're doing here, please use
it.

The interfaces are called "hweight{8,16,32,64}()" where the number in
the name indicates the bit-size of the word the interface operates on.

I also notice that this code uses it's own internal buffering scheme
with kmalloc()'d buffers, then seperately allocates actual SKB's and
copies the data there.

Just use the SKB facilities how they were designed to be used, instead
of needlessly inventing new things.  Allocate your initial SKB and put
the initial forwarding header in it, then when you want to send a copy
off, skb_clone() it, and push the other bits you want at the head
and/or the tail of the cloned SKB, then simply send it off.

^ permalink raw reply

* Re: [PATCH net-next 0/3] cxgb4vf: fixes for several small issues discovered by QA
From: David Miller @ 2010-07-20  3:40 UTC (permalink / raw)
  To: leedom; +Cc: netdev
In-Reply-To: <201007191812.10459.leedom@chelsio.com>

From: Casey Leedom <leedom@chelsio.com>
Date: Mon, 19 Jul 2010 18:12:10 -0700

>   A couple of small (but important) fixes discovered by our QA people.  I've also 
> included a patch to add myself as the maintainer of cxgb4vf in the MAINTAINERS 
> file which I think is the protocol but please correct me if changes to that file 
> are usually performed by someone else.

Well, where are they?

^ permalink raw reply

* Re: [PATCH net-next 4/4] bnx2: Update version to 2.0.17.
From: David Miller @ 2010-07-20  3:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1279584905-15084-4-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Mon, 19 Jul 2010 17:15:05 -0700

> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox