Netdev List
 help / color / mirror / Atom feed
* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: Eric Dumazet @ 2010-07-20  6:30 UTC (permalink / raw)
  To: Bill Fink
  Cc: David Miller, jeffrey.t.kirsher, netdev, gospo, bphilips,
	alexander.h.duyck, donald.c.skidmore
In-Reply-To: <20100720020754.135b5ff7.billfink@mindspring.com>

Le mardi 20 juillet 2010 à 02:07 -0400, Bill Fink a écrit :
> On Mon, 19 Jul 2010, David Miller wrote:
> 

> Should there be a /proc or ethtool setting for whether or not to
> use RSS hashing for UDP flows?  I would think that for many common
> UDP applications, IP fragmentation would not be an issue because
> they often tend to use sub-MTU sized datagrams.  And of course
> UDP does not guarantee in-order delivery in any event.  Then a
> remaining issue is what the default setting of such an option
> should be.  I would lean to having it enabled by default, but
> I can also see the safety argument for having it off by default.
> 

Their are several issues here.

1) Ability for the NIC to spread UDP loads on several queues.

2) Ability for the NIC to provide the hash to our stack, to speedup a
bit RPS.


If the patch is about 1), ie disables NIC ability to split UDP flows on
several RX queues, then yes : its probably _not_ wanted.



Commit message is not very clear on this topic.

By nature, UDP flows are subject to out of order issues, so what is this
patch tries to avoid ?




^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: David Miller @ 2010-07-20  6:39 UTC (permalink / raw)
  To: eric.dumazet
  Cc: billfink, jeffrey.t.kirsher, netdev, gospo, bphilips,
	alexander.h.duyck, donald.c.skidmore
In-Reply-To: <1279607400.2458.76.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Jul 2010 08:30:00 +0200

> By nature, UDP flows are subject to out of order issues, so what is this
> patch tries to avoid ?

UDP being subject to out-of-order issues really doesn't matter one
bit.

We should never, _knowingly_ create out-of-order packets in our
networking stack.  And this is regardless of protocol.

If there is no way to make ixgbe respect in-order packet delivery for
UDP frames vis-a-vis fragmented frames, we must disable RX flow
spreading for UDP.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: Eric Dumazet @ 2010-07-20  6:39 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: davem, netdev, gospo, bphilips, Alexander Duyck, Don Skidmore
In-Reply-To: <20100719235925.14112.65890.stgit@localhost.localdomain>

Le lundi 19 juillet 2010 à 16:59 -0700, Jeff Kirsher a écrit :
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change removes UDP from the supported protocols for RSS hashing.  The
> reason for removing this protocol is because IP fragmentation was causing a
> network flow to be broken into two streams, one for fragmented, and one for
> non-fragmented and this in turn was causing out-of-order issues.
> 

Jeff, does it mean all UDP packets are going to be delivered to a single
queue ?

This would be a serious regression.

Many UDP applications try hard to not use fragments. 

They are going to pay the price because some application :
- Use big segments, fragmented.
- Is subject to OOO artifacts.

We would like some clarifications please :)




^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/5] ixgbe: drop support for UDP in RSS hash generation
From: David Miller @ 2010-07-20  6:44 UTC (permalink / raw)
  To: eric.dumazet
  Cc: jeffrey.t.kirsher, netdev, gospo, bphilips, alexander.h.duyck,
	donald.c.skidmore
In-Reply-To: <1279607980.2458.82.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Jul 2010 08:39:40 +0200

> This would be a serious regression.

The regression is in the hardware Eric.

> Many UDP applications try hard to not use fragments. 
> 
> They are going to pay the price because some application :
> - Use big segments, fragmented.
> - Is subject to OOO artifacts.

None of this matters.  If the hardware can't flow seperate properly
it's tough cookies.

We never may reorder packets in our stack by our own doing.  The
only safe default is to turn UDP flow seperation off on chips
like ixgbe.

If you want an ethtool knob to turn it back on for your machines,
fine.  But never can it be enabled by default.

^ permalink raw reply

* Re: [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
From: Koki Sanagi @ 2010-07-20  6:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, nhorman, laijs, scott.a.mcmillan, rostedt,
	fweisbec, mathieu.desnoyers
In-Reply-To: <1279601643.2458.64.camel@edumazet-laptop>

(2010/07/20 13:54), Eric Dumazet wrote:
> Le mardi 20 juillet 2010 à 09:49 +0900, Koki Sanagi a écrit :
>> [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
>> This patch adds tracepoint to consume_skb, dev_kfree_skb_irq and
>> skb_free_datagram_locked. Combinating with tracepoint on dev_hard_start_xmit,
>> we can check how long it takes to free transmited packets. And using it, we can
>> calculate how many packets driver had at that time. It is useful when a drop of
>> transmited packet is a problem.
>>
>>           <idle>-0     [001] 241409.218333: consume_skb: skbaddr=dd6b2fb8
>>           <idle>-0     [001] 241409.490555: dev_kfree_skb_irq: skbaddr=f5e29840
>>
>>         udp-recv-302   [001] 515031.206008: skb_free_datagram_locked: skbaddr=f5b1d900
>>
>>
>> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
>> ---
>>  include/trace/events/skb.h |   42 ++++++++++++++++++++++++++++++++++++++++++
>>  net/core/datagram.c        |    1 +
>>  net/core/dev.c             |    2 ++
>>  net/core/skbuff.c          |    1 +
>>  4 files changed, 46 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
>> index 4b2be6d..84c9041 100644
>> --- a/include/trace/events/skb.h
>> +++ b/include/trace/events/skb.h
>> @@ -35,6 +35,48 @@ TRACE_EVENT(kfree_skb,
>>  		__entry->skbaddr, __entry->protocol, __entry->location)
>>  );
>>  
>> +DECLARE_EVENT_CLASS(free_skb,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(	void *,	skbaddr	)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->skbaddr = skb;
>> +	),
>> +
>> +	TP_printk("skbaddr=%p", __entry->skbaddr)
>> +
>> +);
>> +
>> +DEFINE_EVENT(free_skb, consume_skb,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +
>> +);
>> +
>> +DEFINE_EVENT(free_skb, dev_kfree_skb_irq,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +
>> +);
>> +
>> +DEFINE_EVENT(free_skb, skb_free_datagram_locked,
>> +
>> +	TP_PROTO(struct sk_buff *skb),
>> +
>> +	TP_ARGS(skb)
>> +
>> +);
>> +
>>  TRACE_EVENT(skb_copy_datagram_iovec,
>>  
>>  	TP_PROTO(const struct sk_buff *skb, int len),
>> diff --git a/net/core/datagram.c b/net/core/datagram.c
>> index f5b6f43..1ea32a0 100644
>> --- a/net/core/datagram.c
>> +++ b/net/core/datagram.c
>> @@ -231,6 +231,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
>>  {
>>  	bool slow;
>>  
>> +	trace_skb_free_datagram_locked(skb);
> 
> Here you unconditionally trace before the test on skb->users
> 
>>  	if (likely(atomic_read(&skb->users) == 1))
>>  		smp_rmb();
>>  	else if (likely(!atomic_dec_and_test(&skb->users)))
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 4acfec6..d979847 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -131,6 +131,7 @@
>>  #include <linux/random.h>
>>  #include <trace/events/napi.h>
>>  #include <trace/events/net.h>
>> +#include <trace/events/skb.h>
>>  #include <linux/pci.h>
>>  
>>  #include "net-sysfs.h"
>> @@ -1581,6 +1582,7 @@ void dev_kfree_skb_irq(struct sk_buff *skb)
>>  		struct softnet_data *sd;
>>  		unsigned long flags;
>>  
>> +		trace_dev_kfree_skb_irq(skb);
>>  		local_irq_save(flags);
>>  		sd = &__get_cpu_var(softnet_data);
>>  		skb->next = sd->completion_queue;
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 34432b4..a7b4036 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -466,6 +466,7 @@ void consume_skb(struct sk_buff *skb)
>>  		smp_rmb();
>>  	else if (likely(!atomic_dec_and_test(&skb->users)))
>>  		return;
> 
> While here you trace _after_ the test on skb->users - and a "return;" ,
> so you miss some consume_skb() calls
> 
Yeah, I'll move trace_consume_skb() before the test.

Thanks,
Koki Sanagi.

> 
>> +	trace_consume_skb(skb);
>>  	__kfree_skb(skb);
>>  }
>>  EXPORT_SYMBOL(consume_skb);
>>
> 
> 
> 
> 

^ permalink raw reply

* [PATCH] phy: add suspend/resume in the ic+
From: Giuseppe CAVALLARO @ 2010-07-20  7:12 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 drivers/net/phy/icplus.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/phy/icplus.c b/drivers/net/phy/icplus.c
index 439adaf..3f2583f 100644
--- a/drivers/net/phy/icplus.c
+++ b/drivers/net/phy/icplus.c
@@ -116,6 +116,8 @@ static struct phy_driver ip175c_driver = {
 	.config_init	= &ip175c_config_init,
 	.config_aneg	= &ip175c_config_aneg,
 	.read_status	= &ip175c_read_status,
+	.suspend	= genphy_suspend,
+	.resume		= genphy_resume,
 	.driver		= { .owner = THIS_MODULE,},
 };
 
-- 
1.5.5.6


^ permalink raw reply related

* Re: [PATCH] net: Add batman-adv meshing protocol
From: Sven Eckelmann @ 2010-07-20  8:28 UTC (permalink / raw)
  To: b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r, Simon Wunderlich
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <20100719.212625.255369607.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

[-- Attachment #1: Type: Text/Plain, Size: 1679 bytes --]

Thanks a lot for your review and for your hints.

David Miller wrote:
> From: Sven Eckelmann <sven.eckelmann-Mmb7MZpHnFY@public.gmane.org>
> Date: Fri, 16 Jul 2010 16:39:16 +0200
[...]
> 
> The kernel has a hamming weight library function which takes advantage
> of population count instructions on cpus that suport it, and also has
> a sw version than is faster than what you're doing here, please use
> it.
> 
> The interfaces are called "hweight{8,16,32,64}()" where the number in
> the name indicates the bit-size of the word the interface operates on.

Correct, the inner loop is a quite straight forward implementation without any 
kind of optimization. I will change that.

> I also notice that this code uses it's own internal buffering scheme
> with kmalloc()'d buffers, then seperately allocates actual SKB's and
> copies the data there.
> 
> Just use the SKB facilities how they were designed to be used, instead
> of needlessly inventing new things.  Allocate your initial SKB and put
> the initial forwarding header in it, then when you want to send a copy
> off, skb_clone() it, and push the other bits you want at the head
> and/or the tail of the cloned SKB, then simply send it off.

Good catch. That comes from a time when batman-adv was a minimalistic 
conversation of the userspace proof of concept implementation. This happens 
for example in vis.c, icmp_socket.c and send.c (just grepping for 
send_raw_packet is a good way to find those places). But is also happening 
with batman_if->packet_buff in schedule_own_packet and similar places.

I would leave that to the original author of those functions.

thanks,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCHv2] tcp: fix crash in tcp_xmit_retransmit_queue
From: Ilpo Järvinen @ 2010-07-20  8:33 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, lennart.schulte, tj, LKML, Netdev, henning.fehrmann,
	carsten.aulbert
In-Reply-To: <20100719.125500.257479409.davem@davemloft.net>

On Mon, 19 Jul 2010, David Miller wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 19 Jul 2010 19:39:08 +0200
> 
> > Do you know in what exact circumstance the bug triggers ?
> > 
> > It's hard to believe thousand of machines on the Internet never hit
> > it :(
> > 
> > Maybe another problem in congestion control ?
> 
> This is something to investigate, but the conditions under which
> tcp_fastretrans_alert() (the main invoker of tcp_xmit_retransmit_queue())
> does it's thing are complicated enough that I'm going to add this fix
> for the time being and push it out to stable too.

This is so true. ...So far I've managed to twice rule out of the 
possibility of this being really triggerable (ie., it would mean
Lennart's out of tree changes broke it), and once in the middle came
into opposite conclusion. Thus by majority voting we can deduce that it 
won't happen - how reassuring :-/. It seems that tcp_try_undo_recovery 
causes return if TCP remained in CA_Loss/CA_Recovery and that 
tcp_time_to_recover won't really let past return either under normal 
circumstances (more details below), and tcp_simple_retransmit 
requires lost_out to change; seems safe in mainline to me.

Hmm... It seems that I've just solved another report too. ...Somebody a 
while back found out that setting reordering sysctl to zero (ie. to a 
value which does not make too much sense) crashed the kernel. It seems 
that at least then tcp_time_to_recover() would return true and trigger 
this bug (though I'm not sure if that's the only breakage to happen).

Also worth to keep in mind is the bugzilla entry ("New freez in 
TCP" or something like that) so I'm not really sure I could say for sure 
nobody never hit it. The bugzilla one goes away by disable SACK (at least 
for some) but it might mix two different issues. It seems that there 
really are two different issues, the other may have something to do with 
SACK though there are other variables then involved, e.g., the changes in 
retransmission logic/timing, so it's impossible to say if the SACK disable 
really "fixed" the bugzilla one or not. Also Tejun's ->next == NULL 
finding points out to a different bug than this Lennart's one.


-- 
 i.

^ permalink raw reply

* recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Roy Marples @ 2010-07-20  8:26 UTC (permalink / raw)
  To: netdev

Hi

I would like to support all possible kernels I can and previously used a 
fixed buffer of size 256 to read from netlink sockets. This is now 
proving too small for some 64-bit kernels so I would like to use recv(2) 
with MSG_TRUNK to wor out the size. However, the man page says that this 
only works for 2.6.22 kernels or newer.

My question is, what is the behaviour of recv on older kernels where 
MSG_TRUNC is not supported? I would rather not use some arbitary size if 
at all possible.

Reply directly please as I'm not subscribed here.

Thanks

Roy

^ permalink raw reply

* Re: oops in tcp_xmit_retransmit_queue() w/ v2.6.32.15
From: Ilpo Järvinen @ 2010-07-20  8:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lennart Schulte, Eric Dumazet, David S. Miller, lkml,
	netdev@vger.kernel.org, Fehrmann, Henning, Carsten Aulbert
In-Reply-To: <4C4467E0.9080907@kernel.org>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6095 bytes --]

On Mon, 19 Jul 2010, Tejun Heo wrote:

> Hello,
> 
> On 07/16/2010 02:02 PM, Ilpo Järvinen wrote:
> > Besides, Tejun has also found that it's hint->next ptr which is NULL in 
> > his case so this won't solve his case anyway. Tejun, can you confirm 
> > whether it was retransmit_skb_hint->next being NULL on _entry time_ to 
> > tcp_xmit_retransmit_queue() or later on in the loop after the updates done 
> > by the loop itself to the hint (or that your testing didn't conclude 
> > either)?
> 
> Sorry about the delay.  I was traveling last week.  Unfortunately, I
> don't know whether ->next was NULL on entry or not.  I hacked up the
> following ugly patch for the next test run.  It should have everything
> which has come up till now + list and hint sanity checking before
> starting processing them.  I'm planning on deploying it w/ crashdump
> enabled in several days.  If I've missed something, please let me
> know.

One thing that complicates things further is the fact that the write queue 
can change beneath us (ie., in tcp_retrans_try_collapse and tcp_fragment).
I've read them multiple times through and always found them innocent so 
this might be just for-the-record type of a note (at least I hope so).

-- 
 i.

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b4ed957..1c8b1e0 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2190,6 +2190,53 @@ static int tcp_can_forward_retransmit(struct sock *sk)
>  	return 1;
>  }
> 
> +static void print_queue(struct sock *sk, struct sk_buff *old, struct sk_buff *hole)
> +{
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	struct sk_buff *skb, *prev;
> +	bool do_panic = false;
> +
> +	skb = tcp_write_queue_head(sk);
> +	prev = (struct sk_buff *)(&sk->sk_write_queue);
> +
> +	if (skb == NULL) {
> +		printk("XXX NULL head, pkts %u\n", tp->packets_out);
> +		do_panic = true;
> +	}
> +
> +	printk("XXX head %p tail %p sendhead %p oldhint %p now %p hole %p high %u\n",
> +	       tcp_write_queue_head(sk), tcp_write_queue_tail(sk),
> +	       tcp_send_head(sk), old, tp->retransmit_skb_hint, hole,
> +	       tp->retransmit_high);
> +
> +	while (skb) {
> +		printk("XXX skb %p (%u-%u) next %p prev %p sacked %u\n",
> +		       skb, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
> +		       skb->next, skb->prev, TCP_SKB_CB(skb)->sacked);
> +		if (prev != skb->prev) {
> +			printk("XXX Inconsistent prev\n");
> +			do_panic = true;
> +		}
> +
> +		if (skb == tcp_write_queue_tail(sk)) {
> +			if (skb->next != (struct sk_buff *)(&sk->sk_write_queue)) {
> +				printk("XXX Improper next at tail\n");
> +				do_panic = true;
> +			}
> +			break;
> +		}
> +
> +		prev = skb;
> +		skb = skb->next;
> +	}
> +	if (!skb) {
> +		printk("XXX Encountered unexpected NULL\n");
> +		do_panic = true;
> +	}
> +	if (do_panic)
> +		panic("XXX panicking");
> +}
> +
>  /* This gets called after a retransmit timeout, and the initially
>   * retransmitted data is acknowledged.  It tries to continue
>   * resending the rest of the retransmit queue, until either
> @@ -2198,19 +2245,53 @@ static int tcp_can_forward_retransmit(struct sock *sk)
>   * based retransmit packet might feed us FACK information again.
>   * If so, we use it to avoid unnecessarily retransmissions.
>   */
> +static unsigned int caught_it;
> +
>  void tcp_xmit_retransmit_queue(struct sock *sk)
>  {
>  	const struct inet_connection_sock *icsk = inet_csk(sk);
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	struct sk_buff *skb;
> +	struct sk_buff *skb, *prev;
>  	struct sk_buff *hole = NULL;
> +	struct sk_buff *old = tp->retransmit_skb_hint;
>  	u32 last_lost;
>  	int mib_idx;
>  	int fwd_rexmitting = 0;
> +	bool saw_hint = false;
> +
> +	if (!tp->packets_out) {
> +		if (net_ratelimit())
> +			printk("XXX !tp->packets_out, retransmit_skb_hint=%p, write_queue_head=%p\n",
> +			       tp->retransmit_skb_hint, tcp_write_queue_head(sk));
> +		return;
> +	}
> 
>  	if (!tp->lost_out)
>  		tp->retransmit_high = tp->snd_una;
> 
> +	for (skb = tcp_write_queue_head(sk),
> +	     prev = (struct sk_buff *)&sk->sk_write_queue;
> +	     skb != (struct sk_buff *)&sk->sk_write_queue;
> +	     prev = skb, skb = skb->next) {
> +		if (prev != skb->prev) {
> +			printk("XXX sanity check: prev corrupt\n");
> +			print_queue(sk, old, hole);
> +		}
> +		if (skb == tp->retransmit_skb_hint)
> +			saw_hint = true;
> +		if (skb == tcp_write_queue_tail(sk) &&
> +		    skb->next != (struct sk_buff *)(&sk->sk_write_queue)) {
> +			printk("XXX sanity check: end corrupt\n");
> +			print_queue(sk, old, hole);
> +		}
> +	}
> +	if (tp->retransmit_skb_hint && !saw_hint) {
> +		printk("XXX sanity check: retransmit_skb_hint=%p is not on list, claring hint\n",
> +		       tp->retransmit_skb_hint);
> +		print_queue(sk, old, hole);
> +		tp->retransmit_skb_hint = NULL;
> +	}
> +
>  	if (tp->retransmit_skb_hint) {
>  		skb = tp->retransmit_skb_hint;
>  		last_lost = TCP_SKB_CB(skb)->end_seq;
> @@ -2218,7 +2299,17 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
>  			last_lost = tp->retransmit_high;
>  	} else {
>  		skb = tcp_write_queue_head(sk);
> -		last_lost = tp->snd_una;
> +		if (skb)
> +			last_lost = tp->snd_una;
> +	}
> +
> +checknull:
> +	if (skb == NULL) {
> +		print_queue(sk, old, hole);
> +		caught_it++;
> +		if (net_ratelimit())
> +			printk("XXX Errors caught so far %u\n", caught_it);
> +		return;
>  	}
> 
>  	tcp_for_write_queue_from(skb, sk) {
> @@ -2261,7 +2352,7 @@ begin_fwd:
>  		} else if (!(sacked & TCPCB_LOST)) {
>  			if (hole == NULL && !(sacked & (TCPCB_SACKED_RETRANS|TCPCB_SACKED_ACKED)))
>  				hole = skb;
> -			continue;
> +			goto cont;
> 
>  		} else {
>  			last_lost = TCP_SKB_CB(skb)->end_seq;
> @@ -2272,7 +2363,7 @@ begin_fwd:
>  		}
> 
>  		if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS))
> -			continue;
> +			goto cont;
> 
>  		if (tcp_retransmit_skb(sk, skb))
>  			return;
> @@ -2282,6 +2373,9 @@ begin_fwd:
>  			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>  						  inet_csk(sk)->icsk_rto,
>  						  TCP_RTO_MAX);
> +cont:
> +		skb = skb->next;
> +		goto checknull;
>  	}
>  }
> 
> 

^ permalink raw reply

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Eric Dumazet @ 2010-07-20  8:54 UTC (permalink / raw)
  To: Roy Marples; +Cc: netdev
In-Reply-To: <4C455DC3.1050304@marples.name>

Le mardi 20 juillet 2010 à 09:26 +0100, Roy Marples a écrit :
> Hi
> 
> I would like to support all possible kernels I can and previously used a 
> fixed buffer of size 256 to read from netlink sockets. This is now 
> proving too small for some 64-bit kernels so I would like to use recv(2) 
> with MSG_TRUNK to wor out the size. However, the man page says that this 
> only works for 2.6.22 kernels or newer.
> 
> My question is, what is the behaviour of recv on older kernels where 
> MSG_TRUNC is not supported? I would rather not use some arbitary size if 
> at all possible.
> 

Is it for the dhcpcd problem we talk about few week ago, disturbed by
new 64bit stats ?

Why do you want to have a fixed size of 256 bytes ?

Using 8192 bytes on stack would avoid MSG_TRUNK mess.

static int
get_netlink(int fd, int flags,
    int (*callback)(struct nlmsghdr *))
{
        char buffer[8192];
        ssize_t bytes;
        struct nlmsghdr *nlm;
        int r = -1;

        for (;;) {
                bytes = recv(fd, buffer, sizeof(buffer), flags);
                if (bytes == -1) {
                        if (errno == EAGAIN) {
                                r = 0;
                                goto eexit;
                        }
                        if (errno == EINTR)
                                continue;
                        goto eexit;
                }





^ permalink raw reply

* Re: Badness with the kernel version 2.6.35-rc1-git1 running on P6 box
From: divya @ 2010-07-20  9:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, linuxppc-dev, sachinp, benh, netdev, David Miller,
	Jan-Bernd Themann
In-Reply-To: <1279274185.2549.14.camel@edumazet-laptop>

On Friday 16 July 2010 03:26 PM, Eric Dumazet wrote:
> Le vendredi 16 juillet 2010 à 14:20 +0530, divya a écrit :
>    
>> Hi ,
>>
>> With the latest kernel version 2.6.35-rc5-git1(2f7989efd4398) running on power(p6) box came across the following
>> call trace
>>
>> Call Trace:
>> [c000000006a0e800] [c000000000011c30] .show_stack+0x6c/0x16c (unreliable)
>> [c000000006a0e8b0] [c00000000012129c] .__alloc_pages_nodemask+0x6a0/0x75c
>> [c000000006a0ea30] [c0000000001527cc] .alloc_pages_current+0xc4/0x104
>> [c000000006a0ead0] [c00000000015b1a0] .new_slab+0xe0/0x314
>> [c000000006a0eb70] [c00000000015b6fc] .__slab_alloc+0x328/0x644
>> [c000000006a0ec50] [c00000000015cc34] .__kmalloc_node_track_caller+0x114/0x194
>> [c000000006a0ed00] [c000000000599f6c] .__alloc_skb+0x94/0x180
>> [c000000006a0edb0] [c00000000059af5c] .__netdev_alloc_skb+0x3c/0x74
>> [c000000006a0ee30] [c0000000004f9480] .ehea_refill_rq_def+0xf8/0x2d0
>> [c000000006a0ef30] [c0000000004fab8c] .ehea_up+0x5b8/0x69c
>> [c000000006a0f040] [c0000000004facd4] .ehea_open+0x64/0x118
>> [c000000006a0f0e0] [c0000000005a6e9c] .__dev_open+0x100/0x168
>> [c000000006a0f170] [c0000000005a3ac0] .__dev_change_flags+0x10c/0x1ac
>> [c000000006a0f210] [c0000000005a6d44] .dev_change_flags+0x24/0x7c
>> [c000000006a0f2a0] [c0000000005b50b4] .do_setlink+0x31c/0x750
>> [c000000006a0f3b0] [c0000000005b6724] .rtnl_newlink+0x388/0x618
>> [c000000006a0f5f0] [c0000000005b6350] .rtnetlink_rcv_msg+0x268/0x2b4
>> [c000000006a0f6a0] [c0000000005cfdc0] .netlink_rcv_skb+0x74/0x108
>> [c000000006a0f730] [c0000000005b60c4] .rtnetlink_rcv+0x38/0x5c
>> [c000000006a0f7c0] [c0000000005cf8c8] .netlink_unicast+0x318/0x3f4
>> [c000000006a0f890] [c0000000005d05b4] .netlink_sendmsg+0x2d0/0x310
>> [c000000006a0f970] [c00000000058e1e8] .sock_sendmsg+0xd4/0x110
>> [c000000006a0fb50] [c00000000058e514] .SyS_sendmsg+0x1f4/0x288
>> [c000000006a0fd70] [c00000000058c2b8] .SyS_socketcall+0x214/0x280
>> [c000000006a0fe30] [c0000000000085b4] syscall_exit+0x0/0x40
>> Mem-Info:
>> Node 0 DMA per-cpu:
>> CPU    0: hi:    0, btch:   1 usd:   0
>> CPU    1: hi:    0, btch:   1 usd:   0
>> CPU    2: hi:    0, btch:   1 usd:   0
>> CPU    3: hi:    0, btch:   1 usd:   0
>> active_anon:50 inactive_anon:260 isolated_anon:0
>>    active_file:159 inactive_file:139 isolated_file:0
>>    unevictable:0 dirty:2 writeback:1 unstable:0
>>    free:16 slab_reclaimable:66 slab_unreclaimable:502
>>    mapped:120 shmem:2 pagetables:37 bounce:0
>> Node 0 DMA free:1024kB min:1408kB low:1728kB high:2112kB active_anon:3200kB inactive_anon:16640kB active_file:10176kB inactive_file:8896kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:130944kB mlocked:0kB dirty:128kB writeback:64kB mapped:7680kB shmem:128kB slab_reclaimable:4224kB slab_unreclaimable:32128kB kernel_stack:2528kB pagetables:2368kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>> lowmem_reserve[]: 0 0 0
>> Node 0 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
>> 496 total pagecache pages
>> 178 pages in swap cache
>> Swap cache stats: add 780, delete 602, find 467/551
>> Free swap  = 1027904kB
>> Total swap = 1044160kB
>> 2048 pages RAM
>> 683 pages reserved
>> 582 pages shared
>> 1075 pages non-shared
>> SLUB: Unable to allocate memory on node -1 (gfp=0x20)
>>     cache: kmalloc-16384, object size: 16384, buffer size: 16384, default order: 2, min order: 0
>>     node 0: slabs: 28, objs: 292, free: 0
>> ip: page allocation failure. order:0, mode:0x8020
>> Call Trace:
>> [c000000006a0eb40] [c000000000011c30] .show_stack+0x6c/0x16c (unreliable)
>> [c000000006a0ebf0] [c00000000012129c] .__alloc_pages_nodemask+0x6a0/0x75c
>> [c000000006a0ed70] [c0000000001527cc] .alloc_pages_current+0xc4/0x104
>> [c000000006a0ee10] [c00000000011fca4] .__get_free_pages+0x18/0x90
>> [c000000006a0ee90] [c0000000004f7058] .ehea_get_stats+0x4c/0x1bc
>> [c000000006a0ef30] [c0000000005a0a04] .dev_get_stats+0x38/0x64
>> [c000000006a0efc0] [c0000000005b456c] .rtnl_fill_ifinfo+0x35c/0x85c
>> [c000000006a0f150] [c0000000005b5920] .rtmsg_ifinfo+0x164/0x204
>> [c000000006a0f210] [c0000000005a6d6c] .dev_change_flags+0x4c/0x7c
>> [c000000006a0f2a0] [c0000000005b50b4] .do_setlink+0x31c/0x750
>> [c000000006a0f3b0] [c0000000005b6724] .rtnl_newlink+0x388/0x618
>> [c000000006a0f5f0] [c0000000005b6350] .rtnetlink_rcv_msg+0x268/0x2b4
>> [c000000006a0f6a0] [c0000000005cfdc0] .netlink_rcv_skb+0x74/0x108
>> [c000000006a0f730] [c0000000005b60c4] .rtnetlink_rcv+0x38/0x5c
>> [c000000006a0f7c0] [c0000000005cf8c8] .netlink_unicast+0x318/0x3f4
>> [c000000006a0f890] [c0000000005d05b4] .netlink_sendmsg+0x2d0/0x310
>> [c000000006a0f970] [c00000000058e1e8] .sock_sendmsg+0xd4/0x110
>> [c000000006a0fb50] [c00000000058e514] .SyS_sendmsg+0x1f4/0x288
>> [c000000006a0fd70] [c00000000058c2b8] .SyS_socketcall+0x214/0x280
>> [c000000006a0fe30] [c0000000000085b4] syscall_exit+0x0/0x40
>> Mem-Info:
>> Node 0 DMA per-cpu:
>> CPU    0: hi:    0, btch:   1 usd:   0
>> CPU    1: hi:    0, btch:   1 usd:   0
>> CPU    2: hi:    0, btch:   1 usd:   0
>> CPU    3: hi:    0, btch:   1 usd:   0
>>
>> The mainline 2.6.35-rc5 worked fine.
>>      
> Maybe you were lucky with 2.6.35-rc5
>
> Anyway ehea should not use GFP_ATOMIC in its ehea_get_stats() method,
> called in process context, but GFP_KERNEL.
>
> Another patch is needed for ehea_refill_rq_def() as well.
>
>
>
> [PATCH] ehea: ehea_get_stats() should use GFP_KERNEL
>
> ehea_get_stats() is called in process context and should use GFP_KERNEL
> allocation instead of GFP_ATOMIC.
>
> Clearing stats at beginning of ehea_get_stats() is racy in case of
> concurrent stat readers.
>
> get_stats() can also use netdev net_device_stats, instead of a private
> copy.
>
> Reported-by: divya<dipraksh@linux.vnet.ibm.com>
> Signed-off-by: Eric Dumazet<eric.dumazet@gmail.com>
> ---
>   drivers/net/ehea/ehea.h      |    1 -
>   drivers/net/ehea/ehea_main.c |    6 ++----
>   2 files changed, 2 insertions(+), 5 deletions(-)
>    
Hi,

The call trace mentioned above still appears on upstream kernel and linux-next tree too.
The mentioned patch hasn't still been merged into upstream yet - hence getting call traces for both ehea_get_stats()
and ehea_refill_rq_def() methods.
However w.r.t to linux-next getting call trace only for ehea_refill_rq_def() method.

Thanks
Divya

^ permalink raw reply

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Roy Marples @ 2010-07-20  9:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1279616099.2498.9.camel@edumazet-laptop>

On 20/07/2010 09:54, Eric Dumazet wrote:
> Is it for the dhcpcd problem we talk about few week ago, disturbed by
> new 64bit stats ?

Yes

>
> Why do you want to have a fixed size of 256 bytes ?
>
> Using 8192 bytes on stack would avoid MSG_TRUNK mess.

Yes it would, but that doesn't answer my question :)
I would like to use a buffer big enough, but not a whole 8k in size.
dhcpcd has quite a small runtime and I'd like to keep it that way.

Thanks

Roy

^ permalink raw reply

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Eric Dumazet @ 2010-07-20  9:24 UTC (permalink / raw)
  To: Roy Marples; +Cc: netdev
In-Reply-To: <4C45678A.4080408@marples.name>

Le mardi 20 juillet 2010 à 10:08 +0100, Roy Marples a écrit :
> On 20/07/2010 09:54, Eric Dumazet wrote:
> > Is it for the dhcpcd problem we talk about few week ago, disturbed by
> > new 64bit stats ?
> 
> Yes
> 
> >
> > Why do you want to have a fixed size of 256 bytes ?
> >
> > Using 8192 bytes on stack would avoid MSG_TRUNK mess.
> 
> Yes it would, but that doesn't answer my question :)

Your question might be wrong ? :=)

> I would like to use a buffer big enough, but not a whole 8k in size.
> dhcpcd has quite a small runtime and I'd like to keep it that way.

8192 bytes on stack is too much for you ?

Then you should automatically resize your buffer, and not using
MSG_TRUNK at all (there is no guarantee the information you need will be
part of the truncated part)




^ permalink raw reply

* [RFC PATCH] dst: check if dst is freed in dst_check()
From: Nicolas Dichtel @ 2010-07-20  9:49 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]

Hi,

I probably missed something, but I cannot find where obsolete field is checked 
when dst_check() is called. If dst->obsolete is > 1, dst cannot be used!

Attached is a proposal to fix this issue.


Regards,

-- 
Nicolas DICHTEL
6WIND
R&D Engineer

Tel: +33 1 39 30 92 10
Fax: +33 1 39 30 92 11
nicolas.dichtel@6wind.com
www.6wind.com
Join the Multicore Packet Processing Forum: www.multicorepacketprocessing.com

Ce courriel ainsi que toutes les pièces jointes, est uniquement destiné à son ou 
ses destinataires. Il contient des informations confidentielles qui sont la 
propriété de 6WIND. Toute révélation, distribution ou copie des informations 
qu'il contient est strictement interdite. Si vous avez reçu ce message par 
erreur, veuillez immédiatement le signaler à l'émetteur et détruire toutes les 
données reçues.

This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and contains information that is confidential and 
proprietary to 6WIND. All unauthorized review, use, disclosure or distribution 
is prohibited. If you are not the intended recipient, please contact the sender 
by reply e-mail and destroy all copies of the original message.

[-- Attachment #2: 0001-dst-check-if-dst-is-freed-in-dst_check.patch --]
[-- Type: text/x-diff, Size: 772 bytes --]

>From 69990a516f4b5b48608b0ea283dfac6f1fa110b3 Mon Sep 17 00:00:00 2001
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue, 20 Jul 2010 11:35:53 +0200
Subject: [PATCH] dst: check if dst is freed in dst_check()

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/dst.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 81d1413..7bf4f9a 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -319,6 +319,8 @@ static inline int dst_input(struct sk_buff *skb)
 
 static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
 {
+	if (dst->obsolete > 1)
+		return NULL;
 	if (dst->obsolete)
 		dst = dst->ops->check(dst, cookie);
 	return dst;
-- 
1.5.4.5


^ permalink raw reply related

* [RFC v2] netfilter: xt_condition: add condition target
From: Luciano Coelho @ 2010-07-20  9:50 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, kaber, jengelh, sameo

This patch implements a condition target to the xt_condition module,
which let's the user set netfilter rules that enable/disable the
variables used by the condition match.  Originally, the condition
match only allowed the variable to be changed via procfs.  This new
target makes it easy to enable or disable the condition depending on
the rules set.

Signed-off-by: Luciano Coelho <luciano.coelho@nokia.com>
---
 include/linux/netfilter/xt_condition.h |   12 ++-
 net/netfilter/Kconfig                  |   19 ++--
 net/netfilter/Makefile                 |    2 +-
 net/netfilter/xt_condition.c           |  179 +++++++++++++++++++++++---------
 4 files changed, 153 insertions(+), 59 deletions(-)

diff --git a/include/linux/netfilter/xt_condition.h b/include/linux/netfilter/xt_condition.h
index 4faf3ca..c9e72c2 100644
--- a/include/linux/netfilter/xt_condition.h
+++ b/include/linux/netfilter/xt_condition.h
@@ -3,12 +3,22 @@
 
 #include <linux/types.h>
 
+#define XT_CONDITION_MAX_NAME_SIZE 31
+
 struct xt_condition_mtinfo {
-	char name[31];
+	char name[XT_CONDITION_MAX_NAME_SIZE];
 	__u8 invert;
 
 	/* Used internally by the kernel */
 	void *condvar __attribute__((aligned(8)));
 };
 
+struct condition_tg_info {
+	char name[XT_CONDITION_MAX_NAME_SIZE];
+	__u8 enabled;
+
+	/* Used internally by the kernel */
+	void *condvar __attribute__((aligned(8)));
+};
+
 #endif /* _XT_CONDITION_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index e54e216..adaa3b4 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -310,6 +310,17 @@ config NETFILTER_XT_MARK
 	"Use netfilter MARK value as routing key") and can also be used by
 	other subsystems to change their behavior.
 
+config NETFILTER_XT_CONDITION
+	tristate '"condition" match and target support'
+	depends on NETFILTER_ADVANCED
+	depends on PROC_FS
+	---help---
+	This option adds the "CONDITION" target and "condition" match.
+
+	It allows you to match rules against condition variables
+	stored in the /proc/net/nf_condition directory. It also allows
+	you to set the variables using the target.
+
 config NETFILTER_XT_CONNMARK
 	tristate 'ctmark target and match support'
 	depends on NF_CONNTRACK
@@ -621,14 +632,6 @@ config NETFILTER_XT_MATCH_COMMENT
 	  If you want to compile it as a module, say M here and read
 	  <file:Documentation/kbuild/modules.txt>.  If unsure, say `N'.
 
-config NETFILTER_XT_MATCH_CONDITION
-	tristate '"condition" match support'
-	depends on NETFILTER_ADVANCED
-	depends on PROC_FS
-	---help---
-	This option allows you to match firewall rules against condition
-	variables stored in the /proc/net/nf_condition directory.
-
 config NETFILTER_XT_MATCH_CONNBYTES
 	tristate  '"connbytes" per-connection counter match support'
 	depends on NF_CONNTRACK
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 474dd06..ee34f6c 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_NETFILTER_XTABLES) += x_tables.o xt_tcpudp.o
 # combos
 obj-$(CONFIG_NETFILTER_XT_MARK) += xt_mark.o
 obj-$(CONFIG_NETFILTER_XT_CONNMARK) += xt_connmark.o
+obj-$(CONFIG_NETFILTER_XT_CONDITION) += xt_condition.o
 
 # targets
 obj-$(CONFIG_NETFILTER_XT_TARGET_CLASSIFY) += xt_CLASSIFY.o
@@ -66,7 +67,6 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_IDLETIMER) += xt_IDLETIMER.o
 # matches
 obj-$(CONFIG_NETFILTER_XT_MATCH_CLUSTER) += xt_cluster.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_COMMENT) += xt_comment.o
-obj-$(CONFIG_NETFILTER_XT_MATCH_CONDITION) += xt_condition.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNBYTES) += xt_connbytes.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNLIMIT) += xt_connlimit.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNTRACK) += xt_conntrack.o
diff --git a/net/netfilter/xt_condition.c b/net/netfilter/xt_condition.c
index 9af0257..dbd762c 100644
--- a/net/netfilter/xt_condition.c
+++ b/net/netfilter/xt_condition.c
@@ -2,11 +2,13 @@
  *	"condition" match extension for Xtables
  *
  *	Description: This module allows firewall rules to match using
- *	condition variables available through procfs.
+ *	condition variables available through procfs.  It also allows
+ *	target rules to set the condition variable.
  *
  *	Authors:
  *	Stephane Ouellette <ouellettes@videotron.ca>, 2002-10-22
  *	Massimiliano Hofer <max@nucleus.it>, 2006-05-15
+ *	Luciano Coelho <luciano.coelho@nokia.com>, 2010-07-20
  *
  *	This program is free software; you can redistribute it and/or modify it
  *	under the terms of the GNU General Public License; either version 2
@@ -32,7 +34,8 @@ static unsigned int condition_gid_perms;
 MODULE_AUTHOR("Stephane Ouellette <ouellettes@videotron.ca>");
 MODULE_AUTHOR("Massimiliano Hofer <max@nucleus.it>");
 MODULE_AUTHOR("Jan Engelhardt <jengelh@medozas.de>");
-MODULE_DESCRIPTION("Allows rules to match against condition variables");
+MODULE_AUTHOR("Luciano Coelho <luciano.coelho@nokia.com>");
+MODULE_DESCRIPTION("Allows rules to set and match condition variables");
 MODULE_LICENSE("GPL");
 module_param(condition_list_perms, uint, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(condition_list_perms, "default permissions on /proc/net/nf_condition/* files");
@@ -91,56 +94,34 @@ static int condition_proc_write(struct file *file, const char __user *buffer,
 	return length;
 }
 
-static bool
-condition_mt(const struct sk_buff *skb, struct xt_action_param *par)
+static struct condition_variable *xt_condition_insert(const char *name)
 {
-	const struct xt_condition_mtinfo *info = par->matchinfo;
-	const struct condition_variable *var   = info->condvar;
-
-	return var->enabled ^ info->invert;
-}
-
-static int condition_mt_check(const struct xt_mtchk_param *par)
-{
-	struct xt_condition_mtinfo *info = par->matchinfo;
 	struct condition_variable *var;
 
-	/* Forbid certain names */
-	if (*info->name == '\0' || *info->name == '.' ||
-	    info->name[sizeof(info->name)-1] != '\0' ||
-	    memchr(info->name, '/', sizeof(info->name)) != NULL) {
-		pr_info("name not allowed or too long: \"%.*s\"\n",
-			(unsigned int)sizeof(info->name), info->name);
-		return -EINVAL;
-	}
 	/*
 	 * Let's acquire the lock, check for the condition and add it
 	 * or increase the reference counter.
 	 */
 	mutex_lock(&proc_lock);
 	list_for_each_entry(var, &conditions_list, list) {
-		if (strcmp(info->name, var->status_proc->name) == 0) {
+		if (strcmp(name, var->status_proc->name) == 0) {
 			++var->refcount;
-			mutex_unlock(&proc_lock);
-			info->condvar = var;
-			return 0;
+			goto out;
 		}
 	}
 
 	/* At this point, we need to allocate a new condition variable. */
 	var = kmalloc(sizeof(struct condition_variable), GFP_KERNEL);
-	if (var == NULL) {
-		mutex_unlock(&proc_lock);
-		return -ENOMEM;
-	}
+	if (var == NULL)
+		goto out;
 
 	/* Create the condition variable's proc file entry. */
-	var->status_proc = create_proc_entry(info->name, condition_list_perms,
+	var->status_proc = create_proc_entry(name, condition_list_perms,
 			   proc_net_condition);
 	if (var->status_proc == NULL) {
 		kfree(var);
-		mutex_unlock(&proc_lock);
-		return -ENOMEM;
+		var = NULL;
+		goto out;
 	}
 
 	var->refcount = 1;
@@ -151,16 +132,13 @@ static int condition_mt_check(const struct xt_mtchk_param *par)
 	var->status_proc->uid        = condition_uid_perms;
 	var->status_proc->gid        = condition_gid_perms;
 	list_add(&var->list, &conditions_list);
+out:
 	mutex_unlock(&proc_lock);
-	info->condvar = var;
-	return 0;
+	return var;
 }
 
-static void condition_mt_destroy(const struct xt_mtdtor_param *par)
+static void xt_condition_put(struct condition_variable *var)
 {
-	const struct xt_condition_mtinfo *info = par->matchinfo;
-	struct condition_variable *var = info->condvar;
-
 	mutex_lock(&proc_lock);
 	if (--var->refcount == 0) {
 		list_del(&var->list);
@@ -172,6 +150,101 @@ static void condition_mt_destroy(const struct xt_mtdtor_param *par)
 	mutex_unlock(&proc_lock);
 }
 
+static bool
+condition_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_condition_mtinfo *info = par->matchinfo;
+	const struct condition_variable *var   = info->condvar;
+
+	return var->enabled ^ info->invert;
+}
+
+static int condition_mt_check(const struct xt_mtchk_param *par)
+{
+	struct xt_condition_mtinfo *info = par->matchinfo;
+	struct condition_variable *var;
+
+	/* Forbid certain names */
+	if (*info->name == '\0' || *info->name == '.' ||
+	    info->name[sizeof(info->name)-1] != '\0' ||
+	    memchr(info->name, '/', sizeof(info->name)) != NULL) {
+		pr_info("name not allowed or too long: \"%.*s\"\n",
+			(unsigned int)sizeof(info->name), info->name);
+		return -EINVAL;
+	}
+
+	var = xt_condition_insert(info->name);
+	if (var == NULL)
+		return -ENOMEM;
+
+	info->condvar = var;
+	return 0;
+}
+
+static void condition_mt_destroy(const struct xt_mtdtor_param *par)
+{
+	const struct xt_condition_mtinfo *info = par->matchinfo;
+
+	xt_condition_put(info->condvar);
+}
+
+static unsigned int condition_tg_target(struct sk_buff *skb,
+					 const struct xt_action_param *par)
+{
+	const struct condition_tg_info *info = par->targinfo;
+	struct condition_variable *var = info->condvar;
+
+	pr_debug("setting condition %s, enabled %d\n",
+		 info->name, info->enabled);
+
+	var->enabled = info->enabled;
+
+	return XT_CONTINUE;
+}
+
+static int condition_tg_checkentry(const struct xt_tgchk_param *par)
+{
+	struct condition_tg_info *info = par->targinfo;
+	struct condition_variable *var;
+
+	pr_debug("checkentry %s\n", info->name);
+
+	/* Forbid certain names */
+	if (*info->name == '\0' || *info->name == '.' ||
+	    info->name[sizeof(info->name)-1] != '\0' ||
+	    memchr(info->name, '/', sizeof(info->name)) != NULL) {
+		pr_info("name not allowed or too long: \"%.*s\"\n",
+			(unsigned int)sizeof(info->name), info->name);
+		return -EINVAL;
+	}
+
+	var = xt_condition_insert(info->name);
+	if (var == NULL)
+		return -ENOMEM;
+
+	info->condvar = var;
+	return 0;
+}
+
+static void condition_tg_destroy(const struct xt_tgdtor_param *par)
+{
+	const struct condition_tg_info *info = par->targinfo;
+
+	pr_debug("destroy %s\n", info->name);
+
+	xt_condition_put(info->condvar);
+}
+
+static struct xt_target condition_tg_reg __read_mostly = {
+	.name		= "CONDITION",
+	.family		= NFPROTO_UNSPEC,
+	.target		= condition_tg_target,
+	.targetsize     = sizeof(struct condition_tg_info),
+	.checkentry	= condition_tg_checkentry,
+	.destroy        = condition_tg_destroy,
+	.me		= THIS_MODULE,
+};
+
 static struct xt_match condition_mt_reg __read_mostly = {
 	.name       = "condition",
 	.revision   = 1,
@@ -185,24 +258,24 @@ static struct xt_match condition_mt_reg __read_mostly = {
 
 static const char *const dir_name = "nf_condition";
 
-static int __net_init condnet_mt_init(struct net *net)
+static int __net_init condnet_init(struct net *net)
 {
 	proc_net_condition = proc_mkdir(dir_name, net->proc_net);
 
 	return (proc_net_condition == NULL) ? -EACCES : 0;
 }
 
-static void __net_exit condnet_mt_exit(struct net *net)
+static void __net_exit condnet_exit(struct net *net)
 {
 	remove_proc_entry(dir_name, net->proc_net);
 }
 
-static struct pernet_operations condition_mt_netops = {
-	.init = condnet_mt_init,
-	.exit = condnet_mt_exit,
+static struct pernet_operations condition_netops = {
+	.init = condnet_init,
+	.exit = condnet_exit,
 };
 
-static int __init condition_mt_init(void)
+static int __init condition_init(void)
 {
 	int ret;
 
@@ -211,8 +284,15 @@ static int __init condition_mt_init(void)
 	if (ret < 0)
 		return ret;
 
-	ret = register_pernet_subsys(&condition_mt_netops);
+	ret =  xt_register_target(&condition_tg_reg);
+	if (ret < 0) {
+		xt_unregister_match(&condition_mt_reg);
+		return ret;
+	}
+
+	ret = register_pernet_subsys(&condition_netops);
 	if (ret < 0) {
+		xt_unregister_target(&condition_tg_reg);
 		xt_unregister_match(&condition_mt_reg);
 		return ret;
 	}
@@ -220,11 +300,12 @@ static int __init condition_mt_init(void)
 	return 0;
 }
 
-static void __exit condition_mt_exit(void)
+static void __exit condition_exit(void)
 {
-	unregister_pernet_subsys(&condition_mt_netops);
+	unregister_pernet_subsys(&condition_netops);
+	xt_unregister_target(&condition_tg_reg);
 	xt_unregister_match(&condition_mt_reg);
 }
 
-module_init(condition_mt_init);
-module_exit(condition_mt_exit);
+module_init(condition_init);
+module_exit(condition_exit);
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH] __dst_free(): put EXPORT_SYMBOLS after the fct
From: Nicolas Dichtel @ 2010-07-20  9:51 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]

Hi,

this patch is just cosmetic: EXPORT_SYMBOLS(__dst_free) has been put after 
___dst_free() function instead of __dst_free().


Regards,

-- 
Nicolas DICHTEL
6WIND
R&D Engineer

Tel: +33 1 39 30 92 10
Fax: +33 1 39 30 92 11
nicolas.dichtel@6wind.com
www.6wind.com
Join the Multicore Packet Processing Forum: www.multicorepacketprocessing.com

Ce courriel ainsi que toutes les pièces jointes, est uniquement destiné à son ou 
ses destinataires. Il contient des informations confidentielles qui sont la 
propriété de 6WIND. Toute révélation, distribution ou copie des informations 
qu'il contient est strictement interdite. Si vous avez reçu ce message par 
erreur, veuillez immédiatement le signaler à l'émetteur et détruire toutes les 
données reçues.

This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and contains information that is confidential and 
proprietary to 6WIND. All unauthorized review, use, disclosure or distribution 
is prohibited. If you are not the intended recipient, please contact the sender 
by reply e-mail and destroy all copies of the original message.

[-- Attachment #2: 0002-__dst_free-put-EXPORT_SYMBOLS-after-the-fct.patch --]
[-- Type: text/x-diff, Size: 893 bytes --]

>From f96851b4d7e6125d8c0e5f4da7b6fffffd21b642 Mon Sep 17 00:00:00 2001
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue, 20 Jul 2010 11:38:20 +0200
Subject: [PATCH] __dst_free(): put EXPORT_SYMBOLS after the fct

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/core/dst.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index 9920722..6c41b1f 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -197,7 +197,6 @@ static void ___dst_free(struct dst_entry *dst)
 		dst->input = dst->output = dst_discard;
 	dst->obsolete = 2;
 }
-EXPORT_SYMBOL(__dst_free);
 
 void __dst_free(struct dst_entry *dst)
 {
@@ -213,6 +212,7 @@ void __dst_free(struct dst_entry *dst)
 	}
 	spin_unlock_bh(&dst_garbage.lock);
 }
+EXPORT_SYMBOL(__dst_free);
 
 struct dst_entry *dst_destroy(struct dst_entry * dst)
 {
-- 
1.5.4.5


^ permalink raw reply related

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Eric Dumazet @ 2010-07-20 10:02 UTC (permalink / raw)
  To: Roy Marples; +Cc: netdev
In-Reply-To: <1279617873.2498.13.camel@edumazet-laptop>

Le mardi 20 juillet 2010 à 11:24 +0200, Eric Dumazet a écrit :
> Le mardi 20 juillet 2010 à 10:08 +0100, Roy Marples a écrit :
> > On 20/07/2010 09:54, Eric Dumazet wrote:
> > > Is it for the dhcpcd problem we talk about few week ago, disturbed by
> > > new 64bit stats ?
> > 
> > Yes
> > 
> > >
> > > Why do you want to have a fixed size of 256 bytes ?
> > >
> > > Using 8192 bytes on stack would avoid MSG_TRUNK mess.
> > 
> > Yes it would, but that doesn't answer my question :)
> 
> Your question might be wrong ? :=)
> 
> > I would like to use a buffer big enough, but not a whole 8k in size.
> > dhcpcd has quite a small runtime and I'd like to keep it that way.
> 
> 8192 bytes on stack is too much for you ?
> 
> Then you should automatically resize your buffer, and not using
> MSG_TRUNK at all (there is no guarantee the information you need will be
> part of the truncated part)
> 
> 

On < 2.6.22 kernels, recv() returns the length of your buffer, not size
of netlink frame.

You'll need something like :

size_t sz = 256;
char *buf = malloc(sz);
while (1) {
	if (!buf) error();
	len = recv(fd, buf, sz, MSG_PEEK | MSG_TRUNC);
	if (len < sz)
		break;
	if (len == sz)
		sz *= 2; // old kernel, try to double size
	else
		sz = len; // recent kernel is nice with us
	buf = realloc(buf, sz);
}
len = recv(fd, buf, sz, 0);





^ permalink raw reply

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Roy Marples @ 2010-07-20 10:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1279620171.2498.30.camel@edumazet-laptop>

On 20/07/2010 11:02, Eric Dumazet wrote:
> Le mardi 20 juillet 2010 à 11:24 +0200, Eric Dumazet a écrit :
>> Le mardi 20 juillet 2010 à 10:08 +0100, Roy Marples a écrit :
>>> On 20/07/2010 09:54, Eric Dumazet wrote:
>>>> Is it for the dhcpcd problem we talk about few week ago, disturbed by
>>>> new 64bit stats ?
>>>
>>> Yes
>>>
>>>>
>>>> Why do you want to have a fixed size of 256 bytes ?
>>>>
>>>> Using 8192 bytes on stack would avoid MSG_TRUNK mess.
>>>
>>> Yes it would, but that doesn't answer my question :)
>>
>> Your question might be wrong ? :=)
>>
>>> I would like to use a buffer big enough, but not a whole 8k in size.
>>> dhcpcd has quite a small runtime and I'd like to keep it that way.
>>
>> 8192 bytes on stack is too much for you ?
>>
>> Then you should automatically resize your buffer, and not using
>> MSG_TRUNK at all (there is no guarantee the information you need will be
>> part of the truncated part)
>>
>>
>
> On<  2.6.22 kernels, recv() returns the length of your buffer, not size
> of netlink frame.
>
> You'll need something like :
>
> size_t sz = 256;
> char *buf = malloc(sz);
> while (1) {
> 	if (!buf) error();
> 	len = recv(fd, buf, sz, MSG_PEEK | MSG_TRUNC);
> 	if (len<  sz)
> 		break;
> 	if (len == sz)
> 		sz *= 2; // old kernel, try to double size
> 	else
> 		sz = len; // recent kernel is nice with us
> 	buf = realloc(buf, sz);
> }
> len = recv(fd, buf, sz, 0);

Thankyou

If buf is NULL and sz is 0, would 0 still be returned? I'm guessing so.

Thanks

Roy

^ permalink raw reply

* Re: [RFC v2] netfilter: xt_condition: add condition target
From: Jan Engelhardt @ 2010-07-20 10:45 UTC (permalink / raw)
  To: Luciano Coelho; +Cc: netfilter-devel, netdev, kaber, sameo
In-Reply-To: <1279619434-11849-1-git-send-email-luciano.coelho@nokia.com>


On Tuesday 2010-07-20 11:50, Luciano Coelho wrote:
> struct xt_condition_mtinfo {
>-	char name[31];
>+	char name[XT_CONDITION_MAX_NAME_SIZE];
> 	__u8 invert;
> 
> 	/* Used internally by the kernel */
> 	void *condvar __attribute__((aligned(8)));
> };
> 
>+struct condition_tg_info {

In the line of standardized naming, xt_condition_tginfo.

>+	char name[XT_CONDITION_MAX_NAME_SIZE];
>+	__u8 enabled;

No u32 yet?

>+static struct xt_target condition_tg_reg __read_mostly = {
>+       .name           = "CONDITION",
>+       .family         = NFPROTO_UNSPEC,
>+       .target         = condition_tg_target,
>+       .targetsize     = sizeof(struct condition_tg_info),
>+       .checkentry     = condition_tg_checkentry,
>+       .destroy        = condition_tg_destroy,
>+       .me             = THIS_MODULE,
>+};
>+
> static struct xt_match condition_mt_reg __read_mostly = {
>        .name       = "condition",
>        .revision   = 1,

(I see that you just sent a diff from the previous submission. That
in itself is ok.) Since xt_condition is a new module to upstream but
already exists in Xtables-addons, it makes sense to use a
.revision number of 2 for the initial Linux kernel submission,
also because the struct contents are different from those currently
in Xt-a.

>From an overall quick glance, looks good!

^ permalink raw reply

* [PATCH net-next] sysfs: add entry to indicate network interfaces with random MAC address
From: Stefan Assmann @ 2010-07-20 10:50 UTC (permalink / raw)
  To: netdev
  Cc: Linux Kernel Mailing List, davem, Andy Gospodarek,
	Rose, Gregory V, Duyck, Alexander H, Ben Hutchings, Casey Leedom,
	Harald Hoyer

From: Stefan Assmann <sassmann@redhat.com>

Reserve a bit in struct net_device to indicate whether an interface
generates its MAC address randomly, and expose the information via
sysfs.
May look like this:
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/net/eth0/ifrndmac

By default the value of ifrndmac is 0. Any driver that generates the MAC
address randomly should return a value to 1.

This simplifies the handling of network devices with random MAC addresses
as user-space may just query sysfs to find out if the MAC is real or fake.
Udev may check sysfs for devices that generate their MAC address
randomly and create persistent net rules by using the unique device path
if the value returned is 1.

Also introducing a helper function to assist driver adoption.

Signed-off-by: Stefan Assmann <sassmann@redhat.com>
---
 include/linux/etherdevice.h |   14 ++++++++++++++
 include/linux/netdevice.h   |    1 +
 net/core/net-sysfs.c        |   12 ++++++++++++
 3 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 3d7a668..ebb34ac 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -127,6 +127,20 @@ static inline void random_ether_addr(u8 *addr)
 }

 /**
+ * dev_hw_addr_random - Create random MAC and set device flag
+ * @dev: pointer to net_device structure
+ * @addr: Pointer to a six-byte array containing the Ethernet address
+ *
+ * Generate random MAC to be used by a device and set NETIF_F_RNDMAC flag
+ * so the state can be read by sysfs and be used by udev.
+ */
+static inline void dev_hw_addr_random(struct net_device *dev, u8 *hwaddr)
+{
+	dev->features |= NETIF_F_RNDMAC;
+	random_ether_addr(hwaddr);
+}
+
+/**
  * compare_ether_addr - Compare two Ethernet addresses
  * @addr1: Pointer to a six-byte array containing the Ethernet address
  * @addr2: Pointer other six-byte array containing the Ethernet address
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b626289..2ea0298 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -845,6 +845,7 @@ struct net_device {
 #define NETIF_F_FCOE_MTU	(1 << 26) /* Supports max FCoE MTU, 2158 bytes*/
 #define NETIF_F_NTUPLE		(1 << 27) /* N-tuple filters supported */
 #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
+#define NETIF_F_RNDMAC		(1 << 29) /* Interface with random MAC address */

 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d2b5965..91a9808 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -188,6 +188,17 @@ static ssize_t show_dormant(struct device *dev,
 	return -EINVAL;
 }

+static ssize_t show_ifrndmac(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct net_device *net = to_net_dev(dev);
+
+	if (net->features & NETIF_F_RNDMAC)
+		return sprintf(buf, fmt_dec, 1);
+	else
+		return sprintf(buf, fmt_dec, 0);
+}
+
 static const char *const operstates[] = {
 	"unknown",
 	"notpresent", /* currently unused */
@@ -300,6 +311,7 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(ifalias, S_IRUGO | S_IWUSR, show_ifalias, store_ifalias),
 	__ATTR(iflink, S_IRUGO, show_iflink, NULL),
 	__ATTR(ifindex, S_IRUGO, show_ifindex, NULL),
+	__ATTR(ifrndmac, S_IRUGO, show_ifrndmac, NULL),
 	__ATTR(features, S_IRUGO, show_features, NULL),
 	__ATTR(type, S_IRUGO, show_type, NULL),
 	__ATTR(link_mode, S_IRUGO, show_link_mode, NULL),
-- 
1.6.5.2


^ permalink raw reply related

* Re: [RFC v2] netfilter: xt_condition: add condition target
From: Luciano Coelho @ 2010-07-20 11:04 UTC (permalink / raw)
  To: ext Jan Engelhardt
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	kaber@trash.net, sameo@linux.intel.com
In-Reply-To: <alpine.LSU.2.01.1007201235250.20447@obet.zrqbmnf.qr>

On Tue, 2010-07-20 at 12:45 +0200, ext Jan Engelhardt wrote:
> On Tuesday 2010-07-20 11:50, Luciano Coelho wrote:
> > struct xt_condition_mtinfo {
> >-	char name[31];
> >+	char name[XT_CONDITION_MAX_NAME_SIZE];
> > 	__u8 invert;
> > 
> > 	/* Used internally by the kernel */
> > 	void *condvar __attribute__((aligned(8)));
> > };
> > 
> >+struct condition_tg_info {
> 
> In the line of standardized naming, xt_condition_tginfo.

Ack.


> >+	char name[XT_CONDITION_MAX_NAME_SIZE];
> >+	__u8 enabled;
> 
> No u32 yet?

Yes, I decided to make this in different steps.  I'll be submitting a
new patch with the u32 (and the binary operators support) pretty soon.


> >+static struct xt_target condition_tg_reg __read_mostly = {
> >+       .name           = "CONDITION",
> >+       .family         = NFPROTO_UNSPEC,
> >+       .target         = condition_tg_target,
> >+       .targetsize     = sizeof(struct condition_tg_info),
> >+       .checkentry     = condition_tg_checkentry,
> >+       .destroy        = condition_tg_destroy,
> >+       .me             = THIS_MODULE,
> >+};
> >+
> > static struct xt_match condition_mt_reg __read_mostly = {
> >        .name       = "condition",
> >        .revision   = 1,
> 
> (I see that you just sent a diff from the previous submission. That
> in itself is ok.) Since xt_condition is a new module to upstream but
> already exists in Xtables-addons, it makes sense to use a
> .revision number of 2 for the initial Linux kernel submission,
> also because the struct contents are different from those currently
> in Xt-a.

Yes, I made this patch on top of the one you have sent earlier for
upstream inclusion.  There were some comments from Patrick to that one
and, as I said in my email yesterday, I'll rebase the target patches
once the original one is included upstream.

Do you want me to take a look at Patrick's comments and resubmit the
patch you've sent with the changes Patrick asked for?

I'll change the revision to 2 as well.


> From an overall quick glance, looks good!

Thanks for your review and help on this!

-- 
Cheers,
Luca.


^ permalink raw reply

* Re: [RFC PATCH v3 1/5] irq: add tracepoint to softirq_raise
From: Neil Horman @ 2010-07-20 11:04 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
	fweisbec, mathieu.desnoyers
In-Reply-To: <4C44F1AB.6020902@jp.fujitsu.com>

On Tue, Jul 20, 2010 at 09:45:31AM +0900, Koki Sanagi wrote:
> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> 
> Add a tracepoint for tracing when softirq action is raised.
> 
> It and the existed tracepoints complete softirq's tracepoints:
> softirq_raise, softirq_entry and softirq_exit.
> 
> And when this tracepoint is used in combination with
> the softirq_entry tracepoint we can determine
> the softirq raise latency.
> 
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
> 
> [ factorize softirq events with DECLARE_EVENT_CLASS ]
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  include/linux/interrupt.h  |    8 +++++-
>  include/trace/events/irq.h |   57 ++++++++++++++++++++++++++-----------------
>  kernel/softirq.c           |    4 +-
>  3 files changed, 43 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index c233113..1cb5726 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -18,6 +18,7 @@
>  #include <asm/atomic.h>
>  #include <asm/ptrace.h>
>  #include <asm/system.h>
> +#include <trace/events/irq.h>
>  
>  /*
>   * These correspond to the IORESOURCE_IRQ_* defines in
> @@ -402,7 +403,12 @@ asmlinkage void do_softirq(void);
>  asmlinkage void __do_softirq(void);
>  extern void open_softirq(int nr, void (*action)(struct softirq_action *));
>  extern void softirq_init(void);
> -#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0)
> +static inline void __raise_softirq_irqoff(unsigned int nr)
> +{
> +	trace_softirq_raise(nr);
> +	or_softirq_pending(1UL << nr);
> +}
> +
We already have tracepoints in irq_enter and irq_exit.  If the goal here is to
detect latency during packet processing, cant the delta in time between those
two points be used to determine interrupt handling latency?


>  extern void raise_softirq_irqoff(unsigned int nr);
>  extern void raise_softirq(unsigned int nr);
>  extern void wakeup_softirqd(void);
> diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
> index 0e4cfb6..717744c 100644
> --- a/include/trace/events/irq.h
> +++ b/include/trace/events/irq.h
> @@ -5,7 +5,9 @@
>  #define _TRACE_IRQ_H
>  
>  #include <linux/tracepoint.h>
> -#include <linux/interrupt.h>
> +
> +struct irqaction;
> +struct softirq_action;
>  
>  #define softirq_name(sirq) { sirq##_SOFTIRQ, #sirq }
>  #define show_softirq_name(val)				\
> @@ -84,56 +86,65 @@ TRACE_EVENT(irq_handler_exit,
>  
>  DECLARE_EVENT_CLASS(softirq,
>  
> -	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
> +	TP_PROTO(unsigned int nr),
>  
> -	TP_ARGS(h, vec),
> +	TP_ARGS(nr),
>  
>  	TP_STRUCT__entry(
> -		__field(	int,	vec			)
> +		__field(	unsigned int,	vec	)
>  	),
>  
>  	TP_fast_assign(
> -		__entry->vec = (int)(h - vec);
> +		__entry->vec	= nr;
>  	),
>  
>  	TP_printk("vec=%d [action=%s]", __entry->vec,
> -		  show_softirq_name(__entry->vec))
> +		show_softirq_name(__entry->vec))
> +);
> +
> +/**
> + * softirq_raise - called immediately when a softirq is raised
> + * @nr: softirq vector number
> + *
> + * Tracepoint for tracing when softirq action is raised.
> + * Also, when used in combination with the softirq_entry tracepoint
> + * we can determine the softirq raise latency.
> + */
> +DEFINE_EVENT(softirq, softirq_raise,
> +
> +	TP_PROTO(unsigned int nr),
> +
> +	TP_ARGS(nr)
>  );
>  
>  /**
>   * softirq_entry - called immediately before the softirq handler
> - * @h: pointer to struct softirq_action
> - * @vec: pointer to first struct softirq_action in softirq_vec array
> + * @nr: softirq vector number
>   *
> - * The @h parameter, contains a pointer to the struct softirq_action
> - * which has a pointer to the action handler that is called. By subtracting
> - * the @vec pointer from the @h pointer, we can determine the softirq
> - * number. Also, when used in combination with the softirq_exit tracepoint
> + * Tracepoint for tracing when softirq action starts.
> + * Also, when used in combination with the softirq_exit tracepoint
>   * we can determine the softirq latency.
>   */
>  DEFINE_EVENT(softirq, softirq_entry,
>  
> -	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
> +	TP_PROTO(unsigned int nr),
>  
> -	TP_ARGS(h, vec)
> +	TP_ARGS(nr)
>  );
>  
>  /**
>   * softirq_exit - called immediately after the softirq handler returns
> - * @h: pointer to struct softirq_action
> - * @vec: pointer to first struct softirq_action in softirq_vec array
> + * @nr: softirq vector number
>   *
> - * The @h parameter contains a pointer to the struct softirq_action
> - * that has handled the softirq. By subtracting the @vec pointer from
> - * the @h pointer, we can determine the softirq number. Also, when used in
> - * combination with the softirq_entry tracepoint we can determine the softirq
> - * latency.
> + * Tracepoint for tracing when softirq action ends.
> + * Also, when used in combination with the softirq_entry tracepoint
> + * we can determine the softirq latency.
>   */
>  DEFINE_EVENT(softirq, softirq_exit,
>  
> -	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
> +	TP_PROTO(unsigned int nr),
>  
> -	TP_ARGS(h, vec)
> +	TP_ARGS(nr)
>  );
>  
>  #endif /*  _TRACE_IRQ_H */
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 825e112..6790599 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -215,9 +215,9 @@ restart:
>  			int prev_count = preempt_count();
>  			kstat_incr_softirqs_this_cpu(h - softirq_vec);
>  
> -			trace_softirq_entry(h, softirq_vec);
> +			trace_softirq_entry(h - softirq_vec);
>  			h->action(h);
> -			trace_softirq_exit(h, softirq_vec);
> +			trace_softirq_exit(h - softirq_vec);

You're loosing information here by reducing the numbers of parameters in this
tracepoint.  How many other tracepoint scripts rely on having both pointers
handy?  Why not just do the pointer math inside your tracehook instead?

>  			if (unlikely(prev_count != preempt_count())) {
>  				printk(KERN_ERR "huh, entered softirq %td %s %p"
>  				       "with preempt_count %08x,"
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [RFC PATCH v3 2/5] napi: convert trace_napi_poll to TRACE_EVENT
From: Neil Horman @ 2010-07-20 11:09 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
	fweisbec, mathieu.desnoyers
In-Reply-To: <4C44F1FB.8080309@jp.fujitsu.com>

On Tue, Jul 20, 2010 at 09:46:51AM +0900, Koki Sanagi wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> 
> This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT to improve
> the usability of napi_poll tracepoint.
> 
>           <idle>-0     [001] 241302.750777: napi_poll: napi poll on napi struct f6acc480 for device eth3
>           <idle>-0     [000] 241302.852389: napi_poll: napi poll on napi struct f5d0d70c for device eth1
> 
> An original patch is below.
> http://marc.info/?l=linux-kernel&m=126021713809450&w=2
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> And add a fix by Steven Rostedt.
> http://marc.info/?l=linux-kernel&m=126150506519173&w=2
> 
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  include/trace/events/napi.h |   25 +++++++++++++++++++++++--
>  1 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
> index 188deca..8fe1e93 100644
> --- a/include/trace/events/napi.h
> +++ b/include/trace/events/napi.h
> @@ -6,10 +6,31 @@
>  
>  #include <linux/netdevice.h>
>  #include <linux/tracepoint.h>
> +#include <linux/ftrace.h>
> +
> +#define NO_DEV "(no_device)"
> +
> +TRACE_EVENT(napi_poll,
>  
> -DECLARE_TRACE(napi_poll,
>  	TP_PROTO(struct napi_struct *napi),
> -	TP_ARGS(napi));
> +
> +	TP_ARGS(napi),
> +
> +	TP_STRUCT__entry(
> +		__field(	struct napi_struct *,	napi)
> +		__string(	dev_name, napi->dev ? napi->dev->name : NO_DEV)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->napi = napi;
> +		__assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
> +	),
> +
> +	TP_printk("napi poll on napi struct %p for device %s",
> +		__entry->napi, __get_str(dev_name))
> +);
> +
> +#undef NO_DEV
>  
>  #endif /* _TRACE_NAPI_H_ */
>  
> 
NAK, This change will create a build break in the drop monitor code.  You'll
need to fix that up if you want to make this change.
Neil

^ permalink raw reply

* Re: [RFC v2] netfilter: xt_condition: add condition target
From: Jan Engelhardt @ 2010-07-20 11:11 UTC (permalink / raw)
  To: Luciano Coelho
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	kaber@trash.net, sameo@linux.intel.com
In-Reply-To: <1279623855.16431.4.camel@powerslave>


On Tuesday 2010-07-20 13:04, Luciano Coelho wrote:
>
>Yes, I made this patch on top of the one you have sent earlier for
>upstream inclusion.  There were some comments from Patrick to that one
>and, as I said in my email yesterday, I'll rebase the target patches
>once the original one is included upstream.

The original one won't be - that is, basically you will be making the
initial upstream submission.
However, you are right; fabricating two patches is a good idea and
is in fact what I advertise too (xt_TEE discussion about specifying
oif..) - avoiding a singular huge patch is a best practice.
Just be sure to have condition plus its 32-bit upgrade patch merged at
the same time.

>Do you want me to take a look at Patrick's comments and resubmit the
>patch you've sent with the changes Patrick asked for?

Yes. Not obeying His Highness's wishes is a death nail for a module ;-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox