Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] sysfs: add entry to indicate network interfaces with random MAC address
From: Ben Hutchings @ 2010-07-20 12:07 UTC (permalink / raw)
  To: Stefan Assmann
  Cc: netdev, Linux Kernel Mailing List, davem, Andy Gospodarek,
	Rose, Gregory V, Duyck, Alexander H, Casey Leedom, Harald Hoyer
In-Reply-To: <4C458CB7.3030508@redhat.com>

On Tue, 2010-07-20 at 13:47 +0200, Stefan Assmann wrote:
> On 20.07.2010 13:20, Ben Hutchings wrote:
> > On Tue, 2010-07-20 at 12:50 +0200, Stefan Assmann wrote:
> >> From: Stefan Assmann <sassmann@redhat.com>
> >>
> >> Reserve a bit in struct net_device to indicate whether an interface
> >> generates its MAC address randomly, and expose the information via
> >> sysfs.
> >> May look like this:
> >> /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/net/eth0/ifrndmac
> >>
> >> By default the value of ifrndmac is 0. Any driver that generates the MAC
> >> address randomly should return a value to 1.
> > 
> > The name should incorporate 'address', not 'mac', for consistency with
> > the generic 'address' attribute.
> 
> We can call it "ifrndhwaddr" if that's more consistent.
> 
> > 
> > What about devices that 'steal' MAC addresses from slave devices?
> > Currently I believe udev has special cases for them but ideally these
> > drivers would indicate explicitly that their addresses are not stable
> > identifiers (even though they aren't random).
> 
> It's really up to the driver to decide whether it makes sense to set the
> flag or not. The question is what should udev do with these MAC address
> stealing devices apart from ignoring them? Sorry I have no higher
> insight into it.
> This flag has the purpose to allow udev to explicitly handle devices
> that generate their MAC address randomly and generate a persistent
> rule based on the device path instead of the MAC address.
> I'm open for suggestions but I'm not sure we can handle both cases with
> a single flag.

OK, then call it something like 'address_temporary'.

> JFYI this is an alternative approach to changing the kernel name of VFs
> to vfeth. The advantage of this way should be that we don't break any
> user-space applications that rely on network interfaces being names
> "ethX". Actually this goes in the direction of "fixing udev" which was
> what you asked for in your comment on my first patch concering vfeth. :)
> 
> > 
> > [...]
> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> index b626289..2ea0298 100644
> >> --- a/include/linux/netdevice.h
> >> +++ b/include/linux/netdevice.h
> >> @@ -845,6 +845,7 @@ struct net_device {
> >>  #define NETIF_F_FCOE_MTU	(1 << 26) /* Supports max FCoE MTU, 2158 bytes*/
> >>  #define NETIF_F_NTUPLE		(1 << 27) /* N-tuple filters supported */
> >>  #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
> >> +#define NETIF_F_RNDMAC		(1 << 29) /* Interface with random MAC address */
> > [...]
> > 
> > This is not really a feature, and we are running out of real feature
> > bits.  Can you find somewhere else to put this flag?
> 
> Actually Dave Miller suggested to put it there. What other place is
> there to put it?

If Dave said that then I'm sure it's OK.

However, if you define this as an interface flag (net_device::flags;
<linux/if.h>) and add it to the set of changeable flags in
__dev_change_flags(), user-space will be able to clear the flag if it
later sets a stable address.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH net-next] sysfs: add entry to indicate network interfaces with random MAC address
From: Alex Badea @ 2010-07-20 11:58 UTC (permalink / raw)
  To: Stefan Assmann
  Cc: Ben Hutchings, netdev, Linux Kernel Mailing List, davem,
	Andy Gospodarek, Rose, Gregory V, Duyck, Alexander H,
	Casey Leedom, Harald Hoyer
In-Reply-To: <4C458CB7.3030508@redhat.com>

Hi,

On 07/20/2010 02:47 PM, Stefan Assmann wrote:
>> What about devices that 'steal' MAC addresses from slave devices?

Might I suggest an attribute such as "address_type", which could report
"permanent", "random", "stolen", or something else in the future?

My two cents,
Alex

^ permalink raw reply

* Re: [RFC v2] netfilter: xt_condition: add condition target
From: Luciano Coelho @ 2010-07-20 11:53 UTC (permalink / raw)
  To: ext Jan Engelhardt
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	kaber@trash.net, sameo@linux.intel.com
In-Reply-To: <alpine.LSU.2.01.1007201343470.3843@obet.zrqbmnf.qr>

On Tue, 2010-07-20 at 13:46 +0200, ext Jan Engelhardt wrote:
> On Tuesday 2010-07-20 13:32, Luciano Coelho wrote:
> >it in any archives).  Then a few days ago I asked you if you were going
> >to resend and you answered:
> >
> >On Fri, 2010-07-16 at 13:20 +0200, ext Jan Engelhardt wrote:
> >> On Friday 2010-07-16 13:10, Luciano Coelho wrote:
> >> >Are you planning to resend this patch with the changes Patrick
> >> >suggested?
> >> 
> >> I can try. 
> >
> >So I assumed you would resubmit it for upstream inclusion.  I probably
> >misunderstood something in the way then ;)
> 
> It usually means "I'll do it .... unless someone else does first".
> Since you proposed the condition target, you nicely took all of
> xt_condition into your hands. Or as Linus once put it:
> "they are doing all the work, and I get the credit". :-)

:D

I'm fine with that, I just didn't want to jump in and steal somebody
else's work ;)


-- 
Cheers,
Luca.


^ permalink raw reply

* Re: [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
From: Neil Horman @ 2010-07-20 11:50 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
	fweisbec, mathieu.desnoyers
In-Reply-To: <4C44F286.1050907@jp.fujitsu.com>

On Tue, Jul 20, 2010 at 09:49:10AM +0900, Koki Sanagi wrote:
> [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
> This patch adds tracepoint to consume_skb, dev_kfree_skb_irq and
> skb_free_datagram_locked. Combinating with tracepoint on dev_hard_start_xmit,
> we can check how long it takes to free transmited packets. And using it, we can
> calculate how many packets driver had at that time. It is useful when a drop of
> transmited packet is a problem.
> 
>           <idle>-0     [001] 241409.218333: consume_skb: skbaddr=dd6b2fb8
>           <idle>-0     [001] 241409.490555: dev_kfree_skb_irq: skbaddr=f5e29840
> 
>         udp-recv-302   [001] 515031.206008: skb_free_datagram_locked: skbaddr=f5b1d900
> 
> 
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  include/trace/events/skb.h |   42 ++++++++++++++++++++++++++++++++++++++++++
>  net/core/datagram.c        |    1 +
>  net/core/dev.c             |    2 ++
>  net/core/skbuff.c          |    1 +
>  4 files changed, 46 insertions(+), 0 deletions(-)
> 
> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
> index 4b2be6d..84c9041 100644
> --- a/include/trace/events/skb.h
> +++ b/include/trace/events/skb.h
> @@ -35,6 +35,48 @@ TRACE_EVENT(kfree_skb,
>  		__entry->skbaddr, __entry->protocol, __entry->location)
>  );
>  
> +DECLARE_EVENT_CLASS(free_skb,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb),
> +
> +	TP_STRUCT__entry(
> +		__field(	void *,	skbaddr	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->skbaddr = skb;
> +	),
> +
> +	TP_printk("skbaddr=%p", __entry->skbaddr)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, consume_skb,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, dev_kfree_skb_irq,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +
> +DEFINE_EVENT(free_skb, skb_free_datagram_locked,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +
> +);
> +

Why create these last two tracepoints at all?  dev_kfree_skb_irq will eventually
pass through kfree_skb anyway, getting picked up by the tracepoint there, the
while the latter won't (since it uses __kfree_skb instead), I think that could
be fixed up by add a call to trace_kfree_skb there directly, saving you two
tracepoints.

Neil

> 

^ permalink raw reply

* Re: [PATCH net-next] sysfs: add entry to indicate network interfaces with random MAC address
From: Stefan Assmann @ 2010-07-20 11:47 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, Linux Kernel Mailing List, davem, Andy Gospodarek,
	Rose, Gregory V, Duyck, Alexander H, Casey Leedom, Harald Hoyer
In-Reply-To: <1279624844.2110.3.camel@achroite.uk.solarflarecom.com>

On 20.07.2010 13:20, Ben Hutchings wrote:
> On Tue, 2010-07-20 at 12:50 +0200, Stefan Assmann wrote:
>> From: Stefan Assmann <sassmann@redhat.com>
>>
>> Reserve a bit in struct net_device to indicate whether an interface
>> generates its MAC address randomly, and expose the information via
>> sysfs.
>> May look like this:
>> /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/net/eth0/ifrndmac
>>
>> By default the value of ifrndmac is 0. Any driver that generates the MAC
>> address randomly should return a value to 1.
> 
> The name should incorporate 'address', not 'mac', for consistency with
> the generic 'address' attribute.

We can call it "ifrndhwaddr" if that's more consistent.

> 
> What about devices that 'steal' MAC addresses from slave devices?
> Currently I believe udev has special cases for them but ideally these
> drivers would indicate explicitly that their addresses are not stable
> identifiers (even though they aren't random).

It's really up to the driver to decide whether it makes sense to set the
flag or not. The question is what should udev do with these MAC address
stealing devices apart from ignoring them? Sorry I have no higher
insight into it.
This flag has the purpose to allow udev to explicitly handle devices
that generate their MAC address randomly and generate a persistent
rule based on the device path instead of the MAC address.
I'm open for suggestions but I'm not sure we can handle both cases with
a single flag.

JFYI this is an alternative approach to changing the kernel name of VFs
to vfeth. The advantage of this way should be that we don't break any
user-space applications that rely on network interfaces being names
"ethX". Actually this goes in the direction of "fixing udev" which was
what you asked for in your comment on my first patch concering vfeth. :)

> 
> [...]
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index b626289..2ea0298 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -845,6 +845,7 @@ struct net_device {
>>  #define NETIF_F_FCOE_MTU	(1 << 26) /* Supports max FCoE MTU, 2158 bytes*/
>>  #define NETIF_F_NTUPLE		(1 << 27) /* N-tuple filters supported */
>>  #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
>> +#define NETIF_F_RNDMAC		(1 << 29) /* Interface with random MAC address */
> [...]
> 
> This is not really a feature, and we are running out of real feature
> bits.  Can you find somewhere else to put this flag?

Actually Dave Miller suggested to put it there. What other place is
there to put it?

  Stefan
--
Stefan Assmann         | Red Hat GmbH
Software Engineer      | Otto-Hahn-Strasse 20, 85609 Dornach
                       | HR: Amtsgericht Muenchen HRB 153243
                       | GF: Brendan Lane, Charlie Peters,
sassmann at redhat.com |     Michael Cunningham, Charles Cachera

^ permalink raw reply

* Re: [RFC v2] netfilter: xt_condition: add condition target
From: Jan Engelhardt @ 2010-07-20 11:46 UTC (permalink / raw)
  To: Luciano Coelho
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	kaber@trash.net, sameo@linux.intel.com
In-Reply-To: <1279625526.16431.22.camel@powerslave>


On Tuesday 2010-07-20 13:32, Luciano Coelho wrote:
>it in any archives).  Then a few days ago I asked you if you were going
>to resend and you answered:
>
>On Fri, 2010-07-16 at 13:20 +0200, ext Jan Engelhardt wrote:
>> On Friday 2010-07-16 13:10, Luciano Coelho wrote:
>> >Are you planning to resend this patch with the changes Patrick
>> >suggested?
>> 
>> I can try. 
>
>So I assumed you would resubmit it for upstream inclusion.  I probably
>misunderstood something in the way then ;)

It usually means "I'll do it .... unless someone else does first".
Since you proposed the condition target, you nicely took all of
xt_condition into your hands. Or as Linus once put it:
"they are doing all the work, and I get the credit". :-)

>> Just be sure to have condition plus its 32-bit upgrade patch merged at
>> the same time.
>
>I think the best idea will be to send a patchset with the three patches
>at once (original xt_condition, plus the target patch, plus the 32-bit
>patch).

Fine by me.

^ permalink raw reply

* Re: [RFC PATCH v3 3/5] netdev: add tracepoints to netdev layer
From: Neil Horman @ 2010-07-20 11:41 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
	fweisbec, mathieu.desnoyers
In-Reply-To: <4C44F23F.6050800@jp.fujitsu.com>

On Tue, Jul 20, 2010 at 09:47:59AM +0900, Koki Sanagi wrote:
> This patch adds tracepoint to dev_queue_xmit, dev_hard_start_xmit and
> netif_receive_skb. These tracepoints help you to monitor network driver's
> input/output.
> 
>             sshd-4445  [001] 241367.066046: net_dev_queue: dev=eth3 skbaddr=dd6b2538 len=114
>             sshd-4445  [001] 241367.066047: net_dev_xmit: dev=eth3 skbaddr=dd6b2538 len=114 rc=0
>           <idle>-0     [001] 241367.067472: net_dev_receive: dev=eth3 skbaddr=f5e59000 len=52
> 
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  include/trace/events/net.h |   75 ++++++++++++++++++++++++++++++++++++++++++++
>  net/core/dev.c             |    5 +++
>  net/core/net-traces.c      |    1 +
>  3 files changed, 81 insertions(+), 0 deletions(-)
> 
> diff --git a/include/trace/events/net.h b/include/trace/events/net.h
> new file mode 100644
> index 0000000..8a21361
> --- /dev/null
> +++ b/include/trace/events/net.h
> @@ -0,0 +1,75 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM net
> +
> +#if !defined(_TRACE_NET_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_NET_H
> +
> +#include <linux/skbuff.h>
> +#include <linux/netdevice.h>
> +#include <linux/ip.h>
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(net_dev_xmit,
> +
> +	TP_PROTO(struct sk_buff *skb,
> +		 int rc),
> +
> +	TP_ARGS(skb, rc),
> +
> +	TP_STRUCT__entry(
> +		__field(	void *,		skbaddr		)
> +		__field(	unsigned int,	len		)
> +		__field(	int,		rc		)
> +		__string(	name,		skb->dev->name	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->skbaddr = skb;
> +		__entry->len = skb->len;
> +		__entry->rc = rc;
> +		__assign_str(name, skb->dev->name);
> +	),
> +
> +	TP_printk("dev=%s skbaddr=%p len=%u rc=%d",
> +		__get_str(name), __entry->skbaddr, __entry->len, __entry->rc)
> +);
> +
> +DECLARE_EVENT_CLASS(net_dev_template,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb),
> +
> +	TP_STRUCT__entry(
> +		__field(	void *,		skbaddr		)
> +		__field(	unsigned int,	len		)
> +		__string(	name,		skb->dev->name	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->skbaddr = skb;
> +		__entry->len = skb->len;
> +		__assign_str(name, skb->dev->name);
> +	),
> +
> +	TP_printk("dev=%s skbaddr=%p len=%u",
> +		__get_str(name), __entry->skbaddr, __entry->len)
> +)
> +
> +DEFINE_EVENT(net_dev_template, net_dev_queue,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +);
> +
> +DEFINE_EVENT(net_dev_template, net_dev_receive,
> +
> +	TP_PROTO(struct sk_buff *skb),
> +
> +	TP_ARGS(skb)
> +);
> +#endif /* _TRACE_NET_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 93b8929..4acfec6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -130,6 +130,7 @@
>  #include <linux/jhash.h>
>  #include <linux/random.h>
>  #include <trace/events/napi.h>
> +#include <trace/events/net.h>
>  #include <linux/pci.h>
>  
>  #include "net-sysfs.h"
> @@ -1955,6 +1956,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>  		}
>  
>  		rc = ops->ndo_start_xmit(skb, dev);
> +		trace_net_dev_xmit(skb, rc);
>  		if (rc == NETDEV_TX_OK)
>  			txq_trans_update(txq);
>  		return rc;
> @@ -1975,6 +1977,7 @@ gso:
>  			skb_dst_drop(nskb);
>  
>  		rc = ops->ndo_start_xmit(nskb, dev);
> +		trace_net_dev_xmit(nskb, rc);
>  		if (unlikely(rc != NETDEV_TX_OK)) {
>  			if (rc & ~NETDEV_TX_MASK)
>  				goto out_kfree_gso_skb;
> @@ -2165,6 +2168,7 @@ int dev_queue_xmit(struct sk_buff *skb)
>  #ifdef CONFIG_NET_CLS_ACT
>  	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
>  #endif
> +	trace_net_dev_queue(skb);
>  	if (q->enqueue) {
>  		rc = __dev_xmit_skb(skb, q, dev, txq);
>  		goto out;
> @@ -2939,6 +2943,7 @@ int netif_receive_skb(struct sk_buff *skb)
>  	if (netdev_tstamp_prequeue)
>  		net_timestamp_check(skb);
>  
> +	trace_net_dev_receive(skb);

I imagine for completeness you'll want to make a call to this in netif_rx and
netif_rx_ni as well.

^ permalink raw reply

* Re: [RFC v2] netfilter: xt_condition: add condition target
From: Luciano Coelho @ 2010-07-20 11:32 UTC (permalink / raw)
  To: ext Jan Engelhardt
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	kaber@trash.net, sameo@linux.intel.com
In-Reply-To: <alpine.LSU.2.01.1007201308370.526@obet.zrqbmnf.qr>

On Tue, 2010-07-20 at 13:11 +0200, ext Jan Engelhardt wrote:
> On Tuesday 2010-07-20 13:04, Luciano Coelho wrote:
> >
> >Yes, I made this patch on top of the one you have sent earlier for
> >upstream inclusion.  There were some comments from Patrick to that one
> >and, as I said in my email yesterday, I'll rebase the target patches
> >once the original one is included upstream.
> 
> The original one won't be - that is, basically you will be making the
> initial upstream submission.

Oh, ok.  I thought you had already submitted it to upstream in Apr 21.
As I understood from that thread, you were going to make the necessary
changes and resubmit:

Jan Engelhardt <jengelh@medozas.de> writes:
> On Thursday 2010-04-22 13:14, Patrick McHardy wrote:
> 
> > This looks better, thanks. A few remaining questions about things
> > I missed previously:
> 
> Will deal with it shortly.

But I never saw a follow up to that email (or at least I couldn't find
it in any archives).  Then a few days ago I asked you if you were going
to resend and you answered:

On Fri, 2010-07-16 at 13:20 +0200, ext Jan Engelhardt wrote:
> On Friday 2010-07-16 13:10, Luciano Coelho wrote:
> >Are you planning to resend this patch with the changes Patrick
> >suggested?
> 
> I can try. 

So I assumed you would resubmit it for upstream inclusion.  I probably
misunderstood something in the way then ;)

> However, you are right; fabricating two patches is a good idea and
> is in fact what I advertise too (xt_TEE discussion about specifying
> oif..) - avoiding a singular huge patch is a best practice.

Sure, I also agree that patches should be small and incremental.
Especially since the xt_condition already exists elsewhere, I think it's
best to get it included upstream as is and then start improving it with
subsequent patches.

> Just be sure to have condition plus its 32-bit upgrade patch merged at
> the same time.

I think the best idea will be to send a patchset with the three patches
at once (original xt_condition, plus the target patch, plus the 32-bit
patch).

> >Do you want me to take a look at Patrick's comments and resubmit the
> >patch you've sent with the changes Patrick asked for?
> 
> Yes. Not obeying His Highness's wishes is a death nail for a module ;-)

Heh! And myself, as a newbie, certainly don't want to do that. ;)

-- 
Cheers,
Luca.

^ permalink raw reply

* Re: [PATCH net-next] sysfs: add entry to indicate network interfaces with random MAC address
From: Ben Hutchings @ 2010-07-20 11:20 UTC (permalink / raw)
  To: Stefan Assmann
  Cc: netdev, Linux Kernel Mailing List, davem, Andy Gospodarek,
	Rose, Gregory V, Duyck, Alexander H, Casey Leedom, Harald Hoyer
In-Reply-To: <4C457F72.8090708@redhat.com>

On Tue, 2010-07-20 at 12:50 +0200, Stefan Assmann wrote:
> From: Stefan Assmann <sassmann@redhat.com>
> 
> Reserve a bit in struct net_device to indicate whether an interface
> generates its MAC address randomly, and expose the information via
> sysfs.
> May look like this:
> /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/net/eth0/ifrndmac
> 
> By default the value of ifrndmac is 0. Any driver that generates the MAC
> address randomly should return a value to 1.

The name should incorporate 'address', not 'mac', for consistency with
the generic 'address' attribute.

What about devices that 'steal' MAC addresses from slave devices?
Currently I believe udev has special cases for them but ideally these
drivers would indicate explicitly that their addresses are not stable
identifiers (even though they aren't random).

[...]
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index b626289..2ea0298 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -845,6 +845,7 @@ struct net_device {
>  #define NETIF_F_FCOE_MTU	(1 << 26) /* Supports max FCoE MTU, 2158 bytes*/
>  #define NETIF_F_NTUPLE		(1 << 27) /* N-tuple filters supported */
>  #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
> +#define NETIF_F_RNDMAC		(1 << 29) /* Interface with random MAC address */
[...]

This is not really a feature, and we are running out of real feature
bits.  Can you find somewhere else to put this flag?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [RFC v2] netfilter: xt_condition: add condition target
From: Jan Engelhardt @ 2010-07-20 11:11 UTC (permalink / raw)
  To: Luciano Coelho
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	kaber@trash.net, sameo@linux.intel.com
In-Reply-To: <1279623855.16431.4.camel@powerslave>

On Tuesday 2010-07-20 13:04, Luciano Coelho wrote:
>
>Yes, I made this patch on top of the one you have sent earlier for
>upstream inclusion.  There were some comments from Patrick to that one
>and, as I said in my email yesterday, I'll rebase the target patches
>once the original one is included upstream.

The original one won't be - that is, basically you will be making the
initial upstream submission.
However, you are right; fabricating two patches is a good idea and
is in fact what I advertise too (xt_TEE discussion about specifying
oif..) - avoiding a singular huge patch is a best practice.
Just be sure to have condition plus its 32-bit upgrade patch merged at
the same time.

>Do you want me to take a look at Patrick's comments and resubmit the
>patch you've sent with the changes Patrick asked for?

Yes. Not obeying His Highness's wishes is a death nail for a module ;-)

^ permalink raw reply

* Re: [RFC PATCH v3 2/5] napi: convert trace_napi_poll to TRACE_EVENT
From: Neil Horman @ 2010-07-20 11:09 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
	fweisbec, mathieu.desnoyers
In-Reply-To: <4C44F1FB.8080309@jp.fujitsu.com>

On Tue, Jul 20, 2010 at 09:46:51AM +0900, Koki Sanagi wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> 
> This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT to improve
> the usability of napi_poll tracepoint.
> 
>           <idle>-0     [001] 241302.750777: napi_poll: napi poll on napi struct f6acc480 for device eth3
>           <idle>-0     [000] 241302.852389: napi_poll: napi poll on napi struct f5d0d70c for device eth1
> 
> An original patch is below.
> http://marc.info/?l=linux-kernel&m=126021713809450&w=2
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> And add a fix by Steven Rostedt.
> http://marc.info/?l=linux-kernel&m=126150506519173&w=2
> 
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  include/trace/events/napi.h |   25 +++++++++++++++++++++++--
>  1 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
> index 188deca..8fe1e93 100644
> --- a/include/trace/events/napi.h
> +++ b/include/trace/events/napi.h
> @@ -6,10 +6,31 @@
>  
>  #include <linux/netdevice.h>
>  #include <linux/tracepoint.h>
> +#include <linux/ftrace.h>
> +
> +#define NO_DEV "(no_device)"
> +
> +TRACE_EVENT(napi_poll,
>  
> -DECLARE_TRACE(napi_poll,
>  	TP_PROTO(struct napi_struct *napi),
> -	TP_ARGS(napi));
> +
> +	TP_ARGS(napi),
> +
> +	TP_STRUCT__entry(
> +		__field(	struct napi_struct *,	napi)
> +		__string(	dev_name, napi->dev ? napi->dev->name : NO_DEV)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->napi = napi;
> +		__assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
> +	),
> +
> +	TP_printk("napi poll on napi struct %p for device %s",
> +		__entry->napi, __get_str(dev_name))
> +);
> +
> +#undef NO_DEV
>  
>  #endif /* _TRACE_NAPI_H_ */
>  
> 
NAK, This change will create a build break in the drop monitor code.  You'll
need to fix that up if you want to make this change.
Neil

^ permalink raw reply

* Re: [RFC PATCH v3 1/5] irq: add tracepoint to softirq_raise
From: Neil Horman @ 2010-07-20 11:04 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
	kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
	fweisbec, mathieu.desnoyers
In-Reply-To: <4C44F1AB.6020902@jp.fujitsu.com>

On Tue, Jul 20, 2010 at 09:45:31AM +0900, Koki Sanagi wrote:
> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> 
> Add a tracepoint for tracing when softirq action is raised.
> 
> It and the existed tracepoints complete softirq's tracepoints:
> softirq_raise, softirq_entry and softirq_exit.
> 
> And when this tracepoint is used in combination with
> the softirq_entry tracepoint we can determine
> the softirq raise latency.
> 
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
> 
> [ factorize softirq events with DECLARE_EVENT_CLASS ]
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  include/linux/interrupt.h  |    8 +++++-
>  include/trace/events/irq.h |   57 ++++++++++++++++++++++++++-----------------
>  kernel/softirq.c           |    4 +-
>  3 files changed, 43 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index c233113..1cb5726 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -18,6 +18,7 @@
>  #include <asm/atomic.h>
>  #include <asm/ptrace.h>
>  #include <asm/system.h>
> +#include <trace/events/irq.h>
>  
>  /*
>   * These correspond to the IORESOURCE_IRQ_* defines in
> @@ -402,7 +403,12 @@ asmlinkage void do_softirq(void);
>  asmlinkage void __do_softirq(void);
>  extern void open_softirq(int nr, void (*action)(struct softirq_action *));
>  extern void softirq_init(void);
> -#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0)
> +static inline void __raise_softirq_irqoff(unsigned int nr)
> +{
> +	trace_softirq_raise(nr);
> +	or_softirq_pending(1UL << nr);
> +}
> +
We already have tracepoints in irq_enter and irq_exit.  If the goal here is to
detect latency during packet processing, cant the delta in time between those
two points be used to determine interrupt handling latency?


>  extern void raise_softirq_irqoff(unsigned int nr);
>  extern void raise_softirq(unsigned int nr);
>  extern void wakeup_softirqd(void);
> diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
> index 0e4cfb6..717744c 100644
> --- a/include/trace/events/irq.h
> +++ b/include/trace/events/irq.h
> @@ -5,7 +5,9 @@
>  #define _TRACE_IRQ_H
>  
>  #include <linux/tracepoint.h>
> -#include <linux/interrupt.h>
> +
> +struct irqaction;
> +struct softirq_action;
>  
>  #define softirq_name(sirq) { sirq##_SOFTIRQ, #sirq }
>  #define show_softirq_name(val)				\
> @@ -84,56 +86,65 @@ TRACE_EVENT(irq_handler_exit,
>  
>  DECLARE_EVENT_CLASS(softirq,
>  
> -	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
> +	TP_PROTO(unsigned int nr),
>  
> -	TP_ARGS(h, vec),
> +	TP_ARGS(nr),
>  
>  	TP_STRUCT__entry(
> -		__field(	int,	vec			)
> +		__field(	unsigned int,	vec	)
>  	),
>  
>  	TP_fast_assign(
> -		__entry->vec = (int)(h - vec);
> +		__entry->vec	= nr;
>  	),
>  
>  	TP_printk("vec=%d [action=%s]", __entry->vec,
> -		  show_softirq_name(__entry->vec))
> +		show_softirq_name(__entry->vec))
> +);
> +
> +/**
> + * softirq_raise - called immediately when a softirq is raised
> + * @nr: softirq vector number
> + *
> + * Tracepoint for tracing when softirq action is raised.
> + * Also, when used in combination with the softirq_entry tracepoint
> + * we can determine the softirq raise latency.
> + */
> +DEFINE_EVENT(softirq, softirq_raise,
> +
> +	TP_PROTO(unsigned int nr),
> +
> +	TP_ARGS(nr)
>  );
>  
>  /**
>   * softirq_entry - called immediately before the softirq handler
> - * @h: pointer to struct softirq_action
> - * @vec: pointer to first struct softirq_action in softirq_vec array
> + * @nr: softirq vector number
>   *
> - * The @h parameter, contains a pointer to the struct softirq_action
> - * which has a pointer to the action handler that is called. By subtracting
> - * the @vec pointer from the @h pointer, we can determine the softirq
> - * number. Also, when used in combination with the softirq_exit tracepoint
> + * Tracepoint for tracing when softirq action starts.
> + * Also, when used in combination with the softirq_exit tracepoint
>   * we can determine the softirq latency.
>   */
>  DEFINE_EVENT(softirq, softirq_entry,
>  
> -	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
> +	TP_PROTO(unsigned int nr),
>  
> -	TP_ARGS(h, vec)
> +	TP_ARGS(nr)
>  );
>  
>  /**
>   * softirq_exit - called immediately after the softirq handler returns
> - * @h: pointer to struct softirq_action
> - * @vec: pointer to first struct softirq_action in softirq_vec array
> + * @nr: softirq vector number
>   *
> - * The @h parameter contains a pointer to the struct softirq_action
> - * that has handled the softirq. By subtracting the @vec pointer from
> - * the @h pointer, we can determine the softirq number. Also, when used in
> - * combination with the softirq_entry tracepoint we can determine the softirq
> - * latency.
> + * Tracepoint for tracing when softirq action ends.
> + * Also, when used in combination with the softirq_entry tracepoint
> + * we can determine the softirq latency.
>   */
>  DEFINE_EVENT(softirq, softirq_exit,
>  
> -	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
> +	TP_PROTO(unsigned int nr),
>  
> -	TP_ARGS(h, vec)
> +	TP_ARGS(nr)
>  );
>  
>  #endif /*  _TRACE_IRQ_H */
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 825e112..6790599 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -215,9 +215,9 @@ restart:
>  			int prev_count = preempt_count();
>  			kstat_incr_softirqs_this_cpu(h - softirq_vec);
>  
> -			trace_softirq_entry(h, softirq_vec);
> +			trace_softirq_entry(h - softirq_vec);
>  			h->action(h);
> -			trace_softirq_exit(h, softirq_vec);
> +			trace_softirq_exit(h - softirq_vec);

You're loosing information here by reducing the numbers of parameters in this
tracepoint.  How many other tracepoint scripts rely on having both pointers
handy?  Why not just do the pointer math inside your tracehook instead?

>  			if (unlikely(prev_count != preempt_count())) {
>  				printk(KERN_ERR "huh, entered softirq %td %s %p"
>  				       "with preempt_count %08x,"
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [RFC v2] netfilter: xt_condition: add condition target
From: Luciano Coelho @ 2010-07-20 11:04 UTC (permalink / raw)
  To: ext Jan Engelhardt
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	kaber@trash.net, sameo@linux.intel.com
In-Reply-To: <alpine.LSU.2.01.1007201235250.20447@obet.zrqbmnf.qr>

On Tue, 2010-07-20 at 12:45 +0200, ext Jan Engelhardt wrote:
> On Tuesday 2010-07-20 11:50, Luciano Coelho wrote:
> > struct xt_condition_mtinfo {
> >-	char name[31];
> >+	char name[XT_CONDITION_MAX_NAME_SIZE];
> > 	__u8 invert;
> > 
> > 	/* Used internally by the kernel */
> > 	void *condvar __attribute__((aligned(8)));
> > };
> > 
> >+struct condition_tg_info {
> 
> In the line of standardized naming, xt_condition_tginfo.

Ack.


> >+	char name[XT_CONDITION_MAX_NAME_SIZE];
> >+	__u8 enabled;
> 
> No u32 yet?

Yes, I decided to make this in different steps.  I'll be submitting a
new patch with the u32 (and the binary operators support) pretty soon.


> >+static struct xt_target condition_tg_reg __read_mostly = {
> >+       .name           = "CONDITION",
> >+       .family         = NFPROTO_UNSPEC,
> >+       .target         = condition_tg_target,
> >+       .targetsize     = sizeof(struct condition_tg_info),
> >+       .checkentry     = condition_tg_checkentry,
> >+       .destroy        = condition_tg_destroy,
> >+       .me             = THIS_MODULE,
> >+};
> >+
> > static struct xt_match condition_mt_reg __read_mostly = {
> >        .name       = "condition",
> >        .revision   = 1,
> 
> (I see that you just sent a diff from the previous submission. That
> in itself is ok.) Since xt_condition is a new module to upstream but
> already exists in Xtables-addons, it makes sense to use a
> .revision number of 2 for the initial Linux kernel submission,
> also because the struct contents are different from those currently
> in Xt-a.

Yes, I made this patch on top of the one you have sent earlier for
upstream inclusion.  There were some comments from Patrick to that one
and, as I said in my email yesterday, I'll rebase the target patches
once the original one is included upstream.

Do you want me to take a look at Patrick's comments and resubmit the
patch you've sent with the changes Patrick asked for?

I'll change the revision to 2 as well.


> From an overall quick glance, looks good!

Thanks for your review and help on this!

-- 
Cheers,
Luca.


^ permalink raw reply

* [PATCH net-next] sysfs: add entry to indicate network interfaces with random MAC address
From: Stefan Assmann @ 2010-07-20 10:50 UTC (permalink / raw)
  To: netdev
  Cc: Linux Kernel Mailing List, davem, Andy Gospodarek,
	Rose, Gregory V, Duyck, Alexander H, Ben Hutchings, Casey Leedom,
	Harald Hoyer

From: Stefan Assmann <sassmann@redhat.com>

Reserve a bit in struct net_device to indicate whether an interface
generates its MAC address randomly, and expose the information via
sysfs.
May look like this:
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/net/eth0/ifrndmac

By default the value of ifrndmac is 0. Any driver that generates the MAC
address randomly should return a value to 1.

This simplifies the handling of network devices with random MAC addresses
as user-space may just query sysfs to find out if the MAC is real or fake.
Udev may check sysfs for devices that generate their MAC address
randomly and create persistent net rules by using the unique device path
if the value returned is 1.

Also introducing a helper function to assist driver adoption.

Signed-off-by: Stefan Assmann <sassmann@redhat.com>
---
 include/linux/etherdevice.h |   14 ++++++++++++++
 include/linux/netdevice.h   |    1 +
 net/core/net-sysfs.c        |   12 ++++++++++++
 3 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 3d7a668..ebb34ac 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -127,6 +127,20 @@ static inline void random_ether_addr(u8 *addr)
 }

 /**
+ * dev_hw_addr_random - Create random MAC and set device flag
+ * @dev: pointer to net_device structure
+ * @addr: Pointer to a six-byte array containing the Ethernet address
+ *
+ * Generate random MAC to be used by a device and set NETIF_F_RNDMAC flag
+ * so the state can be read by sysfs and be used by udev.
+ */
+static inline void dev_hw_addr_random(struct net_device *dev, u8 *hwaddr)
+{
+	dev->features |= NETIF_F_RNDMAC;
+	random_ether_addr(hwaddr);
+}
+
+/**
  * compare_ether_addr - Compare two Ethernet addresses
  * @addr1: Pointer to a six-byte array containing the Ethernet address
  * @addr2: Pointer other six-byte array containing the Ethernet address
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b626289..2ea0298 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -845,6 +845,7 @@ struct net_device {
 #define NETIF_F_FCOE_MTU	(1 << 26) /* Supports max FCoE MTU, 2158 bytes*/
 #define NETIF_F_NTUPLE		(1 << 27) /* N-tuple filters supported */
 #define NETIF_F_RXHASH		(1 << 28) /* Receive hashing offload */
+#define NETIF_F_RNDMAC		(1 << 29) /* Interface with random MAC address */

 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d2b5965..91a9808 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -188,6 +188,17 @@ static ssize_t show_dormant(struct device *dev,
 	return -EINVAL;
 }

+static ssize_t show_ifrndmac(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct net_device *net = to_net_dev(dev);
+
+	if (net->features & NETIF_F_RNDMAC)
+		return sprintf(buf, fmt_dec, 1);
+	else
+		return sprintf(buf, fmt_dec, 0);
+}
+
 static const char *const operstates[] = {
 	"unknown",
 	"notpresent", /* currently unused */
@@ -300,6 +311,7 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(ifalias, S_IRUGO | S_IWUSR, show_ifalias, store_ifalias),
 	__ATTR(iflink, S_IRUGO, show_iflink, NULL),
 	__ATTR(ifindex, S_IRUGO, show_ifindex, NULL),
+	__ATTR(ifrndmac, S_IRUGO, show_ifrndmac, NULL),
 	__ATTR(features, S_IRUGO, show_features, NULL),
 	__ATTR(type, S_IRUGO, show_type, NULL),
 	__ATTR(link_mode, S_IRUGO, show_link_mode, NULL),
-- 
1.6.5.2


^ permalink raw reply related

* Re: [RFC v2] netfilter: xt_condition: add condition target
From: Jan Engelhardt @ 2010-07-20 10:45 UTC (permalink / raw)
  To: Luciano Coelho; +Cc: netfilter-devel, netdev, kaber, sameo
In-Reply-To: <1279619434-11849-1-git-send-email-luciano.coelho@nokia.com>


On Tuesday 2010-07-20 11:50, Luciano Coelho wrote:
> struct xt_condition_mtinfo {
>-	char name[31];
>+	char name[XT_CONDITION_MAX_NAME_SIZE];
> 	__u8 invert;
> 
> 	/* Used internally by the kernel */
> 	void *condvar __attribute__((aligned(8)));
> };
> 
>+struct condition_tg_info {

In the line of standardized naming, xt_condition_tginfo.

>+	char name[XT_CONDITION_MAX_NAME_SIZE];
>+	__u8 enabled;

No u32 yet?

>+static struct xt_target condition_tg_reg __read_mostly = {
>+       .name           = "CONDITION",
>+       .family         = NFPROTO_UNSPEC,
>+       .target         = condition_tg_target,
>+       .targetsize     = sizeof(struct condition_tg_info),
>+       .checkentry     = condition_tg_checkentry,
>+       .destroy        = condition_tg_destroy,
>+       .me             = THIS_MODULE,
>+};
>+
> static struct xt_match condition_mt_reg __read_mostly = {
>        .name       = "condition",
>        .revision   = 1,

(I see that you just sent a diff from the previous submission. That
in itself is ok.) Since xt_condition is a new module to upstream but
already exists in Xtables-addons, it makes sense to use a
.revision number of 2 for the initial Linux kernel submission,
also because the struct contents are different from those currently
in Xt-a.

>From an overall quick glance, looks good!

^ permalink raw reply

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Roy Marples @ 2010-07-20 10:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1279620171.2498.30.camel@edumazet-laptop>

On 20/07/2010 11:02, Eric Dumazet wrote:
> Le mardi 20 juillet 2010 à 11:24 +0200, Eric Dumazet a écrit :
>> Le mardi 20 juillet 2010 à 10:08 +0100, Roy Marples a écrit :
>>> On 20/07/2010 09:54, Eric Dumazet wrote:
>>>> Is it for the dhcpcd problem we talk about few week ago, disturbed by
>>>> new 64bit stats ?
>>>
>>> Yes
>>>
>>>>
>>>> Why do you want to have a fixed size of 256 bytes ?
>>>>
>>>> Using 8192 bytes on stack would avoid MSG_TRUNK mess.
>>>
>>> Yes it would, but that doesn't answer my question :)
>>
>> Your question might be wrong ? :=)
>>
>>> I would like to use a buffer big enough, but not a whole 8k in size.
>>> dhcpcd has quite a small runtime and I'd like to keep it that way.
>>
>> 8192 bytes on stack is too much for you ?
>>
>> Then you should automatically resize your buffer, and not using
>> MSG_TRUNK at all (there is no guarantee the information you need will be
>> part of the truncated part)
>>
>>
>
> On<  2.6.22 kernels, recv() returns the length of your buffer, not size
> of netlink frame.
>
> You'll need something like :
>
> size_t sz = 256;
> char *buf = malloc(sz);
> while (1) {
> 	if (!buf) error();
> 	len = recv(fd, buf, sz, MSG_PEEK | MSG_TRUNC);
> 	if (len<  sz)
> 		break;
> 	if (len == sz)
> 		sz *= 2; // old kernel, try to double size
> 	else
> 		sz = len; // recent kernel is nice with us
> 	buf = realloc(buf, sz);
> }
> len = recv(fd, buf, sz, 0);

Thankyou

If buf is NULL and sz is 0, would 0 still be returned? I'm guessing so.

Thanks

Roy

^ permalink raw reply

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Eric Dumazet @ 2010-07-20 10:02 UTC (permalink / raw)
  To: Roy Marples; +Cc: netdev
In-Reply-To: <1279617873.2498.13.camel@edumazet-laptop>

Le mardi 20 juillet 2010 à 11:24 +0200, Eric Dumazet a écrit :
> Le mardi 20 juillet 2010 à 10:08 +0100, Roy Marples a écrit :
> > On 20/07/2010 09:54, Eric Dumazet wrote:
> > > Is it for the dhcpcd problem we talk about few week ago, disturbed by
> > > new 64bit stats ?
> > 
> > Yes
> > 
> > >
> > > Why do you want to have a fixed size of 256 bytes ?
> > >
> > > Using 8192 bytes on stack would avoid MSG_TRUNK mess.
> > 
> > Yes it would, but that doesn't answer my question :)
> 
> Your question might be wrong ? :=)
> 
> > I would like to use a buffer big enough, but not a whole 8k in size.
> > dhcpcd has quite a small runtime and I'd like to keep it that way.
> 
> 8192 bytes on stack is too much for you ?
> 
> Then you should automatically resize your buffer, and not using
> MSG_TRUNK at all (there is no guarantee the information you need will be
> part of the truncated part)
> 
> 

On < 2.6.22 kernels, recv() returns the length of your buffer, not size
of netlink frame.

You'll need something like :

size_t sz = 256;
char *buf = malloc(sz);
while (1) {
	if (!buf) error();
	len = recv(fd, buf, sz, MSG_PEEK | MSG_TRUNC);
	if (len < sz)
		break;
	if (len == sz)
		sz *= 2; // old kernel, try to double size
	else
		sz = len; // recent kernel is nice with us
	buf = realloc(buf, sz);
}
len = recv(fd, buf, sz, 0);





^ permalink raw reply

* [PATCH] __dst_free(): put EXPORT_SYMBOLS after the fct
From: Nicolas Dichtel @ 2010-07-20  9:51 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]

Hi,

this patch is just cosmetic: EXPORT_SYMBOLS(__dst_free) has been put after 
___dst_free() function instead of __dst_free().

Regards,

-- 
Nicolas DICHTEL
6WIND
R&D Engineer

Tel: +33 1 39 30 92 10
Fax: +33 1 39 30 92 11
nicolas.dichtel@6wind.com
www.6wind.com
Join the Multicore Packet Processing Forum: www.multicorepacketprocessing.com

Ce courriel ainsi que toutes les pièces jointes, est uniquement destiné à son ou 
ses destinataires. Il contient des informations confidentielles qui sont la 
propriété de 6WIND. Toute révélation, distribution ou copie des informations 
qu'il contient est strictement interdite. Si vous avez reçu ce message par 
erreur, veuillez immédiatement le signaler à l'émetteur et détruire toutes les 
données reçues.

This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and contains information that is confidential and 
proprietary to 6WIND. All unauthorized review, use, disclosure or distribution 
is prohibited. If you are not the intended recipient, please contact the sender 
by reply e-mail and destroy all copies of the original message.

[-- Attachment #2: 0002-__dst_free-put-EXPORT_SYMBOLS-after-the-fct.patch --]
[-- Type: text/x-diff, Size: 893 bytes --]

>From f96851b4d7e6125d8c0e5f4da7b6fffffd21b642 Mon Sep 17 00:00:00 2001
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue, 20 Jul 2010 11:38:20 +0200
Subject: [PATCH] __dst_free(): put EXPORT_SYMBOLS after the fct

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/core/dst.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index 9920722..6c41b1f 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -197,7 +197,6 @@ static void ___dst_free(struct dst_entry *dst)
 		dst->input = dst->output = dst_discard;
 	dst->obsolete = 2;
 }
-EXPORT_SYMBOL(__dst_free);

 void __dst_free(struct dst_entry *dst)
 {
@@ -213,6 +212,7 @@ void __dst_free(struct dst_entry *dst)
 	}
 	spin_unlock_bh(&dst_garbage.lock);
 }
+EXPORT_SYMBOL(__dst_free);

 struct dst_entry *dst_destroy(struct dst_entry * dst)
 {
-- 
1.5.4.5

^ permalink raw reply related

* [RFC v2] netfilter: xt_condition: add condition target
From: Luciano Coelho @ 2010-07-20  9:50 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, kaber, jengelh, sameo

This patch implements a condition target to the xt_condition module,
which let's the user set netfilter rules that enable/disable the
variables used by the condition match.  Originally, the condition
match only allowed the variable to be changed via procfs.  This new
target makes it easy to enable or disable the condition depending on
the rules set.

Signed-off-by: Luciano Coelho <luciano.coelho@nokia.com>
---
 include/linux/netfilter/xt_condition.h |   12 ++-
 net/netfilter/Kconfig                  |   19 ++--
 net/netfilter/Makefile                 |    2 +-
 net/netfilter/xt_condition.c           |  179 +++++++++++++++++++++++---------
 4 files changed, 153 insertions(+), 59 deletions(-)

diff --git a/include/linux/netfilter/xt_condition.h b/include/linux/netfilter/xt_condition.h
index 4faf3ca..c9e72c2 100644
--- a/include/linux/netfilter/xt_condition.h
+++ b/include/linux/netfilter/xt_condition.h
@@ -3,12 +3,22 @@
 
 #include <linux/types.h>
 
+#define XT_CONDITION_MAX_NAME_SIZE 31
+
 struct xt_condition_mtinfo {
-	char name[31];
+	char name[XT_CONDITION_MAX_NAME_SIZE];
 	__u8 invert;
 
 	/* Used internally by the kernel */
 	void *condvar __attribute__((aligned(8)));
 };
 
+struct condition_tg_info {
+	char name[XT_CONDITION_MAX_NAME_SIZE];
+	__u8 enabled;
+
+	/* Used internally by the kernel */
+	void *condvar __attribute__((aligned(8)));
+};
+
 #endif /* _XT_CONDITION_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index e54e216..adaa3b4 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -310,6 +310,17 @@ config NETFILTER_XT_MARK
 	"Use netfilter MARK value as routing key") and can also be used by
 	other subsystems to change their behavior.
 
+config NETFILTER_XT_CONDITION
+	tristate '"condition" match and target support'
+	depends on NETFILTER_ADVANCED
+	depends on PROC_FS
+	---help---
+	This option adds the "CONDITION" target and "condition" match.
+
+	It allows you to match rules against condition variables
+	stored in the /proc/net/nf_condition directory. It also allows
+	you to set the variables using the target.
+
 config NETFILTER_XT_CONNMARK
 	tristate 'ctmark target and match support'
 	depends on NF_CONNTRACK
@@ -621,14 +632,6 @@ config NETFILTER_XT_MATCH_COMMENT
 	  If you want to compile it as a module, say M here and read
 	  <file:Documentation/kbuild/modules.txt>.  If unsure, say `N'.
 
-config NETFILTER_XT_MATCH_CONDITION
-	tristate '"condition" match support'
-	depends on NETFILTER_ADVANCED
-	depends on PROC_FS
-	---help---
-	This option allows you to match firewall rules against condition
-	variables stored in the /proc/net/nf_condition directory.
-
 config NETFILTER_XT_MATCH_CONNBYTES
 	tristate  '"connbytes" per-connection counter match support'
 	depends on NF_CONNTRACK
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 474dd06..ee34f6c 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_NETFILTER_XTABLES) += x_tables.o xt_tcpudp.o
 # combos
 obj-$(CONFIG_NETFILTER_XT_MARK) += xt_mark.o
 obj-$(CONFIG_NETFILTER_XT_CONNMARK) += xt_connmark.o
+obj-$(CONFIG_NETFILTER_XT_CONDITION) += xt_condition.o
 
 # targets
 obj-$(CONFIG_NETFILTER_XT_TARGET_CLASSIFY) += xt_CLASSIFY.o
@@ -66,7 +67,6 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_IDLETIMER) += xt_IDLETIMER.o
 # matches
 obj-$(CONFIG_NETFILTER_XT_MATCH_CLUSTER) += xt_cluster.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_COMMENT) += xt_comment.o
-obj-$(CONFIG_NETFILTER_XT_MATCH_CONDITION) += xt_condition.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNBYTES) += xt_connbytes.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNLIMIT) += xt_connlimit.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNTRACK) += xt_conntrack.o
diff --git a/net/netfilter/xt_condition.c b/net/netfilter/xt_condition.c
index 9af0257..dbd762c 100644
--- a/net/netfilter/xt_condition.c
+++ b/net/netfilter/xt_condition.c
@@ -2,11 +2,13 @@
  *	"condition" match extension for Xtables
  *
  *	Description: This module allows firewall rules to match using
- *	condition variables available through procfs.
+ *	condition variables available through procfs.  It also allows
+ *	target rules to set the condition variable.
  *
  *	Authors:
  *	Stephane Ouellette <ouellettes@videotron.ca>, 2002-10-22
  *	Massimiliano Hofer <max@nucleus.it>, 2006-05-15
+ *	Luciano Coelho <luciano.coelho@nokia.com>, 2010-07-20
  *
  *	This program is free software; you can redistribute it and/or modify it
  *	under the terms of the GNU General Public License; either version 2
@@ -32,7 +34,8 @@ static unsigned int condition_gid_perms;
 MODULE_AUTHOR("Stephane Ouellette <ouellettes@videotron.ca>");
 MODULE_AUTHOR("Massimiliano Hofer <max@nucleus.it>");
 MODULE_AUTHOR("Jan Engelhardt <jengelh@medozas.de>");
-MODULE_DESCRIPTION("Allows rules to match against condition variables");
+MODULE_AUTHOR("Luciano Coelho <luciano.coelho@nokia.com>");
+MODULE_DESCRIPTION("Allows rules to set and match condition variables");
 MODULE_LICENSE("GPL");
 module_param(condition_list_perms, uint, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(condition_list_perms, "default permissions on /proc/net/nf_condition/* files");
@@ -91,56 +94,34 @@ static int condition_proc_write(struct file *file, const char __user *buffer,
 	return length;
 }
 
-static bool
-condition_mt(const struct sk_buff *skb, struct xt_action_param *par)
+static struct condition_variable *xt_condition_insert(const char *name)
 {
-	const struct xt_condition_mtinfo *info = par->matchinfo;
-	const struct condition_variable *var   = info->condvar;
-
-	return var->enabled ^ info->invert;
-}
-
-static int condition_mt_check(const struct xt_mtchk_param *par)
-{
-	struct xt_condition_mtinfo *info = par->matchinfo;
 	struct condition_variable *var;
 
-	/* Forbid certain names */
-	if (*info->name == '\0' || *info->name == '.' ||
-	    info->name[sizeof(info->name)-1] != '\0' ||
-	    memchr(info->name, '/', sizeof(info->name)) != NULL) {
-		pr_info("name not allowed or too long: \"%.*s\"\n",
-			(unsigned int)sizeof(info->name), info->name);
-		return -EINVAL;
-	}
 	/*
 	 * Let's acquire the lock, check for the condition and add it
 	 * or increase the reference counter.
 	 */
 	mutex_lock(&proc_lock);
 	list_for_each_entry(var, &conditions_list, list) {
-		if (strcmp(info->name, var->status_proc->name) == 0) {
+		if (strcmp(name, var->status_proc->name) == 0) {
 			++var->refcount;
-			mutex_unlock(&proc_lock);
-			info->condvar = var;
-			return 0;
+			goto out;
 		}
 	}
 
 	/* At this point, we need to allocate a new condition variable. */
 	var = kmalloc(sizeof(struct condition_variable), GFP_KERNEL);
-	if (var == NULL) {
-		mutex_unlock(&proc_lock);
-		return -ENOMEM;
-	}
+	if (var == NULL)
+		goto out;
 
 	/* Create the condition variable's proc file entry. */
-	var->status_proc = create_proc_entry(info->name, condition_list_perms,
+	var->status_proc = create_proc_entry(name, condition_list_perms,
 			   proc_net_condition);
 	if (var->status_proc == NULL) {
 		kfree(var);
-		mutex_unlock(&proc_lock);
-		return -ENOMEM;
+		var = NULL;
+		goto out;
 	}
 
 	var->refcount = 1;
@@ -151,16 +132,13 @@ static int condition_mt_check(const struct xt_mtchk_param *par)
 	var->status_proc->uid        = condition_uid_perms;
 	var->status_proc->gid        = condition_gid_perms;
 	list_add(&var->list, &conditions_list);
+out:
 	mutex_unlock(&proc_lock);
-	info->condvar = var;
-	return 0;
+	return var;
 }
 
-static void condition_mt_destroy(const struct xt_mtdtor_param *par)
+static void xt_condition_put(struct condition_variable *var)
 {
-	const struct xt_condition_mtinfo *info = par->matchinfo;
-	struct condition_variable *var = info->condvar;
-
 	mutex_lock(&proc_lock);
 	if (--var->refcount == 0) {
 		list_del(&var->list);
@@ -172,6 +150,101 @@ static void condition_mt_destroy(const struct xt_mtdtor_param *par)
 	mutex_unlock(&proc_lock);
 }
 
+static bool
+condition_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_condition_mtinfo *info = par->matchinfo;
+	const struct condition_variable *var   = info->condvar;
+
+	return var->enabled ^ info->invert;
+}
+
+static int condition_mt_check(const struct xt_mtchk_param *par)
+{
+	struct xt_condition_mtinfo *info = par->matchinfo;
+	struct condition_variable *var;
+
+	/* Forbid certain names */
+	if (*info->name == '\0' || *info->name == '.' ||
+	    info->name[sizeof(info->name)-1] != '\0' ||
+	    memchr(info->name, '/', sizeof(info->name)) != NULL) {
+		pr_info("name not allowed or too long: \"%.*s\"\n",
+			(unsigned int)sizeof(info->name), info->name);
+		return -EINVAL;
+	}
+
+	var = xt_condition_insert(info->name);
+	if (var == NULL)
+		return -ENOMEM;
+
+	info->condvar = var;
+	return 0;
+}
+
+static void condition_mt_destroy(const struct xt_mtdtor_param *par)
+{
+	const struct xt_condition_mtinfo *info = par->matchinfo;
+
+	xt_condition_put(info->condvar);
+}
+
+static unsigned int condition_tg_target(struct sk_buff *skb,
+					 const struct xt_action_param *par)
+{
+	const struct condition_tg_info *info = par->targinfo;
+	struct condition_variable *var = info->condvar;
+
+	pr_debug("setting condition %s, enabled %d\n",
+		 info->name, info->enabled);
+
+	var->enabled = info->enabled;
+
+	return XT_CONTINUE;
+}
+
+static int condition_tg_checkentry(const struct xt_tgchk_param *par)
+{
+	struct condition_tg_info *info = par->targinfo;
+	struct condition_variable *var;
+
+	pr_debug("checkentry %s\n", info->name);
+
+	/* Forbid certain names */
+	if (*info->name == '\0' || *info->name == '.' ||
+	    info->name[sizeof(info->name)-1] != '\0' ||
+	    memchr(info->name, '/', sizeof(info->name)) != NULL) {
+		pr_info("name not allowed or too long: \"%.*s\"\n",
+			(unsigned int)sizeof(info->name), info->name);
+		return -EINVAL;
+	}
+
+	var = xt_condition_insert(info->name);
+	if (var == NULL)
+		return -ENOMEM;
+
+	info->condvar = var;
+	return 0;
+}
+
+static void condition_tg_destroy(const struct xt_tgdtor_param *par)
+{
+	const struct condition_tg_info *info = par->targinfo;
+
+	pr_debug("destroy %s\n", info->name);
+
+	xt_condition_put(info->condvar);
+}
+
+static struct xt_target condition_tg_reg __read_mostly = {
+	.name		= "CONDITION",
+	.family		= NFPROTO_UNSPEC,
+	.target		= condition_tg_target,
+	.targetsize     = sizeof(struct condition_tg_info),
+	.checkentry	= condition_tg_checkentry,
+	.destroy        = condition_tg_destroy,
+	.me		= THIS_MODULE,
+};
+
 static struct xt_match condition_mt_reg __read_mostly = {
 	.name       = "condition",
 	.revision   = 1,
@@ -185,24 +258,24 @@ static struct xt_match condition_mt_reg __read_mostly = {
 
 static const char *const dir_name = "nf_condition";
 
-static int __net_init condnet_mt_init(struct net *net)
+static int __net_init condnet_init(struct net *net)
 {
 	proc_net_condition = proc_mkdir(dir_name, net->proc_net);
 
 	return (proc_net_condition == NULL) ? -EACCES : 0;
 }
 
-static void __net_exit condnet_mt_exit(struct net *net)
+static void __net_exit condnet_exit(struct net *net)
 {
 	remove_proc_entry(dir_name, net->proc_net);
 }
 
-static struct pernet_operations condition_mt_netops = {
-	.init = condnet_mt_init,
-	.exit = condnet_mt_exit,
+static struct pernet_operations condition_netops = {
+	.init = condnet_init,
+	.exit = condnet_exit,
 };
 
-static int __init condition_mt_init(void)
+static int __init condition_init(void)
 {
 	int ret;
 
@@ -211,8 +284,15 @@ static int __init condition_mt_init(void)
 	if (ret < 0)
 		return ret;
 
-	ret = register_pernet_subsys(&condition_mt_netops);
+	ret =  xt_register_target(&condition_tg_reg);
+	if (ret < 0) {
+		xt_unregister_match(&condition_mt_reg);
+		return ret;
+	}
+
+	ret = register_pernet_subsys(&condition_netops);
 	if (ret < 0) {
+		xt_unregister_target(&condition_tg_reg);
 		xt_unregister_match(&condition_mt_reg);
 		return ret;
 	}
@@ -220,11 +300,12 @@ static int __init condition_mt_init(void)
 	return 0;
 }
 
-static void __exit condition_mt_exit(void)
+static void __exit condition_exit(void)
 {
-	unregister_pernet_subsys(&condition_mt_netops);
+	unregister_pernet_subsys(&condition_netops);
+	xt_unregister_target(&condition_tg_reg);
 	xt_unregister_match(&condition_mt_reg);
 }
 
-module_init(condition_mt_init);
-module_exit(condition_mt_exit);
+module_init(condition_init);
+module_exit(condition_exit);
-- 
1.7.0.4


^ permalink raw reply related

* [RFC PATCH] dst: check if dst is freed in dst_check()
From: Nicolas Dichtel @ 2010-07-20  9:49 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]

Hi,

I probably missed something, but I cannot find where obsolete field is checked 
when dst_check() is called. If dst->obsolete is > 1, dst cannot be used!

Attached is a proposal to fix this issue.

Regards,

-- 
Nicolas DICHTEL
6WIND
R&D Engineer

Tel: +33 1 39 30 92 10
Fax: +33 1 39 30 92 11
nicolas.dichtel@6wind.com
www.6wind.com
Join the Multicore Packet Processing Forum: www.multicorepacketprocessing.com

Ce courriel ainsi que toutes les pièces jointes, est uniquement destiné à son ou 
ses destinataires. Il contient des informations confidentielles qui sont la 
propriété de 6WIND. Toute révélation, distribution ou copie des informations 
qu'il contient est strictement interdite. Si vous avez reçu ce message par 
erreur, veuillez immédiatement le signaler à l'émetteur et détruire toutes les 
données reçues.

This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and contains information that is confidential and 
proprietary to 6WIND. All unauthorized review, use, disclosure or distribution 
is prohibited. If you are not the intended recipient, please contact the sender 
by reply e-mail and destroy all copies of the original message.

[-- Attachment #2: 0001-dst-check-if-dst-is-freed-in-dst_check.patch --]
[-- Type: text/x-diff, Size: 772 bytes --]

>From 69990a516f4b5b48608b0ea283dfac6f1fa110b3 Mon Sep 17 00:00:00 2001
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue, 20 Jul 2010 11:35:53 +0200
Subject: [PATCH] dst: check if dst is freed in dst_check()

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/dst.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 81d1413..7bf4f9a 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -319,6 +319,8 @@ static inline int dst_input(struct sk_buff *skb)

 static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
 {
+	if (dst->obsolete > 1)
+		return NULL;
 	if (dst->obsolete)
 		dst = dst->ops->check(dst, cookie);
 	return dst;
-- 
1.5.4.5

^ permalink raw reply related

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Eric Dumazet @ 2010-07-20  9:24 UTC (permalink / raw)
  To: Roy Marples; +Cc: netdev
In-Reply-To: <4C45678A.4080408@marples.name>

Le mardi 20 juillet 2010 à 10:08 +0100, Roy Marples a écrit :
> On 20/07/2010 09:54, Eric Dumazet wrote:
> > Is it for the dhcpcd problem we talk about few week ago, disturbed by
> > new 64bit stats ?
> 
> Yes
> 
> >
> > Why do you want to have a fixed size of 256 bytes ?
> >
> > Using 8192 bytes on stack would avoid MSG_TRUNK mess.
> 
> Yes it would, but that doesn't answer my question :)

Your question might be wrong ? :=)

> I would like to use a buffer big enough, but not a whole 8k in size.
> dhcpcd has quite a small runtime and I'd like to keep it that way.

8192 bytes on stack is too much for you ?

Then you should automatically resize your buffer, and not using
MSG_TRUNK at all (there is no guarantee the information you need will be
part of the truncated part)




^ permalink raw reply

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Roy Marples @ 2010-07-20  9:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1279616099.2498.9.camel@edumazet-laptop>

On 20/07/2010 09:54, Eric Dumazet wrote:
> Is it for the dhcpcd problem we talk about few week ago, disturbed by
> new 64bit stats ?

Yes

>
> Why do you want to have a fixed size of 256 bytes ?
>
> Using 8192 bytes on stack would avoid MSG_TRUNK mess.

Yes it would, but that doesn't answer my question :)
I would like to use a buffer big enough, but not a whole 8k in size.
dhcpcd has quite a small runtime and I'd like to keep it that way.

Thanks

Roy

^ permalink raw reply

* Re: Badness with the kernel version 2.6.35-rc1-git1 running on P6 box
From: divya @ 2010-07-20  9:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, linuxppc-dev, sachinp, benh, netdev, David Miller,
	Jan-Bernd Themann
In-Reply-To: <1279274185.2549.14.camel@edumazet-laptop>

On Friday 16 July 2010 03:26 PM, Eric Dumazet wrote:
> Le vendredi 16 juillet 2010 à 14:20 +0530, divya a écrit :
>    
>> Hi ,
>>
>> With the latest kernel version 2.6.35-rc5-git1(2f7989efd4398) running on power(p6) box came across the following
>> call trace
>>
>> Call Trace:
>> [c000000006a0e800] [c000000000011c30] .show_stack+0x6c/0x16c (unreliable)
>> [c000000006a0e8b0] [c00000000012129c] .__alloc_pages_nodemask+0x6a0/0x75c
>> [c000000006a0ea30] [c0000000001527cc] .alloc_pages_current+0xc4/0x104
>> [c000000006a0ead0] [c00000000015b1a0] .new_slab+0xe0/0x314
>> [c000000006a0eb70] [c00000000015b6fc] .__slab_alloc+0x328/0x644
>> [c000000006a0ec50] [c00000000015cc34] .__kmalloc_node_track_caller+0x114/0x194
>> [c000000006a0ed00] [c000000000599f6c] .__alloc_skb+0x94/0x180
>> [c000000006a0edb0] [c00000000059af5c] .__netdev_alloc_skb+0x3c/0x74
>> [c000000006a0ee30] [c0000000004f9480] .ehea_refill_rq_def+0xf8/0x2d0
>> [c000000006a0ef30] [c0000000004fab8c] .ehea_up+0x5b8/0x69c
>> [c000000006a0f040] [c0000000004facd4] .ehea_open+0x64/0x118
>> [c000000006a0f0e0] [c0000000005a6e9c] .__dev_open+0x100/0x168
>> [c000000006a0f170] [c0000000005a3ac0] .__dev_change_flags+0x10c/0x1ac
>> [c000000006a0f210] [c0000000005a6d44] .dev_change_flags+0x24/0x7c
>> [c000000006a0f2a0] [c0000000005b50b4] .do_setlink+0x31c/0x750
>> [c000000006a0f3b0] [c0000000005b6724] .rtnl_newlink+0x388/0x618
>> [c000000006a0f5f0] [c0000000005b6350] .rtnetlink_rcv_msg+0x268/0x2b4
>> [c000000006a0f6a0] [c0000000005cfdc0] .netlink_rcv_skb+0x74/0x108
>> [c000000006a0f730] [c0000000005b60c4] .rtnetlink_rcv+0x38/0x5c
>> [c000000006a0f7c0] [c0000000005cf8c8] .netlink_unicast+0x318/0x3f4
>> [c000000006a0f890] [c0000000005d05b4] .netlink_sendmsg+0x2d0/0x310
>> [c000000006a0f970] [c00000000058e1e8] .sock_sendmsg+0xd4/0x110
>> [c000000006a0fb50] [c00000000058e514] .SyS_sendmsg+0x1f4/0x288
>> [c000000006a0fd70] [c00000000058c2b8] .SyS_socketcall+0x214/0x280
>> [c000000006a0fe30] [c0000000000085b4] syscall_exit+0x0/0x40
>> Mem-Info:
>> Node 0 DMA per-cpu:
>> CPU    0: hi:    0, btch:   1 usd:   0
>> CPU    1: hi:    0, btch:   1 usd:   0
>> CPU    2: hi:    0, btch:   1 usd:   0
>> CPU    3: hi:    0, btch:   1 usd:   0
>> active_anon:50 inactive_anon:260 isolated_anon:0
>>    active_file:159 inactive_file:139 isolated_file:0
>>    unevictable:0 dirty:2 writeback:1 unstable:0
>>    free:16 slab_reclaimable:66 slab_unreclaimable:502
>>    mapped:120 shmem:2 pagetables:37 bounce:0
>> Node 0 DMA free:1024kB min:1408kB low:1728kB high:2112kB active_anon:3200kB inactive_anon:16640kB active_file:10176kB inactive_file:8896kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:130944kB mlocked:0kB dirty:128kB writeback:64kB mapped:7680kB shmem:128kB slab_reclaimable:4224kB slab_unreclaimable:32128kB kernel_stack:2528kB pagetables:2368kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>> lowmem_reserve[]: 0 0 0
>> Node 0 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
>> 496 total pagecache pages
>> 178 pages in swap cache
>> Swap cache stats: add 780, delete 602, find 467/551
>> Free swap  = 1027904kB
>> Total swap = 1044160kB
>> 2048 pages RAM
>> 683 pages reserved
>> 582 pages shared
>> 1075 pages non-shared
>> SLUB: Unable to allocate memory on node -1 (gfp=0x20)
>>     cache: kmalloc-16384, object size: 16384, buffer size: 16384, default order: 2, min order: 0
>>     node 0: slabs: 28, objs: 292, free: 0
>> ip: page allocation failure. order:0, mode:0x8020
>> Call Trace:
>> [c000000006a0eb40] [c000000000011c30] .show_stack+0x6c/0x16c (unreliable)
>> [c000000006a0ebf0] [c00000000012129c] .__alloc_pages_nodemask+0x6a0/0x75c
>> [c000000006a0ed70] [c0000000001527cc] .alloc_pages_current+0xc4/0x104
>> [c000000006a0ee10] [c00000000011fca4] .__get_free_pages+0x18/0x90
>> [c000000006a0ee90] [c0000000004f7058] .ehea_get_stats+0x4c/0x1bc
>> [c000000006a0ef30] [c0000000005a0a04] .dev_get_stats+0x38/0x64
>> [c000000006a0efc0] [c0000000005b456c] .rtnl_fill_ifinfo+0x35c/0x85c
>> [c000000006a0f150] [c0000000005b5920] .rtmsg_ifinfo+0x164/0x204
>> [c000000006a0f210] [c0000000005a6d6c] .dev_change_flags+0x4c/0x7c
>> [c000000006a0f2a0] [c0000000005b50b4] .do_setlink+0x31c/0x750
>> [c000000006a0f3b0] [c0000000005b6724] .rtnl_newlink+0x388/0x618
>> [c000000006a0f5f0] [c0000000005b6350] .rtnetlink_rcv_msg+0x268/0x2b4
>> [c000000006a0f6a0] [c0000000005cfdc0] .netlink_rcv_skb+0x74/0x108
>> [c000000006a0f730] [c0000000005b60c4] .rtnetlink_rcv+0x38/0x5c
>> [c000000006a0f7c0] [c0000000005cf8c8] .netlink_unicast+0x318/0x3f4
>> [c000000006a0f890] [c0000000005d05b4] .netlink_sendmsg+0x2d0/0x310
>> [c000000006a0f970] [c00000000058e1e8] .sock_sendmsg+0xd4/0x110
>> [c000000006a0fb50] [c00000000058e514] .SyS_sendmsg+0x1f4/0x288
>> [c000000006a0fd70] [c00000000058c2b8] .SyS_socketcall+0x214/0x280
>> [c000000006a0fe30] [c0000000000085b4] syscall_exit+0x0/0x40
>> Mem-Info:
>> Node 0 DMA per-cpu:
>> CPU    0: hi:    0, btch:   1 usd:   0
>> CPU    1: hi:    0, btch:   1 usd:   0
>> CPU    2: hi:    0, btch:   1 usd:   0
>> CPU    3: hi:    0, btch:   1 usd:   0
>>
>> The mainline 2.6.35-rc5 worked fine.
>>      
> Maybe you were lucky with 2.6.35-rc5
>
> Anyway ehea should not use GFP_ATOMIC in its ehea_get_stats() method,
> called in process context, but GFP_KERNEL.
>
> Another patch is needed for ehea_refill_rq_def() as well.
>
>
>
> [PATCH] ehea: ehea_get_stats() should use GFP_KERNEL
>
> ehea_get_stats() is called in process context and should use GFP_KERNEL
> allocation instead of GFP_ATOMIC.
>
> Clearing stats at beginning of ehea_get_stats() is racy in case of
> concurrent stat readers.
>
> get_stats() can also use netdev net_device_stats, instead of a private
> copy.
>
> Reported-by: divya<dipraksh@linux.vnet.ibm.com>
> Signed-off-by: Eric Dumazet<eric.dumazet@gmail.com>
> ---
>   drivers/net/ehea/ehea.h      |    1 -
>   drivers/net/ehea/ehea_main.c |    6 ++----
>   2 files changed, 2 insertions(+), 5 deletions(-)
>    
Hi,

The call trace mentioned above still appears on upstream kernel and linux-next tree too.
The mentioned patch hasn't still been merged into upstream yet - hence getting call traces for both ehea_get_stats()
and ehea_refill_rq_def() methods.
However w.r.t to linux-next getting call trace only for ehea_refill_rq_def() method.

Thanks
Divya

^ permalink raw reply

* Re: recv(2), MSG_TRUNK and kernels older than 2.6.22
From: Eric Dumazet @ 2010-07-20  8:54 UTC (permalink / raw)
  To: Roy Marples; +Cc: netdev
In-Reply-To: <4C455DC3.1050304@marples.name>

Le mardi 20 juillet 2010 à 09:26 +0100, Roy Marples a écrit :
> Hi
> 
> I would like to support all possible kernels I can and previously used a 
> fixed buffer of size 256 to read from netlink sockets. This is now 
> proving too small for some 64-bit kernels so I would like to use recv(2) 
> with MSG_TRUNK to wor out the size. However, the man page says that this 
> only works for 2.6.22 kernels or newer.
> 
> My question is, what is the behaviour of recv on older kernels where 
> MSG_TRUNC is not supported? I would rather not use some arbitary size if 
> at all possible.
> 

Is it for the dhcpcd problem we talk about few week ago, disturbed by
new 64bit stats ?

Why do you want to have a fixed size of 256 bytes ?

Using 8192 bytes on stack would avoid MSG_TRUNK mess.

static int
get_netlink(int fd, int flags,
    int (*callback)(struct nlmsghdr *))
{
        char buffer[8192];
        ssize_t bytes;
        struct nlmsghdr *nlm;
        int r = -1;

        for (;;) {
                bytes = recv(fd, buffer, sizeof(buffer), flags);
                if (bytes == -1) {
                        if (errno == EAGAIN) {
                                r = 0;
                                goto eexit;
                        }
                        if (errno == EINTR)
                                continue;
                        goto eexit;
                }





^ permalink raw reply

* Re: oops in tcp_xmit_retransmit_queue() w/ v2.6.32.15
From: Ilpo Järvinen @ 2010-07-20  8:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lennart Schulte, Eric Dumazet, David S. Miller, lkml,
	netdev@vger.kernel.org, Fehrmann, Henning, Carsten Aulbert
In-Reply-To: <4C4467E0.9080907@kernel.org>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6095 bytes --]

On Mon, 19 Jul 2010, Tejun Heo wrote:

> Hello,
> 
> On 07/16/2010 02:02 PM, Ilpo Järvinen wrote:
> > Besides, Tejun has also found that it's hint->next ptr which is NULL in 
> > his case so this won't solve his case anyway. Tejun, can you confirm 
> > whether it was retransmit_skb_hint->next being NULL on _entry time_ to 
> > tcp_xmit_retransmit_queue() or later on in the loop after the updates done 
> > by the loop itself to the hint (or that your testing didn't conclude 
> > either)?
> 
> Sorry about the delay.  I was traveling last week.  Unfortunately, I
> don't know whether ->next was NULL on entry or not.  I hacked up the
> following ugly patch for the next test run.  It should have everything
> which has come up till now + list and hint sanity checking before
> starting processing them.  I'm planning on deploying it w/ crashdump
> enabled in several days.  If I've missed something, please let me
> know.

One thing that complicates things further is the fact that the write queue 
can change beneath us (ie., in tcp_retrans_try_collapse and tcp_fragment).
I've read them multiple times through and always found them innocent so 
this might be just for-the-record type of a note (at least I hope so).

-- 
 i.

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b4ed957..1c8b1e0 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2190,6 +2190,53 @@ static int tcp_can_forward_retransmit(struct sock *sk)
>  	return 1;
>  }
> 
> +static void print_queue(struct sock *sk, struct sk_buff *old, struct sk_buff *hole)
> +{
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	struct sk_buff *skb, *prev;
> +	bool do_panic = false;
> +
> +	skb = tcp_write_queue_head(sk);
> +	prev = (struct sk_buff *)(&sk->sk_write_queue);
> +
> +	if (skb == NULL) {
> +		printk("XXX NULL head, pkts %u\n", tp->packets_out);
> +		do_panic = true;
> +	}
> +
> +	printk("XXX head %p tail %p sendhead %p oldhint %p now %p hole %p high %u\n",
> +	       tcp_write_queue_head(sk), tcp_write_queue_tail(sk),
> +	       tcp_send_head(sk), old, tp->retransmit_skb_hint, hole,
> +	       tp->retransmit_high);
> +
> +	while (skb) {
> +		printk("XXX skb %p (%u-%u) next %p prev %p sacked %u\n",
> +		       skb, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
> +		       skb->next, skb->prev, TCP_SKB_CB(skb)->sacked);
> +		if (prev != skb->prev) {
> +			printk("XXX Inconsistent prev\n");
> +			do_panic = true;
> +		}
> +
> +		if (skb == tcp_write_queue_tail(sk)) {
> +			if (skb->next != (struct sk_buff *)(&sk->sk_write_queue)) {
> +				printk("XXX Improper next at tail\n");
> +				do_panic = true;
> +			}
> +			break;
> +		}
> +
> +		prev = skb;
> +		skb = skb->next;
> +	}
> +	if (!skb) {
> +		printk("XXX Encountered unexpected NULL\n");
> +		do_panic = true;
> +	}
> +	if (do_panic)
> +		panic("XXX panicking");
> +}
> +
>  /* This gets called after a retransmit timeout, and the initially
>   * retransmitted data is acknowledged.  It tries to continue
>   * resending the rest of the retransmit queue, until either
> @@ -2198,19 +2245,53 @@ static int tcp_can_forward_retransmit(struct sock *sk)
>   * based retransmit packet might feed us FACK information again.
>   * If so, we use it to avoid unnecessarily retransmissions.
>   */
> +static unsigned int caught_it;
> +
>  void tcp_xmit_retransmit_queue(struct sock *sk)
>  {
>  	const struct inet_connection_sock *icsk = inet_csk(sk);
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	struct sk_buff *skb;
> +	struct sk_buff *skb, *prev;
>  	struct sk_buff *hole = NULL;
> +	struct sk_buff *old = tp->retransmit_skb_hint;
>  	u32 last_lost;
>  	int mib_idx;
>  	int fwd_rexmitting = 0;
> +	bool saw_hint = false;
> +
> +	if (!tp->packets_out) {
> +		if (net_ratelimit())
> +			printk("XXX !tp->packets_out, retransmit_skb_hint=%p, write_queue_head=%p\n",
> +			       tp->retransmit_skb_hint, tcp_write_queue_head(sk));
> +		return;
> +	}
> 
>  	if (!tp->lost_out)
>  		tp->retransmit_high = tp->snd_una;
> 
> +	for (skb = tcp_write_queue_head(sk),
> +	     prev = (struct sk_buff *)&sk->sk_write_queue;
> +	     skb != (struct sk_buff *)&sk->sk_write_queue;
> +	     prev = skb, skb = skb->next) {
> +		if (prev != skb->prev) {
> +			printk("XXX sanity check: prev corrupt\n");
> +			print_queue(sk, old, hole);
> +		}
> +		if (skb == tp->retransmit_skb_hint)
> +			saw_hint = true;
> +		if (skb == tcp_write_queue_tail(sk) &&
> +		    skb->next != (struct sk_buff *)(&sk->sk_write_queue)) {
> +			printk("XXX sanity check: end corrupt\n");
> +			print_queue(sk, old, hole);
> +		}
> +	}
> +	if (tp->retransmit_skb_hint && !saw_hint) {
> +		printk("XXX sanity check: retransmit_skb_hint=%p is not on list, claring hint\n",
> +		       tp->retransmit_skb_hint);
> +		print_queue(sk, old, hole);
> +		tp->retransmit_skb_hint = NULL;
> +	}
> +
>  	if (tp->retransmit_skb_hint) {
>  		skb = tp->retransmit_skb_hint;
>  		last_lost = TCP_SKB_CB(skb)->end_seq;
> @@ -2218,7 +2299,17 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
>  			last_lost = tp->retransmit_high;
>  	} else {
>  		skb = tcp_write_queue_head(sk);
> -		last_lost = tp->snd_una;
> +		if (skb)
> +			last_lost = tp->snd_una;
> +	}
> +
> +checknull:
> +	if (skb == NULL) {
> +		print_queue(sk, old, hole);
> +		caught_it++;
> +		if (net_ratelimit())
> +			printk("XXX Errors caught so far %u\n", caught_it);
> +		return;
>  	}
> 
>  	tcp_for_write_queue_from(skb, sk) {
> @@ -2261,7 +2352,7 @@ begin_fwd:
>  		} else if (!(sacked & TCPCB_LOST)) {
>  			if (hole == NULL && !(sacked & (TCPCB_SACKED_RETRANS|TCPCB_SACKED_ACKED)))
>  				hole = skb;
> -			continue;
> +			goto cont;
> 
>  		} else {
>  			last_lost = TCP_SKB_CB(skb)->end_seq;
> @@ -2272,7 +2363,7 @@ begin_fwd:
>  		}
> 
>  		if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS))
> -			continue;
> +			goto cont;
> 
>  		if (tcp_retransmit_skb(sk, skb))
>  			return;
> @@ -2282,6 +2373,9 @@ begin_fwd:
>  			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>  						  inet_csk(sk)->icsk_rto,
>  						  TCP_RTO_MAX);
> +cont:
> +		skb = skb->next;
> +		goto checknull;
>  	}
>  }
> 
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox