Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] inet: limit length of fragment queue hash table bucket lists
From: Jesper Dangaard Brouer @ 2013-03-19 14:20 UTC (permalink / raw)
  To: David Miller; +Cc: hannes, netdev, eric.dumazet
In-Reply-To: <20130319.100324.927922515830950770.davem@davemloft.net>

On Tue, 2013-03-19 at 10:03 -0400, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Fri, 15 Mar 2013 22:32:30 +0100
> 
> > This patch introduces a constant limit of the fragment queue hash
> > table bucket list lengths. Currently the limit 128 is choosen somewhat
> > arbitrary and just ensures that we can fill up the fragment cache with
> > empty packets up to the default ip_frag_high_thresh limits. It should
> > just protect from list iteration eating considerable amounts of cpu.
> > 
> > If we reach the maximum length in one hash bucket a warning is printed.
> > This is implemented on the caller side of inet_frag_find to distinguish
> > between the different users of inet_fragment.c.
> > 
> > I dropped the out of memory warning in the ipv4 fragment lookup path,
> > because we already get a warning by the slab allocator.
> > 
> > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> 
> This looks mostly fine to me, Eric could you give it a quick review?
> 
> Although one comment from me:
> 
> > +/* averaged:
> > + * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
> > + *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
> > + *	       struct frag_queue))
> > + */
> > +#define INETFRAGS_MAXDEPTH		128
> 
> If we deem this to be the ideal formula, maybe we can maintain it
> accurately and very cheaply at run time.  We'd do this by adding a
> handler for the ipfrag_high_thresh sysctl, and use that to recalculate
> the maxdepth any time ipfrag_high_thresh is changed by the user.

I think it's overkill to implement this now.  I just want this patch in
as a safeguard.

The idea I discussed with Eric, will remove the need for this patch.
The idea is to drop the LRU lists, increase the hash size a bit, and do
cleanup/eviction directly on the frag hash tables.  And e.g. only allow
5 frag queue elements in each hash bucket... but more work and testing
is needed before I have something ready.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: Re: [BUG][mvebu] mvneta: cannot request irq 25 on openblocks-ax3
From: Masami Hiramatsu @ 2013-03-19 14:19 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: linux-arm-kernel, thomas.petazzoni, Jason Cooper, netdev,
	linux-kernel, yrl.pp-manager.tt@hitachi.com, Gregory Clement
In-Reply-To: <20130319133928.GE3137@localhost>

Hi Ezequiel,

(2013/03/19 22:39), Ezequiel Garcia wrote:
> Hi Masami,
> 
> On Tue, Mar 19, 2013 at 10:12:37PM +0900, Masami Hiramatsu wrote:
>>
>> Here I've hit a bug on the recent kernel. As far as I know, this bug
>> exists on 3.9-rc1 too.
>>
>> When I tried the latest mvebu for-next tree
>> (git://git.infradead.org/users/jcooper/linux.git mvebu/for-next),
>> I got below warning at bootup time and mvneta didn't work (link was never up).
>> I ensured that "ifconfig ethX up" always caused that.
>>
>> Does anyone succeed to boot openblocks-ax3 recently or hit same
>> trouble?
> 
> This is a known bug. Gregory Clement already has a fix and he
> will submit it soon. In case you need this fixed ASAP, I'm attaching
> you a patch with a fix.

Thanks! I'll try that.

> Please note the attached patch is not ready for mainline inclusion,
> as I said Gregory will submit a cleaner version soon.

Yeah, I look forward to it :)

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply

* Re: [PATCH v3 net-next 1/5] GRE: Refactor GRE tunneling code.
From: David Miller @ 2013-03-19 14:19 UTC (permalink / raw)
  To: pshelar; +Cc: netdev, jesse
In-Reply-To: <1363680615-1751-1-git-send-email-pshelar@nicira.com>

From: Pravin B Shelar <pshelar@nicira.com>
Date: Tue, 19 Mar 2013 01:10:15 -0700

> @@ -0,0 +1,169 @@
> +#ifndef __NET_IP_TUNNELS_H
> +#define __NET_IP_TUNNELS_H 1
> +
> +#include <linux/if_tunnel.h>
> +#include <net/dsfield.h>
> +#include <net/gro_cells.h>
> +#include <net/inet_ecn.h>
> +#include <net/ip.h>
> +
> +/* Keep error state on tunnel for 30 sec */
> +#define IPTUNNEL_ERR_TIMEO	(30*HZ)
> +
> +/* 6rd prefix/relay information */
> +struct ip_tunnel_6rd_parm {
> +	struct in6_addr		prefix;

Please include <linux/in6.h> explicitly rather than getting the in6_addr
definition indirectly via linux/dsfield.h

Please audit the data types and interfaces used in the rest of this
header file and make sure all the proper includes are being explicitly
made.

Finally, if you haven't already, please do some sanity builds with
different combinations of IPV6 being on and off.

Thanks.

^ permalink raw reply

* Re: Re: [BUG][mvebu] mvneta: cannot request irq 25 on openblocks-ax3
From: Masami Hiramatsu @ 2013-03-19 14:19 UTC (permalink / raw)
  To: Jason Cooper
  Cc: linux-arm-kernel, thomas.petazzoni, netdev, linux-kernel,
	yrl.pp-manager.tt@hitachi.com
In-Reply-To: <20130319133324.GS13280@titan.lakedaemon.net>

Hi Jason,

(2013/03/19 22:33), Jason Cooper wrote:
> On Tue, Mar 19, 2013 at 10:12:37PM +0900, Masami Hiramatsu wrote:
>> Hi,
>>
>> Here I've hit a bug on the recent kernel. As far as I know, this bug
>> exists on 3.9-rc1 too.
>>
>> When I tried the latest mvebu for-next tree
>> (git://git.infradead.org/users/jcooper/linux.git mvebu/for-next),
> 
> FYI: that branch isn't stable, it's used as a merge-test of
> arm-soc/for-next (also not stable) and any branches I am trying to push
> upstream that day.

Thanks! could you tell me which branch is stable?
(however, I'd like to try new fixes/features on my device too :))

> Gregory has a patch in the works for this.  Hopefully he'll submit
> it by the end of the week.

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply

* Re: [PATCH net] inet: limit length of fragment queue hash table bucket lists
From: Eric Dumazet @ 2013-03-19 14:15 UTC (permalink / raw)
  To: David Miller; +Cc: hannes, netdev, jbrouer
In-Reply-To: <20130319.100324.927922515830950770.davem@davemloft.net>

On Tue, 2013-03-19 at 10:03 -0400, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Fri, 15 Mar 2013 22:32:30 +0100
> 
> > This patch introduces a constant limit of the fragment queue hash
> > table bucket list lengths. Currently the limit 128 is choosen somewhat
> > arbitrary and just ensures that we can fill up the fragment cache with
> > empty packets up to the default ip_frag_high_thresh limits. It should
> > just protect from list iteration eating considerable amounts of cpu.
> > 
> > If we reach the maximum length in one hash bucket a warning is printed.
> > This is implemented on the caller side of inet_frag_find to distinguish
> > between the different users of inet_fragment.c.
> > 
> > I dropped the out of memory warning in the ipv4 fragment lookup path,
> > because we already get a warning by the slab allocator.
> > 
> > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> 
> This looks mostly fine to me, Eric could you give it a quick review?
> 

Sure, it looks ok for me

Acked-by: Eric Dumazet <edumazet@google.com>

> Although one comment from me:
> 
> > +/* averaged:
> > + * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
> > + *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
> > + *	       struct frag_queue))
> > + */
> > +#define INETFRAGS_MAXDEPTH		128
> 
> If we deem this to be the ideal formula, maybe we can maintain it
> accurately and very cheaply at run time.  We'd do this by adding a
> handler for the ipfrag_high_thresh sysctl, and use that to recalculate
> the maxdepth any time ipfrag_high_thresh is changed by the user.

This can probably be done in a second patch for net-next

^ permalink raw reply

* Re: [patch 0/3] s390: network bug fixes for net
From: David Miller @ 2013-03-19 14:10 UTC (permalink / raw)
  To: frank.blaschka; +Cc: netdev, linux-s390
In-Reply-To: <20130319060441.956271217@de.ibm.com>

From: frank.blaschka@de.ibm.com
Date: Tue, 19 Mar 2013 07:04:41 +0100

> here are 3 bug fixes for net.

All applied, thanks Frank.

^ permalink raw reply

* Re: [PATCH net] inet: limit length of fragment queue hash table bucket lists
From: Hannes Frederic Sowa @ 2013-03-19 14:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eric.dumazet, jbrouer
In-Reply-To: <20130319.100324.927922515830950770.davem@davemloft.net>

On Tue, Mar 19, 2013 at 10:03:24AM -0400, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Fri, 15 Mar 2013 22:32:30 +0100
> 
> > This patch introduces a constant limit of the fragment queue hash
> > table bucket list lengths. Currently the limit 128 is choosen somewhat
> > arbitrary and just ensures that we can fill up the fragment cache with
> > empty packets up to the default ip_frag_high_thresh limits. It should
> > just protect from list iteration eating considerable amounts of cpu.
> > 
> > If we reach the maximum length in one hash bucket a warning is printed.
> > This is implemented on the caller side of inet_frag_find to distinguish
> > between the different users of inet_fragment.c.
> > 
> > I dropped the out of memory warning in the ipv4 fragment lookup path,
> > because we already get a warning by the slab allocator.
> > 
> > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> 
> This looks mostly fine to me, Eric could you give it a quick review?
> 
> Although one comment from me:
> 
> > +/* averaged:
> > + * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
> > + *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
> > + *	       struct frag_queue))
> > + */
> > +#define INETFRAGS_MAXDEPTH		128
> 
> If we deem this to be the ideal formula, maybe we can maintain it
> accurately and very cheaply at run time.  We'd do this by adding a
> handler for the ipfrag_high_thresh sysctl, and use that to recalculate
> the maxdepth any time ipfrag_high_thresh is changed by the user.

I already did this, have a look at patch
<20130313012715.GE14801@order.stressinduktion.org>, here:
<http://patchwork.ozlabs.org/patch/227136/>

Other comments regarding this patch are in the thread:
"ipv6: use stronger hash for reassembly queue hash table"

Thanks,

  Hannes

^ permalink raw reply

* Re: [PATCH 0/4] Misc MRF24J40 Fixes
From: David Miller @ 2013-03-19 14:08 UTC (permalink / raw)
  To: alan
  Cc: linux-zigbee-devel, netdev, alex.bluesman.smirnov, dbaryshkov,
	linux-kernel
In-Reply-To: <1363644403-11003-1-git-send-email-alan@signal11.us>

From: Alan Ott <alan@signal11.us>
Date: Mon, 18 Mar 2013 18:06:39 -0400

> These are fairly straight-forward.

All applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net] inet: limit length of fragment queue hash table bucket lists
From: David Miller @ 2013-03-19 14:03 UTC (permalink / raw)
  To: hannes; +Cc: netdev, eric.dumazet, jbrouer
In-Reply-To: <20130315213230.GB24041@order.stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Fri, 15 Mar 2013 22:32:30 +0100

> This patch introduces a constant limit of the fragment queue hash
> table bucket list lengths. Currently the limit 128 is choosen somewhat
> arbitrary and just ensures that we can fill up the fragment cache with
> empty packets up to the default ip_frag_high_thresh limits. It should
> just protect from list iteration eating considerable amounts of cpu.
> 
> If we reach the maximum length in one hash bucket a warning is printed.
> This is implemented on the caller side of inet_frag_find to distinguish
> between the different users of inet_fragment.c.
> 
> I dropped the out of memory warning in the ipv4 fragment lookup path,
> because we already get a warning by the slab allocator.
> 
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

This looks mostly fine to me, Eric could you give it a quick review?

Although one comment from me:

> +/* averaged:
> + * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
> + *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
> + *	       struct frag_queue))
> + */
> +#define INETFRAGS_MAXDEPTH		128

If we deem this to be the ideal formula, maybe we can maintain it
accurately and very cheaply at run time.  We'd do this by adding a
handler for the ipfrag_high_thresh sysctl, and use that to recalculate
the maxdepth any time ipfrag_high_thresh is changed by the user.

^ permalink raw reply

* Re: [PATCH net-next 1/4] flow_keys: include thoff into flow_keys for later usage
From: Daniel Borkmann @ 2013-03-19 14:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, jasowang
In-Reply-To: <1363701339.2558.1.camel@edumazet-glaptop>

On 03/19/2013 02:55 PM, Eric Dumazet wrote:
> On Tue, 2013-03-19 at 14:28 +0100, Daniel Borkmann wrote:
>> In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
>> doing the work here anyway, also store thoff for a later usage, e.g. in
>> the BPF filter. We need to reorder choke_skb_cb a bit, though. Also, by
>> having thoff 16 Bit, we do not need to pack flow_keys.
>>
>> Suggested-by: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>> ---
>>   include/net/flow_keys.h   | 1 +
>>   net/core/flow_dissector.c | 5 ++++-
>>   net/sched/sch_choke.c     | 4 ++--
>>   3 files changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h
>> index 80461c1..bb8271d 100644
>> --- a/include/net/flow_keys.h
>> +++ b/include/net/flow_keys.h
>> @@ -9,6 +9,7 @@ struct flow_keys {
>>   		__be32 ports;
>>   		__be16 port16[2];
>>   	};
>> +	u16 thoff;
>>   	u8 ip_proto;
>>   };
>>
>> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> index f8d9e03..eb9dde1 100644
>> --- a/net/core/flow_dissector.c
>> +++ b/net/core/flow_dissector.c
>> @@ -23,7 +23,8 @@ static void iph_to_flow_copy_addrs(struct flow_keys *flow, const struct iphdr *i
>>
>>   bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow)
>>   {
>> -	int poff, nhoff = skb_network_offset(skb);
>> +	int poff;
>> +	u16 nhoff = skb_network_offset(skb);
>>   	u8 ip_proto;
>>   	__be16 proto = skb->protocol;
>>
>> @@ -151,6 +152,8 @@ ipv6:
>>   			flow->ports = *ports;
>>   	}
>>
>> +	flow->thoff = nhoff;
>> +
>>   	return true;
>>   }
>>   EXPORT_SYMBOL(skb_flow_dissect);
>> diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
>> index cc37dd5..9854c2b 100644
>> --- a/net/sched/sch_choke.c
>> +++ b/net/sched/sch_choke.c
>> @@ -141,9 +141,9 @@ static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx)
>>   }
>>
>>   struct choke_skb_cb {
>> -	u16			classid;
>> -	u8			keys_valid;
>>   	struct flow_keys	keys;
>> +	u8			keys_valid;
>> +	u16			classid;
>>   };
>>
>>   static inline struct choke_skb_cb *choke_skb_cb(const struct sk_buff *skb)
>
> Hmm, we dont need to change choke_skb_cb, as the sizeof(keys) didnt
> change
>
> (its aligned on 4 bytes)
>
> The first version of the patch needed this because we were using an u32
> field for thoff

You're right. I will send a version 2 of this set in a while.

^ permalink raw reply

* Re: [patch] RxRPC: use copy_to_user() instead of memcpy()
From: David Howells @ 2013-03-19 13:59 UTC (permalink / raw)
  To: David Miller, dan.carpenter; +Cc: dhowells, netdev, kernel-janitors
In-Reply-To: <20130319.094240.1315516663563952557.davem@davemloft.net>

David Miller <davem@davemloft.net> wrote:

> >  		/* copy the peer address and timestamp */
> >  		if (!continue_call) {
> > -			if (msg->msg_name && msg->msg_namelen > 0)
> > -				memcpy(msg->msg_name,
> > -				       &call->conn->trans->peer->srx,
> > -				       sizeof(call->conn->trans->peer->srx));
> 
> I bet the size is too large for a sockaddr_storage, and therefore we
> spam the kernel stack.  So I can only guess that changing this to a
> copy_to_user() fixes the hang because it simply faults on the kernel
> destination address.

Maybe, though I don't see how that would just fix the hang rather than
oopsing.  If Dan can printk the following:

	msg->msg_namelen
	sizeof(call->conn->trans->peer->srx)

before doing the memcpy, that could be handy.

David

^ permalink raw reply

* Re: [PATCH can-next] can: dump stack on protocol bugs
From: David Miller @ 2013-03-19 13:55 UTC (permalink / raw)
  To: socketcan; +Cc: mkl, netdev
In-Reply-To: <51475446.9000604@hartkopp.net>

From: Oliver Hartkopp <socketcan@hartkopp.net>
Date: Mon, 18 Mar 2013 18:52:06 +0100

> The rework of the kernel hlist implementation "hlist: drop the node parameter
> from iterators" (b67bfe0d42cac56c512dd5da4b1b347a23f4b70a) created some
> fallout in the form of non matching comments and obsolete code.
> 
> Additionally to the cleanup this patch adds a WARN() statement to catch the
> caller of the wrong filter removal request.
> 
> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>

Applied, thanks.

^ permalink raw reply

* Re: question about klen in move_addr_to_user()
From: David Miller @ 2013-03-19 13:55 UTC (permalink / raw)
  To: dan.carpenter; +Cc: netdev
In-Reply-To: <20130318101007.GO9189@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Mon, 18 Mar 2013 13:10:07 +0300

> The call tree is this:
> 
> __sys_recvmsg() gets the msg->msg_namelen from the user.
> 
> Normally the network protocols set msg->msg_namelen in their
> ->recvmsg() function but some don't like caif_seqpkt_recvmsg() and
> recv_msg() for tipc.

In fact, even TCP will just leave the msg->msg_namelen alone.

I think the best thing to do is to cap the klen to the size of
sockaddr_storage in verify_iovec() when mode is not VERIFY_READ.

But actually, it looks like sendmsg() has a similar problem.
We use m->msg_namelen as-is in verify_iovec() via __sys_sendmsg()
when mode is VERIFY_READ.

This makes me think that we should cap this at the precise moment
we import the user's msghdr.  Which means:

1) Create a helper function copy_msghdr_from_user() and use
   it everywhere we do the straight copy_from_user(msg_sys, ...)

2) In both copy_msghdr_from_user() and get_compat_msghdr(), cap
   the msg_namelen to sizeof(struct sockaddr_storage).

That should eliminate any and all problems in this area.

Thanks Dan.

^ permalink raw reply

* Re: [PATCH net-next 1/4] flow_keys: include thoff into flow_keys for later usage
From: Eric Dumazet @ 2013-03-19 13:55 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, davem, jasowang
In-Reply-To: <b80164aa55e82d0c3190fde317aed9bfc09d3fcb.1363699285.git.dborkman@redhat.com>

On Tue, 2013-03-19 at 14:28 +0100, Daniel Borkmann wrote:
> In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
> doing the work here anyway, also store thoff for a later usage, e.g. in
> the BPF filter. We need to reorder choke_skb_cb a bit, though. Also, by
> having thoff 16 Bit, we do not need to pack flow_keys.
> 
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
>  include/net/flow_keys.h   | 1 +
>  net/core/flow_dissector.c | 5 ++++-
>  net/sched/sch_choke.c     | 4 ++--
>  3 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h
> index 80461c1..bb8271d 100644
> --- a/include/net/flow_keys.h
> +++ b/include/net/flow_keys.h
> @@ -9,6 +9,7 @@ struct flow_keys {
>  		__be32 ports;
>  		__be16 port16[2];
>  	};
> +	u16 thoff;
>  	u8 ip_proto;
>  };
>  
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index f8d9e03..eb9dde1 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -23,7 +23,8 @@ static void iph_to_flow_copy_addrs(struct flow_keys *flow, const struct iphdr *i
>  
>  bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow)
>  {
> -	int poff, nhoff = skb_network_offset(skb);
> +	int poff;
> +	u16 nhoff = skb_network_offset(skb);
>  	u8 ip_proto;
>  	__be16 proto = skb->protocol;
>  
> @@ -151,6 +152,8 @@ ipv6:
>  			flow->ports = *ports;
>  	}
>  
> +	flow->thoff = nhoff;
> +
>  	return true;
>  }
>  EXPORT_SYMBOL(skb_flow_dissect);
> diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
> index cc37dd5..9854c2b 100644
> --- a/net/sched/sch_choke.c
> +++ b/net/sched/sch_choke.c
> @@ -141,9 +141,9 @@ static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx)
>  }
>  
>  struct choke_skb_cb {
> -	u16			classid;
> -	u8			keys_valid;
>  	struct flow_keys	keys;
> +	u8			keys_valid;
> +	u16			classid;
>  };
>  
>  static inline struct choke_skb_cb *choke_skb_cb(const struct sk_buff *skb)

Hmm, we dont need to change choke_skb_cb, as the sizeof(keys) didnt
change

(its aligned on 4 bytes)

The first version of the patch needed this because we were using an u32
field for thoff

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: reset transport header if it was not set before transmission
From: Daniel Borkmann @ 2013-03-19 13:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jason Wang, David Miller, netdev, linux-kernel, mst, edumazet
In-Reply-To: <1363697998.21184.54.camel@edumazet-glaptop>

On 03/19/2013 01:59 PM, Eric Dumazet wrote:
> On Tue, 2013-03-19 at 13:58 +0100, Daniel Borkmann wrote:
>
>> Yes, will post them in a couple of minutes.
>
> Please target net tree for the first patch (adding thoff into struct
> flow_keys), so that Jason or me can fix DODGY  providers.

Sorry, I received this too late. The patch set is already out, but we
can put a note into the ``[PATCH net-next 1/4] flow_keys: include thoff
into flow_keys for later'' thread to let Dave know.

^ permalink raw reply

* RE: [PATCH net-next 2/2] bnx2x: add RSS capability for GRE traffic
From: Dmitry Kravkov @ 2013-03-19 13:43 UTC (permalink / raw)
  To: Eric Dumazet, Maciej Żenczykowski
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Eilon Greenstein,
	Tom Herbert
In-Reply-To: <1363695958.21184.42.camel@edumazet-glaptop>

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Tuesday, March 19, 2013 2:26 PM
> To: Maciej Żenczykowski
> Cc: Dmitry Kravkov; davem@davemloft.net; netdev@vger.kernel.org; Eilon Greenstein; Tom Herbert
> Subject: Re: [PATCH net-next 2/2] bnx2x: add RSS capability for GRE traffic
> 
> Please Maciej do not top post on lkml or netdev mailing lists.
> 
> On Tue, 2013-03-19 at 02:18 -0700, Maciej Żenczykowski wrote:
> > Can the HW calculate and return a 1s complement sum of the entire
> > packet (or a large portion there-of)?
> > Fixing that up to be only of the outer IPv4, inner IPv4 and inner TCP
> > relevant portions should still be simpler (well faster) than
> > calculating the TCP checksum.
> > I'm pretty sure that some relationship between 1s complement sum of
> > all bytes, outer IPv4 checksum, inner IPv4 checksum and TCP checksum
> > could be pulled out of a hat with some deeper thought.  (similarly for
> > IPv4/GRE/IPv6/TCP and other combinations)
> >
> > What portions of the packet can the HW/FW [partially] checksum - and
> > return the value to the driver for further processing?
> > Can it return 1s complement sum of data portion of outer IPv4 (ie. in
> > IPv4/GRE/IPv4/TCP return a 1s complement sum of GRE/IPv4/TCP bytes)
> >
> 
> I assume Dmitry was speaking of this possibility, and our stack should
> handle this just fine.

In case of tunneling bnx2x HW can not provide csum of any portion of the packet.
Flag for XSUM_NO_VALIDATION on cqe will be set for all gre packets.
As a result driver will leave:
skb->ip_summed = CHECKSUM_NONE;

> 
> NIC providing these kind of checksums set :
> 
> skb->ip_summed = CHECKSUM_COMPLETE;
> skb->csum = csum;
> 
> before feeding the packet to the stack.
> 
> When we pull some header, we have to call skb_postpull_rcsum()
> to adjust the skb->csum so that final check can be done.
> 
> About 20 drivers currently provide these kind of checksumming.
> 


^ permalink raw reply

* Re: [patch] RxRPC: use copy_to_user() instead of memcpy()
From: David Miller @ 2013-03-19 13:42 UTC (permalink / raw)
  To: dan.carpenter; +Cc: dhowells, netdev, kernel-janitors
In-Reply-To: <20130318105503.GA17102@longonot.mountain>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Mon, 18 Mar 2013 13:55:03 +0300

> This is a user pointer.  Changing the memcpy() to copy_to_user()
> fixes a hang on my system.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> ---
> I'm not very familiar with this code, so please review this
> carefully.

It really should be a kernel pointer, not a user pointer.

For example, look at how recvfrom() cooks up a recvmsg method
call:

	struct sockaddr_storage address;
 ...
	msg.msg_name = (struct sockaddr *)&address;
	msg.msg_namelen = sizeof(address);
 ...
	err = sock_recvmsg(sock, &msg, size, flags);

recvmsg() proper works similarly, copying the user msghdr into a
kernel space copy via verify_iovec() or verify_compat_iovec() (and
I can understand how it's not obvious that this is the function
that performs this operation).

>  		/* copy the peer address and timestamp */
>  		if (!continue_call) {
> -			if (msg->msg_name && msg->msg_namelen > 0)
> -				memcpy(msg->msg_name,
> -				       &call->conn->trans->peer->srx,
> -				       sizeof(call->conn->trans->peer->srx));

I bet the size is too large for a sockaddr_storage, and therefore we
spam the kernel stack.  So I can only guess that changing this to a
copy_to_user() fixes the hang because it simply faults on the kernel
destination address.

->srx should be a "struct sockaddr_rxrpc" but that doesn't appear to
exceed the 128-byte size of sockaddr_storage.

^ permalink raw reply

* Re: [Xen-devel] [PATCH 2/4] xen-netfront: drop skb when skb->len > 65535
From: David Vrabel @ 2013-03-19 13:40 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, netdev@vger.kernel.org, xen-devel@lists.xen.org,
	annie.li@oracle.com, konrad.wilk@oracle.com
In-Reply-To: <1363616388.29093.201.camel@zion.uk.xensource.com>

On 18/03/13 14:19, Wei Liu wrote:
> On Mon, 2013-03-18 at 14:00 +0000, David Vrabel wrote:
>> On 18/03/13 13:48, Ian Campbell wrote:
>>> On Mon, 2013-03-18 at 13:46 +0000, David Vrabel wrote:
>>>> On 18/03/13 10:35, Wei Liu wrote:
>>>>> The `size' field of Xen network wire format is uint16_t, anything bigger than
>>>>> 65535 will cause overflow.
>>>>
>>>> The backend needs to be able to handle these bad packets without
>>>> disconnecting the VIF -- we can't fix all the frontend drivers.
>>>
>>> Agreed, although that doesn't imply that we shouldn't fix the frontend
>>> where we can -- such as upstream as Wei does here.
>>
>> Yes, frontends should be fixed where possible.
>>
>> This is what I came up with for the backend.  I don't have time to look
>> into it further but, Wei, feel free to use it as a starting point.
>>
> 
> Thanks for this patch.
> 
> I haven't gone through XSA-39 discussion, this is why I didn't come up
> with a fix for backend -- I need to make sure dropping packet like this
> won't re-exhibit the security hole.

How are these overlarge packets generated?  How do you reproduce the issue?

David

^ permalink raw reply

* Re: [BUG][mvebu] mvneta: cannot request irq 25 on openblocks-ax3
From: Ezequiel Garcia @ 2013-03-19 13:39 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: linux-arm-kernel, thomas.petazzoni, Jason Cooper, netdev,
	linux-kernel, yrl.pp-manager.tt@hitachi.com, Gregory Clement
In-Reply-To: <51486445.8040506@hitachi.com>

[-- Attachment #1: Type: text/plain, Size: 901 bytes --]

Hi Masami,

On Tue, Mar 19, 2013 at 10:12:37PM +0900, Masami Hiramatsu wrote:
> 
> Here I've hit a bug on the recent kernel. As far as I know, this bug
> exists on 3.9-rc1 too.
> 
> When I tried the latest mvebu for-next tree
> (git://git.infradead.org/users/jcooper/linux.git mvebu/for-next),
> I got below warning at bootup time and mvneta didn't work (link was never up).
> I ensured that "ifconfig ethX up" always caused that.
> 
> Does anyone succeed to boot openblocks-ax3 recently or hit same
> trouble?

This is a known bug. Gregory Clement already has a fix and he
will submit it soon. In case you need this fixed ASAP, I'm attaching
you a patch with a fix.

Please note the attached patch is not ready for mainline inclusion,
as I said Gregory will submit a cleaner version soon.

-- 
Ezequiel García, Free Electrons
Embedded Linux, Kernel and Android Engineering
http://free-electrons.com

[-- Attachment #2: 0001-net-mvneta-convert-to-percpu-interrupt.patch --]
[-- Type: text/plain, Size: 1799 bytes --]

>From 03080b4e459b103b97b658789658f118053de522 Mon Sep 17 00:00:00 2001
From: Gregory CLEMENT <gregory.clement@free-electrons.com>
Date: Sat, 9 Feb 2013 22:07:54 +0100
Subject: [PATCH] net: mvneta: convert to percpu interrupt

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b6025c3..7f63dd4 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1800,7 +1800,7 @@ static void mvneta_set_rx_mode(struct net_device *dev)
 /* Interrupt handling - the callback for request_irq() */
 static irqreturn_t mvneta_isr(int irq, void *dev_id)
 {
-	struct mvneta_port *pp = (struct mvneta_port *)dev_id;
+	struct mvneta_port *pp = *(struct mvneta_port **)dev_id;
 
 	/* Mask all interrupts */
 	mvreg_write(pp, MVNETA_INTR_NEW_MASK, 0);
@@ -2368,6 +2368,7 @@ static void mvneta_mdio_remove(struct mvneta_port *pp)
 	phy_disconnect(pp->phy_dev);
 	pp->phy_dev = NULL;
 }
+static struct mvneta_port __percpu **percpu_pp;
 
 static int mvneta_open(struct net_device *dev)
 {
@@ -2386,9 +2387,14 @@ static int mvneta_open(struct net_device *dev)
 	if (ret)
 		goto err_cleanup_rxqs;
 
+	percpu_pp = alloc_percpu(struct mvneta_port *);
+	*__this_cpu_ptr(percpu_pp) = pp;
+
 	/* Connect to port interrupt line */
-	ret = request_irq(pp->dev->irq, mvneta_isr, 0,
-			  MVNETA_DRIVER_NAME, pp);
+	ret = request_percpu_irq(pp->dev->irq, mvneta_isr,
+			  MVNETA_DRIVER_NAME, percpu_pp);
+	enable_percpu_irq(pp->dev->irq, 0);
+
 	if (ret) {
 		netdev_err(pp->dev, "cannot request irq %d\n", pp->dev->irq);
 		goto err_cleanup_txqs;
-- 
1.7.8.6


^ permalink raw reply related

* Re: [BUG][mvebu] mvneta: cannot request irq 25 on openblocks-ax3
From: Jason Cooper @ 2013-03-19 13:33 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: linux-arm-kernel, thomas.petazzoni, netdev, linux-kernel,
	yrl.pp-manager.tt@hitachi.com
In-Reply-To: <51486445.8040506@hitachi.com>

On Tue, Mar 19, 2013 at 10:12:37PM +0900, Masami Hiramatsu wrote:
> Hi,
> 
> Here I've hit a bug on the recent kernel. As far as I know, this bug
> exists on 3.9-rc1 too.
> 
> When I tried the latest mvebu for-next tree
> (git://git.infradead.org/users/jcooper/linux.git mvebu/for-next),

FYI: that branch isn't stable, it's used as a merge-test of
arm-soc/for-next (also not stable) and any branches I am trying to push
upstream that day.

Gregory has a patch in the works for this.  Hopefully he'll submit
it by the end of the week.

thx,

Jason.

^ permalink raw reply

* Re: [PATCH v2 1/4] tcp: fix too short FIN_WAIT2 time out
From: David Miller @ 2013-03-19 13:32 UTC (permalink / raw)
  To: makita.toshiaki; +Cc: eric.dumazet, netdev
In-Reply-To: <1363610344.7121.5.camel@ubuntu-vm-makita>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Date: Mon, 18 Mar 2013 21:39:04 +0900

>  		if (tp->linger2 >= 0) {
> -			const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
> -
> -			if (tmo > 0) {
> -				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
> -				goto out;
> -			}
> +			tcp_time_wait(sk, TCP_FIN_WAIT2, TCP_TIMEWAIT_LEN);
> +			goto out;
>  		}

Well, now you're completely ignoring the user's linger setting.

I really can't take these patches seriously, and will not apply them,
sorry.

^ permalink raw reply

* [PATCH net-next 4/4] filter: add minimal BPF JIT emitted image disassembler
From: Daniel Borkmann @ 2013-03-19 13:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, jasowang, Eric Dumazet
In-Reply-To: <cover.1363699285.git.dborkman@redhat.com>

This is a minimal stand-alone user space helper, that allows for debugging or
verification of emitted BPF JIT images. This is in particular useful for
emitted opcode debugging, since minor bugs in the JIT compiler can be fatal.
The disassembler is architecture generic and uses libopcodes and libbfd.

How to get to the disassembly, example:

  1) `echo 2 > /proc/sys/net/core/bpf_jit_enable`
  2) Load a BPF filter (e.g. `tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24`)
  3) Run e.g. `bpf_jit_disasm -o` to disassemble the most recent JIT code output

`bpf_jit_disasm -o` will display the related opcodes to a particular instruction
as well. Example for x86_64:

$./bpf_jit_disasm
94 bytes emitted from JIT compiler (pass:3, flen:9)
ffffffffa0356000 + <x>:
   0:	push   %rbp
   1:	mov    %rsp,%rbp
   4:	sub    $0x60,%rsp
   8:	mov    %rbx,-0x8(%rbp)
   c:	mov    0x68(%rdi),%r9d
  10:	sub    0x6c(%rdi),%r9d
  14:	mov    0xe0(%rdi),%r8
  1b:	mov    $0xc,%esi
  20:	callq  0xffffffffe0d01b71
  25:	cmp    $0x86dd,%eax
  2a:	jne    0x000000000000003d
  2c:	mov    $0x14,%esi
  31:	callq  0xffffffffe0d01b8d
  36:	cmp    $0x6,%eax
[...]
  5c:	leaveq
  5d:	retq

$ ./bpf_jit_disasm -o
94 bytes emitted from JIT compiler (pass:3, flen:9)
ffffffffa0356000 + <x>:
   0:	push   %rbp
	55
   1:	mov    %rsp,%rbp
	48 89 e5
   4:	sub    $0x60,%rsp
	48 83 ec 60
   8:	mov    %rbx,-0x8(%rbp)
	48 89 5d f8
   c:	mov    0x68(%rdi),%r9d
	44 8b 4f 68
  10:	sub    0x6c(%rdi),%r9d
	44 2b 4f 6c
[...]
  5c:	leaveq
	c9
  5d:	retq
	c3

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 scripts/bpf_jit_disasm.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 216 insertions(+)
 create mode 100644 scripts/bpf_jit_disasm.c

diff --git a/scripts/bpf_jit_disasm.c b/scripts/bpf_jit_disasm.c
new file mode 100644
index 0000000..1fe9fb5
--- /dev/null
+++ b/scripts/bpf_jit_disasm.c
@@ -0,0 +1,216 @@
+/*
+ * Minimal BPF JIT image disassembler
+ *
+ * Disassembles BPF JIT compiler emitted opcodes back to asm insn's for
+ * debugging or verification purposes.
+ *
+ * There is no Makefile. Compile with
+ *
+ *   `gcc -Wall -O2 bpf_jit_disasm.c -o bpf_jit_disasm -lopcodes -lbfd -ldl`
+ *
+ * or similar.
+ *
+ * To get the disassembly of the JIT code, do the following:
+ *
+ *  1) `echo 2 > /proc/sys/net/core/bpf_jit_enable`
+ *  2) Load a BPF filter (e.g. `tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24`)
+ *  3) Run e.g. `./bpf_jit_disasm -o` to read out the last JIT code
+ *
+ * Copyright 2013 Daniel Borkmann <borkmann@redhat.com>
+ * Licensed under the GNU General Public License, version 2.0 (GPLv2)
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <unistd.h>
+#include <string.h>
+#include <bfd.h>
+#include <dis-asm.h>
+#include <sys/klog.h>
+#include <sys/types.h>
+#include <regex.h>
+
+#define VERSION_STRING	"1.0"
+
+static void get_exec_path(char *tpath, size_t size)
+{
+	char *path;
+	ssize_t len;
+
+	snprintf(tpath, size, "/proc/%d/exe", (int) getpid());
+	tpath[size - 1] = 0;
+
+	path = strdup(tpath);
+	assert(path);
+
+	len = readlink(path, tpath, size);
+	tpath[len] = 0;
+
+	free(path);
+}
+
+static void get_asm_insns(uint8_t *image, size_t len, unsigned long base,
+			  int opcodes)
+{
+	int count, i, pc = 0;
+	char tpath[256];
+	struct disassemble_info info;
+	disassembler_ftype disassemble;
+	bfd *bfdf;
+
+	memset(tpath, 0, sizeof(tpath));
+	get_exec_path(tpath, sizeof(tpath));
+
+	bfdf = bfd_openr(tpath, NULL);
+	assert(bfdf);
+	assert(bfd_check_format(bfdf, bfd_object));
+
+	init_disassemble_info(&info, stdout, (fprintf_ftype) fprintf);
+	info.arch = bfd_get_arch(bfdf);
+	info.mach = bfd_get_mach(bfdf);
+	info.buffer = image;
+	info.buffer_length = len;
+
+	disassemble_init_for_target(&info);
+
+	disassemble = disassembler(bfdf);
+	assert(disassemble);
+
+	do {
+		printf("%4x:\t", pc);
+
+		count = disassemble(pc, &info);
+
+		if (opcodes) {
+			printf("\n\t");
+			for (i = 0; i < count; ++i)
+				printf("%02x ", (uint8_t) image[pc + i]);
+		}
+		printf("\n");
+
+		pc += count;
+	} while(count > 0 && pc < len);
+
+	bfd_close(bfdf);
+}
+
+static char *get_klog_buff(int *klen)
+{
+	int ret, len = klogctl(10, NULL, 0);
+	char *buff = malloc(len);
+
+	assert(buff && klen);
+	ret = klogctl(3, buff, len);
+	assert(ret >= 0);
+	*klen = ret;
+
+	return buff;
+}
+
+static void put_klog_buff(char *buff)
+{
+	free(buff);
+}
+
+static int get_last_jit_image(char *haystack, size_t hlen,
+			      uint8_t *image, size_t ilen,
+			      unsigned long *base)
+{
+	char *ptr, *pptr, *tmp;
+	off_t off = 0;
+	int ret, flen, proglen, pass, ulen = 0;
+	regmatch_t pmatch[1];
+	regex_t regex;
+
+	if (hlen == 0)
+		return 0;
+
+	ret = regcomp(&regex, "flen=[[:alnum:]]+ proglen=[[:digit:]]+ "
+		      "pass=[[:digit:]]+ image=[[:xdigit:]]+", REG_EXTENDED);
+	assert(ret == 0);
+
+	ptr = haystack;
+	while (1) {
+		ret = regexec(&regex, ptr, 1, pmatch, 0);
+		if (ret == 0) {
+			ptr += pmatch[0].rm_eo;
+			off += pmatch[0].rm_eo;
+			assert(off < hlen);
+		} else
+			break;
+	}
+
+	ptr = haystack + off - (pmatch[0].rm_eo - pmatch[0].rm_so);
+	ret = sscanf(ptr, "flen=%d proglen=%d pass=%d image=%lx",
+		     &flen, &proglen, &pass, base);
+	if (ret != 4)
+		return 0;
+
+	tmp = ptr = haystack + off;
+	while ((ptr = strtok(tmp, "\n")) != NULL && ulen < ilen) {
+		tmp = NULL;
+		if (!strstr(ptr, "JIT code"))
+			continue;
+		pptr = ptr;
+		while ((ptr = strstr(pptr, ":")))
+			pptr = ptr + 1;
+		ptr = pptr;
+		do {
+			image[ulen++] = (uint8_t) strtoul(pptr, &pptr, 16);
+			if (ptr == pptr || ulen >= ilen) {
+				ulen--;
+				break;
+			}
+			ptr = pptr;
+		} while (1);
+	}
+
+	assert(ulen == proglen);
+	printf("%d bytes emitted from JIT compiler (pass:%d, flen:%d)\n",
+	       proglen, pass, flen);
+	printf("%lx + <x>:\n", *base);
+
+	regfree(&regex);
+	return ulen;
+}
+
+static void help(void)
+{
+	printf("Usage: bpf_jit_disasm [-ohv]\n");
+	printf("Version %s, written by Daniel Borkmann <borkmann@redhat.com>\n",
+	       VERSION_STRING);
+	printf("  -o                             Include opcodes in output\n");
+	printf("  -h|-v                          Show help/version\n");
+	exit(0);
+}
+
+int main(int argc, char **argv)
+{
+	int len, klen, opcodes = 0;
+	char *kbuff;
+	unsigned long base;
+	uint8_t image[4096];
+
+	if (argc > 1) {
+		if (!strncmp("-o", argv[argc - 1], 2))
+			opcodes = 1;
+		if (!strncmp("-h", argv[argc - 1], 2) ||
+		    !strncmp("-v", argv[argc - 1], 2))
+			help();
+	}
+
+	bfd_init();
+	memset(image, 0, sizeof(image));
+
+	kbuff = get_klog_buff(&klen);
+
+	len = get_last_jit_image(kbuff, klen, image, sizeof(image), &base);
+	if (len > 0 && base > 0)
+		get_asm_insns(image, len, base, opcodes);
+
+	put_klog_buff(kbuff);
+
+	return 0;
+}
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next 3/4] filter: add ANC_PAY_OFFSET instruction for loading payload start offset
From: Daniel Borkmann @ 2013-03-19 13:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, jasowang
In-Reply-To: <cover.1363699285.git.dborkman@redhat.com>

It is very useful to do dynamic truncation of packets. In particular,
we're interested to push the necessary header bytes to the user space and
cut off user payload that should probably not be transferred for some reasons
(e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET,
we can load it into the accumulator, and return it. E.g. in bpfc syntax ...

        ld #poff        ; { 0x20, 0, 0, 0xfffff034 },
        ret a           ; { 0x16, 0, 0, 0x00000000 },

... as a filter will accomplish this without having to do a big hackery in
a BPF filter itself. Follow-up JIT implementations are welcome.

Thanks to Eric Dumazet for suggesting and discussing this during the
Netfilter Workshop in Copenhagen.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/linux/filter.h      | 1 +
 include/uapi/linux/filter.h | 3 ++-
 net/core/filter.c           | 5 +++++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c45eabc..d2059cb 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -126,6 +126,7 @@ enum {
 	BPF_S_ANC_SECCOMP_LD_W,
 	BPF_S_ANC_VLAN_TAG,
 	BPF_S_ANC_VLAN_TAG_PRESENT,
+	BPF_S_ANC_PAY_OFFSET,
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 9cfde69..8eb9cca 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -129,7 +129,8 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define SKF_AD_ALU_XOR_X	40
 #define SKF_AD_VLAN_TAG	44
 #define SKF_AD_VLAN_TAG_PRESENT 48
-#define SKF_AD_MAX	52
+#define SKF_AD_PAY_OFFSET	52
+#define SKF_AD_MAX	56
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 2e20b55..dad2a17 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -348,6 +348,9 @@ load_b:
 		case BPF_S_ANC_VLAN_TAG_PRESENT:
 			A = !!vlan_tx_tag_present(skb);
 			continue;
+		case BPF_S_ANC_PAY_OFFSET:
+			A = __skb_get_poff(skb);
+			continue;
 		case BPF_S_ANC_NLATTR: {
 			struct nlattr *nla;
 
@@ -612,6 +615,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
 			ANCILLARY(ALU_XOR_X);
 			ANCILLARY(VLAN_TAG);
 			ANCILLARY(VLAN_TAG_PRESENT);
+			ANCILLARY(PAY_OFFSET);
 			}
 
 			/* ancillary operation unknown or unsupported */
@@ -814,6 +818,7 @@ static void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to)
 		[BPF_S_ANC_SECCOMP_LD_W] = BPF_LD|BPF_B|BPF_ABS,
 		[BPF_S_ANC_VLAN_TAG]	= BPF_LD|BPF_B|BPF_ABS,
 		[BPF_S_ANC_VLAN_TAG_PRESENT] = BPF_LD|BPF_B|BPF_ABS,
+		[BPF_S_ANC_PAY_OFFSET]	= BPF_LD|BPF_B|BPF_ABS,
 		[BPF_S_LD_W_LEN]	= BPF_LD|BPF_W|BPF_LEN,
 		[BPF_S_LD_W_IND]	= BPF_LD|BPF_W|BPF_IND,
 		[BPF_S_LD_H_IND]	= BPF_LD|BPF_H|BPF_IND,
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next 2/4] net: flow_dissector: add __skb_get_poff to get a start offset to payload
From: Daniel Borkmann @ 2013-03-19 13:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, jasowang
In-Reply-To: <cover.1363699285.git.dborkman@redhat.com>

__skb_get_poff() returns the offset to the payload as far as it could
be dissected. The main user is currently BPF, so that we can dynamically
truncate packets without needing to push actual payload to the user
space and instead can analyze headers only.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/linux/skbuff.h    |  2 ++
 net/core/flow_dissector.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index eb2106f..0e84fd8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2835,6 +2835,8 @@ static inline void skb_checksum_none_assert(const struct sk_buff *skb)
 
 bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
 
+u32 __skb_get_poff(const struct sk_buff *skb);
+
 /**
  * skb_head_is_locked - Determine if the skb->head is locked down
  * @skb: skb to check
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index eb9dde1..8213da7 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -5,6 +5,10 @@
 #include <linux/if_vlan.h>
 #include <net/ip.h>
 #include <net/ipv6.h>
+#include <linux/igmp.h>
+#include <linux/icmp.h>
+#include <linux/sctp.h>
+#include <linux/dccp.h>
 #include <linux/if_tunnel.h>
 #include <linux/if_pppox.h>
 #include <linux/ppp_defs.h>
@@ -229,6 +233,59 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 }
 EXPORT_SYMBOL(__skb_tx_hash);
 
+/* __skb_get_poff() returns the offset to the payload as far as it could
+ * be dissected. The main user is currently BPF, so that we can dynamically
+ * truncate packets without needing to push actual payload to the user
+ * space and can analyze headers only, instead.
+ */
+u32 __skb_get_poff(const struct sk_buff *skb)
+{
+	struct flow_keys keys;
+	u32 poff = 0;
+
+	if (!skb_flow_dissect(skb, &keys))
+		return 0;
+
+	poff += keys.thoff;
+	switch (keys.ip_proto) {
+	case IPPROTO_TCP: {
+		const struct tcphdr *tcph;
+		struct tcphdr _tcph;
+
+		tcph = skb_header_pointer(skb, poff, sizeof(_tcph), &_tcph);
+		if (!tcph)
+			return poff;
+
+		poff += max_t(u32, sizeof(struct tcphdr), tcph->doff * 4);
+		break;
+	}
+	case IPPROTO_UDP:
+	case IPPROTO_UDPLITE:
+		poff += sizeof(struct udphdr);
+		break;
+	/* For the rest, we do not really care about header
+	 * extensions at this point for now.
+	 */
+	case IPPROTO_ICMP:
+		poff += sizeof(struct icmphdr);
+		break;
+	case IPPROTO_ICMPV6:
+		poff += sizeof(struct icmp6hdr);
+		break;
+	case IPPROTO_IGMP:
+		poff += sizeof(struct igmphdr);
+		break;
+	case IPPROTO_DCCP:
+		poff += sizeof(struct dccp_hdr);
+		break;
+	case IPPROTO_SCTP:
+		poff += sizeof(struct sctphdr);
+		break;
+	}
+
+	return poff;
+}
+
 static inline u16 dev_cap_txqueue(struct net_device *dev, u16 queue_index)
 {
 	if (unlikely(queue_index >= dev->real_num_tx_queues)) {
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next 1/4] flow_keys: include thoff into flow_keys for later usage
From: Daniel Borkmann @ 2013-03-19 13:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, jasowang
In-Reply-To: <cover.1363699285.git.dborkman@redhat.com>

In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
doing the work here anyway, also store thoff for a later usage, e.g. in
the BPF filter. We need to reorder choke_skb_cb a bit, though. Also, by
having thoff 16 Bit, we do not need to pack flow_keys.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/flow_keys.h   | 1 +
 net/core/flow_dissector.c | 5 ++++-
 net/sched/sch_choke.c     | 4 ++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h
index 80461c1..bb8271d 100644
--- a/include/net/flow_keys.h
+++ b/include/net/flow_keys.h
@@ -9,6 +9,7 @@ struct flow_keys {
 		__be32 ports;
 		__be16 port16[2];
 	};
+	u16 thoff;
 	u8 ip_proto;
 };
 
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index f8d9e03..eb9dde1 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -23,7 +23,8 @@ static void iph_to_flow_copy_addrs(struct flow_keys *flow, const struct iphdr *i
 
 bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow)
 {
-	int poff, nhoff = skb_network_offset(skb);
+	int poff;
+	u16 nhoff = skb_network_offset(skb);
 	u8 ip_proto;
 	__be16 proto = skb->protocol;
 
@@ -151,6 +152,8 @@ ipv6:
 			flow->ports = *ports;
 	}
 
+	flow->thoff = nhoff;
+
 	return true;
 }
 EXPORT_SYMBOL(skb_flow_dissect);
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index cc37dd5..9854c2b 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -141,9 +141,9 @@ static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx)
 }
 
 struct choke_skb_cb {
-	u16			classid;
-	u8			keys_valid;
 	struct flow_keys	keys;
+	u8			keys_valid;
+	u16			classid;
 };
 
 static inline struct choke_skb_cb *choke_skb_cb(const struct sk_buff *skb)
-- 
1.7.11.7

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox