Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/2] core: Factor out flow calculation from get_rps_cpu
From: Changli Gao @ 2010-08-03  7:18 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: arnd, bhutchings, davem, mst, netdev, therbert
In-Reply-To: <AANLkTikJMih-CAA_d95ot0z57EOhMFsiNc2nnki6v6LY@mail.gmail.com>

On Tue, Aug 3, 2010 at 2:11 PM, Changli Gao <xiaosuo@gmail.com> wrote:
> On Tue, Aug 3, 2010 at 1:57 PM, Krishna Kumar2 <krkumar2@in.ibm.com> wrote:
>> Hi Changli,
>>
>> Good catch.
>>
>> Instead of adding support for ethernet header or pull/push,
>> I could defer the skb_push(ETH_HLEN), something like:
>>
>> static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
>> {
>>        struct macvtap_queue *q = macvtap_get_queue(dev, skb);
>>        if (!q)
>>                goto drop;
>>
>>        if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
>>                goto drop;
>>
>> +       skb_push(skb, ETH_HLEN);
>>        ...
>> }
>>
>> and remove the same in macvtap_receive. Will this be better?
>>
>
> I am confused by the call sites of macvlan_dev.receive and
> macvlan_dev.forward. They both are possible to be called in both
> RX(skb->data points to network header) and TX(skb->data points to
> ethernet) paths. The current code in macvtap shows that
> macvlan_dev.receive should be called in network layer, and
> macvlan_dev.forward should be called in dev layer. Am I correct?
>

After checking the code carefully, I find macvlan_dev.receive is
called in network layer(RX path), and macvlan_dev.forward is called in
dev layer(TX path). The current code hasn't any issue. Your solution
won't work, as macvtap_forward is called in dev layer, when skb->data
points to ethernet header already.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: David Miller @ 2010-08-03  7:19 UTC (permalink / raw)
  To: leonerd; +Cc: netdev, hagen
In-Reply-To: <20100803070709.GO11110@cel.leo>

From: Paul LeoNerd Evans <leonerd@leonerd.org.uk>
Date: Tue, 3 Aug 2010 08:07:10 +0100

> On Mon, Aug 02, 2010 at 10:18:13PM -0700, David Miller wrote:
>> 1) The limiting scheme will make legitimate scripts USELESS
> 
> Rightnow, BPF is all but useless for parsing, say, IPv6. I only pick
> IPv6 as one example, I'm sure there must exist a great number more
> packet-based protocols that use a "linked-list" style approach to
> headers. None of those are currently filterable on the current set of
> instructions. LOOP would allow these.

It's not meant for detailed packet protocol header analysis,
it's for stateless straight line matching of masked values
in packet headers.

Nothing more.

^ permalink raw reply

* Re: RFC: New BPF 'LOOP' instruction
From: Paul LeoNerd Evans @ 2010-08-03  7:18 UTC (permalink / raw)
  To: Hagen Paul Pfeifer, netdev
In-Reply-To: <20100802201629.GA5973@nuttenaction>

[-- Attachment #1: Type: text/plain, Size: 5159 bytes --]

On Mon, Aug 02, 2010 at 10:16:29PM +0200, Hagen Paul Pfeifer wrote:
> In general: BPF was constructed to address filters rules in a generic manner
> and BPF does not contain any special protocol specific optimization - nor any
> sophisticated connection tracking functionality. In general you should
> pre-filter unneeded packets and shift the real high level filtering to some
> post-processing step. tcpdump filter capabilities are limited and where never
> designed to filter _any_ traffic. For example: you are lost if you want to match
> transport layer fields like port number where the underlying IPv{4,6} packet
> is fragmented.

Oh, I am quite aware of the futility in trying to, for example, match up
IPv4 fragments.

There's nothing about my suggestion that is in any way IPv6-specific. I
used IPv6 simply as an example to motivate the suggestion. It could
quite easily apply to any other sort of protocol that uses a linked-list
of headers.

> Furthermore, I doubt that the loop provides any significant advantages. 
> IPv6 extension header parsing is not that straight forward. For example 
> check  the IPSec AH Extension header where the extension header length 
> must interpreted differently because of a IPSec AH protcol defect. I assume
> that a straight forward pcap encoded BPF opcode (composed of jump and load
> instructions) is more efficient as an highly flexible loop construct. 

I'm not sure I follow your logic here.

By my understanding, pcap's IPv6 header parsing filter is a 6-times
statically-unrolled loop, where each loop body has to parse some
headers. I'm already aware that various headers are hard to parse.

Allow me some pseudocode... Currently, pcap has to do the equivalent of:

X = 0
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
got:
  ... continue with filter.

That "load A with its length" is the IPv6-specific part; I'm not
suggesting that my LOOP suggestion in any way helps that. It's a
difficult problem, sure. What I _am_ suggesting is that this static
unrolling can be avoided, instead becoming:

X = 0
start:
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
LOOP to start
got:
  ... continue with filter.

This results also in a shorter program, because there is a hard limit on
the total number of instructions in a filter.

> Last but not least I am interested in a RFC patch as well as a pcap patch (see
> pcap-opt.c). You should not underrate the effort to generate an generic IPv6
> extension header opcode optimizer - without this the newly introduced opcode
> is pointless.

As above; I was under the impression that pcap already -does- contain
code to have a reasonable attempt to hunt down the requested IPv6
header, which is what implements "ipv6 protochain". I'll quote from
pcap-filter(7):

       ip6 protochain protocol
              True  if the packet is IPv6 packet, and contains protocol header
              with type protocol in its protocol header chain.  For example,
                   ip6 protochain 6
              matches any IPv6 packet with TCP protocol header in the protocol
              header  chain.  The packet may contain, for example, authentica‐
              tion  header,  routing  header,  or  hop-by-hop  option  header,
              between  IPv6  header  and  TCP header.  The BPF code emitted by
              this primitive is complex and cannot be  optimized  by  the  BPF
              optimizer code, so this can be somewhat slow.


> PS: the LOOP opcode must be secure against any ressource attack -> the loop
> must be break after n iterations.

Which is -exactly- what it does. I'll quote my original:

       X += A.
       If X < len, jump backwards jt instructions.
       Otherwise, fallthrough to the next instruction
  ...
  The intention of this instruction is to be able to implement a loop in
  which successive iterations advance the index register along the packet
  buffer. By comparing X to the packet length, we can bound the running
  time of the loop instruction, avoiding it locking up the kernel. By
  banning STX instructions within the body of the loop, we can ensure that
  X must be a strictly monotonically increasing sequence. At absolute
  worst, X is increased by 1 each time, meaning at worst the body of the
  loop must execute for every byte in the packet.

Is this sufficiently secure, or do you suggest a further limit is
required?

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: David Miller @ 2010-08-03  7:18 UTC (permalink / raw)
  To: leonerd; +Cc: netdev
In-Reply-To: <20100803070426.GN11110@cel.leo>

From: Paul LeoNerd Evans <leonerd@leonerd.org.uk>
Date: Tue, 3 Aug 2010 08:04:27 +0100

> On Mon, Aug 02, 2010 at 10:13:41PM -0700, David Miller wrote:
>> > Any comments on this, while I proceed? Barring any major complaints,
>> > I'll have a hack at some code and present a patch in due course...
>> 
>> We're not adding loop instructions, it's just asking for trouble
>> since any user can attach BPF filters to a socket and it's just
>> way too easy to make a loop endless.
>> 
>> There's a reason no loop primitives were added to the original
>> BPF specification, perhaps you should take a look at what their
>> reasoning was.
> 
> Yes. I am very aware of that.
> 
> Please read carefully my suggestion. These loops cannot be made endless
> - they will be bounded by, at most, the number of bytes in the packet
> buffer. The loop is required to increment X at least 1 at every
> iteration, and will not allow it to continue past the end of the packet.
> This puts a strict bound on the runtime of the loop.

That makes the looping construct largely useless, which I mentioned in
my second reply to this thread.


^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: Paul LeoNerd Evans @ 2010-08-03  7:07 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: hagen
In-Reply-To: <20100802.221813.43045517.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

On Mon, Aug 02, 2010 at 10:18:13PM -0700, David Miller wrote:
> If you just check for a single loop hitting, the user will just use
> a chaining of two looping constructs.  And then three levels of
> indirection, then four, etc.  He can run up to just before exhasting
> the "iteration limit" of one loop, and branch to the next one, and
> so on and so forth.

And this is why part of my suggestion bans the use of a LOOP
instruction within the "body" of another, such that they cannot nest.

> There are probably a million ways to exploit this, and once you come
> up with a validation or limiting scheme one of two things will happen:
> 
> 1) The limiting scheme will make legitimate scripts USELESS

Rightnow, BPF is all but useless for parsing, say, IPv6. I only pick
IPv6 as one example, I'm sure there must exist a great number more
packet-based protocols that use a "linked-list" style approach to
headers. None of those are currently filterable on the current set of
instructions. LOOP would allow these.

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: softirq warnings when calling dev_kfree_skb_irq - bug in conntrack?
From: Johannes Berg @ 2010-08-03  7:04 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: NetDev, Xu, Dongxiao, Xen-devel@lists.xensource.com, Ian Campbell,
	Patrick McHardy, Eric Dumazet
In-Reply-To: <4C571476.7070301@goop.org>

On Mon, 2010-08-02 at 11:54 -0700, Jeremy Fitzhardinge wrote:

> ------------[ cut here ]------------
> WARNING: at kernel/softirq.c:143 local_bh_enable+0x40/0x87()
> Modules linked in: xt_state dm_mirror dm_region_hash dm_log microcode [last unloaded: scsi_wait_scan]
> Pid: 0, comm: swapper Not tainted 2.6.35-rc6-next-20100729+ #29
> Call Trace:
>   <IRQ>   [<ffffffff81030de3>] warn_slowpath_common+0x80/0x98
>   [<ffffffff81030e10>] warn_slowpath_null+0x15/0x17
>   [<ffffffff81035ff3>] local_bh_enable+0x40/0x87
>   [<ffffffff814236e5>] destroy_conntrack+0x78/0x9e
>   [<ffffffff810bea55>] ? __kmalloc_track_caller+0xc3/0x135
>   [<ffffffff814203b4>] nf_conntrack_destroy+0x16/0x18
>   [<ffffffff813fadee>] skb_release_head_state+0x97/0xd9
>   [<ffffffff813fabbe>] __kfree_skb+0x11/0x7a
>   [<ffffffff813fac4e>] consume_skb+0x27/0x29
>   [<ffffffff81402d3a>] dev_kfree_skb_irq+0x18/0x62
>   [<ffffffff8130a762>] xennet_tx_buf_gc+0xfc/0x192
>   [<ffffffff8130a8fb>] smart_poll_function+0x50/0x121
>   [<ffffffff8130a8ab>] ? smart_poll_function+0x0/0x121
>   [<ffffffff8104b8d1>] __run_hrtimer+0xcc/0x127
>   [<ffffffff8104bad3>] hrtimer_interrupt+0x9c/0x17b

> It seems the basic problem is that xennet_tx_buf_gc() is being called in 
> interrupt context - with smartpoll it's from the timer interrupt, but 
> even without it is being called from xennet_interrupt(), which in turn 
> calls dev_kfree_skb_irq().
> 
> Since this should be perfectly OK, it appears the problem is actually in 
> conntrack.  I'm not sure where this bug started happening, but its 
> relatively recently I think.

I had this too:
http://article.gmane.org/gmane.linux.network/167590

But I'm not convinced it's conntrack, I'd think it's

commit 15e83ed78864d0625e87a85f09b297c0919a4797
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed May 19 23:16:03 2010 +0000

    net: remove zap_completion_queue

which, from the looks of it, ought to be reverted because it failed to
take into account that dev_kfree_skb() can do more things that require
non-irq-context than just calling skb->destructor, like for instance the
conntrack thing we see here.

johannes


^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: Paul LeoNerd Evans @ 2010-08-03  7:04 UTC (permalink / raw)
  To: David Miller, netdev
In-Reply-To: <20100802.221341.137851732.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

On Mon, Aug 02, 2010 at 10:13:41PM -0700, David Miller wrote:
> > Any comments on this, while I proceed? Barring any major complaints,
> > I'll have a hack at some code and present a patch in due course...
> 
> We're not adding loop instructions, it's just asking for trouble
> since any user can attach BPF filters to a socket and it's just
> way too easy to make a loop endless.
> 
> There's a reason no loop primitives were added to the original
> BPF specification, perhaps you should take a look at what their
> reasoning was.

Yes. I am very aware of that.

Please read carefully my suggestion. These loops cannot be made endless
- they will be bounded by, at most, the number of bytes in the packet
buffer. The loop is required to increment X at least 1 at every
iteration, and will not allow it to continue past the end of the packet.
This puts a strict bound on the runtime of the loop.

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: Is it a possible bug in dev_gro_receive()?
From: Jarek Poplawski @ 2010-08-03  6:45 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net
In-Reply-To: <F2E9EB7348B8264F86B6AB8151CE2D791BAB3078E0@shsmsx502.ccr.corp.intel.com>

On Tue, Aug 03, 2010 at 10:33:24AM +0800, Xin, Xiaohui wrote:
> >-----Original Message-----
> >From: Jarek Poplawski [mailto:jarkao2@gmail.com]
> >Sent: Monday, August 02, 2010 6:29 PM
> >To: Xin, Xiaohui
> >Cc: netdev@vger.kernel.org; herbert@gondor.apana.org.au; davem@davemloft.net
> >Subject: Re: Is it a possible bug in dev_gro_receive()?
> >
> >Xin Xiaohui wrote:
> >> I looked into the code dev_gro_receive(), found the code here:
> >> if the frags[0] is pulled to 0, then the page will be released,
> >> and memmove() frags left.
> >> Is that right? I'm not sure if memmove do right or not, but
> >> frags[0].size is never set after memove at least. what I think
> >> a simple way is not to do anything if we found frags[0].size == 0.
> >> The patch is as followed.
> >>
> >> Or am I missing something here?
> >
> >I think, you're right, but fixing memmove looks nicer to me:
> >
> > -	--skb_shinfo(skb)->nr_frags);
> > +	--skb_shinfo(skb)->nr_frags * sizeof(skb_frag_t));
> >
> >Jarek P.
> 
> Is there a little hurt of performance to do memmove() if skb_shinfo(skb)->nr_frags is large?
> We're now working on the zero-copy patches based on napi_gro_frags() interface, and in 
> this case, we have found a lot of skbs which frags[0] is pulled to 0. And after the memmove is
> fixed, each frags[x].size is needed to modify too.
> So I think don't do anything is better. Or is there any side effect with a null page in the stack?

Even if it's better, generally you should separate fixes from
optimizations. On the other hand, it was expected to be "unlikely" by
design, so you should probably explain more why it has to be changed
here too.

Thanks,
Jarek P.

> 
> Thanks
> Xiaohui
> >
> >>
> >> ---
> >>  net/core/dev.c |    7 -------
> >>  1 files changed, 0 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 264137f..28cdbbf 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -2730,13 +2730,6 @@ pull:
> >>
> >>  		skb_shinfo(skb)->frags[0].page_offset += grow;
> >>  		skb_shinfo(skb)->frags[0].size -= grow;
> >> -
> >> -		if (unlikely(!skb_shinfo(skb)->frags[0].size)) {
> >> -			put_page(skb_shinfo(skb)->frags[0].page);
> >> -			memmove(skb_shinfo(skb)->frags,
> >> -				skb_shinfo(skb)->frags + 1,
> >> -				--skb_shinfo(skb)->nr_frags);
> >> -		}
> >>  	}
> >>
> >>  ok:
> >
> >
> 

^ permalink raw reply

* Re: [PATCH v3 1/2] core: Factor out flow calculation from get_rps_cpu
From: Changli Gao @ 2010-08-03  6:11 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: arnd, bhutchings, davem, mst, netdev, therbert
In-Reply-To: <OF46AA822C.08F8EAED-ON65257774.001F5AE3-65257774.00209BBF@in.ibm.com>

On Tue, Aug 3, 2010 at 1:57 PM, Krishna Kumar2 <krkumar2@in.ibm.com> wrote:
> Hi Changli,
>
> Good catch.
>
> Instead of adding support for ethernet header or pull/push,
> I could defer the skb_push(ETH_HLEN), something like:
>
> static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
> {
>        struct macvtap_queue *q = macvtap_get_queue(dev, skb);
>        if (!q)
>                goto drop;
>
>        if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
>                goto drop;
>
> +       skb_push(skb, ETH_HLEN);
>        ...
> }
>
> and remove the same in macvtap_receive. Will this be better?
>

I am confused by the call sites of macvlan_dev.receive and
macvlan_dev.forward. They both are possible to be called in both
RX(skb->data points to network header) and TX(skb->data points to
ethernet) paths. The current code in macvtap shows that
macvlan_dev.receive should be called in network layer, and
macvlan_dev.forward should be called in dev layer. Am I correct?


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH v3 1/2] core: Factor out flow calculation from get_rps_cpu
From: Krishna Kumar2 @ 2010-08-03  5:57 UTC (permalink / raw)
  To: Changli Gao; +Cc: arnd, bhutchings, davem, mst, netdev, therbert
In-Reply-To: <AANLkTikNOstJr3qqMUqtUoOm1xHOjw5uM7NpQUNJz4PR@mail.gmail.com>

Hi Changli,

Good catch.

Instead of adding support for ethernet header or pull/push,
I could defer the skb_push(ETH_HLEN), something like:

static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
{
	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
	if (!q)
		goto drop;

	if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
		goto drop;

+	skb_push(skb, ETH_HLEN);
	...
}

and remove the same in macvtap_receive. Will this be better?

Your other suggestions also looks good.

Thanks,

- KK

Changli Gao <xiaosuo@gmail.com> wrote on 08/03/2010 09:35:34 AM:

> Changli Gao <xiaosuo@gmail.com>
> 08/03/2010 09:35 AM
>
> To
>
> Krishna Kumar2/India/IBM@IBMIN
>
> cc
>
> davem@davemloft.net, arnd@arndb.de, bhutchings@solarflare.com,
> netdev@vger.kernel.org, therbert@google.com, mst@redhat.com
>
> Subject
>
> Re: [PATCH v3 1/2] core: Factor out flow calculation from get_rps_cpu
>
> On Tue, Aug 3, 2010 at 11:02 AM, Krishna Kumar <krkumar2@in.ibm.com>
wrote:
> > From: Krishna Kumar <krkumar2@in.ibm.com>
> >
> > Factor out flow calculation code from get_rps_cpu, since macvtap
> > driver can use the same code.
> >
> > Revisions:
> >
> > v2 - Ben: Separate flow calcuation out and use in select queue
> > v3 - Arnd: Don't re-implement MIN
> >
> > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> > ---
> >  include/linux/netdevice.h |    1
> >  net/core/dev.c            |   94 ++++++++++++++++++++++--------------
> >  2 files changed, 59 insertions(+), 36 deletions(-)
> >
> > diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
> > --- org/include/linux/netdevice.h       2010-08-03 08:19:57.000000000
+0530
> > +++ new/include/linux/netdevice.h       2010-08-03 08:19:57.000000000
+0530
> > @@ -2253,6 +2253,7 @@ static inline const char *netdev_name(co
> >        return dev->name;
> >  }
> >
> > +extern int skb_calculate_flow(struct net_device *dev, struct sk_buff
*skb);
> >  extern int netdev_printk(const char *level, const struct net_device
*dev,
> >                         const char *format, ...)
> >        __attribute__ ((format (printf, 3, 4)));
> > diff -ruNp org/net/core/dev.c new/net/core/dev.c
> > --- org/net/core/dev.c  2010-08-03 08:19:57.000000000 +0530
> > +++ new/net/core/dev.c  2010-08-03 08:19:57.000000000 +0530
> > @@ -2263,51 +2263,24 @@ static inline void ____napi_schedule(str
> >        __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >  }
> >
> > -#ifdef CONFIG_RPS
> > -
> > -/* One global table that all flow-based protocols share. */
> > -struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;
> > -EXPORT_SYMBOL(rps_sock_flow_table);
> > -
> >  /*
> > - * get_rps_cpu is called from netif_receive_skb and returns the target
> > - * CPU from the RPS map of the receiving queue for a given skb.
> > - * rcu_read_lock must be held on entry.
> > + * skb_calculate_flow: calculate a flow hash based on src/dst
addresses
> > + * and src/dst port numbers. On success, returns a hash number (> 0),
> > + * otherwise -1.
> >  */
> > -static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
> > -                      struct rps_dev_flow **rflowp)
> > +int skb_calculate_flow(struct net_device *dev, struct sk_buff *skb)
> >  {
> > +       int hash = skb->rxhash;
> >        struct ipv6hdr *ip6;
> >        struct iphdr *ip;
> > -       struct netdev_rx_queue *rxqueue;
> > -       struct rps_map *map;
> > -       struct rps_dev_flow_table *flow_table;
> > -       struct rps_sock_flow_table *sock_flow_table;
> > -       int cpu = -1;
> >        u8 ip_proto;
> > -       u16 tcpu;
> >        u32 addr1, addr2, ihl;
> >        union {
> >                u32 v32;
> >                u16 v16[2];
> >        } ports;
> >
> > -       if (skb_rx_queue_recorded(skb)) {
> > -               u16 index = skb_get_rx_queue(skb);
> > -               if (unlikely(index >= dev->num_rx_queues)) {
> > -                       WARN_ONCE(dev->num_rx_queues > 1, "%s
> received packet "
> > -                               "on queue %u, but number of RX
> queues is %u\n",
> > -                               dev->name, index, dev->num_rx_queues);
> > -                       goto done;
> > -               }
> > -               rxqueue = dev->_rx + index;
> > -       } else
> > -               rxqueue = dev->_rx;
> > -
> > -       if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
> > -               goto done;
> > -
> > -       if (skb->rxhash)
> > +       if (hash)
> >                goto got_hash; /* Skip hash computation on packet header
*/
> >
> >        switch (skb->protocol) {
> > @@ -2334,6 +2307,7 @@ static int get_rps_cpu(struct net_device
> >        default:
> >                goto done;
> >        }
> > +
> >        switch (ip_proto) {
> >        case IPPROTO_TCP:
> >        case IPPROTO_UDP:
> > @@ -2356,11 +2330,59 @@ static int get_rps_cpu(struct net_device
> >        /* get a consistent hash (same value on both flow directions) */
> >        if (addr2 < addr1)
> >                swap(addr1, addr2);
> > -       skb->rxhash = jhash_3words(addr1, addr2, ports.v32, hashrnd);
> > -       if (!skb->rxhash)
> > -               skb->rxhash = 1;
> > +
> > +       hash = jhash_3words(addr1, addr2, ports.v32, hashrnd);
> > +       if (!hash)
> > +               hash = 1;
> >
> >  got_hash:
> > +       return hash;
> > +
> > +done:
> > +       return -1;
> > +}
> > +EXPORT_SYMBOL(skb_calculate_flow);
>
> I have noticed that you use skb_calculate_flow() in
> macvtap_get_queue() where skb->data doesn't point to the network
> header but the ethernet header. However, skb_calculate_flow() assume
> skb->data points to the network header. There are two choices:
>  * update skb_calculate_flow to support called in ethernet layer.
>  * pull skb before skb_calculate_flow, and push skb after
> skb_calculate_flow() in macvtap_get_queue().
>
> I prefer the former way.
>
> BTW: the function name skb_calculate_flow isn't good. How about
> skb_get_rxhash(). Maybe we can implement two versions: fast path and
> slow path. And implement the fast path version as a inline function in
> skbuff.h.
>
> static inline u32 skb_get_rxhash(struct sk_buff *skb)
> {
>         u32 rxhash;
>
>         rxhash = skb->rxhash;
>         if (!rxhash)
>                 return __skb_get_rxhash(skb);
>         return rxhash;
> }
>
>
> > +
> > +#ifdef CONFIG_RPS
> > +
> > +/* One global table that all flow-based protocols share. */
> > +struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;
> > +EXPORT_SYMBOL(rps_sock_flow_table);
> > +
> > +/*
> > + * get_rps_cpu is called from netif_receive_skb and returns the target
> > + * CPU from the RPS map of the receiving queue for a given skb.
> > + * rcu_read_lock must be held on entry.
> > + */
> > +static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
> > +                      struct rps_dev_flow **rflowp)
> > +{
> > +       struct netdev_rx_queue *rxqueue;
> > +       struct rps_map *map;
> > +       struct rps_dev_flow_table *flow_table;
> > +       struct rps_sock_flow_table *sock_flow_table;
> > +       int cpu = -1;
> > +       u16 tcpu;
> > +
> > +       if (skb_rx_queue_recorded(skb)) {
> > +               u16 index = skb_get_rx_queue(skb);
> > +               if (unlikely(index >= dev->num_rx_queues)) {
> > +                       WARN_ONCE(dev->num_rx_queues > 1, "%s
> received packet "
> > +                               "on queue %u, but number of RX
> queues is %u\n",
> > +                               dev->name, index, dev->num_rx_queues);
> > +                       goto done;
> > +               }
> > +               rxqueue = dev->_rx + index;
> > +       } else
> > +               rxqueue = dev->_rx;
> > +
> > +       if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
> > +               goto done;
> > +
> > +       skb->rxhash = skb_calculate_flow(dev, skb);
> > +       if (skb->rxhash < 0)
> > +               goto done;
> > +
> >        flow_table = rcu_dereference(rxqueue->rps_flow_table);
> >        sock_flow_table = rcu_dereference(rps_sock_flow_table);
> >        if (flow_table && sock_flow_table) {
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
>
>
>
> --
> Regards,
> Changli Gao(xiaosuo@gmail.com)


^ permalink raw reply

* reminder: bug fixes only
From: David Miller @ 2010-08-03  5:48 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA; +Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA


2.6.35 is out, we're about to enter the merge window, so
if it isn't already in my tree I don't want to see it unless
it's a bug fix.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: cleanup inclusion
From: David Miller @ 2010-08-03  5:45 UTC (permalink / raw)
  To: xiaosuo; +Cc: netdev
In-Reply-To: <1280787101-5589-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Tue,  3 Aug 2010 06:11:41 +0800

> Commit ab95bfe01f9872459c8678572ccadbf646badad0 replaces bridge and macvlan
> hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Applied, thanks.

^ permalink raw reply

* [PATCH] net: cleanup inclusion
From: Changli Gao @ 2010-08-02 22:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Changli Gao

Commit ab95bfe01f9872459c8678572ccadbf646badad0 replaces bridge and macvlan
hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 net/core/dev.c |    2 --
 1 file changed, 2 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 5d1282d..8c663db 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -101,8 +101,6 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/stat.h>
-#include <linux/if_bridge.h>
-#include <linux/if_macvlan.h>
 #include <net/dst.h>
 #include <net/pkt_sched.h>
 #include <net/checksum.h>

^ permalink raw reply related

* Re: RFC: New BGF 'LOOP' instruction
From: David Miller @ 2010-08-03  5:18 UTC (permalink / raw)
  To: hagen; +Cc: leonerd, netdev
In-Reply-To: <20100802201629.GA5973@nuttenaction>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Mon, 2 Aug 2010 22:16:29 +0200

> PS: the LOOP opcode must be secure against any ressource attack -> the loop
> must be break after n iterations.

Oh yeah, what is an iteration in your definition?  See this is why I
totally refuse to add a looping construct to BPF.

If you just check for a single loop hitting, the user will just use
a chaining of two looping constructs.  And then three levels of
indirection, then four, etc.  He can run up to just before exhasting
the "iteration limit" of one loop, and branch to the next one, and
so on and so forth.

There are probably a million ways to exploit this, and once you come
up with a validation or limiting scheme one of two things will happen:

1) The limiting scheme will make legitimate scripts USELESS

2) Someone will just figure out another hole to punch through and
   exploit

There's a reason no back branching construct was added to BPF to begin
with.  It violates the core design principles of BPF.

^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: David Miller @ 2010-08-03  5:13 UTC (permalink / raw)
  To: leonerd; +Cc: netdev
In-Reply-To: <20100802110334.GK11110@cel.leo>

From: Paul LeoNerd Evans <leonerd@leonerd.org.uk>
Date: Mon, 2 Aug 2010 12:03:34 +0100

> Any comments on this, while I proceed? Barring any major complaints,
> I'll have a hack at some code and present a patch in due course...

We're not adding loop instructions, it's just asking for trouble
since any user can attach BPF filters to a socket and it's just
way too easy to make a loop endless.

There's a reason no loop primitives were added to the original
BPF specification, perhaps you should take a look at what their
reasoning was.

It still applies now.

^ permalink raw reply

* Re: [PATCH 0/2] Minor extensions to marvell phy driver
From: David Miller @ 2010-08-03  5:09 UTC (permalink / raw)
  To: cyril; +Cc: netdev
In-Reply-To: <1280778294-2993-1-git-send-email-cyril@ti.com>

From: Cyril Chemparathy <cyril@ti.com>
Date: Mon,  2 Aug 2010 15:44:52 -0400

>   phy/marvell: add 88e1121 interface mode support

Applied.

>   phy/marvell: add 88ec048 support

You have to update marvel_id_tbl otherwise you break PHY
module autoloading.

Please fix this and resubmit this second patch.

Thanks.


^ permalink raw reply

* Re: [PATCH] u32: negative offset fix
From: David Miller @ 2010-08-03  5:08 UTC (permalink / raw)
  To: shemminger; +Cc: xiaosuo, netdev
In-Reply-To: <20100802164413.6f327ce6@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon, 2 Aug 2010 16:44:13 -0700

> It was possible to use a negative offset in a u32 match to reference
> the ethernet header or other parts of the link layer header.
> This fixes the regression caused by:
> 
> commit fbc2e7d9cf49e0bf89b9e91fd60a06851a855c5d
> Author: Changli Gao <xiaosuo@gmail.com>
> Date:   Wed Jun 2 07:32:42 2010 -0700
> 
>     cls_u32: use skb_header_pointer() to dereference data safely
> 
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply

* Re: [PATCH] deal with if frags[0].size is pulled to 0 in dev_gro_receive()
From: David Miller @ 2010-08-03  5:03 UTC (permalink / raw)
  To: herbert; +Cc: xiaohui.xin, netdev
In-Reply-To: <20100803045637.GA14173@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 3 Aug 2010 12:56:38 +0800

> On Tue, Aug 03, 2010 at 11:17:19AM +0800, xiaohui.xin@intel.com wrote:
>> From: Xin Xiaohui <xiaohui.xin@intel.com>
>> 
>> Now in dev_gro_receive(), if frags[0].size is pulled to 0, memmove is called and
>> the null page is released. But it's not enough, we should reset size of each frags
>> left as well. Compared to this, we can have another way to do this, it's not do do
>> anything at all.
>> 
>> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> This patch can only work if you audit everything that uses skb
> frags to ensure that they can tolerate a zero-sided frag.
> 
> I think it's much easier to just fix the memmove.

Agreed.

^ permalink raw reply

* Re: [PATCH] Fixes a typo from "dev" to "ndev"
From: David Miller @ 2010-08-03  5:03 UTC (permalink / raw)
  To: henrique.camargo
  Cc: chaithrika, srk, khilman, jpirko, netdev, linux-kernel, segooon
In-Reply-To: <1280805042.2089.6.camel@lemming>

From: Henrique Camargo <henrique.camargo@ensitec.com.br>
Date: Tue, 03 Aug 2010 00:10:42 -0300

> The typo was causing compilation errors since "dev" was not defined.
> 
> Signed-off-by: Henrique Camargo <henrique.camargo@ensitec.com.br>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH] deal with if frags[0].size is pulled to 0 in dev_gro_receive()
From: Herbert Xu @ 2010-08-03  4:56 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, davem
In-Reply-To: <1280805439-18988-1-git-send-email-xiaohui.xin@intel.com>

On Tue, Aug 03, 2010 at 11:17:19AM +0800, xiaohui.xin@intel.com wrote:
> From: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> Now in dev_gro_receive(), if frags[0].size is pulled to 0, memmove is called and
> the null page is released. But it's not enough, we should reset size of each frags
> left as well. Compared to this, we can have another way to do this, it's not do do
> anything at all.
> 
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>

This patch can only work if you audit everything that uses skb
frags to ensure that they can tolerate a zero-sided frag.

I think it's much easier to just fix the memmove.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [net-next-2.6 PATCH] ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
From: David Miller @ 2010-08-03  4:21 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, alexander.h.duyck
In-Reply-To: <20100803005849.4678.10583.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 02 Aug 2010 17:59:04 -0700

> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change corrects an issue that resulted in a null pointer dereference
> for the addition of VLAN 0 without any VLANs being registered.  Also this
> code removes some unnecessary checks for defines and the unnecessary setting
> of VLAN flags since that is now handled within the kernel via the
> vlan_features.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH] igb: Use irq_synchronize per vector when using MSI-X
From: David Miller @ 2010-08-03  4:21 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, jdelvare, emil.s.tantilov
In-Reply-To: <20100803004044.4441.34335.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 02 Aug 2010 17:40:52 -0700

> From: Emil Tantilov <emil.s.tantilov@intel.com>
> 
> Synchronize all IRQs when using MSI-X. Similar to ixgbe.
> Issue was reported on e1000e, but the patch is also valid for igb.
> 
> CC: Jean Delvare <jdelvare@suse.de>
> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
> Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 3/3] e1000e: update to workaround for jumbo frames on 82577
From: David Miller @ 2010-08-03  4:21 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, bruce.w.allan
In-Reply-To: <AANLkTi=TvRmXxJu1+y9B8x7G55Ur9wNhYzDZ8Q=+HJRJ@mail.gmail.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 2 Aug 2010 17:36:56 -0700

> On Mon, Aug 2, 2010 at 17:27, Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:
>> From: Bruce Allan <bruce.w.allan@intel.com>
>>
>> For OEM systems with this part that also has Spread Spectrum Clocking (SSC)
>> enabled in the BIOS, there is an Rx performance issue with 4K jumbo frames.
>> Leaving the defaults in PHY page 770 register 26 resolves the issue, and
>> does not negatively impact jumbo frames on systems with SSC disabled.
>>
>> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
>> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> ---
>>
>>  drivers/net/e1000e/netdev.c |    5 -----
>>  1 files changed, 0 insertions(+), 5 deletions(-)
>>
> 
> Please disregard this patch, it was sent out accidentally (my bad).
> During testing issues were found and changes need to be made to this
> patch.

Ok.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/3] e1000e: Fix irq_synchronize in MSI-X case
From: David Miller @ 2010-08-03  4:21 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: netdev, gospo, bphilips, jdelvare, jesse.brandeburg,
	bruce.w.allan
In-Reply-To: <20100803002721.4179.75916.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 02 Aug 2010 17:27:23 -0700

> Based on original patch/work from Jean Delvare <jdelvare@suse.de>
> Synchronize all IRQs when in MSI-X IRQ mode.
> 
> Jean's original patch hard coded the sync with the 3 possible vectors,
> this patch incorporates more flexibility for the future and aligns
> with how igb stores the number of vectors into the adapter structure.
> 
> CC: Jean Delvare <jdelvare@suse.de>
> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
> Acked-by: Bruce Allan <bruce.w.allan@intel.com>

Applied.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/3] e1000e: register pm_qos request on hardware activation
From: David Miller @ 2010-08-03  4:21 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, bphilips, florian
In-Reply-To: <20100803002622.4179.31850.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 02 Aug 2010 17:27:00 -0700

> From: Florian Mickler <florian@mickler.org>
> 
> The pm_qos_add_request call has to register the pm_qos request with the pm_qos
> susbsystem before first use of the pm_qos request via
> pm_qos_update_request.
> 
> As pm_qos changed to use plists there is no benefit in registering and
> unregistering the pm_qos request on ifup/ifdown and thus we move the
> registering into e1000_open and the unregistering in e1000_close.
> 
> This fixes the following warning:
 ...
> Signed-off-by: Florian Mickler <florian@mickler.org>
> Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
> Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox