Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: RFC: New BGF 'LOOP' instruction
From: Hagen Paul Pfeifer @ 2010-08-03  9:03 UTC (permalink / raw)
  To: David Miller; +Cc: leonerd, netdev
In-Reply-To: <20100802.221813.43045517.davem@davemloft.net>


On Mon, 02 Aug 2010 22:18:13 -0700 (PDT), David Miller wrote:



> Oh yeah, what is an iteration in your definition?  See this is why I

> totally refuse to add a looping construct to BPF.

> 

> If you just check for a single loop hitting, the user will just use

> a chaining of two looping constructs.  And then three levels of

> indirection, then four, etc.  He can run up to just before exhasting

> the "iteration limit" of one loop, and branch to the next one, and

> so on and so forth.



I am aware of any problems caused by complex instructions. David, I was

rather curious to see an unrecognized and ground breaking instructions, I

don't wanted to scotch any (possible) improvement.



HGN

^ permalink raw reply

* RE: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
From: Xin, Xiaohui @ 2010-08-03  8:48 UTC (permalink / raw)
  To: Shirley Ma
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mst@redhat.com, mingo@elte.hu,
	davem@davemloft.net, herbert@gondor.apana.org.au,
	jdike@linux.intel.com
In-Reply-To: <1280442682.9058.15.camel@localhost.localdomain>

>-----Original Message-----
>From: Shirley Ma [mailto:mashirle@us.ibm.com]
>Sent: Friday, July 30, 2010 6:31 AM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>mst@redhat.com; mingo@elte.hu; davem@davemloft.net; herbert@gondor.apana.org.au;
>jdike@linux.intel.com
>Subject: Re: [RFC PATCH v8 00/16] Provide a zero-copy method on KVM virtio-net.
>
>Hello Xiaohui,
>
>On Thu, 2010-07-29 at 19:14 +0800, xiaohui.xin@intel.com wrote:
>> The idea is simple, just to pin the guest VM user space and then
>> let host NIC driver has the chance to directly DMA to it.
>> The patches are based on vhost-net backend driver. We add a device
>> which provides proto_ops as sendmsg/recvmsg to vhost-net to
>> send/recv directly to/from the NIC driver. KVM guest who use the
>> vhost-net backend may bind any ethX interface in the host side to
>> get copyless data transfer thru guest virtio-net frontend.
>
>Since vhost-net already supports macvtap/tun backends, do you think
>whether it's better to implement zero copy in macvtap/tun than inducing
>a new media passthrough device here?
>
>> Our goal is to improve the bandwidth and reduce the CPU usage.
>> Exact performance data will be provided later.
>
>I did some vhost performance measurement over 10Gb ixgbe, and found that
>in order to get consistent BW results, netperf/netserver, qemu, vhost
>threads smp affinities are required.
>
>Looking forward to these results for small message size comparison. For
>large message size 10Gb ixgbe BW already reached by doing vhost smp
>affinity w/i offloading support, we will see how much CPU utilization it
>can be reduced.
>
>Please provide latency results as well. I did some experimental on
>macvtap zero copy sendmsg, what I have found that get_user_pages latency
>pretty high.
>
May you share me with your performance results (including BW and latency)on 
vhost-net and how you get them(your configuration and especially with the affinity 
settings)?

Thanks
Xiaohui

>Thanks
>Shirley
>
>
>


^ permalink raw reply

* Re: [PATCH v3 1/2] core: Factor out flow calculation from get_rps_cpu
From: Arnd Bergmann @ 2010-08-03  8:32 UTC (permalink / raw)
  To: Changli Gao; +Cc: Krishna Kumar2, bhutchings, davem, mst, netdev, therbert
In-Reply-To: <AANLkTimP_0e5UMC7mgn9=g=kyhqaghdEZC61=Tx7OesP@mail.gmail.com>

On Tuesday 03 August 2010, Changli Gao wrote:
> > I am confused by the call sites of macvlan_dev.receive and
> > macvlan_dev.forward. They both are possible to be called in both
> > RX(skb->data points to network header) and TX(skb->data points to
> > ethernet) paths. The current code in macvtap shows that
> > macvlan_dev.receive should be called in network layer, and
> > macvlan_dev.forward should be called in dev layer. Am I correct?
> >
> 
> After checking the code carefully, I find macvlan_dev.receive is
> called in network layer(RX path), and macvlan_dev.forward is called in
> dev layer(TX path). The current code hasn't any issue. Your solution
> won't work, as macvtap_forward is called in dev layer, when skb->data
> points to ethernet header already.

Yes, that is correct. Forward is used for "bridge" mode, where we
can send data between two macvlan/macvtap endpoints directly, as
opposed to the default "vepa" mode, where all data is always sent
out to the lowerdev and may be returned by an adjacent switch in
hairpin configuration.

	Arnd

^ permalink raw reply

* Re: [PATCH] [Bug 16494] NFS client over TCP hangs due to packet loss
From: David Miller @ 2010-08-03  8:21 UTC (permalink / raw)
  To: andyc.bluearc
  Cc: kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet,
	William.Allen.Simpson, gilad, ilpo.jarvinen, netdev, linux-kernel,
	akpm
In-Reply-To: <4c57cfe8.887b0e0a.2f79.4772@mx.google.com>

From: "Andy Chittenden" <andyc.bluearc@gmail.com>
Date: Tue, 3 Aug 2010 09:14:31 +0100

> I don't know whether this patch is the correct fix or not but it enables the
> NFS client to recover.
> 
> Kernel version: 2.6.34.1 and 2.6.32.
> 
> Fixes <https://bugzilla.kernel.org/show_bug.cgi?id=16494>. It clears down
> any previous shutdown attempts so that reconnects on a socket that's been
> shutdown leave the socket in a usable state (otherwise tcp_sendmsg() returns
> -EPIPE).

If the SunRPC code wants to close a TCP socket then use it again,
it should disconnect by doing a connect() with sa_family == AF_UNSPEC

^ permalink raw reply

* [PATCH] [Bug 16494] NFS client over TCP hangs due to packet loss
From: Andy Chittenden @ 2010-08-03  8:14 UTC (permalink / raw)
  To: 'David S. Miller', 'Alexey Kuznetsov',
	'Pekka Savola (ipv6', 'James Morris',
	"'Hideaki Y
  Cc: akpm

I don't know whether this patch is the correct fix or not but it enables the
NFS client to recover.

Kernel version: 2.6.34.1 and 2.6.32.

Fixes <https://bugzilla.kernel.org/show_bug.cgi?id=16494>. It clears down
any previous shutdown attempts so that reconnects on a socket that's been
shutdown leave the socket in a usable state (otherwise tcp_sendmsg() returns
-EPIPE).

# diff -up /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c
net/ipv4
--- /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c
2010-07-27 08:46:46.917000000 +0100
+++ net/ipv4/tcp_output.c       2010-07-27 09:19:16.000000000 +0100
@@ -2522,6 +2522,13 @@ static void tcp_connect_init(struct sock
        struct tcp_sock *tp = tcp_sk(sk);
        __u8 rcv_wscale;

+       /* clear down any previous shutdown attempts so that
+        * reconnects on a socket that's been shutdown leave the
+        * socket in a usable state (otherwise tcp_sendmsg() returns
+        * -EPIPE).
+        */
+       sk->sk_shutdown = 0;
+
        /* We'll fix this up when we get a response from the other end.
         * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
         */

Signed-off-by: Andy Chittenden <andyc.bluearc@gmail.com>




^ permalink raw reply

* Re: [PATCH] can-raw: Fix skb_orphan_try handling
From: David Miller @ 2010-08-03  7:30 UTC (permalink / raw)
  To: socketcan
  Cc: eric.dumazet, patrick.ohly, netdev, socketcan-core,
	matthias.fuchs
In-Reply-To: <4C555168.3090800@hartkopp.net>

From: Oliver Hartkopp <socketcan@hartkopp.net>
Date: Sun, 01 Aug 2010 12:50:16 +0200

> On 01.08.2010 10:03, David Miller wrote:
>> From: Oliver Hartkopp <socketcan@hartkopp.net>
>> Date: Fri, 30 Jul 2010 11:44:27 +0200
>> 
>>> Hello Eric, hello Patrick,
>>>
>>> Commit fc6055a5ba31e2c14e36e8939f9bf2b6d586a7f5 (net: Introduce
>>> skb_orphan_try()) allows an early orphan of the skb and takes care on
>>> tx timestamping, which needs the sk-reference in the skb on driver level.
>>> So does the can-raw socket, which has not been taken into account here.
>>>
>>> The patch below adds a 'prevent_sk_orphan' bit in the skb tx shared info,
>>> which fixes the problem discovered by Matthias Fuchs here:
>>>
>>>       http://marc.info/?t=128030411900003&r=1&w=2
>> 
>> Your patch sets this new value, but I never see it getting tested anywhere.
>> 
>> How does this work?
> 
> 
> The flags are tested all together in skb_orphan_try() ...

This is why I hate using unions in situations like this... it makes
code impossible to audit easily.

This damn thing should just be a "u8 flags" and a bunch of bit mask
CPP macro defines for the various boolean values.

Anyways, I'll apply your patch thanks.

^ permalink raw reply

* Re: softirq warnings when calling dev_kfree_skb_irq - bug in conntrack?
From: David Miller @ 2010-08-03  7:23 UTC (permalink / raw)
  To: johannes
  Cc: jeremy, Xen-devel, Ian.Campbell, eric.dumazet, netdev,
	dongxiao.xu, kaber
In-Reply-To: <1280819074.3874.8.camel@jlt3.sipsolutions.net>

From: Johannes Berg <johannes@sipsolutions.net>
Date: Tue, 03 Aug 2010 09:04:34 +0200

> I had this too:
> http://article.gmane.org/gmane.linux.network/167590
> 
> But I'm not convinced it's conntrack, I'd think it's
> 
> commit 15e83ed78864d0625e87a85f09b297c0919a4797
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Wed May 19 23:16:03 2010 +0000
> 
>     net: remove zap_completion_queue
> 
> which, from the looks of it, ought to be reverted because it failed to
> take into account that dev_kfree_skb() can do more things that require
> non-irq-context than just calling skb->destructor, like for instance the
> conntrack thing we see here.

Agreed.  I'll revert this and queue that up for 2.6.35-stable

^ permalink raw reply

* Re: [PATCH v3 1/2] core: Factor out flow calculation from get_rps_cpu
From: Changli Gao @ 2010-08-03  7:18 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: arnd, bhutchings, davem, mst, netdev, therbert
In-Reply-To: <AANLkTikJMih-CAA_d95ot0z57EOhMFsiNc2nnki6v6LY@mail.gmail.com>

On Tue, Aug 3, 2010 at 2:11 PM, Changli Gao <xiaosuo@gmail.com> wrote:
> On Tue, Aug 3, 2010 at 1:57 PM, Krishna Kumar2 <krkumar2@in.ibm.com> wrote:
>> Hi Changli,
>>
>> Good catch.
>>
>> Instead of adding support for ethernet header or pull/push,
>> I could defer the skb_push(ETH_HLEN), something like:
>>
>> static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
>> {
>>        struct macvtap_queue *q = macvtap_get_queue(dev, skb);
>>        if (!q)
>>                goto drop;
>>
>>        if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
>>                goto drop;
>>
>> +       skb_push(skb, ETH_HLEN);
>>        ...
>> }
>>
>> and remove the same in macvtap_receive. Will this be better?
>>
>
> I am confused by the call sites of macvlan_dev.receive and
> macvlan_dev.forward. They both are possible to be called in both
> RX(skb->data points to network header) and TX(skb->data points to
> ethernet) paths. The current code in macvtap shows that
> macvlan_dev.receive should be called in network layer, and
> macvlan_dev.forward should be called in dev layer. Am I correct?
>

After checking the code carefully, I find macvlan_dev.receive is
called in network layer(RX path), and macvlan_dev.forward is called in
dev layer(TX path). The current code hasn't any issue. Your solution
won't work, as macvtap_forward is called in dev layer, when skb->data
points to ethernet header already.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: David Miller @ 2010-08-03  7:19 UTC (permalink / raw)
  To: leonerd; +Cc: netdev, hagen
In-Reply-To: <20100803070709.GO11110@cel.leo>

From: Paul LeoNerd Evans <leonerd@leonerd.org.uk>
Date: Tue, 3 Aug 2010 08:07:10 +0100

> On Mon, Aug 02, 2010 at 10:18:13PM -0700, David Miller wrote:
>> 1) The limiting scheme will make legitimate scripts USELESS
> 
> Rightnow, BPF is all but useless for parsing, say, IPv6. I only pick
> IPv6 as one example, I'm sure there must exist a great number more
> packet-based protocols that use a "linked-list" style approach to
> headers. None of those are currently filterable on the current set of
> instructions. LOOP would allow these.

It's not meant for detailed packet protocol header analysis,
it's for stateless straight line matching of masked values
in packet headers.

Nothing more.

^ permalink raw reply

* Re: RFC: New BPF 'LOOP' instruction
From: Paul LeoNerd Evans @ 2010-08-03  7:18 UTC (permalink / raw)
  To: Hagen Paul Pfeifer, netdev
In-Reply-To: <20100802201629.GA5973@nuttenaction>

[-- Attachment #1: Type: text/plain, Size: 5159 bytes --]

On Mon, Aug 02, 2010 at 10:16:29PM +0200, Hagen Paul Pfeifer wrote:
> In general: BPF was constructed to address filters rules in a generic manner
> and BPF does not contain any special protocol specific optimization - nor any
> sophisticated connection tracking functionality. In general you should
> pre-filter unneeded packets and shift the real high level filtering to some
> post-processing step. tcpdump filter capabilities are limited and where never
> designed to filter _any_ traffic. For example: you are lost if you want to match
> transport layer fields like port number where the underlying IPv{4,6} packet
> is fragmented.

Oh, I am quite aware of the futility in trying to, for example, match up
IPv4 fragments.

There's nothing about my suggestion that is in any way IPv6-specific. I
used IPv6 simply as an example to motivate the suggestion. It could
quite easily apply to any other sort of protocol that uses a linked-list
of headers.

> Furthermore, I doubt that the loop provides any significant advantages. 
> IPv6 extension header parsing is not that straight forward. For example 
> check  the IPSec AH Extension header where the extension header length 
> must interpreted differently because of a IPSec AH protcol defect. I assume
> that a straight forward pcap encoded BPF opcode (composed of jump and load
> instructions) is more efficient as an highly flexible loop construct. 

I'm not sure I follow your logic here.

By my understanding, pcap's IPv6 header parsing filter is a 6-times
statically-unrolled loop, where each loop body has to parse some
headers. I'm already aware that various headers are hard to parse.

Allow me some pseudocode... Currently, pcap has to do the equivalent of:

X = 0
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
X += A
got:
  ... continue with filter.

That "load A with its length" is the IPv6-specific part; I'm not
suggesting that my LOOP suggestion in any way helps that. It's a
difficult problem, sure. What I _am_ suggesting is that this static
unrolling can be avoided, instead becoming:

X = 0
start:
Look at header at [X]; if it's what we want goto 'got'; else load A with
its length.
LOOP to start
got:
  ... continue with filter.

This results also in a shorter program, because there is a hard limit on
the total number of instructions in a filter.

> Last but not least I am interested in a RFC patch as well as a pcap patch (see
> pcap-opt.c). You should not underrate the effort to generate an generic IPv6
> extension header opcode optimizer - without this the newly introduced opcode
> is pointless.

As above; I was under the impression that pcap already -does- contain
code to have a reasonable attempt to hunt down the requested IPv6
header, which is what implements "ipv6 protochain". I'll quote from
pcap-filter(7):

       ip6 protochain protocol
              True  if the packet is IPv6 packet, and contains protocol header
              with type protocol in its protocol header chain.  For example,
                   ip6 protochain 6
              matches any IPv6 packet with TCP protocol header in the protocol
              header  chain.  The packet may contain, for example, authentica‐
              tion  header,  routing  header,  or  hop-by-hop  option  header,
              between  IPv6  header  and  TCP header.  The BPF code emitted by
              this primitive is complex and cannot be  optimized  by  the  BPF
              optimizer code, so this can be somewhat slow.

> PS: the LOOP opcode must be secure against any ressource attack -> the loop
> must be break after n iterations.

Which is -exactly- what it does. I'll quote my original:

       X += A.
       If X < len, jump backwards jt instructions.
       Otherwise, fallthrough to the next instruction
  ...
  The intention of this instruction is to be able to implement a loop in
  which successive iterations advance the index register along the packet
  buffer. By comparing X to the packet length, we can bound the running
  time of the loop instruction, avoiding it locking up the kernel. By
  banning STX instructions within the body of the loop, we can ensure that
  X must be a strictly monotonically increasing sequence. At absolute
  worst, X is increased by 1 each time, meaning at worst the body of the
  loop must execute for every byte in the packet.

Is this sufficiently secure, or do you suggest a further limit is
required?

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: David Miller @ 2010-08-03  7:18 UTC (permalink / raw)
  To: leonerd; +Cc: netdev
In-Reply-To: <20100803070426.GN11110@cel.leo>

From: Paul LeoNerd Evans <leonerd@leonerd.org.uk>
Date: Tue, 3 Aug 2010 08:04:27 +0100

> On Mon, Aug 02, 2010 at 10:13:41PM -0700, David Miller wrote:
>> > Any comments on this, while I proceed? Barring any major complaints,
>> > I'll have a hack at some code and present a patch in due course...
>> 
>> We're not adding loop instructions, it's just asking for trouble
>> since any user can attach BPF filters to a socket and it's just
>> way too easy to make a loop endless.
>> 
>> There's a reason no loop primitives were added to the original
>> BPF specification, perhaps you should take a look at what their
>> reasoning was.
> 
> Yes. I am very aware of that.
> 
> Please read carefully my suggestion. These loops cannot be made endless
> - they will be bounded by, at most, the number of bytes in the packet
> buffer. The loop is required to increment X at least 1 at every
> iteration, and will not allow it to continue past the end of the packet.
> This puts a strict bound on the runtime of the loop.

That makes the looping construct largely useless, which I mentioned in
my second reply to this thread.


^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: Paul LeoNerd Evans @ 2010-08-03  7:07 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: hagen
In-Reply-To: <20100802.221813.43045517.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

On Mon, Aug 02, 2010 at 10:18:13PM -0700, David Miller wrote:
> If you just check for a single loop hitting, the user will just use
> a chaining of two looping constructs.  And then three levels of
> indirection, then four, etc.  He can run up to just before exhasting
> the "iteration limit" of one loop, and branch to the next one, and
> so on and so forth.

And this is why part of my suggestion bans the use of a LOOP
instruction within the "body" of another, such that they cannot nest.

> There are probably a million ways to exploit this, and once you come
> up with a validation or limiting scheme one of two things will happen:
> 
> 1) The limiting scheme will make legitimate scripts USELESS

Rightnow, BPF is all but useless for parsing, say, IPv6. I only pick
IPv6 as one example, I'm sure there must exist a great number more
packet-based protocols that use a "linked-list" style approach to
headers. None of those are currently filterable on the current set of
instructions. LOOP would allow these.

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: softirq warnings when calling dev_kfree_skb_irq - bug in conntrack?
From: Johannes Berg @ 2010-08-03  7:04 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: NetDev, Xu, Dongxiao, Xen-devel@lists.xensource.com, Ian Campbell,
	Patrick McHardy, Eric Dumazet
In-Reply-To: <4C571476.7070301@goop.org>

On Mon, 2010-08-02 at 11:54 -0700, Jeremy Fitzhardinge wrote:

> ------------[ cut here ]------------
> WARNING: at kernel/softirq.c:143 local_bh_enable+0x40/0x87()
> Modules linked in: xt_state dm_mirror dm_region_hash dm_log microcode [last unloaded: scsi_wait_scan]
> Pid: 0, comm: swapper Not tainted 2.6.35-rc6-next-20100729+ #29
> Call Trace:
>   <IRQ>   [<ffffffff81030de3>] warn_slowpath_common+0x80/0x98
>   [<ffffffff81030e10>] warn_slowpath_null+0x15/0x17
>   [<ffffffff81035ff3>] local_bh_enable+0x40/0x87
>   [<ffffffff814236e5>] destroy_conntrack+0x78/0x9e
>   [<ffffffff810bea55>] ? __kmalloc_track_caller+0xc3/0x135
>   [<ffffffff814203b4>] nf_conntrack_destroy+0x16/0x18
>   [<ffffffff813fadee>] skb_release_head_state+0x97/0xd9
>   [<ffffffff813fabbe>] __kfree_skb+0x11/0x7a
>   [<ffffffff813fac4e>] consume_skb+0x27/0x29
>   [<ffffffff81402d3a>] dev_kfree_skb_irq+0x18/0x62
>   [<ffffffff8130a762>] xennet_tx_buf_gc+0xfc/0x192
>   [<ffffffff8130a8fb>] smart_poll_function+0x50/0x121
>   [<ffffffff8130a8ab>] ? smart_poll_function+0x0/0x121
>   [<ffffffff8104b8d1>] __run_hrtimer+0xcc/0x127
>   [<ffffffff8104bad3>] hrtimer_interrupt+0x9c/0x17b

> It seems the basic problem is that xennet_tx_buf_gc() is being called in 
> interrupt context - with smartpoll it's from the timer interrupt, but 
> even without it is being called from xennet_interrupt(), which in turn 
> calls dev_kfree_skb_irq().
> 
> Since this should be perfectly OK, it appears the problem is actually in 
> conntrack.  I'm not sure where this bug started happening, but its 
> relatively recently I think.

I had this too:
http://article.gmane.org/gmane.linux.network/167590

But I'm not convinced it's conntrack, I'd think it's

commit 15e83ed78864d0625e87a85f09b297c0919a4797
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed May 19 23:16:03 2010 +0000

    net: remove zap_completion_queue

which, from the looks of it, ought to be reverted because it failed to
take into account that dev_kfree_skb() can do more things that require
non-irq-context than just calling skb->destructor, like for instance the
conntrack thing we see here.

johannes


^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: Paul LeoNerd Evans @ 2010-08-03  7:04 UTC (permalink / raw)
  To: David Miller, netdev
In-Reply-To: <20100802.221341.137851732.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

On Mon, Aug 02, 2010 at 10:13:41PM -0700, David Miller wrote:
> > Any comments on this, while I proceed? Barring any major complaints,
> > I'll have a hack at some code and present a patch in due course...
> 
> We're not adding loop instructions, it's just asking for trouble
> since any user can attach BPF filters to a socket and it's just
> way too easy to make a loop endless.
> 
> There's a reason no loop primitives were added to the original
> BPF specification, perhaps you should take a look at what their
> reasoning was.

Yes. I am very aware of that.

Please read carefully my suggestion. These loops cannot be made endless
- they will be bounded by, at most, the number of bytes in the packet
buffer. The loop is required to increment X at least 1 at every
iteration, and will not allow it to continue past the end of the packet.
This puts a strict bound on the runtime of the loop.

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: Is it a possible bug in dev_gro_receive()?
From: Jarek Poplawski @ 2010-08-03  6:45 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, herbert@gondor.apana.org.au,
	davem@davemloft.net
In-Reply-To: <F2E9EB7348B8264F86B6AB8151CE2D791BAB3078E0@shsmsx502.ccr.corp.intel.com>

On Tue, Aug 03, 2010 at 10:33:24AM +0800, Xin, Xiaohui wrote:
> >-----Original Message-----
> >From: Jarek Poplawski [mailto:jarkao2@gmail.com]
> >Sent: Monday, August 02, 2010 6:29 PM
> >To: Xin, Xiaohui
> >Cc: netdev@vger.kernel.org; herbert@gondor.apana.org.au; davem@davemloft.net
> >Subject: Re: Is it a possible bug in dev_gro_receive()?
> >
> >Xin Xiaohui wrote:
> >> I looked into the code dev_gro_receive(), found the code here:
> >> if the frags[0] is pulled to 0, then the page will be released,
> >> and memmove() frags left.
> >> Is that right? I'm not sure if memmove do right or not, but
> >> frags[0].size is never set after memove at least. what I think
> >> a simple way is not to do anything if we found frags[0].size == 0.
> >> The patch is as followed.
> >>
> >> Or am I missing something here?
> >
> >I think, you're right, but fixing memmove looks nicer to me:
> >
> > -	--skb_shinfo(skb)->nr_frags);
> > +	--skb_shinfo(skb)->nr_frags * sizeof(skb_frag_t));
> >
> >Jarek P.
> 
> Is there a little hurt of performance to do memmove() if skb_shinfo(skb)->nr_frags is large?
> We're now working on the zero-copy patches based on napi_gro_frags() interface, and in 
> this case, we have found a lot of skbs which frags[0] is pulled to 0. And after the memmove is
> fixed, each frags[x].size is needed to modify too.
> So I think don't do anything is better. Or is there any side effect with a null page in the stack?

Even if it's better, generally you should separate fixes from
optimizations. On the other hand, it was expected to be "unlikely" by
design, so you should probably explain more why it has to be changed
here too.

Thanks,
Jarek P.

> 
> Thanks
> Xiaohui
> >
> >>
> >> ---
> >>  net/core/dev.c |    7 -------
> >>  1 files changed, 0 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 264137f..28cdbbf 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -2730,13 +2730,6 @@ pull:
> >>
> >>  		skb_shinfo(skb)->frags[0].page_offset += grow;
> >>  		skb_shinfo(skb)->frags[0].size -= grow;
> >> -
> >> -		if (unlikely(!skb_shinfo(skb)->frags[0].size)) {
> >> -			put_page(skb_shinfo(skb)->frags[0].page);
> >> -			memmove(skb_shinfo(skb)->frags,
> >> -				skb_shinfo(skb)->frags + 1,
> >> -				--skb_shinfo(skb)->nr_frags);
> >> -		}
> >>  	}
> >>
> >>  ok:
> >
> >
> 

^ permalink raw reply

* Re: [PATCH v3 1/2] core: Factor out flow calculation from get_rps_cpu
From: Changli Gao @ 2010-08-03  6:11 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: arnd, bhutchings, davem, mst, netdev, therbert
In-Reply-To: <OF46AA822C.08F8EAED-ON65257774.001F5AE3-65257774.00209BBF@in.ibm.com>

On Tue, Aug 3, 2010 at 1:57 PM, Krishna Kumar2 <krkumar2@in.ibm.com> wrote:
> Hi Changli,
>
> Good catch.
>
> Instead of adding support for ethernet header or pull/push,
> I could defer the skb_push(ETH_HLEN), something like:
>
> static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
> {
>        struct macvtap_queue *q = macvtap_get_queue(dev, skb);
>        if (!q)
>                goto drop;
>
>        if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
>                goto drop;
>
> +       skb_push(skb, ETH_HLEN);
>        ...
> }
>
> and remove the same in macvtap_receive. Will this be better?
>

I am confused by the call sites of macvlan_dev.receive and
macvlan_dev.forward. They both are possible to be called in both
RX(skb->data points to network header) and TX(skb->data points to
ethernet) paths. The current code in macvtap shows that
macvlan_dev.receive should be called in network layer, and
macvlan_dev.forward should be called in dev layer. Am I correct?


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH v3 1/2] core: Factor out flow calculation from get_rps_cpu
From: Krishna Kumar2 @ 2010-08-03  5:57 UTC (permalink / raw)
  To: Changli Gao; +Cc: arnd, bhutchings, davem, mst, netdev, therbert
In-Reply-To: <AANLkTikNOstJr3qqMUqtUoOm1xHOjw5uM7NpQUNJz4PR@mail.gmail.com>

Hi Changli,

Good catch.

Instead of adding support for ethernet header or pull/push,
I could defer the skb_push(ETH_HLEN), something like:

static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
{
	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
	if (!q)
		goto drop;

	if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
		goto drop;

+	skb_push(skb, ETH_HLEN);
	...
}

and remove the same in macvtap_receive. Will this be better?

Your other suggestions also looks good.

Thanks,

- KK

Changli Gao <xiaosuo@gmail.com> wrote on 08/03/2010 09:35:34 AM:

> Changli Gao <xiaosuo@gmail.com>
> 08/03/2010 09:35 AM
>
> To
>
> Krishna Kumar2/India/IBM@IBMIN
>
> cc
>
> davem@davemloft.net, arnd@arndb.de, bhutchings@solarflare.com,
> netdev@vger.kernel.org, therbert@google.com, mst@redhat.com
>
> Subject
>
> Re: [PATCH v3 1/2] core: Factor out flow calculation from get_rps_cpu
>
> On Tue, Aug 3, 2010 at 11:02 AM, Krishna Kumar <krkumar2@in.ibm.com>
wrote:
> > From: Krishna Kumar <krkumar2@in.ibm.com>
> >
> > Factor out flow calculation code from get_rps_cpu, since macvtap
> > driver can use the same code.
> >
> > Revisions:
> >
> > v2 - Ben: Separate flow calcuation out and use in select queue
> > v3 - Arnd: Don't re-implement MIN
> >
> > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> > ---
> >  include/linux/netdevice.h |    1
> >  net/core/dev.c            |   94 ++++++++++++++++++++++--------------
> >  2 files changed, 59 insertions(+), 36 deletions(-)
> >
> > diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
> > --- org/include/linux/netdevice.h       2010-08-03 08:19:57.000000000
+0530
> > +++ new/include/linux/netdevice.h       2010-08-03 08:19:57.000000000
+0530
> > @@ -2253,6 +2253,7 @@ static inline const char *netdev_name(co
> >        return dev->name;
> >  }
> >
> > +extern int skb_calculate_flow(struct net_device *dev, struct sk_buff
*skb);
> >  extern int netdev_printk(const char *level, const struct net_device
*dev,
> >                         const char *format, ...)
> >        __attribute__ ((format (printf, 3, 4)));
> > diff -ruNp org/net/core/dev.c new/net/core/dev.c
> > --- org/net/core/dev.c  2010-08-03 08:19:57.000000000 +0530
> > +++ new/net/core/dev.c  2010-08-03 08:19:57.000000000 +0530
> > @@ -2263,51 +2263,24 @@ static inline void ____napi_schedule(str
> >        __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >  }
> >
> > -#ifdef CONFIG_RPS
> > -
> > -/* One global table that all flow-based protocols share. */
> > -struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;
> > -EXPORT_SYMBOL(rps_sock_flow_table);
> > -
> >  /*
> > - * get_rps_cpu is called from netif_receive_skb and returns the target
> > - * CPU from the RPS map of the receiving queue for a given skb.
> > - * rcu_read_lock must be held on entry.
> > + * skb_calculate_flow: calculate a flow hash based on src/dst
addresses
> > + * and src/dst port numbers. On success, returns a hash number (> 0),
> > + * otherwise -1.
> >  */
> > -static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
> > -                      struct rps_dev_flow **rflowp)
> > +int skb_calculate_flow(struct net_device *dev, struct sk_buff *skb)
> >  {
> > +       int hash = skb->rxhash;
> >        struct ipv6hdr *ip6;
> >        struct iphdr *ip;
> > -       struct netdev_rx_queue *rxqueue;
> > -       struct rps_map *map;
> > -       struct rps_dev_flow_table *flow_table;
> > -       struct rps_sock_flow_table *sock_flow_table;
> > -       int cpu = -1;
> >        u8 ip_proto;
> > -       u16 tcpu;
> >        u32 addr1, addr2, ihl;
> >        union {
> >                u32 v32;
> >                u16 v16[2];
> >        } ports;
> >
> > -       if (skb_rx_queue_recorded(skb)) {
> > -               u16 index = skb_get_rx_queue(skb);
> > -               if (unlikely(index >= dev->num_rx_queues)) {
> > -                       WARN_ONCE(dev->num_rx_queues > 1, "%s
> received packet "
> > -                               "on queue %u, but number of RX
> queues is %u\n",
> > -                               dev->name, index, dev->num_rx_queues);
> > -                       goto done;
> > -               }
> > -               rxqueue = dev->_rx + index;
> > -       } else
> > -               rxqueue = dev->_rx;
> > -
> > -       if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
> > -               goto done;
> > -
> > -       if (skb->rxhash)
> > +       if (hash)
> >                goto got_hash; /* Skip hash computation on packet header
*/
> >
> >        switch (skb->protocol) {
> > @@ -2334,6 +2307,7 @@ static int get_rps_cpu(struct net_device
> >        default:
> >                goto done;
> >        }
> > +
> >        switch (ip_proto) {
> >        case IPPROTO_TCP:
> >        case IPPROTO_UDP:
> > @@ -2356,11 +2330,59 @@ static int get_rps_cpu(struct net_device
> >        /* get a consistent hash (same value on both flow directions) */
> >        if (addr2 < addr1)
> >                swap(addr1, addr2);
> > -       skb->rxhash = jhash_3words(addr1, addr2, ports.v32, hashrnd);
> > -       if (!skb->rxhash)
> > -               skb->rxhash = 1;
> > +
> > +       hash = jhash_3words(addr1, addr2, ports.v32, hashrnd);
> > +       if (!hash)
> > +               hash = 1;
> >
> >  got_hash:
> > +       return hash;
> > +
> > +done:
> > +       return -1;
> > +}
> > +EXPORT_SYMBOL(skb_calculate_flow);
>
> I have noticed that you use skb_calculate_flow() in
> macvtap_get_queue() where skb->data doesn't point to the network
> header but the ethernet header. However, skb_calculate_flow() assume
> skb->data points to the network header. There are two choices:
>  * update skb_calculate_flow to support called in ethernet layer.
>  * pull skb before skb_calculate_flow, and push skb after
> skb_calculate_flow() in macvtap_get_queue().
>
> I prefer the former way.
>
> BTW: the function name skb_calculate_flow isn't good. How about
> skb_get_rxhash(). Maybe we can implement two versions: fast path and
> slow path. And implement the fast path version as a inline function in
> skbuff.h.
>
> static inline u32 skb_get_rxhash(struct sk_buff *skb)
> {
>         u32 rxhash;
>
>         rxhash = skb->rxhash;
>         if (!rxhash)
>                 return __skb_get_rxhash(skb);
>         return rxhash;
> }
>
>
> > +
> > +#ifdef CONFIG_RPS
> > +
> > +/* One global table that all flow-based protocols share. */
> > +struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;
> > +EXPORT_SYMBOL(rps_sock_flow_table);
> > +
> > +/*
> > + * get_rps_cpu is called from netif_receive_skb and returns the target
> > + * CPU from the RPS map of the receiving queue for a given skb.
> > + * rcu_read_lock must be held on entry.
> > + */
> > +static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
> > +                      struct rps_dev_flow **rflowp)
> > +{
> > +       struct netdev_rx_queue *rxqueue;
> > +       struct rps_map *map;
> > +       struct rps_dev_flow_table *flow_table;
> > +       struct rps_sock_flow_table *sock_flow_table;
> > +       int cpu = -1;
> > +       u16 tcpu;
> > +
> > +       if (skb_rx_queue_recorded(skb)) {
> > +               u16 index = skb_get_rx_queue(skb);
> > +               if (unlikely(index >= dev->num_rx_queues)) {
> > +                       WARN_ONCE(dev->num_rx_queues > 1, "%s
> received packet "
> > +                               "on queue %u, but number of RX
> queues is %u\n",
> > +                               dev->name, index, dev->num_rx_queues);
> > +                       goto done;
> > +               }
> > +               rxqueue = dev->_rx + index;
> > +       } else
> > +               rxqueue = dev->_rx;
> > +
> > +       if (!rxqueue->rps_map && !rxqueue->rps_flow_table)
> > +               goto done;
> > +
> > +       skb->rxhash = skb_calculate_flow(dev, skb);
> > +       if (skb->rxhash < 0)
> > +               goto done;
> > +
> >        flow_table = rcu_dereference(rxqueue->rps_flow_table);
> >        sock_flow_table = rcu_dereference(rps_sock_flow_table);
> >        if (flow_table && sock_flow_table) {
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
>
>
>
> --
> Regards,
> Changli Gao(xiaosuo@gmail.com)


^ permalink raw reply

* reminder: bug fixes only
From: David Miller @ 2010-08-03  5:48 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA; +Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA


2.6.35 is out, we're about to enter the merge window, so
if it isn't already in my tree I don't want to see it unless
it's a bug fix.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: cleanup inclusion
From: David Miller @ 2010-08-03  5:45 UTC (permalink / raw)
  To: xiaosuo; +Cc: netdev
In-Reply-To: <1280787101-5589-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Tue,  3 Aug 2010 06:11:41 +0800

> Commit ab95bfe01f9872459c8678572ccadbf646badad0 replaces bridge and macvlan
> hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Applied, thanks.

^ permalink raw reply

* [PATCH] net: cleanup inclusion
From: Changli Gao @ 2010-08-02 22:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Changli Gao

Commit ab95bfe01f9872459c8678572ccadbf646badad0 replaces bridge and macvlan
hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 net/core/dev.c |    2 --
 1 file changed, 2 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 5d1282d..8c663db 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -101,8 +101,6 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/stat.h>
-#include <linux/if_bridge.h>
-#include <linux/if_macvlan.h>
 #include <net/dst.h>
 #include <net/pkt_sched.h>
 #include <net/checksum.h>

^ permalink raw reply related

* Re: RFC: New BGF 'LOOP' instruction
From: David Miller @ 2010-08-03  5:18 UTC (permalink / raw)
  To: hagen; +Cc: leonerd, netdev
In-Reply-To: <20100802201629.GA5973@nuttenaction>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Mon, 2 Aug 2010 22:16:29 +0200

> PS: the LOOP opcode must be secure against any ressource attack -> the loop
> must be break after n iterations.

Oh yeah, what is an iteration in your definition?  See this is why I
totally refuse to add a looping construct to BPF.

If you just check for a single loop hitting, the user will just use
a chaining of two looping constructs.  And then three levels of
indirection, then four, etc.  He can run up to just before exhasting
the "iteration limit" of one loop, and branch to the next one, and
so on and so forth.

There are probably a million ways to exploit this, and once you come
up with a validation or limiting scheme one of two things will happen:

1) The limiting scheme will make legitimate scripts USELESS

2) Someone will just figure out another hole to punch through and
   exploit

There's a reason no back branching construct was added to BPF to begin
with.  It violates the core design principles of BPF.

^ permalink raw reply

* Re: RFC: New BGF 'LOOP' instruction
From: David Miller @ 2010-08-03  5:13 UTC (permalink / raw)
  To: leonerd; +Cc: netdev
In-Reply-To: <20100802110334.GK11110@cel.leo>

From: Paul LeoNerd Evans <leonerd@leonerd.org.uk>
Date: Mon, 2 Aug 2010 12:03:34 +0100

> Any comments on this, while I proceed? Barring any major complaints,
> I'll have a hack at some code and present a patch in due course...

We're not adding loop instructions, it's just asking for trouble
since any user can attach BPF filters to a socket and it's just
way too easy to make a loop endless.

There's a reason no loop primitives were added to the original
BPF specification, perhaps you should take a look at what their
reasoning was.

It still applies now.

^ permalink raw reply

* Re: [PATCH 0/2] Minor extensions to marvell phy driver
From: David Miller @ 2010-08-03  5:09 UTC (permalink / raw)
  To: cyril; +Cc: netdev
In-Reply-To: <1280778294-2993-1-git-send-email-cyril@ti.com>

From: Cyril Chemparathy <cyril@ti.com>
Date: Mon,  2 Aug 2010 15:44:52 -0400

>   phy/marvell: add 88e1121 interface mode support

Applied.

>   phy/marvell: add 88ec048 support

You have to update marvel_id_tbl otherwise you break PHY
module autoloading.

Please fix this and resubmit this second patch.

Thanks.


^ permalink raw reply

* Re: [PATCH] u32: negative offset fix
From: David Miller @ 2010-08-03  5:08 UTC (permalink / raw)
  To: shemminger; +Cc: xiaosuo, netdev
In-Reply-To: <20100802164413.6f327ce6@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon, 2 Aug 2010 16:44:13 -0700

> It was possible to use a negative offset in a u32 match to reference
> the ethernet header or other parts of the link layer header.
> This fixes the regression caused by:
> 
> commit fbc2e7d9cf49e0bf89b9e91fd60a06851a855c5d
> Author: Changli Gao <xiaosuo@gmail.com>
> Date:   Wed Jun 2 07:32:42 2010 -0700
> 
>     cls_u32: use skb_header_pointer() to dereference data safely
> 
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply

* Re: [PATCH] deal with if frags[0].size is pulled to 0 in dev_gro_receive()
From: David Miller @ 2010-08-03  5:03 UTC (permalink / raw)
  To: herbert; +Cc: xiaohui.xin, netdev
In-Reply-To: <20100803045637.GA14173@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 3 Aug 2010 12:56:38 +0800

> On Tue, Aug 03, 2010 at 11:17:19AM +0800, xiaohui.xin@intel.com wrote:
>> From: Xin Xiaohui <xiaohui.xin@intel.com>
>> 
>> Now in dev_gro_receive(), if frags[0].size is pulled to 0, memmove is called and
>> the null page is released. But it's not enough, we should reset size of each frags
>> left as well. Compared to this, we can have another way to do this, it's not do do
>> anything at all.
>> 
>> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> 
> This patch can only work if you audit everything that uses skb
> frags to ensure that they can tolerate a zero-sided frag.
> 
> I think it's much easier to just fix the memmove.

Agreed.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox