Netdev List
 help / color / mirror / Atom feed
* Re: kernel panic receiving flooded VXLAN traffic with OVS
From: Jay Vosburgh @ 2014-11-07 21:13 UTC (permalink / raw)
  To: Jesse Gross; +Cc: Pravin Shelar, netdev, discuss@openvswitch.org
In-Reply-To: <CAEP_g=_ToXZB14MY4Y+FaYfNj_GHpaX59n3nzDrkALXHBzK1yw@mail.gmail.com>

Jesse Gross <jesse@nicira.com> wrote:

>On Fri, Nov 7, 2014 at 10:34 AM, Jesse Gross <jesse@nicira.com> wrote:
>> On Fri, Nov 7, 2014 at 9:40 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>> On Thu, Nov 6, 2014 at 5:58 PM, Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
[...]
>>>>         I'm not sure if this is an error on the part of the RX / GRO
>>>> processing in assembling the GRO skb, or in how OVS calls skb_segment.
>>>>
>>>
>>> I think this is related skb_segment() issue where it is not able to
>>> handle this type of skb geometry. We need to fix skb-segmentation. I
>>> will investigate it more.
>>
>> One problem that I see is that vxlan_gro_complete() doesn't add
>> SKB_GSO_UDP_TUNNEL to gso_type. This causes us to attempt
>> fragmentation as UDP rather than continuing down to do TCP
>> segmentation. That probably screws up the skb geometry.
>
>I sent out a patch to fix this issue. I'm pretty sure that it is the
>root cause of the originally reported case but I don't have a good way
>to reproduce it so it would be great if you could test it Jay.

	I'm having an issue there; when I set up my recreation on
current net-next (3.18-rc2) without your new patch, I get the following
oops when my ovs script does "ovs-vsctl --if-exists del-br br-ex":

[   18.580812] BUG: unable to handle kernel paging request at 0000000022835df6
[   18.585532] IP: [<ffffffffa01cc5ec>] ovs_flow_tbl_insert+0xdc/0x1f0 [openvswitch]
[   18.585532] PGD b016e067 PUD afdf2067 PMD 0 
[   18.585532] Oops: 0002 [#1] SMP 
[   18.585532] Modules linked in: i915 openvswitch libcrc32c video
[   18.608578] sky2 0000:05:00.0 eth0: Link is up at 1000 Mbps, full duplex, flow control rx
[   18.585532]  drm_kms_helper drm gpio_ich lpc_ich i2c_algo_bit ppdev lp serio_raw coretemp kvm_intel kvm parport_pc parport mac_hid hid_generic usbhid hid psmouse r8169 sky2 mii
[   18.585532] CPU: 0 PID: 843 Comm: ovs-vswitchd Not tainted 3.18.0-rc2+ #7
[   18.585532] Hardware name: LENOVO 0829F3U/To be filled by O.E.M., BIOS 90KT15AUS 07/21/2010
[   18.585532] task: ffff880134af3200 ti: ffff8800b0cc4000 task.ti: ffff8800b0cc4000
[   18.585532] RIP: 0010:[<ffffffffa01cc5ec>]  [<ffffffffa01cc5ec>] ovs_flow_tbl_insert+0xdc/0x1f0 [openvswitch]
[   18.585532] RSP: 0018:ffff8800b0cc77a8  EFLAGS: 00010212
[   18.585532] RAX: 00000000432e9568 RBX: ffff880134cb2120 RCX: 0000000001d3d19d
[   18.585532] RDX: 00000000f4372b69 RSI: 000000006d3fa049 RDI: ffff8800b017c19c
[   18.585532] RBP: ffff8800b0cc77f8 R08: 0000000022835dc6 R09: 000000000974849a
[   18.585532] R10: ffffffffa01cc696 R11: 0000000000000004 R12: ffff880134cb2128
[   18.585532] R13: ffff8800b0cc7850 R14: ffff880134cb2128 R15: ffff8800b2706400
[   18.585532] FS:  00007f0497d3a980(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[   18.585532] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.585532] CR2: 0000000022835df6 CR3: 00000000b060e000 CR4: 00000000000407f0
[   18.585532] Stack:
[   18.585532]  ffff8800b017c000 ffff8800b017c000 ffff8800b0cc7a70 ffff8800b017c1c0
[   18.585532]  ffff8800b076b400 ffff8800b017c000 ffff8800b0cc7a70 0000000000000000
[   18.585532]  ffff8800b076b400 ffff880134cb2120 ffff8800b0cc7a38 ffffffffa01c3ed5
[   18.585532] Call Trace:
[   18.585532]  [<ffffffffa01c3ed5>] ovs_flow_cmd_new+0x175/0x3a0 [openvswitch]
[   18.585532]  [<ffffffff81208688>] ? bh_lru_install+0x178/0x1b0
[   18.585532]  [<ffffffff8137ed83>] ? radix_tree_lookup_slot+0x13/0x30
[   18.585532]  [<ffffffff8165f445>] genl_family_rcv_msg+0x1a5/0x3c0
[   18.585532]  [<ffffffff8165f660>] ? genl_family_rcv_msg+0x3c0/0x3c0
[   18.585532]  [<ffffffff8165f6f1>] genl_rcv_msg+0x91/0xd0
[   18.585532]  [<ffffffff8165d761>] netlink_rcv_skb+0xc1/0xe0
[   18.585532]  [<ffffffff8165dc8c>] genl_rcv+0x2c/0x40
[   18.585532]  [<ffffffff8165ccf6>] netlink_unicast+0xf6/0x200
[   18.585532]  [<ffffffff8165d11d>] netlink_sendmsg+0x31d/0x780
[   18.585532]  [<ffffffff81614173>] sock_sendmsg+0x93/0xd0
[   18.585532]  [<ffffffff8101c375>] ? native_sched_clock+0x35/0x90
[   18.585532]  [<ffffffff8101c3d9>] ? sched_clock+0x9/0x10
[   18.585532]  [<ffffffff810966f5>] ? sched_clock_local+0x25/0x90
[   18.585532]  [<ffffffff81622427>] ? verify_iovec+0x47/0xd0
[   18.585532]  [<ffffffff81614989>] ___sys_sendmsg+0x399/0x3b0
[   18.585532]  [<ffffffff81096cb5>] ? fetch_task_cputime+0x95/0x100
[   18.585532]  [<ffffffff811de4c8>] ? pipe_read+0x1c8/0x2f0
[   18.585532]  [<ffffffff8101c375>] ? native_sched_clock+0x35/0x90
[   18.585532]  [<ffffffff8101c375>] ? native_sched_clock+0x35/0x90
[   18.585532]  [<ffffffff8101c3d9>] ? sched_clock+0x9/0x10
[   18.585532]  [<ffffffff8111cf1c>] ? acct_account_cputime+0x1c/0x20
[   18.585532]  [<ffffffff81096dab>] ? account_user_time+0x8b/0xa0
[   18.585532]  [<ffffffff811f30e5>] ? __fget_light+0x25/0x70
[   18.585532]  [<ffffffff81615082>] __sys_sendmsg+0x42/0x80
[   18.585532]  [<ffffffff816150d2>] SyS_sendmsg+0x12/0x20
[   18.585532]  [<ffffffff817365e4>] tracesys_phase2+0xd8/0xdd
[   18.585532] Code: 24 e8 4c 8b 45 b0 31 d2 4d 89 b8 48 03 00 00 41 0f b7 4f 28 41 0f b7 77 2a 0f b7 c1 29 ce 49 8d 7c 00 38 c1 fe 02 e8 d4 af 1d e1 <41> 89 40 30 4c 8b 2b 4c 89 c6 4c 89 ef e8 a2 f5 ff ff 8b 43 20 
[   18.585532] RIP  [<ffffffffa01cc5ec>] ovs_flow_tbl_insert+0xdc/0x1f0 [openvswitch]
[   18.585532]  RSP <ffff8800b0cc77a8>
[   18.585532] CR2: 0000000022835df6
[   18.969812] ---[ end trace fdb3743001087166 ]---

	I'll go back to 3.17 to test your patch in the meantime.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* [PATCH] checkpatch: Add --strict preference for #defines using BIT(foo)
From: Joe Perches @ 2014-11-07 21:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Miller, jiri, netdev, LKML
In-Reply-To: <20141107.151607.480474516800359791.davem@davemloft.net>

Using BIT(foo) and BIT_ULL(bar) is more common now.
Suggest using these macros over #defines with 1<<value.

Add a --fix option too.

Signed-off-by: Joe Perches <joe@perches.com>
---
 scripts/checkpatch.pl | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 893cbd5..b5dc3f4 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -4973,6 +4973,17 @@ sub process {
 			}
 		}
 
+# check for #defines like: 1 << <digit> that could be BIT(digit)
+		if ($line =~ /#\s*define\s+\w+\s+\(?\s*1\s*([ulUL]*)\s*\<\<\s*(?:\d+|$Ident)\s*\)?/) {
+			my $ull = "";
+			$ull = "_ULL" if (defined($1) && $1 =~ /ll/i);
+			if (CHK("BIT_MACRO",
+				"Prefer using the BIT$ull macro\n" . $herecurr) &&
+			    $fix) {
+				$fixed[$fixlinenr] =~ s/\(?\s*1\s*[ulUL]*\s*<<\s*(\d+|$Ident)\s*\)?/BIT${ull}($1)/;
+			}
+		}
+
 # check for case / default statements not preceded by break/fallthrough/switch
 		if ($line =~ /^.\s*(?:case\s+(?:$Ident|$Constant)\s*|default):/) {
 			my $has_break = 0;

^ permalink raw reply related

* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: David Miller @ 2014-11-07 21:48 UTC (permalink / raw)
  To: viro; +Cc: herbert, netdev, linux-kernel, bcrl
In-Reply-To: <20141106032533.GU7996@ZenIV.linux.org.uk>

From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Thu, 6 Nov 2014 03:25:34 +0000

> OK, I've taken the beginning of the old queue on top of net-next; it's
> in git://git.kernel.org//pub/scm/linux/kernel/git/viro/vfs.git iov_iter-net.

What I see in there looks good.   I wonder if we can somehow make msghdr
pointer args const... but this is not so important now.

Some minor coding style nits, comments:

	/* Like
	 * this.
	 */

and for multi-line function calls in the networking, align the second
and subsequent lines at the first column after the openning parenthesis
of the first line.

Thanks.

^ permalink raw reply

* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: David Miller @ 2014-11-07 21:52 UTC (permalink / raw)
  To: viro; +Cc: herbert, netdev, linux-kernel, bcrl
In-Reply-To: <20141106032533.GU7996@ZenIV.linux.org.uk>

From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Thu, 6 Nov 2014 03:25:34 +0000

> 	* a new helper: zerocopy_sg_from_iter().  I have it, actually,
> but I'd rather not step on Herbert's toes - it's too close to the areas
> his series will touch, so that's probably for when his series goes in.
> It will be needed for complete macvtap conversion...

Just a heads up, his series is applied to net-next.

> 	* why doesn't verify_iovec() use rw_copy_check_uvector()?  The only
> real differences I see is that (a) you do allocation in callers (same as
> rw_copy_check_uvector() would've done), (b) you return EMSGSIZE in case of
> too long vector, while rw_copy_check_uvector() returns EINVAL in that case
> and (c) you don't do access_ok().  The last one is described as optimization,
> but for iov_iter primitives it's a serious PITA - for iovec-backed instances
> they are using __copy_from_user()/__copy_to_user(), etc.

The answer is that nobody knew abuot it and looked, that's why.

> 	It certainly would be nice to have the same code doing all copying
> of iovecs from userland - readv/writev/aio/sendmsg/recvmsg/etc.  Am I
> missing something subtle semantical difference in there?  EMSGSIZE vs EINVAL
> is trivial (we can lift that check into the callers, if nothing else), but
> I could miss something more interesting...

We also need compat counterparts.

> 	* various getfrag will need to grow iov_iter-based counterparts,
> but ip_append_output() needs no changes, AFAICS.

Right.

> 	* there's some really weird stuff in there.  Just what is this
> static int raw_probe_proto_opt(struct flowi4 *fl4, struct msghdr *msg)
> {
>         struct iovec *iov;
>         u8 __user *type = NULL;
>         u8 __user *code = NULL;
>         int probed = 0;
>         unsigned int i;
> 
>         if (!msg->msg_iov)
>                 return 0;
> 
>         for (i = 0; i < msg->msg_iovlen; i++) {
>                 iov = &msg->msg_iov[i];
>                 if (!iov)
>                         continue;
> trying to do?  "If non-NULL pointer + i somehow happened to be NULL, skip it
> and try to use the same pointer + i + 1"?  Huh?  Had been that way since
> the function first went in back in 2004 ("[IPV4] XFRM: probe icmp type/code
> when sending packets via raw socket.", according to historical tree)...

This is probably just bogus, because this address-of will never evaluate to
NULL.

> 	* rds, bluetooth and vsock are doing something odd; need to RTFS some
> more.

It is not surprising.... :-/

^ permalink raw reply

* Re: [PATCH net-next] net/fm10k: Avoid double setting of NETIF_F_SG for the HW encapsulation feature mask
From: Vick, Matthew @ 2014-11-07 21:57 UTC (permalink / raw)
  To: Or Gerlitz, Kirsher, Jeffrey T; +Cc: netdev@vger.kernel.org
In-Reply-To: <1415265664-10738-1-git-send-email-ogerlitz@mellanox.com>

On 11/6/14, 1:21 AM, "Or Gerlitz" <ogerlitz@mellanox.com> wrote:

>The networking core does it for the driver during registration time.
>
>Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
>---
> drivers/net/ethernet/intel/fm10k/fm10k_netdev.c |    5 ++---
> 1 files changed, 2 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
>b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
>index 8811364..2b17cd8 100644
>--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
>+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
>@@ -1414,13 +1414,12 @@ struct net_device *fm10k_alloc_netdev(void)
> 	dev->vlan_features |= dev->features;
> 
> 	/* configure tunnel offloads */
>-	dev->hw_enc_features = NETIF_F_IP_CSUM |
>+	dev->hw_enc_features |= NETIF_F_IP_CSUM |
> 			       NETIF_F_TSO |
> 			       NETIF_F_TSO6 |
> 			       NETIF_F_TSO_ECN |
> 			       NETIF_F_GSO_UDP_TUNNEL |
>-			       NETIF_F_IPV6_CSUM |
>-			       NETIF_F_SG;
>+			       NETIF_F_IPV6_CSUM;
> 
> 	/* we want to leave these both on as we cannot disable VLAN tag
> 	 * insertion or stripping on the hardware since it is contained

Good catch, Or! Thank you for taking care of this!

Someone can correct me if I'm mistaken (I thought checkpatch would
complain about this, but it doesn't seem to be), but I believe the start
of the lines should match up like they are for the dev->features. Would
you like to submit a V2 with this change or would you like me to do it
(giving you credit via your Reported-by)?

Cheers,
Matthew

^ permalink raw reply

* Re: [PATCH net-next] rtnetlink: add babel protocol recognition
From: David Miller @ 2014-11-07 21:58 UTC (permalink / raw)
  To: stephen; +Cc: dave.taht, netdev
In-Reply-To: <20141106081014.38a23889@urahara>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Thu, 6 Nov 2014 08:10:14 -0800

> From: Dave Taht <dave.taht@bufferbloat.net>
> 
> Babel uses rt_proto 42. Add to userspace visible header file.
> 
> Signed-off-by: Dave Taht <dave.taht@bufferbloat.net>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

Applied.

^ permalink raw reply

* Re: [PATCH v2 net-next 1/2] net: gro: add a per device gro flush timer
From: David Miller @ 2014-11-07 22:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ogerlitz, willemb, amirv
In-Reply-To: <1415336984.13896.102.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 06 Nov 2014 21:09:44 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> Tuning coalescing parameters on NIC can be really hard.
> 
> Servers can handle both bulk and RPC like traffic, with conflicting
> goals : bulk flows want as big GRO packets as possible, RPC want minimal
> latencies.
> 
> To reach big GRO packets on 10Gbe NIC, one can use :
> 
> ethtool -C eth0 rx-usecs 4 rx-frames 44
> 
> But this penalizes rpc sessions, with an increase of latencies, up to
> 50% in some cases, as NICs generally do not force an interrupt when
> a packet with TCP Push flag is received.
> 
> Some NICs do not have an absolute timer, only a timer rearmed for every
> incoming packet.
> 
> This patch uses a different strategy : Let GRO stack decides what do do,
> based on traffic pattern.
> 
> Packets with Push flag wont be delayed.
> Packets without Push flag might be held in GRO engine, if we keep
> receiving data.
> 
> This new mechanism is off by default, and shall be enabled by setting
> /sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
> 
> To fully enable this mechanism, drivers should use napi_complete_done()
> instead of napi_complete().
> 
> Tested:
>  Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
> 
> Without this feature, we send back about 305,000 ACK per second.
> 
> GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
> 
> Setting a timer of 2000 nsec is enough to increase GRO packet sizes
> and reduce number of ACK packets. (811/19.2 = 42)
> 
> Receiver performs less calls to upper stacks, less wakes up.
> This also reduces cpu usage on the sender, as it receives less ACK
> packets.
> 
> Note that reducing number of wakes up increases cpu efficiency, but can
> decrease QPS, as applications wont have the chance to warmup cpu caches
> doing a partial read of RPC requests/answers if they fit in one skb.
> 
> B:~# sar -n DEV 1 10 | grep eth0 | tail -1
> Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
> 0.00      0.50
> 
> B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
> 
> B:~# sar -n DEV 1 10 | grep eth0 | tail -1
> Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
> 0.00      0.50
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> v2: As requested by David, drivers should use napi_complete_done()
>     instead of napi_complete() so that we do not have to track if
>     a packet was received during last NAPI poll.

Applied, thanks.

I do think this looks a lot nicer.

^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] mlx4: use napi_complete_done()
From: David Miller @ 2014-11-07 22:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, amirv, ogerlitz, willemb
In-Reply-To: <1415337011.13896.103.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 06 Nov 2014 21:10:11 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> To enable gro_flush_timeout, a driver has to use napi_complete_done()
> instead of napi_complete().
> 
> Tested:
>  Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
> 
> Without this feature, we send back about 305,000 ACK per second.
> 
> GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
> 
> Setting a timer of 2000 nsec is enough to increase GRO packet sizes
> and reduce number of ACK packets. (811/19.2 = 42)
> 
> Receiver performs less calls to upper stacks, less wakes up.
> This also reduces cpu usage on the sender, as it receives less ACK
> packets.
> 
> Note that reducing number of wakes up increases cpu efficiency, but can
> decrease QPS, as applications wont have the chance to warmup cpu caches
> doing a partial read of RPC requests/answers if they fit in one skb.
> 
> B:~# sar -n DEV 1 10 | grep eth0 | tail -1
> Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
> 0.00      0.50
> 
> B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
> 
> B:~# sar -n DEV 1 10 | grep eth0 | tail -1
> Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
> 0.00      0.50
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Al Viro @ 2014-11-07 22:11 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, linux-kernel, bcrl
In-Reply-To: <20141107.164859.951682597018909290.davem@redhat.com>

On Fri, Nov 07, 2014 at 04:48:59PM -0500, David Miller wrote:
> From: Al Viro <viro@ZenIV.linux.org.uk>
> Date: Thu, 6 Nov 2014 03:25:34 +0000
> 
> > OK, I've taken the beginning of the old queue on top of net-next; it's
> > in git://git.kernel.org//pub/scm/linux/kernel/git/viro/vfs.git iov_iter-net.
> 
> What I see in there looks good.   I wonder if we can somehow make msghdr
> pointer args const... but this is not so important now.
> 
> Some minor coding style nits, comments:
> 
> 	/* Like
> 	 * this.
> 	 */
> 
> and for multi-line function calls in the networking, align the second
> and subsequent lines at the first column after the openning parenthesis
> of the first line.

OK...  I played with csum side of things a bit, and I've noticed something
really nasty - iov_iter primitives all assume that iovec has been validated,
_including_ access_ok() on all ranges.  That allows us to use __copy_...()
in primitives, and on the read/write/readv/writev/aio side of things we have
that validation done when we copy iovec from userland (or set a single-element
iovec over the userland-supplied range, as in read(2)/write(2)).

net/* primitives, OTOH, do access_ok() themselves - syscalls do _not_ check
access_ok() on iovec from untrusted source and rely on the low-level stuff
to do such checks.

Result: regular IO syscalls on sockets (i.e. not recvmsg/sendmsg, usual
read/write) do these checks (at least) twice and use of copy_from_iter()
in ->recvmsg() opens quite a nasty hole - one can call recvmsg(2) with
kernel address in ->msg_iov[0].base and have such an instance of ->recvmsg()
stomp on the kernel memory.  At the very least, with Herbert's patches
we need to validate that somewhere on the way to tun and macvtap recvmsg
instances.  We can do that right there, and as a stopgap measure it might
be a good idea.  However, it's not a sane long-term solution.

We could, of course, add those access_ok() in mm/iov_iter.c and drop them
from fs/read_write.c and fs/aio.c, but I really don't see the point - why
not do that along with the checks we do in verify_iovec() anyway?  And drop
them from memcpy_fromiovec() et.al.

I'm looking through the tree right now; so far it looks like we can just
move those suckers to the point where we validate iovec and lose them
from low-level iovec and csum copying completely.  I still haven't finished
tracing all possible paths for address to arrive at the points where we
currently check that stuff, but so far it looks very doable.

Comments?

^ permalink raw reply

* Re: [PATCH] ipvs: Keep skb->sk when allocating headroom on tunnel xmit
From: Calvin Owens @ 2014-11-07 22:12 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Simon Horman, Wensong Zhang, lvs-devel, linux-kernel, netdev,
	agartrell, kernel-team
In-Reply-To: <alpine.LFD.2.11.1411051048540.1631@ja.home.ssi.bg>

On 11/05/2014 01:21 AM, Julian Anastasov wrote:
>
> 	Hello,
>
> On Tue, 4 Nov 2014, Calvin Owens wrote:
>
>> ip_vs_prepare_tunneled_skb() ignores ->sk when allocating a new
>> skb, either unconditionally setting ->sk to NULL or allowing
>> the uninitialized ->sk from a newly allocated skb to leak through
>> to the caller.
>>
>> This patch properly copies ->sk and increments its reference count.
>>
>> Signed-off-by: Calvin Owens <calvinowens@fb.com>
>
> 	Good catch. Please, extend your patch to
> fix also the second place that has such error,
> ip_vs_tunnel_xmit_v6. This call is missing from long time,
> it was not needed. But commits that allow skb->sk (local
> clients) already need it, eg.

I'm not sure where exactly you mean: ip_vs_tunnel_xmit_v6() calls
ip_vs_prepare_tunneled_skb() to do the allocation, so this patch covers 
that case.

In older versions of the kernel, ip_vs_tunnel_xmit_v6() does it 
directly, could that be what you're looking at?

> - f2428ed5e7bc89c7 ("ipvs: load balance ipv6 connections from a local
> process"), 2.6.28
> - 4856c84c1358b798 ("ipvs: load balance IPv4 connections from a local
> process"), 2.6.28
>
>> ---
>>   net/netfilter/ipvs/ip_vs_xmit.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
>> index 437a366..bd90bf8 100644
>> --- a/net/netfilter/ipvs/ip_vs_xmit.c
>> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
>> @@ -846,6 +846,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int skb_af,
>>   		new_skb = skb_realloc_headroom(skb, max_headroom);
>>   		if (!new_skb)
>>   			goto error;
>> +		if (skb->sk)
>> +			skb_set_owner_w(new_skb, skb->sk);
>>   		consume_skb(skb);
>>   		skb = new_skb;
>>   	}
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
>

^ permalink raw reply

* Re: kernel panic receiving flooded VXLAN traffic with OVS
From: Jesse Gross @ 2014-11-07 22:29 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Pravin Shelar, netdev, discuss@openvswitch.org
In-Reply-To: <26235.1415394796@famine>

On Fri, Nov 7, 2014 at 1:13 PM, Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
> Jesse Gross <jesse@nicira.com> wrote:
>
>>On Fri, Nov 7, 2014 at 10:34 AM, Jesse Gross <jesse@nicira.com> wrote:
>>> On Fri, Nov 7, 2014 at 9:40 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>> On Thu, Nov 6, 2014 at 5:58 PM, Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
> [...]
>>>>>         I'm not sure if this is an error on the part of the RX / GRO
>>>>> processing in assembling the GRO skb, or in how OVS calls skb_segment.
>>>>>
>>>>
>>>> I think this is related skb_segment() issue where it is not able to
>>>> handle this type of skb geometry. We need to fix skb-segmentation. I
>>>> will investigate it more.
>>>
>>> One problem that I see is that vxlan_gro_complete() doesn't add
>>> SKB_GSO_UDP_TUNNEL to gso_type. This causes us to attempt
>>> fragmentation as UDP rather than continuing down to do TCP
>>> segmentation. That probably screws up the skb geometry.
>>
>>I sent out a patch to fix this issue. I'm pretty sure that it is the
>>root cause of the originally reported case but I don't have a good way
>>to reproduce it so it would be great if you could test it Jay.
>
>         I'm having an issue there; when I set up my recreation on
> current net-next (3.18-rc2) without your new patch, I get the following
> oops when my ovs script does "ovs-vsctl --if-exists del-br br-ex":

Hmm, that looks like a totally different problem. Pravin - any ideas?

^ permalink raw reply

* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Al Viro @ 2014-11-07 22:31 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, linux-kernel, bcrl
In-Reply-To: <20141107221114.GB7996@ZenIV.linux.org.uk>

On Fri, Nov 07, 2014 at 10:11:14PM +0000, Al Viro wrote:

> I'm looking through the tree right now; so far it looks like we can just
> move those suckers to the point where we validate iovec and lose them
> from low-level iovec and csum copying completely.  I still haven't finished
> tracing all possible paths for address to arrive at the points where we
> currently check that stuff, but so far it looks very doable.

BTW, csum side of that is also chock-full of duplicate access_ok() -
e.g. generic csum_and_copy_from_user() checks before calling
csum_partial_copy_from_user().  And generic instance of that is using
__copy_from_user(), all right, but a _lot_ of non-default instances
repeat that access_ok().

^ permalink raw reply

* Re: [PATCH net 3/5] fm10k: Implement ndo_gso_check()
From: Joe Stringer @ 2014-11-07 22:35 UTC (permalink / raw)
  To: Vick, Matthew
  Cc: alexander.duyck@gmail.com, netdev@vger.kernel.org,
	Dept-GELinuxNICDev@qlogic.com, linux-kernel@vger.kernel.org,
	sathya.perla@emulex.com, Kirsher, Jeffrey T, Linux NICS,
	amirv@mellanox.com, shahed.shaikh@qlogic.com, therbert@google.com,
	Or Gerlitz
In-Reply-To: <D0825758.5EC90%matthew.vick@intel.com>

On Friday, November 07, 2014 11:49:38 Vick, Matthew wrote:
> On 11/6/14, 9:05 PM, "Joe Stringer" <joestringer@nicira.com> wrote:
> >Let's merge both discussions into one thread (pick here or there). We
> >have
> >this suggestion or the one which simply checks for tunnels and
> >inner+outer
> >header lengths. Do you have a preference between them?
> 
> Agreed with merging the thread--consider it merged!
> 
> Reflecting on this some more, I prefer the latter option (checking tunnels
> and header lengths). I'm leaning towards pushing the header length check
> into fm10k_tx_encap_offload and then making fm10k_gso_check call that with
> the gso_type. So, it's really the most recent version of the patch you
> proposed:
> 
> static bool fm10k_gso_check(struct sk_buff *skb, struct net_device *dev)
> {
> 	if ((skb_shinfo(skb)->gso_type & (SKB_GSO_UDP_TUNNEL | SKB_GSO_GRE)) &&
> 	    !fm10k_tx_encap_offload(skb))
> 		return false;
> 
> 	return true;
> }
> 
> 
> plus the header length being checked in fm10k_tx_encap_offload. The only
> nit would be that I'd just return the conditional instead of having
> "return true" or "return false" lines.

OK, that sounds reasonable.

> The tunnel length check really should be there in fm10k_tx_encap_offload
> anyway, so I'll get a patch together for that one.

Thanks.

> >We could introduce an "skb_is_gso_encap()" or similar for this purpose.
> >Checking for SKB_GSO_UDP_TUNNEL or SKB_GSO_GRE is pretty closely tied to
> >what
> >fm10k_tx_encap_offload() checks for though; it might not even make sense
> >to call
> >it if some of the other SKB_GSO_* flags are raised.
> 
> A fair point. On the other hand, we have to check the header length both
> in the GSO and non-GSO cases anyway, so only having the check in
> fm10k_tx_encap_offload and calling it from fm10k_gso_check wouldn't be as
> duplicative. What do you think about that approach?

Sure, I think fm10k_tx_encap_offload() is a good place for the header length 
check. Separately, my question above was regarding the idea of a helper for 
SKB_GSO_{GRE,UDP_TUNNEL}. The only reason it might be useful for the fm10k 
driver is because all encap is checked in the fm10k_tx_encap_offload() function. 
Other drivers don't seem to handle different tunnels together like this though, 
so I'm inclined to stick with the below for now.
                                                                                
static bool fm10k_gso_check(struct sk_buff *skb, struct net_device *dev)        
{                                                                               
        return (!(skb_shinfo(skb)->gso_type &                                   
                  (SKB_GSO_UDP_TUNNEL | SKB_GSO_GRE)) ||                        
                fm10k_tx_encap_offload(skb));                                   
}

Cheers,
Joe

^ permalink raw reply

* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Al Viro @ 2014-11-07 22:35 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, linux-kernel, bcrl
In-Reply-To: <20141107223153.GC7996@ZenIV.linux.org.uk>

On Fri, Nov 07, 2014 at 10:31:53PM +0000, Al Viro wrote:
> On Fri, Nov 07, 2014 at 10:11:14PM +0000, Al Viro wrote:
> 
> > I'm looking through the tree right now; so far it looks like we can just
> > move those suckers to the point where we validate iovec and lose them
> > from low-level iovec and csum copying completely.  I still haven't finished
> > tracing all possible paths for address to arrive at the points where we
> > currently check that stuff, but so far it looks very doable.
> 
> BTW, csum side of that is also chock-full of duplicate access_ok() -
> e.g. generic csum_and_copy_from_user() checks before calling
> csum_partial_copy_from_user().  And generic instance of that is using
> __copy_from_user(), all right, but a _lot_ of non-default instances
> repeat that access_ok().

While we are at it: here's the default csum_and_copy_to_user()
static __inline__ __wsum csum_and_copy_to_user
(const void *src, void __user *dst, int len, __wsum sum, int *err_ptr)
{
        sum = csum_partial(src, len, sum);

        if (access_ok(VERIFY_WRITE, dst, len)) {
                if (copy_to_user(dst, src, len) == 0)
                        return sum;
        }
        if (len)
                *err_ptr = -EFAULT;

        return (__force __wsum)-1; /* invalid checksum */
}

Note that we do that access_ok() and follow it with copy_to_user() on exact
same range, i.e. repeat the same damn check...

^ permalink raw reply

* Re: [net-next 1/9] i40e: poll firmware slower
From: Jesse Brandeburg @ 2014-11-07 22:58 UTC (permalink / raw)
  To: David Laight
  Cc: 'Jeff Kirsher', davem@davemloft.net, Kamil Krawczyk,
	netdev@vger.kernel.org, nhorman@redhat.com, sassmann@redhat.com,
	jogreene@redhat.com, jesse.brandeburg
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1C9EAE27@AcuExch.aculab.com>

Thanks for the review David, comments follow.

On Fri, 7 Nov 2014 09:40:08 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Jeff Kirsher
> > From: Kamil Krawczyk <kamil.krawczyk@intel.com>
> > 
> > The code was polling the firmware tail register for completion every
> > 10 microseconds, which is way faster than the firmware can respond.
> > This changes the poll interval to 1ms, which reduces polling CPU
> > utilization, and the number of times we loop.
> 
> Are you sure the code path is allowed to sleep?

Yes, we are never (should never be) in interrupt context when calling
these routines.

> 
> > The maximum delay is still 100ms.
> 
> Actually it is now likely to be up to 200ms or more.
> You could convert the maximum delay check to one that
> looks at jiffies - but maybe it doesn't matter.

Thats okay, this is all init or reset or shutdown level code.  If the
delay goes up it won't hurt anything.

> 
> > --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.c
> > @@ -853,7 +853,6 @@ i40e_status i40e_asq_send_command(struct i40e_hw *hw,
> >  	 */
> >  	if (!details->async && !details->postpone) {
> >  		u32 total_delay = 0;
> > -		u32 delay_len = 10;
> > 
> >  		do {
> >  			/* AQ designers suggest use of head for better
> > @@ -862,8 +861,8 @@ i40e_status i40e_asq_send_command(struct i40e_hw *hw,
> >  			if (i40e_asq_done(hw))
> >  				break;
> >  			/* ugh! delay while spin_lock */
> 
> The comment is not right any more.

yes it should have been removed.

> 
> > -			udelay(delay_len);
> > -			total_delay += delay_len;
> > +			usleep_range(1000, 2000);
> > +			total_delay++;
> >  		} while (total_delay < hw->aq.asq_cmd_timeout);
> >  	}
> > 
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > index ba38a89..df0bd09 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > @@ -141,7 +141,7 @@ static inline int i40e_aq_rc_to_posix(u16 aq_rc)
> > 
> >  /* general information */
> >  #define I40E_AQ_LARGE_BUF	512
> > -#define I40E_ASQ_CMD_TIMEOUT	100000  /* usecs */
> > +#define I40E_ASQ_CMD_TIMEOUT	100  /* msecs */
> 
> It looks like this value is written to asq_cmd_timeout, that makes
> be wonder whether anything else can change it - otherwise the compile
> time constant would be used.
> Changing the units has broken anything else that modifies the value.

I pretty much agree with you, but I can tell you why it's there.
Currently nothing in the code changes it.  The code was designed such
that it can run on hardware requiring different timeouts.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Tom Herbert @ 2014-11-07 23:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Ying Cai, Willem de Bruijn, Neal Cardwell
In-Reply-To: <1415393472.13896.119.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, Nov 7, 2014 at 12:51 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Alternative to RPS/RFS is to use hardware support for multi queue.
>
> Then split a set of million of sockets into worker threads, each
> one using epoll() to manage events on its own socket pool.
>
> Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
> know after accept() or connect() on which queue/cpu a socket is managed.
>
> We normally use one cpu per RX queue (IRQ smp_affinity being properly
> set), so remembering on socket structure which cpu delivered last packet
> is enough to solve the problem.
>
> After accept(), connect(), or even file descriptor passing around
> processes, applications can use :
>
>  int cpu;
>  socklen_t len = sizeof(cpu);
>
>  getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
>
> And use this information to put the socket into the right silo
> for optimal performance, as all networking stack should run
> on the appropriate cpu, without need to send IPI (RPS/RFS).
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  arch/alpha/include/uapi/asm/socket.h   |    2 ++
>  arch/avr32/include/uapi/asm/socket.h   |    2 ++
>  arch/cris/include/uapi/asm/socket.h    |    2 ++
>  arch/frv/include/uapi/asm/socket.h     |    2 ++
>  arch/ia64/include/uapi/asm/socket.h    |    2 ++
>  arch/m32r/include/uapi/asm/socket.h    |    2 ++
>  arch/mips/include/uapi/asm/socket.h    |    2 ++
>  arch/mn10300/include/uapi/asm/socket.h |    2 ++
>  arch/parisc/include/uapi/asm/socket.h  |    2 ++
>  arch/powerpc/include/uapi/asm/socket.h |    2 ++
>  arch/s390/include/uapi/asm/socket.h    |    2 ++
>  arch/sparc/include/uapi/asm/socket.h   |    2 ++
>  arch/xtensa/include/uapi/asm/socket.h  |    2 ++
>  include/net/sock.h                     |   12 ++++++++++++
>  include/uapi/asm-generic/socket.h      |    2 ++
>  net/core/sock.c                        |    5 +++++
>  net/ipv4/tcp_ipv4.c                    |    1 +
>  net/ipv4/udp.c                         |    1 +
>  net/ipv6/tcp_ipv6.c                    |    1 +
>  net/ipv6/udp.c                         |    1 +
>  net/sctp/ulpqueue.c                    |    5 +++--
>  21 files changed, 52 insertions(+), 2 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
> index 3de1394bcab821984674e89a3ee022cc6dd5f0f2..e2fe0700b3b442bffc1f606b1b8b0bb7759aa157 100644
> --- a/arch/alpha/include/uapi/asm/socket.h
> +++ b/arch/alpha/include/uapi/asm/socket.h
> @@ -87,4 +87,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _UAPI_ASM_SOCKET_H */
> diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
> index 6e6cd159924b1855aa5f1811ad4e4c60b403c431..92121b0f5b989a61c008e0be24030725bab88e36 100644
> --- a/arch/avr32/include/uapi/asm/socket.h
> +++ b/arch/avr32/include/uapi/asm/socket.h
> @@ -80,4 +80,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _UAPI__ASM_AVR32_SOCKET_H */
> diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
> index ed94e5ed0a238c2750e677ccb806a6bc0a94041a..60f60f5b9b35bd219d7a9834fe5394e8ac5fdbab 100644
> --- a/arch/cris/include/uapi/asm/socket.h
> +++ b/arch/cris/include/uapi/asm/socket.h
> @@ -82,6 +82,8 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _ASM_SOCKET_H */
>
>
> diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
> index ca2c6e6f31c6817780d31a246652adcc9847e373..2c6890209ea60c149bf097c2a1b369519cb8c301 100644
> --- a/arch/frv/include/uapi/asm/socket.h
> +++ b/arch/frv/include/uapi/asm/socket.h
> @@ -80,5 +80,7 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _ASM_SOCKET_H */
>
> diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
> index a1b49bac7951929127ed08db549218c2c16ccf89..09a93fb566f6c6c6fe29c10c95b931881843d1cd 100644
> --- a/arch/ia64/include/uapi/asm/socket.h
> +++ b/arch/ia64/include/uapi/asm/socket.h
> @@ -89,4 +89,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _ASM_IA64_SOCKET_H */
> diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
> index 6c9a24b3aefa3a4f3048c17a7fa06d97b585ec14..e8589819c2743c6e112b15a245fc3ebd146e6313 100644
> --- a/arch/m32r/include/uapi/asm/socket.h
> +++ b/arch/m32r/include/uapi/asm/socket.h
> @@ -80,4 +80,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _ASM_M32R_SOCKET_H */
> diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
> index a14baa218c76f14de988ef106bdac5dadc48aceb..2e9ee8c55a103a0337d9f80f71fe9ef28be1154b 100644
> --- a/arch/mips/include/uapi/asm/socket.h
> +++ b/arch/mips/include/uapi/asm/socket.h
> @@ -98,4 +98,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _UAPI_ASM_SOCKET_H */
> diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
> index 6aa3ce1854aa9523d46bc28851eddabd59edeb37..f3492e8c9f7009c33e07168df916f7337bef3929 100644
> --- a/arch/mn10300/include/uapi/asm/socket.h
> +++ b/arch/mn10300/include/uapi/asm/socket.h
> @@ -80,4 +80,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _ASM_SOCKET_H */
> diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
> index fe35ceacf0e72cad69a43d9b1ce7b8f5ec3da98a..7984a1cab3da980f1f810827967b4b67616eb89b 100644
> --- a/arch/parisc/include/uapi/asm/socket.h
> +++ b/arch/parisc/include/uapi/asm/socket.h
> @@ -79,4 +79,6 @@
>
>  #define SO_BPF_EXTENSIONS      0x4029
>
> +#define SO_INCOMING_CPU                0x402A
> +
>  #endif /* _UAPI_ASM_SOCKET_H */
> diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
> index a9c3e2e18c054a1e952fe33599401de57c6a6544..3474e4ef166df4a573773916b325d0fa9f3b45d0 100644
> --- a/arch/powerpc/include/uapi/asm/socket.h
> +++ b/arch/powerpc/include/uapi/asm/socket.h
> @@ -87,4 +87,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _ASM_POWERPC_SOCKET_H */
> diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
> index e031332096d7c7b23b5953680289e8f3bcc3b378..8457636c33e1b67a9b7804daa05627839035a8fb 100644
> --- a/arch/s390/include/uapi/asm/socket.h
> +++ b/arch/s390/include/uapi/asm/socket.h
> @@ -86,4 +86,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _ASM_SOCKET_H */
> diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
> index 54d9608681b6947ae25dab008f808841d96125c0..4a8003a9416348006cfa85d5bcdf7553c8d23958 100644
> --- a/arch/sparc/include/uapi/asm/socket.h
> +++ b/arch/sparc/include/uapi/asm/socket.h
> @@ -76,6 +76,8 @@
>
>  #define SO_BPF_EXTENSIONS      0x0032
>
> +#define SO_INCOMING_CPU                0x0033
> +
>  /* Security levels - as per NRL IPv6 - don't actually do anything */
>  #define SO_SECURITY_AUTHENTICATION             0x5001
>  #define SO_SECURITY_ENCRYPTION_TRANSPORT       0x5002
> diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
> index 39acec0cf0b1d500c1c40f9b523ef3a9a142c2f1..c46f6a696849c6f7f8a34b2cc522b48e04b17380 100644
> --- a/arch/xtensa/include/uapi/asm/socket.h
> +++ b/arch/xtensa/include/uapi/asm/socket.h
> @@ -91,4 +91,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* _XTENSA_SOCKET_H */
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 6767d75ecb17693eb59a99b8218da4319854ccc0..7789b59c0c400eb99f65d1f0e03cd9773664cf93 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -273,6 +273,7 @@ struct cg_proto;
>    *    @sk_rcvtimeo: %SO_RCVTIMEO setting
>    *    @sk_sndtimeo: %SO_SNDTIMEO setting
>    *    @sk_rxhash: flow hash received from netif layer
> +  *    @sk_incoming_cpu: record cpu processing incoming packets
>    *    @sk_txhash: computed flow hash for use on transmit
>    *    @sk_filter: socket filtering instructions
>    *    @sk_protinfo: private area, net family specific, when not using slab
> @@ -350,6 +351,12 @@ struct sock {
>  #ifdef CONFIG_RPS
>         __u32                   sk_rxhash;
>  #endif
> +       u16                     sk_incoming_cpu;
> +       /* 16bit hole
> +        * Warned : sk_incoming_cpu can be set from softirq,
> +        * Do not use this hole without fully understanding possible issues.
> +        */
> +
>         __u32                   sk_txhash;
>  #ifdef CONFIG_NET_RX_BUSY_POLL
>         unsigned int            sk_napi_id;
> @@ -833,6 +840,11 @@ static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
>         return sk->sk_backlog_rcv(sk, skb);
>  }
>
> +static inline void sk_incoming_cpu_update(struct sock *sk)
> +{
> +       sk->sk_incoming_cpu = raw_smp_processor_id();
> +}
> +
>  static inline void sock_rps_record_flow_hash(__u32 hash)
>  {
>  #ifdef CONFIG_RPS
> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
> index ea0796bdcf88404ef0f127eb6e64ba00c16ea856..f541ccefd4acbeb4ad757be9dbf4b67f204bf21d 100644
> --- a/include/uapi/asm-generic/socket.h
> +++ b/include/uapi/asm-generic/socket.h
> @@ -82,4 +82,6 @@
>
>  #define SO_BPF_EXTENSIONS      48
>
> +#define SO_INCOMING_CPU                49
> +
>  #endif /* __ASM_GENERIC_SOCKET_H */
> diff --git a/net/core/sock.c b/net/core/sock.c
> index ac56dd06c306f3712e57ce8e4724c79565589499..0725cf0cb685787b2122606437da53299fb24621 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1213,6 +1213,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
>                 v.val = sk->sk_max_pacing_rate;
>                 break;
>
> +       case SO_INCOMING_CPU:
> +               v.val = sk->sk_incoming_cpu;
> +               break;
> +
>         default:
>                 return -ENOPROTOOPT;
>         }
> @@ -1517,6 +1521,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>
>                 newsk->sk_err      = 0;
>                 newsk->sk_priority = 0;
> +               newsk->sk_incoming_cpu = raw_smp_processor_id();
>                 /*
>                  * Before updating sk_refcnt, we must commit prior changes to memory
>                  * (Documentation/RCU/rculist_nulls.txt for details)
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 9c7d7621466b1241f404a5ca11de809dcff2d02a..3893f51972f28271a6d27a763c05495c5c2554f7 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1662,6 +1662,7 @@ process:
>                 goto discard_and_relse;
>
>         sk_mark_napi_id(sk, skb);
> +       sk_incoming_cpu_update(sk);

Would it be better to get the incoming CPU before packet steering? Or
maybe return this in addition to the CPU where ULP processes packet?

>         skb->dev = NULL;
>
>         bh_lock_sock_nested(sk);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index df19027f44f3d6fbe13dec78d3b085968dbf2329..f52b6081158e87caa5df32e8e5d27dbf314a01b1 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1445,6 +1445,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>         if (inet_sk(sk)->inet_daddr) {
>                 sock_rps_save_rxhash(sk, skb);
>                 sk_mark_napi_id(sk, skb);
> +               sk_incoming_cpu_update(sk);
>         }
>
>         rc = sock_queue_rcv_skb(sk, skb);
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index ace29b60813cf8a1d7182ad2262cbcbd21810fa7..ac40d23204b5e55da5172c80dafd1d4854b370d5 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1455,6 +1455,7 @@ process:
>                 goto discard_and_relse;
>
>         sk_mark_napi_id(sk, skb);
> +       sk_incoming_cpu_update(sk);
>         skb->dev = NULL;
>
>         bh_lock_sock_nested(sk);
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 9b6809232b178c16d699ce3d152196b8c4cb096b..0125ca3daf47a4a3333e7462a11550d3e2f96875 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -577,6 +577,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>         if (!ipv6_addr_any(&sk->sk_v6_daddr)) {
>                 sock_rps_save_rxhash(sk, skb);
>                 sk_mark_napi_id(sk, skb);
> +               sk_incoming_cpu_update(sk);
>         }
>
>         rc = sock_queue_rcv_skb(sk, skb);
> diff --git a/net/sctp/ulpqueue.c b/net/sctp/ulpqueue.c
> index d49dc2ed30adb97a809eb37902b9956c366a2862..ce469d648ffbe166f9ae1c5650f481256f31a7f8 100644
> --- a/net/sctp/ulpqueue.c
> +++ b/net/sctp/ulpqueue.c
> @@ -205,9 +205,10 @@ int sctp_ulpq_tail_event(struct sctp_ulpq *ulpq, struct sctp_ulpevent *event)
>         if (sock_flag(sk, SOCK_DEAD) || (sk->sk_shutdown & RCV_SHUTDOWN))
>                 goto out_free;
>
> -       if (!sctp_ulpevent_is_notification(event))
> +       if (!sctp_ulpevent_is_notification(event)) {
>                 sk_mark_napi_id(sk, skb);
> -
> +               sk_incoming_cpu_update(sk);
> +       }
>         /* Check if the user wishes to receive this event.  */
>         if (!sctp_ulpevent_is_enabled(event, &sctp_sk(sk)->subscribe))
>                 goto out_free;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next 1/9] i40e: poll firmware slower
From: Jesse Brandeburg @ 2014-11-07 23:02 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jeff Kirsher, David Miller, Kamil Krawczyk, Linux Netdev List,
	nhorman, sassmann, jogreene, jesse.brandeburg
In-Reply-To: <CAJ3xEMghFwT8Tm9wONzLPL4MLvgL+3w=CNRNg0Hb=yGJ9uR0Mg@mail.gmail.com>

On Fri, 7 Nov 2014 15:29:15 +0200
Or Gerlitz <gerlitz.or@gmail.com> wrote:

> On Fri, Nov 7, 2014 at 10:57 AM, Jeff Kirsher
> <jeffrey.t.kirsher@intel.com> wrote:
> > The code was polling the firmware tail register for completion
> 
> any reason not to sleep while waiting for this completion? can the
> firmware generate an interrupt?

The code path in question is called from the drivers when they are in
a synchronous context and can sleep or be rescheduled without issue.
An async mode is possible, but not here, because the code complexity
required to change to async mode is unmaintainable (if even doable, how
do you handle an interrupt while you're in probe?).  We use async when
it is practical.

Thanks for your comments,
Jesse

^ permalink raw reply

* Re: kernel panic receiving flooded VXLAN traffic with OVS
From: Pravin Shelar @ 2014-11-07 23:06 UTC (permalink / raw)
  To: Jesse Gross; +Cc: Jay Vosburgh, netdev, discuss@openvswitch.org
In-Reply-To: <CAEP_g=9gWZYtWGiF-JbVwmesa1n7FXYEJSQKdZi3L_yhPGPkAA@mail.gmail.com>

On Fri, Nov 7, 2014 at 2:29 PM, Jesse Gross <jesse@nicira.com> wrote:
> On Fri, Nov 7, 2014 at 1:13 PM, Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>> Jesse Gross <jesse@nicira.com> wrote:
>>
>>>On Fri, Nov 7, 2014 at 10:34 AM, Jesse Gross <jesse@nicira.com> wrote:
>>>> On Fri, Nov 7, 2014 at 9:40 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>>> On Thu, Nov 6, 2014 at 5:58 PM, Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>> [...]
>>>>>>         I'm not sure if this is an error on the part of the RX / GRO
>>>>>> processing in assembling the GRO skb, or in how OVS calls skb_segment.
>>>>>>
>>>>>
>>>>> I think this is related skb_segment() issue where it is not able to
>>>>> handle this type of skb geometry. We need to fix skb-segmentation. I
>>>>> will investigate it more.
>>>>
>>>> One problem that I see is that vxlan_gro_complete() doesn't add
>>>> SKB_GSO_UDP_TUNNEL to gso_type. This causes us to attempt
>>>> fragmentation as UDP rather than continuing down to do TCP
>>>> segmentation. That probably screws up the skb geometry.
>>>
>>>I sent out a patch to fix this issue. I'm pretty sure that it is the
>>>root cause of the originally reported case but I don't have a good way
>>>to reproduce it so it would be great if you could test it Jay.
>>
>>         I'm having an issue there; when I set up my recreation on
>> current net-next (3.18-rc2) without your new patch, I get the following
>> oops when my ovs script does "ovs-vsctl --if-exists del-br br-ex":
>
> Hmm, that looks like a totally different problem. Pravin - any ideas?

I am not able to reproduce with above command. Jay, Can you send me
steps to reproduce this issue?

Thanks,
Pravin.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce SO_INCOMING_CPU
From: Eric Dumazet @ 2014-11-07 23:26 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, netdev, Ying Cai, Willem de Bruijn, Neal Cardwell
In-Reply-To: <CA+mtBx9kpW8Q4HgTyNE+O4yPRmMcOgjshrBfZXVnPu=EMVh4Qg@mail.gmail.com>

On Fri, 2014-11-07 at 15:02 -0800, Tom Herbert wrote:

> Would it be better to get the incoming CPU before packet steering? Or
> maybe return this in addition to the CPU where ULP processes packet?

No packet steering : RFS is off.

If for some reason RPS is on (say NIC has a limited number of RX queues)
we'd like to have the cpu processing the packet in TCP stack for better
affinities.

^ permalink raw reply

* [PATCH net] cxgb4 : Fix bug in DCB app deletion
From: Anish Bhatt @ 2014-11-07 23:41 UTC (permalink / raw)
  To: netdev; +Cc: davem, hariprasad, leedom, Anish Bhatt

Unlike CEE, IEEE has a bespoke app delete call and does not rely on priority
for app deletion

Fixes : 2376c879b80c ('cxgb4 : Improve handling of DCB negotiation or loss
 thereof')

Signed-off-by: Anish Bhatt <anish@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c
index 6fe300e..b6fdb14 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_dcb.c
@@ -79,8 +79,9 @@ static void cxgb4_dcb_cleanup_apps(struct net_device *dev)
 		app.protocol = dcb->app_priority[i].protocolid;
 
 		if (dcb->dcb_version == FW_PORT_DCB_VER_IEEE) {
+			app.priority = dcb->app_priority[i].user_prio_map;
 			app.selector = dcb->app_priority[i].sel_field + 1;
-			err = dcb_ieee_setapp(dev, &app);
+			err = dcb_ieee_delapp(dev, &app);
 		} else {
 			app.selector = !!(dcb->app_priority[i].sel_field);
 			err = dcb_setapp(dev, &app);
-- 
2.1.3

^ permalink raw reply related

* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Al Viro @ 2014-11-07 23:42 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, linux-kernel, bcrl
In-Reply-To: <20141107221114.GB7996@ZenIV.linux.org.uk>

On Fri, Nov 07, 2014 at 10:11:14PM +0000, Al Viro wrote:

> I'm looking through the tree right now; so far it looks like we can just
> move those suckers to the point where we validate iovec and lose them
> from low-level iovec and csum copying completely.  I still haven't finished
> tracing all possible paths for address to arrive at the points where we
> currently check that stuff, but so far it looks very doable.

Definitely doable.  The only remaining interesting part is drivers/vhost
with the stuff it puts in vq->iov[].  If we can guarantee that it satisfies
the sanity checks (access_ok() and size-related ones), we are done -
making verify_iovec() use rw_copy_check_uvector() (and verify_compat_iov()
use compat_rw_copy_check_uvector()) will suffice to guarantee that none of
	csum_partial_copy_fromiovecend
	memcpy_fromiovec
	memcpy_toiovec
	memcpy_toiovecend
	memcpy_fromiovecend
	skb_copy_datagram_iovec
	skb_copy_datagram_iter
	skb_copy_datagram_from_iter
	zerocopy_sg_from_iter
	skb_copy_and_csum_datagram
	skb_copy_and_csum_datagram_iovec
	csum_and_copy_from_user
	csum_and_copy_to_user
	csum_partial_copy_from_user
will ever see an address that doesn't satisfy access_ok() checks.  And
having looked at the data flow...  we definitely want to do those checks
on intake of iovec - as it is, we usually repeat them quite a few times
for the same iovec segment, and we practically never end up _not_ doing them
for some segment of iovec, unless we hit a failure exit before we get around
to copying any data at all.

I'll finish RTFS drivers/vhost and if it turns out to be OK I'll post the
series moving those checks to the moment of copying iovec from userland,
so that kernel-side we could always rely on ->msg_iov elements having been
verified.

^ permalink raw reply

* Re: [PATCH net 3/5] fm10k: Implement ndo_gso_check()
From: Vick, Matthew @ 2014-11-08  0:51 UTC (permalink / raw)
  To: Joe Stringer
  Cc: alexander.duyck@gmail.com, netdev@vger.kernel.org,
	Dept-GELinuxNICDev@qlogic.com, linux-kernel@vger.kernel.org,
	sathya.perla@emulex.com, Kirsher, Jeffrey T, Linux NICS,
	amirv@mellanox.com, shahed.shaikh@qlogic.com, therbert@google.com,
	Or Gerlitz
In-Reply-To: <201411071435.56628.joestringer@nicira.com>

On 11/7/14, 2:35 PM, "Joe Stringer" <joestringer@nicira.com> wrote:

>Sure, I think fm10k_tx_encap_offload() is a good place for the header
>length 
>check. Separately, my question above was regarding the idea of a helper
>for 
>SKB_GSO_{GRE,UDP_TUNNEL}. The only reason it might be useful for the
>fm10k 
>driver is because all encap is checked in the fm10k_tx_encap_offload()
>function. 
>Other drivers don't seem to handle different tunnels together like this
>though, 
>so I'm inclined to stick with the below for now.
>                  
>      
>static bool fm10k_gso_check(struct sk_buff *skb, struct net_device *dev)
>      
>{                 
>      
>        return (!(skb_shinfo(skb)->gso_type &
>      
>                  (SKB_GSO_UDP_TUNNEL | SKB_GSO_GRE)) ||
>      
>                fm10k_tx_encap_offload(skb));
>      
>}
>
>Cheers,
>Joe

I agree. I think that makes the most sense.

Cheers,
Matthew

^ permalink raw reply

* Re: [PATCH net] udptunnel: Add SKB_GSO_UDP_TUNNEL during gro_complete.
From: Jay Vosburgh @ 2014-11-08  1:51 UTC (permalink / raw)
  To: Jesse Gross; +Cc: David Miller, netdev
In-Reply-To: <1415391969-108511-1-git-send-email-jesse@nicira.com>

Jesse Gross <jesse@nicira.com> wrote:

>When doing GRO processing for UDP tunnels, we never add
>SKB_GSO_UDP_TUNNEL to gso_type - only the type of the inner protocol
>is added (such as SKB_GSO_TCPV4). The result is that if the packet is
>later resegmented we will do GSO but not treat it as a tunnel. This
>results in UDP fragmentation and since that is not the original layout
>of the skb, a panic in skb_segment().
>
>Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>Signed-off-by: Jesse Gross <jesse@nicira.com>
>---
>This problem occurs back to 3.14 for VXLAN but the FOU portion needs to
>be removed for kernels other than the 'net' tree.

	This patch does not resolve the problem when applied to a
3.17-ish kernel; the panic message is below, and appears to be
unchanged.

	-J

[  146.960891] kernel BUG at net/core/skbuff.c:3011!
[  146.960891] invalid opcode: 0000 [#1] SMP
[  146.960891] Modules linked in: veth 8021q garp mrp bonding xt_tcpudp bridge s
tp llc iptable_filter ip_tables x_tables openvswitch vxlan udp_tunnel gre libcrc
32c i915 video drm_kms_helper coretemp kvm_intel drm kvm gpio_ich serio_raw i2c_
algo_bit ppdev lpc_ich parport_pc lp parport mac_hid hid_generic usbhid hid psmo
use r8169 mii sky2
[  146.960891] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.17.0-rc7 #11
[  146.960891] Hardware name: LENOVO 0829F3U/To be filled by O.E.M., BIOS 90KT15
AUS 07/21/2010
[  146.960891] task: ffffffff81c1d480 ti: ffffffff81c00000 task.ti: ffffffff81c0
0000
[  146.960891] RIP: 0010:[<ffffffff816117d2>]  [<ffffffff816117d2>] skb_segment+
0x962/0x990
[  146.960891] RSP: 0018:ffff88013fc03908  EFLAGS: 00010216
[  146.960891] RAX: 00000000000002dc RBX: ffff8800a62f0a00 RCX: ffff88003215e4f0
[  146.960891] RDX: 0000000000000074 RSI: ffff88003215e400 RDI: ffff88003215f800
[  146.960891] RBP: ffff88013fc039d0 R08: 0000000000000022 R09: 0000000000000000
[  146.960891] R10: ffff8800a62f0b00 R11: 00000000000005ca R12: ffff88003215f8f0
[  146.960891] R13: 0000000000000000 R14: ffff8800320e4100 R15: 0000000000000074
[  146.960891] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:00000
00000000000
[  146.960891] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  146.960891] CR2: 000000000218c818 CR3: 00000000b1e4e000 CR4: 00000000000407f0
[  146.960891] Stack:
[  146.960891]  ffff8800320e4c00 ffffffffffffffde ffff8800000005ca 0000000000000
022
[  146.960891]  0000000000000040 ffff8800a62f0b00 00000001000005a8 0000000000000
000
[  146.960891]  0000000000000022 00000000000005a8 ffff8800a62f0a00 ffff88003215e
4f0
[  146.960891] Call Trace:
[  146.960891]  <IRQ>
[  146.960891]  [<ffffffff81686612>] udp4_ufo_fragment+0xc2/0x130
[  146.960891]  [<ffffffff8168f372>] inet_gso_segment+0x132/0x360
[  146.960891]  [<ffffffff8161fbe6>] ? dev_hard_start_xmit+0x316/0x5c0
[  146.960891]  [<ffffffff8161f5cb>] skb_mac_gso_segment+0x9b/0x170
[  146.960891]  [<ffffffff8161f700>] __skb_gso_segment+0x60/0xc0
[  147.148004]  [<ffffffffa01a7231>] queue_gso_packets+0x41/0x1b0 [openvswitch]
[  147.148004]  [<ffffffff81616ab3>] ? skb_flow_dissect+0x173/0x430
[  147.148004]  [<ffffffff81616d90>] ? __skb_get_hash+0x20/0x140
[  147.148004]  [<ffffffffa01a7c6e>] ovs_dp_upcall+0x2e/0x70 [openvswitch]      [  147.148004]  [<ffffffffa01a7d9e>] ovs_dp_process_received_packet+0xee/0x110 [
openvswitch]
[  147.148004]  [<ffffffffa01adaea>] ovs_vport_receive+0x2a/0x30 [openvswitch]
[  147.148004]  [<ffffffffa01ae381>] netdev_frame_hook+0xc1/0x120 [openvswitch]
[  147.148004]  [<ffffffff8161daf2>] __netif_receive_skb_core+0x1b2/0x790
[  147.148004]  [<ffffffff8161e0e8>] __netif_receive_skb+0x18/0x60
[  147.148004]  [<ffffffff8161e153>] netif_receive_skb_internal+0x23/0x90
[  147.148004]  [<ffffffff81685e5b>] ? udp_gro_complete+0x5b/0x70
[  147.148004]  [<ffffffff8161e2d4>] napi_gro_complete+0xa4/0xe0
[  147.148004]  [<ffffffff8161e82f>] dev_gro_receive+0x1ef/0x2f0
[  147.148004]  [<ffffffff8161ec1c>] napi_gro_receive+0x2c/0xf0
[  147.148004]  [<ffffffffa001f28f>] sky2_poll+0x78f/0xd70 [sky2]
[  147.148004]  [<ffffffff8161e542>] net_rx_action+0x152/0x250
[  147.148004]  [<ffffffff81070925>] __do_softirq+0xf5/0x2e0
[  147.148004]  [<ffffffff81070de5>] irq_exit+0x105/0x110
[  147.148004]  [<ffffffff81724a78>] do_IRQ+0x58/0xf0
[  147.148004]  [<ffffffff817228ed>] common_interrupt+0x6d/0x6d
[  147.148004]  <EOI>
[  147.148004]  [<ffffffff810c9578>] ? rcu_nocb_kthread+0x158/0x510
[  147.148004]  [<ffffffff81055056>] ? native_safe_halt+0x6/0x10
[  147.148004]  [<ffffffff8101d1af>] default_idle+0x1f/0xc0
[  147.148004]  [<ffffffff8101db1f>] arch_cpu_idle+0xf/0x20
[  147.148004]  [<ffffffff810aadcd>] cpu_startup_entry+0x30d/0x340
[  147.148004]  [<ffffffff8170b307>] rest_init+0x77/0x80
[  147.148004]  [<ffffffff81d37084>] start_kernel+0x42f/0x43a
[  147.148004]  [<ffffffff81d36a4e>] ? set_init_arg+0x53/0x53
[  147.148004]  [<ffffffff81d36120>] ? early_idt_handlers+0x120/0x120
[  147.148004]  [<ffffffff81d365ee>] x86_64_start_reservations+0x2a/0x2c
[  147.148004]  [<ffffffff81d36733>] x86_64_start_kernel+0x143/0x152
[  147.148004] Code: 24 70 4c 8b 54 24 58 44 8b 44 24 68 44 8b 4c 24 28 48 8b 4c
 24 10 44 8b 5c 24 04 8b 14 24 0f 85 19 fd ff ff e9 10 fd ff ff 0f 0b <0f> 0b 0f
 0b c6 44 24 33 01 e9 92 f7 ff ff e8 33 ad 10 00 0f 0b
[  147.148004] RIP  [<ffffffff816117d2>] skb_segment+0x962/0x990
[  147.148004]  RSP <ffff88013fc03908>


> drivers/net/vxlan.c      | 2 ++
> include/net/udp_tunnel.h | 9 +++++++++
> net/ipv4/fou.c           | 2 ++
> 3 files changed, 13 insertions(+)
>
>diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>index ca30982..cfb892b 100644
>--- a/drivers/net/vxlan.c
>+++ b/drivers/net/vxlan.c
>@@ -621,6 +621,8 @@ static int vxlan_gro_complete(struct sk_buff *skb, int nhoff)
> 	int vxlan_len  = sizeof(struct vxlanhdr) + sizeof(struct ethhdr);
> 	int err = -ENOSYS;
> 
>+	udp_tunnel_gro_complete(skb, nhoff);
>+
> 	eh = (struct ethhdr *)(skb->data + nhoff + sizeof(struct vxlanhdr));
> 	type = eh->h_proto;
> 
>diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
>index a47790b..2a50a70 100644
>--- a/include/net/udp_tunnel.h
>+++ b/include/net/udp_tunnel.h
>@@ -100,6 +100,15 @@ static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
> 	return iptunnel_handle_offloads(skb, udp_csum, type);
> }
> 
>+static inline void udp_tunnel_gro_complete(struct sk_buff *skb, int nhoff)
>+{
>+	struct udphdr *uh;
>+
>+	uh = (struct udphdr *)(skb->data + nhoff - sizeof(struct udphdr));
>+	skb_shinfo(skb)->gso_type |= uh->check ?
>+				SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
>+}
>+
> static inline void udp_tunnel_encap_enable(struct socket *sock)
> {
> #if IS_ENABLED(CONFIG_IPV6)
>diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
>index 32e7892..606c520 100644
>--- a/net/ipv4/fou.c
>+++ b/net/ipv4/fou.c
>@@ -133,6 +133,8 @@ static int fou_gro_complete(struct sk_buff *skb, int nhoff)
> 	int err = -ENOSYS;
> 	const struct net_offload **offloads;
> 
>+	udp_tunnel_gro_complete(skb, nhoff);
>+
> 	rcu_read_lock();
> 	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
> 	ops = rcu_dereference(offloads[proto]);
>-- 
>1.9.1

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* Re: [PATCH 1/4] inet: Add skb_copy_datagram_iter
From: Herbert Xu @ 2014-11-08  2:21 UTC (permalink / raw)
  To: Al Viro; +Cc: David Miller, netdev, linux-kernel, bcrl
In-Reply-To: <20141107234253.GE7996@ZenIV.linux.org.uk>

On Fri, Nov 07, 2014 at 11:42:53PM +0000, Al Viro wrote:
> I'll finish RTFS drivers/vhost and if it turns out to be OK I'll post the
> series moving those checks to the moment of copying iovec from userland,
> so that kernel-side we could always rely on ->msg_iov elements having been
> verified.

Great thanks!

Dave, please hold off on my skb_copy_datagram_iter series until
this verify_iovec change is added.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH net] udptunnel: Add SKB_GSO_UDP_TUNNEL during gro_complete.
From: Jesse Gross @ 2014-11-08  3:54 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: David Miller, netdev
In-Reply-To: <28743.1415411491@famine>

On Fri, Nov 7, 2014 at 5:51 PM, Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
> Jesse Gross <jesse@nicira.com> wrote:
>
>>When doing GRO processing for UDP tunnels, we never add
>>SKB_GSO_UDP_TUNNEL to gso_type - only the type of the inner protocol
>>is added (such as SKB_GSO_TCPV4). The result is that if the packet is
>>later resegmented we will do GSO but not treat it as a tunnel. This
>>results in UDP fragmentation and since that is not the original layout
>>of the skb, a panic in skb_segment().
>>
>>Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>>Signed-off-by: Jesse Gross <jesse@nicira.com>
>>---
>>This problem occurs back to 3.14 for VXLAN but the FOU portion needs to
>>be removed for kernels other than the 'net' tree.
>
>         This patch does not resolve the problem when applied to a
> 3.17-ish kernel; the panic message is below, and appears to be
> unchanged.

Hmm, OK, thanks for testing. I think this patch is probably still good
to apply as it may solve a problem after we get this one figured out.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox