Occasional oops with IPSec and IPv6.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Occasional oops with IPSec and IPv6.
@ 2011-11-17 19:09 Nick Bowler
  2011-11-18 16:27 ` Nick Bowler
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Bowler @ 2011-11-17 19:09 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Timo Teras

Hi folks,

One of the tests we do with IPsec involves sending and receiving UDP
datagrams of all sizes from 1 to N bytes, where N is much larger than
the MTU.  In this particular instance, the MTU is 1500 bytes and N is
10000 bytes.  This test works fine with IPv4, but I'm getting an
occasional oops on Linus' master with IPv6 (output at end of email).  We
also run the same test where N is less than the MTU, and it does not
trigger this issue.  The resulting fallout seems to eventually lock up
the box (although it continues to work for a little while afterwards).

The issue appears timing related, and it doesn't always occur.  This
probably also explains why I've not seen this issue before now, as we
recently upgraded all our lab systems to machines from this century
(with newfangled dual core processors).  This also makes it somewhat
hard to reproduce, but I can trigger it pretty reliably by running 'yes'
in an ssh session (which doesn't use IPsec) while running the test:
it'll usually trigger in 2 or 3 runs.  The choice of cipher suite
appears to be irrelevant.

I built a relatively old kernel (2.6.34) and could not reproduce the
issue there, so I ran a git bisect.  It pointed to the following, which
(unsurprisingly) no longer reverts cleanly.

Let me know if you need any more info.  I'll see if I can reproduce the
issue with a smaller test case...

  80c802f3073e84c956846e921e8a0b02dfa3755f is the first bad commit
  commit 80c802f3073e84c956846e921e8a0b02dfa3755f
  Author: Timo Teräs <timo.teras@iki.fi>
  Date:   Wed Apr 7 00:30:05 2010 +0000
  
      xfrm: cache bundles instead of policies for outgoing flows
      
      __xfrm_lookup() is called for each packet transmitted out of
      system. The xfrm_find_bundle() does a linear search which can
      kill system performance depending on how many bundles are
      required per policy.
      
      This modifies __xfrm_lookup() to store bundles directly in
      the flow cache. If we did not get a hit, we just create a new
      bundle instead of doing slow search. This means that we can now
      get multiple xfrm_dst's for same flow (on per-cpu basis).
      
      Signed-off-by: Timo Teras <timo.teras@iki.fi>
      Signed-off-by: David S. Miller <davem@davemloft.net>
  
  :040000 040000 d8e60f5fa4c1329f450d9c7cdf98b34e6a177f22 9f576e68e5bf4ce357d7f0305aee5f410250dfe2 M	include
  :040000 040000 f2876df688ee36907af7b4123eea96592faaed3e a3f6f6f94f0309106856cd99b38ec90b024eb016 M	net

  [  138.024462] skb_under_panic: text:f83aff05 len:1470 put:14 head:f2ee4800 data:f2ee47fa tail:0xf2ee4db8 end:0xf2ee4f40 dev:p10p1
  [  138.036298] ------------[ cut here ]------------
  [  138.037077] kernel BUG at net/core/skbuff.c:147!
  [  138.037077] invalid opcode: 0000 [#1] PREEMPT SMP 
  [  138.037077] Modules linked in: authenc esp6 xfrm6_mode_transport deflate zlib_deflate ctr twofish_generic twofish_common camellia serpent blowfish_generic blowfish_common cast5 des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 hmac crypto_null af_key nfs lockd auth_rpcgss sunrpc rng_core ip6table_filter ip6_tables iptable_filter ip_tables x_tables psmouse sg r8169 mii evdev button ipv6 autofs4 ehci_hcd sd_mod ohci_hcd usbcore usb_common radeon ttm drm_kms_helper drm backlight i2c_algo_bit cfbcopyarea cfbimgblt cfbfillrect [last unloaded: scsi_wait_scan]
  [  138.067337] 
  [  138.067337] Pid: 2846, comm: udp_scan Not tainted 3.2.0-rc2-00043-gaa1b052 #53 System manufacturer System Product Name/M4A785T-M
  [  138.067337] EIP: 0060:[<c11ff3d7>] EFLAGS: 00010246 CPU: 0
  [  138.067337] EIP is at skb_push+0x52/0x5b
  [  138.067337] EAX: 00000089 EBX: f3abf000 ECX: 00000080 EDX: 00000003
  [  138.067337] ESI: f3abf000 EDI: f2ee4808 EBP: f2655b70 ESP: f2655b44
  [  138.067337]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
  [  138.067337] Process udp_scan (pid: 2846, ti=f2654000 task=f2683420 task.ti=f2654000)
  [  138.067337] Stack:
  [  138.067337]  c13a2802 f83aff05 000005be 0000000e f2ee4800 f2ee47fa f2ee4db8 f2ee4f40
  [  138.067337]  f3abf000 f2725780 f4454d18 f2655b90 f83aff05 f4454d08 00000000 f4454c00
  [  138.067337]  f27256c0 f3dfc528 f26fec38 f2655be4 f83b0f58 00000201 f2655ba4 00000046
  [  138.067337] Call Trace:
  [  138.067337]  [<f83aff05>] ? ip6_finish_output2+0x26c/0x31a [ipv6]
  [  138.067337]  [<f83aff05>] ip6_finish_output2+0x26c/0x31a [ipv6]
  [  138.067337]  [<f83b0f58>] ip6_fragment+0x3b4/0x941 [ipv6]
  [  138.067337]  [<f83afc99>] ? NF_HOOK.constprop.4+0x30/0x30 [ipv6]
  [  138.067337]  [<f83b1524>] ip6_finish_output+0x3f/0x4c [ipv6]
  [  138.067337]  [<f83b15e9>] ip6_output+0xb8/0xc0 [ipv6]
  [  138.067337]  [<c1252241>] xfrm_output_resume+0x75/0x2c5
  [  138.067337]  [<c125249e>] xfrm_output2+0xd/0xf
  [  138.067337]  [<c1252533>] xfrm_output+0x93/0x9c
  [  138.067337]  [<f83cdb5e>] xfrm6_output_finish+0x13/0x15 [ipv6]
  [  138.067337]  [<f83cda4b>] __xfrm6_output+0x108/0x10d [ipv6]
  [  138.067337]  [<f83cdba7>] xfrm6_output+0x47/0x4c [ipv6]
  [  138.067337]  [<f83af7b4>] dst_output+0x12/0x15 [ipv6]
  [  138.067337]  [<f83b036a>] ip6_local_out+0x17/0x1a [ipv6]
  [  138.067337]  [<f83b2283>] ip6_push_pending_frames+0x2a4/0x346 [ipv6]
  [  138.067337]  [<f83bf055>] udp_v6_push_pending_frames+0x213/0x271 [ipv6]
  [  138.067337]  [<f83bfea4>] ? udpv6_sendmsg+0x68d/0x832 [ipv6]
  [  138.067337]  [<f83bfec6>] udpv6_sendmsg+0x6af/0x832 [ipv6]
  [  138.067337]  [<c123ffc4>] ? ip_fast_csum+0x30/0x30
  [  138.067337]  [<c1240500>] inet_sendmsg+0x4e/0x57
  [  138.067337]  [<c11f8f0e>] sock_sendmsg+0xbe/0xd9
  [  138.067337]  [<c10542df>] ? mark_lock+0x26/0x1ea
  [  138.067337]  [<c10542df>] ? mark_lock+0x26/0x1ea
  [  138.067337]  [<c10548e7>] ? __lock_acquire+0x444/0xb17
  [  138.067337]  [<c10acd97>] ? fget_light+0x28/0x7c
  [  138.067337]  [<c11fa362>] sys_sendto+0xb1/0xcd
  [  138.067337]  [<c10548e7>] ? __lock_acquire+0x444/0xb17
  [  138.067337]  [<c1021085>] ? __wake_up+0x15/0x3b
  [  138.067337]  [<c10d2f0f>] ? fsnotify+0x64/0x208
  [  138.067337]  [<c102866b>] ? get_parent_ip+0xb/0x31
  [  138.067337]  [<c1055038>] ? lock_release_non_nested+0x7e/0x1bb
  [  138.067337]  [<c11fa396>] sys_send+0x18/0x1a
  [  138.067337]  [<c11fa99f>] sys_socketcall+0xce/0x19a
  [  138.067337]  [<c11508f0>] ? trace_hardirqs_on_thunk+0xc/0x10
  [  138.067337]  [<c12717d0>] sysenter_do_call+0x12/0x36
  [  138.067337] Code: c1 85 f6 0f 45 de 53 ff b1 98 00 00 00 ff b1 94 00 00 00 50 ff b1 9c 00 00 00 52 ff 71 50 ff 75 04 68 02 28 3a c1 e8 86 c7 06 00 <0f> 0b 8d 65 f8 5b 5e 5d c3 55 89 c1 89 e5 56 53 83 79 54 00 8b 
  [  138.067337] EIP: [<c11ff3d7>] skb_push+0x52/0x5b SS:ESP 0068:f2655b44
  [  138.398457] ---[ end trace cb87617e5ef07196 ]---
  [  138.404512] BUG: sleeping function called from invalid context at kernel/rwsem.c:21
  [  138.412662] in_atomic(): 0, irqs_disabled(): 0, pid: 2846, name: udp_scan
  [  138.420076] INFO: lockdep is turned off.
  [  138.424721] Pid: 2846, comm: udp_scan Tainted: G      D      3.2.0-rc2-00043-gaa1b052 #53
  [  138.433542] Call Trace:
  [  138.436387]  [<c10307b1>] ? console_unlock+0x1b6/0x1c9
  [  138.442035]  [<c1024dbd>] __might_sleep+0xe2/0xe9
  [  138.447249]  [<c127009f>] down_read+0x17/0x3b
  [  138.452139]  [<c105fc85>] acct_collect+0x39/0x134
  [  138.457431]  [<c1032c08>] do_exit+0x188/0x5de
  [  138.462369]  [<c1031464>] ? kmsg_dump+0xdf/0xe7
  [  138.467328]  [<c1004737>] oops_end+0x92/0x9a
  [  138.472238]  [<c1004868>] die+0x51/0x59
  [  138.476546]  [<c1002626>] do_trap+0x89/0xa2
  [  138.481264]  [<c1002776>] ? do_bounds+0x52/0x52
  [  138.486308]  [<c10027e7>] do_invalid_op+0x71/0x7b
  [  138.491727]  [<c11ff3d7>] ? skb_push+0x52/0x5b
  [  138.496685]  [<c12710a0>] ? restore_all+0xf/0xf
  [  138.501659]  [<c10307b1>] ? console_unlock+0x1b6/0x1c9
  [  138.507479]  [<c102369b>] ? need_resched+0x14/0x1e
  [  138.512845]  [<c126f34f>] ? preempt_schedule+0x40/0x46
  [  138.518685]  [<c1030c19>] ? vprintk+0x390/0x3ae
  [  138.523751]  [<c1052d01>] ? trace_hardirqs_off_caller+0x2e/0x86
  [  138.530302]  [<c1150900>] ? trace_hardirqs_off_thunk+0xc/0x10
  [  138.536680]  [<c1271563>] error_code+0x5f/0x64
  [  138.541625]  [<c1002776>] ? do_bounds+0x52/0x52
  [  138.546741]  [<c11ff3d7>] ? skb_push+0x52/0x5b
  [  138.551812]  [<f83aff05>] ? ip6_finish_output2+0x26c/0x31a [ipv6]
  [  138.558652]  [<f83aff05>] ip6_finish_output2+0x26c/0x31a [ipv6]
  [  138.565308]  [<f83b0f58>] ip6_fragment+0x3b4/0x941 [ipv6]
  [  138.571373]  [<f83afc99>] ? NF_HOOK.constprop.4+0x30/0x30 [ipv6]
  [  138.578173]  [<f83b1524>] ip6_finish_output+0x3f/0x4c [ipv6]
  [  138.584534]  [<f83b15e9>] ip6_output+0xb8/0xc0 [ipv6]
  [  138.590172]  [<c1252241>] xfrm_output_resume+0x75/0x2c5
  [  138.596199]  [<c125249e>] xfrm_output2+0xd/0xf
  [  138.601362]  [<c1252533>] xfrm_output+0x93/0x9c
  [  138.606581]  [<f83cdb5e>] xfrm6_output_finish+0x13/0x15 [ipv6]
  [  138.613283]  [<f83cda4b>] __xfrm6_output+0x108/0x10d [ipv6]
  [  138.619672]  [<f83cdba7>] xfrm6_output+0x47/0x4c [ipv6]
  [  138.625676]  [<f83af7b4>] dst_output+0x12/0x15 [ipv6]
  [  138.631628]  [<f83b036a>] ip6_local_out+0x17/0x1a [ipv6]
  [  138.637749]  [<f83b2283>] ip6_push_pending_frames+0x2a4/0x346 [ipv6]
  [  138.644714]  [<f83bf055>] udp_v6_push_pending_frames+0x213/0x271 [ipv6]
  [  138.652186]  [<f83bfea4>] ? udpv6_sendmsg+0x68d/0x832 [ipv6]
  [  138.658621]  [<f83bfec6>] udpv6_sendmsg+0x6af/0x832 [ipv6]
  [  138.665021]  [<c123ffc4>] ? ip_fast_csum+0x30/0x30
  [  138.670635]  [<c1240500>] inet_sendmsg+0x4e/0x57
  [  138.676069]  [<c11f8f0e>] sock_sendmsg+0xbe/0xd9
  [  138.681502]  [<c10542df>] ? mark_lock+0x26/0x1ea
  [  138.686811]  [<c10542df>] ? mark_lock+0x26/0x1ea
  [  138.692188]  [<c10548e7>] ? __lock_acquire+0x444/0xb17
  [  138.698257]  [<c10acd97>] ? fget_light+0x28/0x7c
  [  138.703692]  [<c11fa362>] sys_sendto+0xb1/0xcd
  [  138.708962]  [<c10548e7>] ? __lock_acquire+0x444/0xb17
  [  138.714935]  [<c1021085>] ? __wake_up+0x15/0x3b
  [  138.720165]  [<c10d2f0f>] ? fsnotify+0x64/0x208
  [  138.725623]  [<c102866b>] ? get_parent_ip+0xb/0x31
  [  138.731295]  [<c1055038>] ? lock_release_non_nested+0x7e/0x1bb
  [  138.737980]  [<c11fa396>] sys_send+0x18/0x1a
  [  138.743113]  [<c11fa99f>] sys_socketcall+0xce/0x19a
  [  138.748806]  [<c11508f0>] ? trace_hardirqs_on_thunk+0xc/0x10
  [  138.755407]  [<c12717d0>] sysenter_do_call+0x12/0x36
  [  198.038028] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 1, t=60002 jiffies)
  [  198.039017] INFO: Stall ended before state dump start

Thanks,
-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Occasional oops with IPSec and IPv6.
  2011-11-17 19:09 Occasional oops with IPSec and IPv6 Nick Bowler
@ 2011-11-18 16:27 ` Nick Bowler
  2011-11-18 16:39   ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Bowler @ 2011-11-18 16:27 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Timo Teras

On 2011-11-17 14:09 -0500, Nick Bowler wrote:
> One of the tests we do with IPsec involves sending and receiving UDP
> datagrams of all sizes from 1 to N bytes, where N is much larger than
> the MTU.  In this particular instance, the MTU is 1500 bytes and N is
> 10000 bytes.  This test works fine with IPv4, but I'm getting an
> occasional oops on Linus' master with IPv6 (output at end of email).  We
> also run the same test where N is less than the MTU, and it does not
> trigger this issue.  The resulting fallout seems to eventually lock up
> the box (although it continues to work for a little while afterwards).
> 
> The issue appears timing related, and it doesn't always occur.  This
> probably also explains why I've not seen this issue before now, as we
> recently upgraded all our lab systems to machines from this century
> (with newfangled dual core processors).  This also makes it somewhat
> hard to reproduce, but I can trigger it pretty reliably by running 'yes'
> in an ssh session (which doesn't use IPsec) while running the test:
> it'll usually trigger in 2 or 3 runs.  The choice of cipher suite
> appears to be irrelevant.
> 
> I built a relatively old kernel (2.6.34) and could not reproduce the
> issue there, so I ran a git bisect.  It pointed to the following, which
> (unsurprisingly) no longer reverts cleanly.
> 
> Let me know if you need any more info.  I'll see if I can reproduce the
> issue with a smaller test case...

OK, here's a somewhat straigthforward way to reproduce it that I've
found.  It uses a short test program called "udp_burst" which simply
transmits a bunch of UDP datagrams at all sizes between 1 and 10000,
included at the end of this mail.

 * Build the test program

    % gcc -o udp_burst udp_burst.c

 * Setup transport mode IPv6 SAs between two hosts so that they can
   communicate using IPsec.  Choose your favourite cipher suite.
   In this example, my two hosts are "fec0::3/64" and "fec0::2/64": I
   will be crashing the former.

   It can be reproduced with just one host transmitting to the bit
   bucket, but it seems to go much faster with two.

 * Create some constant non-IPsec network traffic on the machine to be
   crashed (for example, log in via SSH and run "yes").
 
 * On the machine to be crashed, run

    % while :; do ./udp_burst remote; done

   where remote is the other host (fec0::2 in my case).
 
 * Wait a few seconds and watch the fireworks.

% cat >udp_burst.c <<'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

#define MAX_DGRAM_SIZE 10000

static char buf[MAX_DGRAM_SIZE];

int main(int argc, char **argv)
{
	char *addr = NULL, *port = "9000";
	struct addrinfo *info, hints = {
		.ai_family   = AF_UNSPEC,
		.ai_socktype = SOCK_DGRAM,
		.ai_flags    = AI_PASSIVE,
	};
	int i, rc, sock;

	if (argc > 1)
		addr = argv[1];
	if (argc > 2)
		port = argv[2];
	if (!addr) {
		fprintf(stderr, "usage: %s addr [port]\n", argv[0]);
		return EXIT_FAILURE;
	}

	rc = getaddrinfo(addr, port, &hints, &info);
	if (rc != 0) {
		fprintf(stderr, "getaddrinfo: %s\n", gai_strerror(rc));
		return EXIT_FAILURE;
	}

	sock = socket(info->ai_family, info->ai_socktype, info->ai_protocol);
	if (sock == -1) {
		perror("socket");
		return EXIT_FAILURE;
	}

	if (connect(sock, info->ai_addr, info->ai_addrlen) == -1) {
		perror("connect");
		return EXIT_FAILURE;
	}

	for (i = 0; i < MAX_DGRAM_SIZE; i++) {
		if (send(sock, buf, i+1, MSG_DONTWAIT) == -1) {
			if (errno != EAGAIN && errno != ECONNREFUSED) {
				perror("send");
			}
		}
	}

	return 0;
}
EOF

Cheers,
-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Occasional oops with IPSec and IPv6.
  2011-11-18 16:27 ` Nick Bowler
@ 2011-11-18 16:39   ` Eric Dumazet
  2011-11-18 18:27     ` Timo Teräs
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2011-11-18 16:39 UTC (permalink / raw)
  To: Nick Bowler; +Cc: netdev, David S. Miller, Timo Teras

Le vendredi 18 novembre 2011 à 11:27 -0500, Nick Bowler a écrit :
> On 2011-11-17 14:09 -0500, Nick Bowler wrote:
> > One of the tests we do with IPsec involves sending and receiving UDP
> > datagrams of all sizes from 1 to N bytes, where N is much larger than
> > the MTU.  In this particular instance, the MTU is 1500 bytes and N is
> > 10000 bytes.  This test works fine with IPv4, but I'm getting an
> > occasional oops on Linus' master with IPv6 (output at end of email).  We
> > also run the same test where N is less than the MTU, and it does not
> > trigger this issue.  The resulting fallout seems to eventually lock up
> > the box (although it continues to work for a little while afterwards).
> > 
> > The issue appears timing related, and it doesn't always occur.  This
> > probably also explains why I've not seen this issue before now, as we
> > recently upgraded all our lab systems to machines from this century
> > (with newfangled dual core processors).  This also makes it somewhat
> > hard to reproduce, but I can trigger it pretty reliably by running 'yes'
> > in an ssh session (which doesn't use IPsec) while running the test:
> > it'll usually trigger in 2 or 3 runs.  The choice of cipher suite
> > appears to be irrelevant.
> > 
> > I built a relatively old kernel (2.6.34) and could not reproduce the
> > issue there, so I ran a git bisect.  It pointed to the following, which
> > (unsurprisingly) no longer reverts cleanly.
> > 
> > Let me know if you need any more info.  I'll see if I can reproduce the
> > issue with a smaller test case...
> 
> OK, here's a somewhat straigthforward way to reproduce it that I've
> found.  It uses a short test program called "udp_burst" which simply
> transmits a bunch of UDP datagrams at all sizes between 1 and 10000,
> included at the end of this mail.
> 
>  * Build the test program
> 
>     % gcc -o udp_burst udp_burst.c
> 
>  * Setup transport mode IPv6 SAs between two hosts so that they can
>    communicate using IPsec.  Choose your favourite cipher suite.
>    In this example, my two hosts are "fec0::3/64" and "fec0::2/64": I
>    will be crashing the former.
> 
>    It can be reproduced with just one host transmitting to the bit
>    bucket, but it seems to go much faster with two.
> 
>  * Create some constant non-IPsec network traffic on the machine to be
>    crashed (for example, log in via SSH and run "yes").
>  
>  * On the machine to be crashed, run
> 
>     % while :; do ./udp_burst remote; done
> 
>    where remote is the other host (fec0::2 in my case).
>  
>  * Wait a few seconds and watch the fireworks.
> 
> % cat >udp_burst.c <<'EOF'
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <errno.h>
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <netdb.h>
> 
> #define MAX_DGRAM_SIZE 10000
> 
> static char buf[MAX_DGRAM_SIZE];
> 
> int main(int argc, char **argv)
> {
> 	char *addr = NULL, *port = "9000";
> 	struct addrinfo *info, hints = {
> 		.ai_family   = AF_UNSPEC,
> 		.ai_socktype = SOCK_DGRAM,
> 		.ai_flags    = AI_PASSIVE,
> 	};
> 	int i, rc, sock;
> 
> 	if (argc > 1)
> 		addr = argv[1];
> 	if (argc > 2)
> 		port = argv[2];
> 	if (!addr) {
> 		fprintf(stderr, "usage: %s addr [port]\n", argv[0]);
> 		return EXIT_FAILURE;
> 	}
> 
> 	rc = getaddrinfo(addr, port, &hints, &info);
> 	if (rc != 0) {
> 		fprintf(stderr, "getaddrinfo: %s\n", gai_strerror(rc));
> 		return EXIT_FAILURE;
> 	}
> 
> 	sock = socket(info->ai_family, info->ai_socktype, info->ai_protocol);
> 	if (sock == -1) {
> 		perror("socket");
> 		return EXIT_FAILURE;
> 	}
> 
> 	if (connect(sock, info->ai_addr, info->ai_addrlen) == -1) {
> 		perror("connect");
> 		return EXIT_FAILURE;
> 	}
> 
> 	for (i = 0; i < MAX_DGRAM_SIZE; i++) {
> 		if (send(sock, buf, i+1, MSG_DONTWAIT) == -1) {
> 			if (errno != EAGAIN && errno != ECONNREFUSED) {
> 				perror("send");
> 			}
> 		}
> 	}
> 
> 	return 0;
> }
> EOF
> 

Please note commit 80c802f307 added a known bug, fixed in commit
0b150932197b (xfrm: avoid possible oopse in xfrm_alloc_dst)

Given commit 80c802f307 complexity, we can assume other bugs are to be
fixed as well.

Unfortunately, Timo seems unresponsive.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Occasional oops with IPSec and IPv6.
  2011-11-18 16:39   ` Eric Dumazet
@ 2011-11-18 18:27     ` Timo Teräs
  2011-11-18 19:26       ` Nick Bowler
  0 siblings, 1 reply; 10+ messages in thread
From: Timo Teräs @ 2011-11-18 18:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Bowler, netdev, David S. Miller

On 11/18/2011 06:39 PM, Eric Dumazet wrote:
> Le vendredi 18 novembre 2011 à 11:27 -0500, Nick Bowler a écrit :
>> On 2011-11-17 14:09 -0500, Nick Bowler wrote:
>>> One of the tests we do with IPsec involves sending and receiving UDP
>>> datagrams of all sizes from 1 to N bytes, where N is much larger than
>>> the MTU.  In this particular instance, the MTU is 1500 bytes and N is
>>> 10000 bytes.  This test works fine with IPv4, but I'm getting an
>>> occasional oops on Linus' master with IPv6 (output at end of email).  We
>>> also run the same test where N is less than the MTU, and it does not
>>> trigger this issue.  The resulting fallout seems to eventually lock up
>>> the box (although it continues to work for a little while afterwards).
>>>
>>> The issue appears timing related, and it doesn't always occur.  This
>>> probably also explains why I've not seen this issue before now, as we
>>> recently upgraded all our lab systems to machines from this century
>>> (with newfangled dual core processors).  This also makes it somewhat
>>> hard to reproduce, but I can trigger it pretty reliably by running 'yes'
>>> in an ssh session (which doesn't use IPsec) while running the test:
>>> it'll usually trigger in 2 or 3 runs.  The choice of cipher suite
>>> appears to be irrelevant.
>>>
>>> I built a relatively old kernel (2.6.34) and could not reproduce the
>>> issue there, so I ran a git bisect.  It pointed to the following, which
>>> (unsurprisingly) no longer reverts cleanly.
>>>
>>> Let me know if you need any more info.  I'll see if I can reproduce the
>>> issue with a smaller test case...
>>
>> OK, here's a somewhat straigthforward way to reproduce it that I've
>> found.  It uses a short test program called "udp_burst" which simply
>> transmits a bunch of UDP datagrams at all sizes between 1 and 10000,
>> included at the end of this mail.
>>[snip]
> 
> Please note commit 80c802f307 added a known bug, fixed in commit
> 0b150932197b (xfrm: avoid possible oopse in xfrm_alloc_dst)
> 
> Given commit 80c802f307 complexity, we can assume other bugs are to be
> fixed as well.
> 
> Unfortunately, Timo seems unresponsive.

This looks quite different. And I've been trying to figure out what
causes this. However, the OOPS happens at ip6_fragment(), indicating
that there was not enough allocated headroom (skb underrun). My initial
thought is ipv6 bug that just got uncovered by my commit; especially
since ipv4 side is happy. But I haven't yet been able to figure this one
out.

Could you also try Herbert's latest patch set:
  [0/6] Replace LL_ALLOCATED_SPACE to allow needed_headroom adjustment

This changes how the headroom is calculated, and *might* fix this issue
too if it's caused by the same SMP race condition which got uncovered by
my other commit earlier.

- Timo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Occasional oops with IPSec and IPv6.
  2011-11-18 18:27     ` Timo Teräs
@ 2011-11-18 19:26       ` Nick Bowler
  2011-11-18 20:06         ` Timo Teräs
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Bowler @ 2011-11-18 19:26 UTC (permalink / raw)
  To: Timo Teräs; +Cc: Eric Dumazet, netdev, David S. Miller

On 2011-11-18 20:27 +0200, Timo Teräs wrote:
> On 11/18/2011 06:39 PM, Eric Dumazet wrote:
> > Le vendredi 18 novembre 2011 à 11:27 -0500, Nick Bowler a écrit :
> >> On 2011-11-17 14:09 -0500, Nick Bowler wrote:
> >>> One of the tests we do with IPsec involves sending and receiving UDP
> >>> datagrams of all sizes from 1 to N bytes, where N is much larger than
> >>> the MTU.  In this particular instance, the MTU is 1500 bytes and N is
> >>> 10000 bytes.  This test works fine with IPv4, but I'm getting an
> >>> occasional oops on Linus' master with IPv6 (output at end of email).  We
> >>> also run the same test where N is less than the MTU, and it does not
> >>> trigger this issue.  The resulting fallout seems to eventually lock up
> >>> the box (although it continues to work for a little while afterwards).
> >>>
> >>> The issue appears timing related, and it doesn't always occur.  This
> >>> probably also explains why I've not seen this issue before now, as we
> >>> recently upgraded all our lab systems to machines from this century
> >>> (with newfangled dual core processors).  This also makes it somewhat
> >>> hard to reproduce, but I can trigger it pretty reliably by running 'yes'
> >>> in an ssh session (which doesn't use IPsec) while running the test:
> >>> it'll usually trigger in 2 or 3 runs.  The choice of cipher suite
> >>> appears to be irrelevant.
[...]
> > Please note commit 80c802f307 added a known bug, fixed in commit
> > 0b150932197b (xfrm: avoid possible oopse in xfrm_alloc_dst)
> > 
> > Given commit 80c802f307 complexity, we can assume other bugs are to be
> > fixed as well.
[...]
> This looks quite different. And I've been trying to figure out what
> causes this. However, the OOPS happens at ip6_fragment(), indicating
> that there was not enough allocated headroom (skb underrun). My initial
> thought is ipv6 bug that just got uncovered by my commit; especially
> since ipv4 side is happy. But I haven't yet been able to figure this one
> out.
> 
> Could you also try Herbert's latest patch set:
>   [0/6] Replace LL_ALLOCATED_SPACE to allow needed_headroom adjustment
> 
> This changes how the headroom is calculated, and *might* fix this issue
> too if it's caused by the same SMP race condition which got uncovered by
> my other commit earlier.

I applied all six of those patches, but I still see a crash.  However,
the call trace seems to be slightly different.  I've appended the trace
from the run with these paches applied, just in case it's significant.

NOTE: I did not carefully look at the traces of all the crashes I've
triggered.  This particular backtrace could potentially have appeared
before applying these patches and I would not have noticed.

[   45.318137] NET: Registered protocol family 15
[  125.153082] skb_under_panic: text:c1215d1d len:1462 put:14 head:f2ff1000 data:f2ff0ffa tail:0xf2ff15b0 end:0xf2ff1780 dev:p10p1
[  125.165124] ------------[ cut here ]------------
[  125.166001] kernel BUG at net/core/skbuff.c:147!
[  125.166001] invalid opcode: 0000 [#1] PREEMPT SMP 
[  125.166001] Modules linked in: authenc esp6 xfrm6_mode_transport deflate zlib_deflate ctr twofish_generic twofish_common camellia serpent blowfish_generic blowfish_common cast5 des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 hmac crypto_null af_key nfs lockd auth_rpcgss sunrpc rng_core iptable_filter ip_tables ip6table_filter ip6_tables x_tables psmouse sg r8169 mii evdev button ipv6 autofs4 usbhid ohci_hcd ehci_hcd usbcore usb_common sd_mod radeon ttm drm_kms_helper drm backlight i2c_algo_bit cfbcopyarea cfbimgblt cfbfillrect [last unloaded: scsi_wait_scan]
[  125.196579] 
[  125.196579] Pid: 2792, comm: udp_burst Not tainted 3.2.0-rc2-00115-g8b662f5 #54 System manufacturer System Product Name/M4A785T-M
[  125.196579] EIP: 0060:[<c11ff2af>] EFLAGS: 00010246 CPU: 0
[  125.196579] EIP is at skb_push+0x52/0x5b
[  125.196579] EAX: 00000089 EBX: f39cb000 ECX: 00000080 EDX: 00000003
[  125.196579] ESI: f39cb000 EDI: f39cb000 EBP: f29abb10 ESP: f29abae4
[  125.196579]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  125.196579] Process udp_burst (pid: 2792, ti=f29aa000 task=f3d8a2c0 task.ti=f29aa000)
[  125.196579] Stack:
[  125.196579]  c13a2756 c1215d1d 000005b6 0000000e f2ff1000 f2ff0ffa f2ff15b0 f2ff1780
[  125.196579]  f39cb000 00000000 f4fcdec0 f29abb28 c1215d1d 000086dd c1215d01 f29a5600
[  125.196579]  00000002 f29abb44 c120ee6d f4fcdec0 00000000 000005a8 f4fcde00 f29a5600
[  125.196579] Call Trace:
[  125.196579]  [<c1215d1d>] ? eth_header+0x1c/0x8b
[  125.196579]  [<c1215d1d>] eth_header+0x1c/0x8b
[  125.196579]  [<c1215d01>] ? eth_rebuild_header+0x53/0x53
[  125.196579]  [<c120ee6d>] dev_hard_header.constprop.12+0x28/0x32
[  125.196579]  [<c120ef74>] neigh_resolve_output+0xfd/0x138
[  125.196579]  [<f838af19>] ip6_finish_output2+0x280/0x31a [ipv6]
[  125.196579]  [<f838bf61>] ip6_fragment+0x3bd/0x939 [ipv6]
[  125.196579]  [<f838ac99>] ? NF_HOOK.constprop.4+0x30/0x30 [ipv6]
[  125.196579]  [<f838c51c>] ip6_finish_output+0x3f/0x4c [ipv6]
[  125.196579]  [<f838c5e1>] ip6_output+0xb8/0xc0 [ipv6]
[  125.196579]  [<c12520f1>] xfrm_output_resume+0x75/0x2c5
[  125.196579]  [<c125234e>] xfrm_output2+0xd/0xf
[  125.196579]  [<c12523e3>] xfrm_output+0x93/0x9c
[  125.196579]  [<f83a8b32>] xfrm6_output_finish+0x13/0x15 [ipv6]
[  125.196579]  [<f83a8a1f>] __xfrm6_output+0x108/0x10d [ipv6]
[  125.196579]  [<f83a8b7b>] xfrm6_output+0x47/0x4c [ipv6]
[  125.196579]  [<f838a7b4>] dst_output+0x12/0x15 [ipv6]
[  125.196579]  [<f838b36a>] ip6_local_out+0x17/0x1a [ipv6]
[  125.196579]  [<f838d27b>] ip6_push_pending_frames+0x2a4/0x346 [ipv6]
[  125.196579]  [<f839a035>] udp_v6_push_pending_frames+0x213/0x271 [ipv6]
[  125.196579]  [<f839ae84>] ? udpv6_sendmsg+0x68d/0x832 [ipv6]
[  125.196579]  [<f839aea6>] udpv6_sendmsg+0x6af/0x832 [ipv6]
[  125.196579]  [<c123fe84>] ? ip_fast_csum+0x30/0x30
[  125.196579]  [<c12403c0>] inet_sendmsg+0x4e/0x57
[  125.196579]  [<c11f8de6>] sock_sendmsg+0xbe/0xd9
[  125.196579]  [<c1052d64>] ? trace_hardirqs_off+0xb/0xd
[  125.196579]  [<c1270f48>] ? restore_all+0xf/0xf
[  125.196579]  [<c1055715>] ? trace_hardirqs_on_caller+0x10e/0x13f
[  125.196579]  [<c10542df>] ? mark_lock+0x26/0x1ea
[  125.196579]  [<c10acdbb>] ? fget_light+0x28/0x7c
[  125.196579]  [<c11fa23a>] sys_sendto+0xb1/0xcd
[  125.196579]  [<c10548e7>] ? __lock_acquire+0x444/0xb17
[  125.196579]  [<c1270bb1>] ? _raw_spin_unlock_irq+0x39/0x45
[  125.196579]  [<c1055038>] ? lock_release_non_nested+0x7e/0x1bb
[  125.196579]  [<c11fa26e>] sys_send+0x18/0x1a
[  125.196579]  [<c11fa877>] sys_socketcall+0xce/0x19a
[  125.196579]  [<c11507c0>] ? trace_hardirqs_on_thunk+0xc/0x10
[  125.196579]  [<c1271650>] sysenter_do_call+0x12/0x36
[  125.196579] Code: c1 85 f6 0f 45 de 53 ff b1 98 00 00 00 ff b1 94 00 00 00 50 ff b1 9c 00 00 00 52 ff 71 50 ff 75 04 68 56 27 3a c1 e8 5a c7 06 00 <0f> 0b 8d 65 f8 5b 5e 5d c3 55 89 c1 89 e5 56 53 83 79 54 00 8b 
[  125.196579] EIP: [<c11ff2af>] skb_push+0x52/0x5b SS:ESP 0068:f29abae4
[  125.544777] ---[ end trace 3ca7fd586035bfb5 ]---
[  125.549588] BUG: sleeping function called from invalid context at kernel/rwsem.c:21
[  125.557655] in_atomic(): 0, irqs_disabled(): 0, pid: 2792, name: udp_burst
[  125.565415] INFO: lockdep is turned off.
[  125.569682] Pid: 2792, comm: udp_burst Tainted: G      D      3.2.0-rc2-00115-g8b662f5 #54
[  125.578640] Call Trace:
[  125.581476]  [<c10307b1>] ? console_unlock+0x1b6/0x1c9
[  125.587209]  [<c1024dbd>] __might_sleep+0xe2/0xe9
[  125.592457]  [<c126ff47>] down_read+0x17/0x3b
[  125.597311]  [<c105fc85>] acct_collect+0x39/0x134
[  125.602749]  [<c1032c08>] do_exit+0x188/0x5de
[  125.607604]  [<c1031464>] ? kmsg_dump+0xdf/0xe7
[  125.612710]  [<c1004737>] oops_end+0x92/0x9a
[  125.617647]  [<c1004868>] die+0x51/0x59
[  125.622008]  [<c1002626>] do_trap+0x89/0xa2
[  125.626665]  [<c1002776>] ? do_bounds+0x52/0x52
[  125.631781]  [<c10027e7>] do_invalid_op+0x71/0x7b
[  125.637157]  [<c11ff2af>] ? skb_push+0x52/0x5b
[  125.642175]  [<c1270f48>] ? restore_all+0xf/0xf
[  125.647256]  [<c10307b1>] ? console_unlock+0x1b6/0x1c9
[  125.653106]  [<c102369b>] ? need_resched+0x14/0x1e
[  125.658517]  [<c126f1f7>] ? preempt_schedule+0x40/0x46
[  125.664271]  [<c1030c19>] ? vprintk+0x390/0x3ae
[  125.669417]  [<c1052d01>] ? trace_hardirqs_off_caller+0x2e/0x86
[  125.675999]  [<c11507d0>] ? trace_hardirqs_off_thunk+0xc/0x10
[  125.682561]  [<c127140b>] error_code+0x5f/0x64
[  125.687553]  [<c1002776>] ? do_bounds+0x52/0x52
[  125.692621]  [<c11ff2af>] ? skb_push+0x52/0x5b
[  125.697723]  [<c1215d1d>] ? eth_header+0x1c/0x8b
[  125.702905]  [<c1215d1d>] eth_header+0x1c/0x8b
[  125.707963]  [<c1215d01>] ? eth_rebuild_header+0x53/0x53
[  125.713945]  [<c120ee6d>] dev_hard_header.constprop.12+0x28/0x32
[  125.720617]  [<c120ef74>] neigh_resolve_output+0xfd/0x138
[  125.726714]  [<f838af19>] ip6_finish_output2+0x280/0x31a [ipv6]
[  125.733397]  [<f838bf61>] ip6_fragment+0x3bd/0x939 [ipv6]
[  125.739483]  [<f838ac99>] ? NF_HOOK.constprop.4+0x30/0x30 [ipv6]
[  125.746261]  [<f838c51c>] ip6_finish_output+0x3f/0x4c [ipv6]
[  125.752772]  [<f838c5e1>] ip6_output+0xb8/0xc0 [ipv6]
[  125.758684]  [<c12520f1>] xfrm_output_resume+0x75/0x2c5
[  125.764729]  [<c125234e>] xfrm_output2+0xd/0xf
[  125.769960]  [<c12523e3>] xfrm_output+0x93/0x9c
[  125.775292]  [<f83a8b32>] xfrm6_output_finish+0x13/0x15 [ipv6]
[  125.781988]  [<f83a8a1f>] __xfrm6_output+0x108/0x10d [ipv6]
[  125.788515]  [<f83a8b7b>] xfrm6_output+0x47/0x4c [ipv6]
[  125.794659]  [<f838a7b4>] dst_output+0x12/0x15 [ipv6]
[  125.800633]  [<f838b36a>] ip6_local_out+0x17/0x1a [ipv6]
[  125.806889]  [<f838d27b>] ip6_push_pending_frames+0x2a4/0x346 [ipv6]
[  125.814176]  [<f839a035>] udp_v6_push_pending_frames+0x213/0x271 [ipv6]
[  125.821792]  [<f839ae84>] ? udpv6_sendmsg+0x68d/0x832 [ipv6]
[  125.828447]  [<f839aea6>] udpv6_sendmsg+0x6af/0x832 [ipv6]
[  125.834931]  [<c123fe84>] ? ip_fast_csum+0x30/0x30
[  125.840522]  [<c12403c0>] inet_sendmsg+0x4e/0x57
[  125.845919]  [<c11f8de6>] sock_sendmsg+0xbe/0xd9
[  125.851343]  [<c1052d64>] ? trace_hardirqs_off+0xb/0xd
[  125.857271]  [<c1270f48>] ? restore_all+0xf/0xf
[  125.862642]  [<c1055715>] ? trace_hardirqs_on_caller+0x10e/0x13f
[  125.869618]  [<c10542df>] ? mark_lock+0x26/0x1ea
[  125.875028]  [<c10acdbb>] ? fget_light+0x28/0x7c
[  125.880431]  [<c11fa23a>] sys_sendto+0xb1/0xcd
[  125.885688]  [<c10548e7>] ? __lock_acquire+0x444/0xb17
[  125.891665]  [<c1270bb1>] ? _raw_spin_unlock_irq+0x39/0x45
[  125.898057]  [<c1055038>] ? lock_release_non_nested+0x7e/0x1bb
[  125.904803]  [<c11fa26e>] sys_send+0x18/0x1a
[  125.909815]  [<c11fa877>] sys_socketcall+0xce/0x19a
[  125.915539]  [<c11507c0>] ? trace_hardirqs_on_thunk+0xc/0x10
[  125.922127]  [<c1271650>] sysenter_do_call+0x12/0x36
[  185.166028] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 1, t=60002 jiffies)
[  185.167017] INFO: Stall ended before state dump start

-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Occasional oops with IPSec and IPv6.
  2011-11-18 19:26       ` Nick Bowler
@ 2011-11-18 20:06         ` Timo Teräs
  2011-11-18 20:10           ` David Miller
  2011-11-18 21:21           ` Nick Bowler
  0 siblings, 2 replies; 10+ messages in thread
From: Timo Teräs @ 2011-11-18 20:06 UTC (permalink / raw)
  To: Nick Bowler; +Cc: Eric Dumazet, netdev, David S. Miller

On 11/18/2011 09:26 PM, Nick Bowler wrote:
> On 2011-11-18 20:27 +0200, Timo Teräs wrote:
>> On 11/18/2011 06:39 PM, Eric Dumazet wrote:
>>> Le vendredi 18 novembre 2011 à 11:27 -0500, Nick Bowler a écrit :
>>>> On 2011-11-17 14:09 -0500, Nick Bowler wrote:
>>>>> One of the tests we do with IPsec involves sending and receiving UDP
>>>>> datagrams of all sizes from 1 to N bytes, where N is much larger than
>>>>> the MTU.  In this particular instance, the MTU is 1500 bytes and N is
>>>>> 10000 bytes.  This test works fine with IPv4, but I'm getting an
>>>>> occasional oops on Linus' master with IPv6 (output at end of email).  We
>>>>> also run the same test where N is less than the MTU, and it does not
>>>>> trigger this issue.  The resulting fallout seems to eventually lock up
>>>>> the box (although it continues to work for a little while afterwards).
>>>>>
>>>>> The issue appears timing related, and it doesn't always occur.  This
>>>>> probably also explains why I've not seen this issue before now, as we
>>>>> recently upgraded all our lab systems to machines from this century
>>>>> (with newfangled dual core processors).  This also makes it somewhat
>>>>> hard to reproduce, but I can trigger it pretty reliably by running 'yes'
>>>>> in an ssh session (which doesn't use IPsec) while running the test:
>>>>> it'll usually trigger in 2 or 3 runs.  The choice of cipher suite
>>>>> appears to be irrelevant.
> [...]
>>> Please note commit 80c802f307 added a known bug, fixed in commit
>>> 0b150932197b (xfrm: avoid possible oopse in xfrm_alloc_dst)
>>>
>>> Given commit 80c802f307 complexity, we can assume other bugs are to be
>>> fixed as well.
> [...]
>> This looks quite different. And I've been trying to figure out what
>> causes this. However, the OOPS happens at ip6_fragment(), indicating
>> that there was not enough allocated headroom (skb underrun). My initial
>> thought is ipv6 bug that just got uncovered by my commit; especially
>> since ipv4 side is happy. But I haven't yet been able to figure this one
>> out.
>>
>> Could you also try Herbert's latest patch set:
>>   [0/6] Replace LL_ALLOCATED_SPACE to allow needed_headroom adjustment
>>
>> This changes how the headroom is calculated, and *might* fix this issue
>> too if it's caused by the same SMP race condition which got uncovered by
>> my other commit earlier.
> 
> I applied all six of those patches, but I still see a crash.  However,
> the call trace seems to be slightly different.  I've appended the trace
> from the run with these paches applied, just in case it's significant.
> 
> NOTE: I did not carefully look at the traces of all the crashes I've
> triggered.  This particular backtrace could potentially have appeared
> before applying these patches and I would not have noticed.

It's still headroom underrun.

I'm not too familiar with the relevant IPv6 code, but it seems to be
mostly modelled after the IPv4 side. Looking at the back trace offset
inside ipv6_fragment, I'd say it was taking the "fast path" for
constructing the fragments. So first guess is that the headroom check
for allowing fast path to happen is not right.

Since the code seems to be treating separately hlen and struct frag_hdr,
I'm wondering if the following patch would be in place?

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 1c9bf8b..c35d9fc 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -675,7 +675,7 @@ int ip6_fragment(struct sk_buff *skb, int
(*output)(struct sk_buff *))
 			/* Correct geometry. */
 			if (frag->len > mtu ||
 			    ((frag->len & 7) && frag->next) ||
-			    skb_headroom(frag) < hlen)
+			    skb_headroom(frag) < hlen + sizeof(struct frag_hdr))
 				goto slow_path_clean;

 			/* Partially cloned skb? */


Alternatively, we could just run the "slow path" unconditionally with
the test load to see if it fixes the issue. At least that'd be pretty
good test if it's a problem in the ipv6 fragmentation code or something
else.

- Timo

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Occasional oops with IPSec and IPv6.
  2011-11-18 20:06         ` Timo Teräs
@ 2011-11-18 20:10           ` David Miller
  2011-11-18 21:21           ` Nick Bowler
  1 sibling, 0 replies; 10+ messages in thread
From: David Miller @ 2011-11-18 20:10 UTC (permalink / raw)
  To: timo.teras; +Cc: nbowler, eric.dumazet, netdev

From: Timo Teräs <timo.teras@iki.fi>
Date: Fri, 18 Nov 2011 22:06:47 +0200

> I'm not too familiar with the relevant IPv6 code, but it seems to be
> mostly modelled after the IPv4 side. Looking at the back trace offset
> inside ipv6_fragment, I'd say it was taking the "fast path" for
> constructing the fragments. So first guess is that the headroom check
> for allowing fast path to happen is not right.
> 
> Since the code seems to be treating separately hlen and struct frag_hdr,
> I'm wondering if the following patch would be in place?
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 1c9bf8b..c35d9fc 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -675,7 +675,7 @@ int ip6_fragment(struct sk_buff *skb, int
> (*output)(struct sk_buff *))
>  			/* Correct geometry. */
>  			if (frag->len > mtu ||
>  			    ((frag->len & 7) && frag->next) ||
> -			    skb_headroom(frag) < hlen)
> +			    skb_headroom(frag) < hlen + sizeof(struct frag_hdr))
>  				goto slow_path_clean;
> 
>  			/* Partially cloned skb? */
> 
> 
> Alternatively, we could just run the "slow path" unconditionally with
> the test load to see if it fixes the issue. At least that'd be pretty
> good test if it's a problem in the ipv6 fragmentation code or something
> else.

This reminds me of the following change from Steffen Klassert in net-next:

commit 299b0767642a65f0c5446ab6d35e6df0daf43d33
Author: Steffen Klassert <steffen.klassert@secunet.com>
Date:   Tue Oct 11 01:43:33 2011 +0000

    ipv6: Fix IPsec slowpath fragmentation problem
    
    ip6_append_data() builds packets based on the mtu from dst_mtu(rt->dst.path).
    On IPsec the effective mtu is lower because we need to add the protocol
    headers and trailers later when we do the IPsec transformations. So after
    the IPsec transformations the packet might be too big, which leads to a
    slowpath fragmentation then. This patch fixes this by building the packets
    based on the lower IPsec mtu from dst_mtu(&rt->dst) and adapts the exthdr
    handling to this.
    
    Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 835c04b..1e20b64 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1193,6 +1193,7 @@ int ip6_append_data(struct sock *sk, int getfrag(void *from, char *to,
 	struct sk_buff *skb;
 	unsigned int maxfraglen, fragheaderlen;
 	int exthdrlen;
+	int dst_exthdrlen;
 	int hh_len;
 	int mtu;
 	int copy;
@@ -1248,7 +1249,7 @@ int ip6_append_data(struct sock *sk, int getfrag(void *from, char *to,
 		np->cork.hop_limit = hlimit;
 		np->cork.tclass = tclass;
 		mtu = np->pmtudisc == IPV6_PMTUDISC_PROBE ?
-		      rt->dst.dev->mtu : dst_mtu(rt->dst.path);
+		      rt->dst.dev->mtu : dst_mtu(&rt->dst);
 		if (np->frag_size < mtu) {
 			if (np->frag_size)
 				mtu = np->frag_size;
@@ -1259,16 +1260,17 @@ int ip6_append_data(struct sock *sk, int getfrag(void *from, char *to,
 		cork->length = 0;
 		sk->sk_sndmsg_page = NULL;
 		sk->sk_sndmsg_off = 0;
-		exthdrlen = rt->dst.header_len + (opt ? opt->opt_flen : 0) -
-			    rt->rt6i_nfheader_len;
+		exthdrlen = (opt ? opt->opt_flen : 0) - rt->rt6i_nfheader_len;
 		length += exthdrlen;
 		transhdrlen += exthdrlen;
+		dst_exthdrlen = rt->dst.header_len;
 	} else {
 		rt = (struct rt6_info *)cork->dst;
 		fl6 = &inet->cork.fl.u.ip6;
 		opt = np->cork.opt;
 		transhdrlen = 0;
 		exthdrlen = 0;
+		dst_exthdrlen = 0;
 		mtu = cork->fragsize;
 	}
 
@@ -1368,6 +1370,8 @@ alloc_new_skb:
 			else
 				alloclen = datalen + fragheaderlen;
 
+			alloclen += dst_exthdrlen;
+
 			/*
 			 * The last fragment gets additional space at tail.
 			 * Note: we overallocate on fragments with MSG_MODE
@@ -1419,9 +1423,9 @@ alloc_new_skb:
 			/*
 			 *	Find where to start putting bytes
 			 */
-			data = skb_put(skb, fraglen);
-			skb_set_network_header(skb, exthdrlen);
-			data += fragheaderlen;
+			data = skb_put(skb, fraglen + dst_exthdrlen);
+			skb_set_network_header(skb, exthdrlen + dst_exthdrlen);
+			data += fragheaderlen + dst_exthdrlen;
 			skb->transport_header = (skb->network_header +
 						 fragheaderlen);
 			if (fraggap) {
@@ -1434,6 +1438,7 @@ alloc_new_skb:
 				pskb_trim_unique(skb_prev, maxfraglen);
 			}
 			copy = datalen - transhdrlen - fraggap;
+
 			if (copy < 0) {
 				err = -EINVAL;
 				kfree_skb(skb);
@@ -1448,6 +1453,7 @@ alloc_new_skb:
 			length -= datalen - fraggap;
 			transhdrlen = 0;
 			exthdrlen = 0;
+			dst_exthdrlen = 0;
 			csummode = CHECKSUM_NONE;
 
 			/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 3486f62..6f7824e 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -542,8 +542,7 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6,
 		goto out;
 
 	offset = rp->offset;
-	total_len = inet_sk(sk)->cork.base.length - (skb_network_header(skb) -
-						     skb->data);
+	total_len = inet_sk(sk)->cork.base.length;
 	if (offset >= total_len - 1) {
 		err = -EINVAL;
 		ip6_flush_pending_frames(sk);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Occasional oops with IPSec and IPv6.
  2011-11-18 20:06         ` Timo Teräs
  2011-11-18 20:10           ` David Miller
@ 2011-11-18 21:21           ` Nick Bowler
  2011-11-19  7:36             ` Timo Teräs
  1 sibling, 1 reply; 10+ messages in thread
From: Nick Bowler @ 2011-11-18 21:21 UTC (permalink / raw)
  To: Timo Teräs; +Cc: Eric Dumazet, netdev, David S. Miller

On 2011-11-18 22:06 +0200, Timo Teräs wrote:
> It's still headroom underrun.
> 
> I'm not too familiar with the relevant IPv6 code, but it seems to be
> mostly modelled after the IPv4 side. Looking at the back trace offset
> inside ipv6_fragment, I'd say it was taking the "fast path" for
> constructing the fragments. So first guess is that the headroom check
> for allowing fast path to happen is not right.
> 
> Since the code seems to be treating separately hlen and struct frag_hdr,
> I'm wondering if the following patch would be in place?
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 1c9bf8b..c35d9fc 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -675,7 +675,7 @@ int ip6_fragment(struct sk_buff *skb, int
> (*output)(struct sk_buff *))
>  			/* Correct geometry. */
>  			if (frag->len > mtu ||
>  			    ((frag->len & 7) && frag->next) ||
> -			    skb_headroom(frag) < hlen)
> +			    skb_headroom(frag) < hlen + sizeof(struct frag_hdr))
>  				goto slow_path_clean;
> 
>  			/* Partially cloned skb? */
> 
> 
> Alternatively, we could just run the "slow path" unconditionally with
> the test load to see if it fixes the issue. At least that'd be pretty
> good test if it's a problem in the ipv6 fragmentation code or something
> else.

Good call.  I replaced the "correct geometry" check with an
unconditional "goto slow_path_clean;", and I can no longer reproduce the
crash.  So at the very least, I have a workaround now.  (I still have
Herbert Xu's six patches applied on top of Linus' master).

I then tried the smaller change above, but this does not correct the
issue.

Cheers,
-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Occasional oops with IPSec and IPv6.
  2011-11-18 21:21           ` Nick Bowler
@ 2011-11-19  7:36             ` Timo Teräs
  2011-11-21 15:12               ` Nick Bowler
  0 siblings, 1 reply; 10+ messages in thread
From: Timo Teräs @ 2011-11-19  7:36 UTC (permalink / raw)
  To: Nick Bowler; +Cc: Eric Dumazet, netdev, David S. Miller

On 11/18/2011 11:21 PM, Nick Bowler wrote:
> On 2011-11-18 22:06 +0200, Timo Teräs wrote:
>> It's still headroom underrun.
>>
>> I'm not too familiar with the relevant IPv6 code, but it seems to be
>> mostly modelled after the IPv4 side. Looking at the back trace offset
>> inside ipv6_fragment, I'd say it was taking the "fast path" for
>> constructing the fragments. So first guess is that the headroom check
>> for allowing fast path to happen is not right.
>>
>> Since the code seems to be treating separately hlen and struct frag_hdr,
>> I'm wondering if the following patch would be in place?
>>
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index 1c9bf8b..c35d9fc 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -675,7 +675,7 @@ int ip6_fragment(struct sk_buff *skb, int
>> (*output)(struct sk_buff *))
>>  			/* Correct geometry. */
>>  			if (frag->len > mtu ||
>>  			    ((frag->len & 7) && frag->next) ||
>> -			    skb_headroom(frag) < hlen)
>> +			    skb_headroom(frag) < hlen + sizeof(struct frag_hdr))
>>  				goto slow_path_clean;
>>
>>  			/* Partially cloned skb? */
>>
>>
>> Alternatively, we could just run the "slow path" unconditionally with
>> the test load to see if it fixes the issue. At least that'd be pretty
>> good test if it's a problem in the ipv6 fragmentation code or something
>> else.
> 
> Good call.  I replaced the "correct geometry" check with an
> unconditional "goto slow_path_clean;", and I can no longer reproduce the
> crash.  So at the very least, I have a workaround now.  (I still have
> Herbert Xu's six patches applied on top of Linus' master).

Ok, so it's most likely ipv6 code issue then. My change just happened to
trigger it.

> I then tried the smaller change above, but this does not correct the
> issue.

That's not it then (likely).

I did notice that the headroom of the main skb is never checked. So my
other suggestion is to try something like:

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 1c9bf8b..735c4dc 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -668,7 +668,8 @@ int ip6_fragment(struct sk_buff *skb, int
(*output)(struct sk_buff *))

 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
-		    skb_cloned(skb))
+		    skb_cloned(skb) ||
+		    skb_headroom(skb) < sizeof(struct frag_hdr))
 			goto slow_path;

 		skb_walk_frags(skb, frag) {

Other than that, I hope some of the ipv6 people could take a look at it.
But the problem is that somewhere some headroom check isn't taking
place, or is checking for too little of headroom.

- Timo

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Occasional oops with IPSec and IPv6.
  2011-11-19  7:36             ` Timo Teräs
@ 2011-11-21 15:12               ` Nick Bowler
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Bowler @ 2011-11-21 15:12 UTC (permalink / raw)
  To: Timo Teräs; +Cc: Eric Dumazet, netdev, David S. Miller

On 2011-11-19 09:36 +0200, Timo Teräs wrote:
> On 11/18/2011 11:21 PM, Nick Bowler wrote:
> > On 2011-11-18 22:06 +0200, Timo Teräs wrote:
> >> Alternatively, we could just run the "slow path" unconditionally with
> >> the test load to see if it fixes the issue. At least that'd be pretty
> >> good test if it's a problem in the ipv6 fragmentation code or something
> >> else.
> > 
> > Good call.  I replaced the "correct geometry" check with an
> > unconditional "goto slow_path_clean;", and I can no longer reproduce the
> > crash.  So at the very least, I have a workaround now.  (I still have
> > Herbert Xu's six patches applied on top of Linus' master).
> 
> Ok, so it's most likely ipv6 code issue then. My change just happened to
> trigger it.
> 
> > I then tried the smaller change above, but this does not correct the
> > issue.
> 
> That's not it then (likely).
> 
> I did notice that the headroom of the main skb is never checked. So my
> other suggestion is to try something like:
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 1c9bf8b..735c4dc 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -668,7 +668,8 @@ int ip6_fragment(struct sk_buff *skb, int
> (*output)(struct sk_buff *))
> 
>  		if (first_len - hlen > mtu ||
>  		    ((first_len - hlen) & 7) ||
> -		    skb_cloned(skb))
> +		    skb_cloned(skb) ||
> +		    skb_headroom(skb) < sizeof(struct frag_hdr))
>  			goto slow_path;
> 
>  		skb_walk_frags(skb, frag) {

Tried this (still on top of Herbert Xu's patch set) to no avail; the
crash still occurs :(.

Using the unconditional slow path workaround, I ran the test over the
weekend and it did not crash, so that seems to be stable.

> Other than that, I hope some of the ipv6 people could take a look at it.
> But the problem is that somewhere some headroom check isn't taking
> place, or is checking for too little of headroom.

Thanks,
-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-11-21 15:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-17 19:09 Occasional oops with IPSec and IPv6 Nick Bowler
2011-11-18 16:27 ` Nick Bowler
2011-11-18 16:39   ` Eric Dumazet
2011-11-18 18:27     ` Timo Teräs
2011-11-18 19:26       ` Nick Bowler
2011-11-18 20:06         ` Timo Teräs
2011-11-18 20:10           ` David Miller
2011-11-18 21:21           ` Nick Bowler
2011-11-19  7:36             ` Timo Teräs
2011-11-21 15:12               ` Nick Bowler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).