netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* race in skb_splice_bits?
@ 2008-05-27  0:25 Octavian Purdila
  2008-05-27  2:08 ` Ben Hutchings
  2008-05-27 11:01 ` Evgeniy Polyakov
  0 siblings, 2 replies; 29+ messages in thread
From: Octavian Purdila @ 2008-05-27  0:25 UTC (permalink / raw)
  To: netdev


Hi,

The following socket lock dropping in skb_splice_bits seems to open a race 
condition which causes an invalid kernel access:

>        if (spd.nr_pages) {
>                int ret;
>
>                /*                                                                                                                            
>                 * Drop the socket lock, otherwise we have reverse                                                                            
>                 * locking dependencies between sk_lock and i_mutex                                                                           
>                 * here as compared to sendfile(). We enter here                                                                              
>                 * with the socket lock held, and splice_to_pipe() will                                                                       
>                 * grab the pipe inode lock. For sendfile() emulation,                                                                        
>                 * we call into ->sendpage() with the i_mutex lock held                                                                       
>                 * and networking will grab the socket lock.                                                                                  
>                 */
>                release_sock(__skb->sk);                                                                                                    
>                ret = splice_to_pipe(pipe, &spd);
>                lock_sock(__skb->sk);                                                                                                       
>                return ret;
>        }

Setup: 

- powerpc, non-SMP, no preemption, 2.6.25
- RX side: LRO enabled, splice from socket to /dev/null; 
- TX side: MTU set to 128 bytes (on the TX side), GSO enabled, splice from 
file to socket

The oops - on the RX side: 

Unable to handle kernel paging request for data at address 0x00000030
Faulting instruction address: 0x80109ee0
Oops: Kernel access of bad area, sig: 11 [#1]
Ixia TCPX
Modules linked in: almfmanager(P) filtermanager ixnam_llm(P) ixna
m_tcpx(P) hwstate ixllm ixhostm ixsysctl(P) nlproc_driver
NIP: 80109ee0 LR: 80109edc CTR: 8010c52c
REGS: bcd25b90 TRAP: 0300   Tainted: P          (2.6.25-00005-gf7b547d)
MSR: 00009032 <EE,ME,IR,DR>  CR: 24000822  XER: 20000000
DAR: 00000030, DSISR: 40000000
TASK = bfbe1bf0[156] 'splice' THREAD: bcd24000
GPR00: 8010c94c bcd25c40 bfbe1bf0 00000000 00000000 802835f8 00000001 0000004c 
GPR08: 00024000 00000100 00000032 bcd24000 00010dc4 100198b4 390046a8 0a5042f3 
GPR16: 8028238c bd18fe00 00000008 10010000 6fbcbac0 00000000 10001060 bcd25dd8 
GPR24: 8014b520 00000000 bcd25e30 bccefa00 bf33e300 fffffe00 bcd25d70 00000000 
NIP [80109ee0] lock_sock_nested+0x1c/0x50
LR [80109edc] lock_sock_nested+0x18/0x50
Call Trace:
[bcd25c60] [8010c94c] skb_splice_bits+0x130/0x134
[bcd25dc0] [8014b548] tcp_splice_data_recv+0x28/0x38
[bcd25dd0] [8014d08c] tcp_read_sock+0x108/0x1f8
[bcd25e20] [8014b58c] __tcp_splice_read+0x34/0x44
[bcd25e40] [8014b61c] tcp_splice_read+0x80/0x220
[bcd25e90] [80105730] sock_splice_read+0x2c/0x44
[bcd25ea0] [8008a374] do_splice_to+0x90/0xac
[bcd25ed0] [8008a850] do_splice+0x258/0x2f0
[bcd25f10] [8008b1d4] sys_splice+0xe0/0xe8
[bcd25f40] [8000ff14] ret_from_syscall+0x0/0x38
 --- Exception: c01 at 0x10000894
     LR = 0x10000e2c


Analysis: 

Printks show that __skb->sk is non-NULL before splice_to_pipe and NULL after. 
Using a hardware watchpoint I was able to see that the write in __skb->sk is 
caused by __allock_skb()'s memset() which seems to indicate that the __skb is 
freed between release_sock() and lock_sock(). Turning on slab debugging and 
the hardware watchpoint shows that the free happens during tcp_collapse() 
which was initiated as a result of an timer interrupt -> softirq -> NAPI 
polling -> lro_flush_all().

Commenting out the sequence that drops the socket lock seems to fix the 
problem on my setup.

Regards,
tavi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27  0:25 race in skb_splice_bits? Octavian Purdila
@ 2008-05-27  2:08 ` Ben Hutchings
  2008-05-27 10:41   ` Octavian Purdila
  2008-05-27 11:01 ` Evgeniy Polyakov
  1 sibling, 1 reply; 29+ messages in thread
From: Ben Hutchings @ 2008-05-27  2:08 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: netdev

Octavian Purdila wrote:
> 
> Hi,
> 
> The following socket lock dropping in skb_splice_bits seems to open a race 
> condition which causes an invalid kernel access:
> 
> >        if (spd.nr_pages) {
> >                int ret;
> >
> >                /*
> >                 * Drop the socket lock, otherwise we have reverse
> >                 * locking dependencies between sk_lock and i_mutex
> >                 * here as compared to sendfile(). We enter here
> >                 * with the socket lock held, and splice_to_pipe() will
> >                 * grab the pipe inode lock. For sendfile() emulation,
> >                 * we call into ->sendpage() with the i_mutex lock held
> >                 * and networking will grab the socket lock.
> >                 */
> >                release_sock(__skb->sk);
> >                ret = splice_to_pipe(pipe, &spd);
> >                lock_sock(__skb->sk);
> >                return ret;
> >        }

Given the previous comment, that certainly looks wrong.

<snip>
> Commenting out the sequence that drops the socket lock seems to fix the
> problem on my setup.

But this could apparently cause deadlock.  Surely the correct fix is
to copy __skb->sk to a local variable before calling splice_to_pipe()
so we can re-lock it?

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27  2:08 ` Ben Hutchings
@ 2008-05-27 10:41   ` Octavian Purdila
  0 siblings, 0 replies; 29+ messages in thread
From: Octavian Purdila @ 2008-05-27 10:41 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev

On Tuesday 27 May 2008, Ben Hutchings wrote:

> > Commenting out the sequence that drops the socket lock seems to fix the
> > problem on my setup.
>
> But this could apparently cause deadlock.  Surely the correct fix is
> to copy __skb->sk to a local variable before calling splice_to_pipe()
> so we can re-lock it?
>

I've tried that, but I think that the freed __skb might be touched later in 
tcp_read_sock:

Faulting instruction address: 0x8014d0c0
Oops: Kernel access of bad area, sig: 11 [#1]
Ixia TCPX
Modules linked in: almfmanager(P) filtermanager ixnam_llm(P) ixna
m_tcpx(P) hwstate ixllm ixhostm ixsysctl(P) nlproc_driver
NIP: 8014d0c0 LR: 8014d090 CTR: 8010c52c
REGS: bcd27d20 TRAP: 0300   Tainted: P          (2.6.25-00005-gf7
b547d-dirty)
MSR: 00009032 <EE,ME,IR,DR>  CR: 24000822  XER: 20000000
DAR: 0000000c, DSISR: 40000000
TASK = bd0d7bd0[172] 'splice' THREAD: bcd26000
GPR00: 00000000 bcd27dd0 bd0d7bd0 fffffe00 00000000 802835f8 00000001 0000004c 
GPR08: 00024000 00000000 00000062 bcd26000 0023ac37 100198b4 390046a8 0a5042f3 
GPR16: 8028238c bd18fe00 00000008 10010000 6fd01ac0 00000000 10001060 bcd27dd8 
GPR24: 8014b524 00000000 bcd27e30 bcd07180 0000004c bcd071e4 a7c41e8b bfb3aa60 
NIP [8014d0c0] tcp_read_sock+0x138/0x1f8
LR [8014d090] tcp_read_sock+0x108/0x1f8
Call Trace:
[bcd27dd0] [8014d090] tcp_read_sock+0x108/0x1f8 (unreliable)
[bcd27e20] [8014b590] __tcp_splice_read+0x34/0x44
[bcd27e40] [8014b620] tcp_splice_read+0x80/0x220
[bcd27e90] [80105730] sock_splice_read+0x2c/0x44
[bcd27ea0] [8008a374] do_splice_to+0x90/0xac
[bcd27ed0] [8008a850] do_splice+0x258/0x2f0
[bcd27f10] [8008b1d4] sys_splice+0xe0/0xe8
[bcd27f40] [8000ff14] ret_from_syscall+0x0/0x38
 --- Exception: c01 at 0x10000894
     LR = 0x10000e2c

Thanks,
tavi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27  0:25 race in skb_splice_bits? Octavian Purdila
  2008-05-27  2:08 ` Ben Hutchings
@ 2008-05-27 11:01 ` Evgeniy Polyakov
  2008-05-27 11:08   ` Ben Hutchings
  1 sibling, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 11:01 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: netdev

On Tue, May 27, 2008 at 03:25:23AM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> 
> Hi,
> 
> The following socket lock dropping in skb_splice_bits seems to open a race 
> condition which causes an invalid kernel access:
> 
> >        if (spd.nr_pages) {
> >                int ret;
> >
> >                /*                                                                                                                            
> >                 * Drop the socket lock, otherwise we have reverse                                                                            
> >                 * locking dependencies between sk_lock and i_mutex                                                                           
> >                 * here as compared to sendfile(). We enter here                                                                              
> >                 * with the socket lock held, and splice_to_pipe() will                                                                       
> >                 * grab the pipe inode lock. For sendfile() emulation,                                                                        
> >                 * we call into ->sendpage() with the i_mutex lock held                                                                       
> >                 * and networking will grab the socket lock.                                                                                  
> >                 */

What about sock_hold() here?
It will prevent from socket freeing and read/write to it will
immediately return with error if socket was closed by another thread.

> >                release_sock(__skb->sk);                                                                                                    
> >                ret = splice_to_pipe(pipe, &spd);
> >                lock_sock(__skb->sk);

And appropriate sock_put(sk);

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 11:01 ` Evgeniy Polyakov
@ 2008-05-27 11:08   ` Ben Hutchings
  2008-05-27 11:52     ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Ben Hutchings @ 2008-05-27 11:08 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Octavian Purdila, netdev

Evgeniy Polyakov wrote:
> On Tue, May 27, 2008 at 03:25:23AM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> >
> > Hi,
> >
> > The following socket lock dropping in skb_splice_bits seems to open a race
> > condition which causes an invalid kernel access:
> >
> > >        if (spd.nr_pages) {
> > >                int ret;
> > >
> > >                /*
> > >                 * Drop the socket lock, otherwise we have reverse
> > >                 * locking dependencies between sk_lock and i_mutex
> > >                 * here as compared to sendfile(). We enter here
> > >                 * with the socket lock held, and splice_to_pipe() will
> > >                 * grab the pipe inode lock. For sendfile() emulation,
> > >                 * we call into ->sendpage() with the i_mutex lock held
> > >                 * and networking will grab the socket lock.
> > >                 */
>
> What about sock_hold() here?
> It will prevent from socket freeing and read/write to it will
> immediately return with error if socket was closed by another thread.
<snip>

We know the socket isn't going to go away because somewhere up the call
stack someone has to be holding the socket in order to lock it.  However,
the skb may (and evidently sometimes does) go away during splice_to_pipe()
so we can't look up the socket through it.

However, from Octavian's later mail it seems we can't let the skb go away
at all.  So some wider changes seem to be required.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 11:08   ` Ben Hutchings
@ 2008-05-27 11:52     ` Evgeniy Polyakov
  2008-05-27 11:56       ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 11:52 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Octavian Purdila, netdev

On Tue, May 27, 2008 at 12:08:41PM +0100, Ben Hutchings (bhutchings@solarflare.com) wrote:
> However, from Octavian's later mail it seems we can't let the skb go away
> at all.  So some wider changes seem to be required.

Yes, I misread the original message.

Octavian, could you please decode where bug occured via gdb:
gdb
$ l *(skb_splice_bits+0x130)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 11:52     ` Evgeniy Polyakov
@ 2008-05-27 11:56       ` Evgeniy Polyakov
  2008-05-27 12:53         ` Octavian Purdila
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 11:56 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Octavian Purdila, netdev

On Tue, May 27, 2008 at 03:52:06PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Tue, May 27, 2008 at 12:08:41PM +0100, Ben Hutchings (bhutchings@solarflare.com) wrote:
> > However, from Octavian's later mail it seems we can't let the skb go away
> > at all.  So some wider changes seem to be required.
> 
> Yes, I misread the original message.
> 
> Octavian, could you please decode where bug occured via gdb:
> gdb
> $ l *(skb_splice_bits+0x130)

It is lock_sock()?

And also please show what
tcp_read_sock+0x108
tcp_read_sock+0x138

are. And do you have a test case for that?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 11:56       ` Evgeniy Polyakov
@ 2008-05-27 12:53         ` Octavian Purdila
  2008-05-27 13:21           ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Octavian Purdila @ 2008-05-27 12:53 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev

[-- Attachment #1: Type: text/plain, Size: 2157 bytes --]

On Tuesday 27 May 2008, Evgeniy Polyakov wrote:

> > > However, from Octavian's later mail it seems we can't let the skb go
> > > away at all.  So some wider changes seem to be required.
> >
> > Yes, I misread the original message.
> >
> > Octavian, could you please decode where bug occured via gdb:
> > gdb
> > $ l *(skb_splice_bits+0x130)
>
> It is lock_sock()?
>

Yes it is lock_sock(). The crash in lock_sock() occurs with the following 
sequence:

release_sock(__skb->sk);
ret = splice_to_pipe(pipe, &spd);
lock_sock(__skb->sk);

0x8010c94c is in skb_splice_bits (include/net/sock.h:837).
832
833     extern void lock_sock_nested(struct sock *sk, int subclass);
834
835     static inline void lock_sock(struct sock *sk)
836     {
837             lock_sock_nested(sk, 0);
838     }
839
840     extern void release_sock(struct sock *sk);
841


> And also please show what
> tcp_read_sock+0x108
> tcp_read_sock+0x138
>
> are.

This crash occurs with the following sequence:

struct sock *sk= __skb->sk;

release_sock(sk);
ret = splice_to_pipe(pipe, &spd);
lock_sock(sk);

0x8014d090 is in tcp_read_sock (net/ipv4/tcp.c:1225).
1220                            used = recv_actor(desc, skb, offset, len);
1221                            if (used < 0) {
1222                                    if (!copied)
1223                                            copied = used;
1224                                    break;
1225                            } else if (used <= len) {
1226                                    seq += used;
1227                                    copied += used;
1228                                    offset += used;
1229                            }

> And do you have a test case for that? 

Yes. I've attached my test program which triggers both crashes. To improve the 
reproducibility rate you need to use a relatively high buffer for splice (I 
used 16384) and a small MTU (I used 128). This will make the 1st splice on 
the receive side block (due to not enough PIPE_BUFFERS it seems), and will 
give plenty of time to the race condition to manifest itself. When you 
interrupt the program, the system will crash.

Thanks,
tavi


[-- Attachment #2: splice.c --]
[-- Type: text/x-csrc, Size: 7932 bytes --]

#define _GNU_SOURCE
#include <stdlib.h>
#include <string.h>
#include <error.h>
#include <argp.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <sys/epoll.h>
#include <sys/fcntl.h>
#include <sys/wait.h>
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netdb.h>
#include <net/if.h>
#include <signal.h>
#include <sys/socket.h>
#include <netpacket/packet.h>
#include <net/ethernet.h>     /* the L2 protocols */
#include <assert.h>
#include <netinet/tcp.h>
#include <linux/unistd.h>

#ifdef __PPC__

#undef __syscall_nr
#define __syscall_nr(nr, type, name, args...)                          \
       unsigned long __sc_ret, __sc_err;                               \
       {                                                               \
               register unsigned long __sc_0  __asm__ ("r0");          \
               register unsigned long __sc_3  __asm__ ("r3");          \
               register unsigned long __sc_4  __asm__ ("r4");          \
               register unsigned long __sc_5  __asm__ ("r5");          \
               register unsigned long __sc_6  __asm__ ("r6");          \
               register unsigned long __sc_7  __asm__ ("r7");          \
               register unsigned long __sc_8  __asm__ ("r8");          \
                                                                       \
               __sc_loadargs_##nr(name, args);                         \
               __asm__ __volatile__                                    \
                       ("sc           \n\t"                            \
                        "mfcr %0      "                                \
                       : "=&r" (__sc_0),                               \
                         "=&r" (__sc_3),  "=&r" (__sc_4),              \
                         "=&r" (__sc_5),  "=&r" (__sc_6),              \
                         "=&r" (__sc_7),  "=&r" (__sc_8)               \
                       : __sc_asm_input_##nr                           \
                       : "cr0", "ctr", "memory",                       \
                         "r9", "r10","r11", "r12");                    \
               __sc_ret = __sc_3;                                      \
               __sc_err = __sc_0;                                      \
       }                                                               \
       if (__sc_err & 0x10000000)                                      \
       {                                                               \
               errno = __sc_ret;                                       \
               __sc_ret = -1;                                          \
       }                                                               \
       return (type) __sc_ret

#define __sc_loadargs_6(name, arg1, arg2, arg3, arg4, arg5, arg6)		\
	__sc_loadargs_5(name, arg1, arg2, arg3, arg4, arg5);			\
	__sc_8 = (unsigned long) (arg6)

#define __sc_asm_input_6 __sc_asm_input_5, "6" (__sc_8)

#define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,type5,arg5,type6,arg6) \
type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5 arg5, type6 arg6) \
{                                                                      \
       __syscall_nr(6, type, name, arg1, arg2, arg3, arg4, arg5, arg6); \
}

#define SPLICE_F_MOVE 0x01
#define SPLICE_F_NONBLOCK 0x02
#define SPLICE_F_MORE 0x04
#define SPLICE_F_GIFT 0x08
#define __NR_splice 283
_syscall6(int, splice, int, fd_in, loff_t*, off_in, int, fd_out, loff_t*, off_out, size_t, len, unsigned int, flags);
#endif

/* 
 * Command line parser
 */
const char *argp_program_version = "0.1";
const char *argp_program_bug_address = "opurdila@ixiacom.com";
static char doc[] = "";
static char args_doc[] = "";
static struct argp_option options[] = {
    {"interface", 'i', "string", 0, "interface"},
    {"dst", 'd', "string", 0, "dst"},
    {"src", 's', "string", 0, "src"},
    {"port", 'p', "string", 0, "port"},
    {"socket-buffer", 'S', "string", 0, "packet size"},
    {"interval", 'I', "int", 0, "interval between packet send operations"},
    {"count", 'c', "int", 0, "counter"},
    {0},
};

typedef struct {
	int size, ifindex, interval, count, port, so_size;
	struct sockaddr_in src, dst;
	const char *ifname;
} cl_args_t;

cl_args_t cla;

static error_t parse_opt(int key, char *arg, struct argp_state *state)
{
	cl_args_t *cla = state->input;
	switch (key) {
	case 'p':
	{
		cla->port=strtol(arg, NULL, 0);
		break;
	}
	case 'c':
	{
		cla->count=strtol(arg, NULL, 0);
		break;
	}
	case 'I':
	{
		cla->interval=strtol(arg, NULL, 0);
		break;
	}
	case 'S':
	{
		cla->so_size=strtol(arg, NULL, 0);
		break;
	}
	case 'd':
	{
		cla->dst.sin_family=AF_INET;
		if (inet_pton(AF_INET, arg, &cla->dst.sin_addr) < 0) {
			printf("bad address: %s\n", arg);
			return -1;
		}
		break;
	}
	case 's':
	{
		cla->src.sin_family=AF_INET;
		if (inet_pton(AF_INET, arg, &cla->src.sin_addr) < 0) {
			printf("bad address: %s\n", arg);
			return -1;
		}
		break;
	}
	case 'i':
	{
		cla->ifindex=if_nametoindex(arg);
		if (cla->ifindex == 0) {
			printf("invalid interface: %s\n", arg);
			return -1;
		}
		cla->ifname=arg;
		break;
	}
	case ARGP_KEY_ARG:
		break;
	default:
		return ARGP_ERR_UNKNOWN;
	}
	return 0;
}

int prepare_socket(void)
{
	int s=socket(PF_INET, SOCK_STREAM, 0);

	if (s < 0) {
		printf("failed to create socket: %s\n", strerror(errno));
		return -1;
	}
	
	if ((cla.src.sin_addr.s_addr || cla.src.sin_port)
	     && bind(s, (const struct sockaddr*)&cla.src, sizeof(cla.src)) < 0) {
		printf("failed to bind socket: %s\n", strerror(errno));
		return -1;
	}
	
	if (cla.ifname) {
		struct ifreq ifr;
		strcpy((char *)&ifr.ifr_name, cla.ifname);
		if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, &ifr, sizeof (ifr)) < 0) {
			printf("failed to bind to interface %s: %s\n", cla.ifname, strerror(errno));
			return -1;
		}
        }

	return s;
}

int do_server(int s)
{
	int error, s2;
	int pipefd[2];
    
	if (pipe(pipefd) < 0) {
	    perror("pipe error:");
	    return -1;
	}
	
	if (listen(s, 1) < 0) {
	    perror("listen error:");
	    return -1;
	}

	if ((s2=accept(s, NULL, NULL)) < 0) {
	    perror("accept error:");
	    return -1;
	}
	
	/* 6% improvement by _lowering_ the buffer size :O */
	if (cla.so_size && setsockopt(s2, SOL_SOCKET, SO_RCVBUF, &cla.so_size, sizeof cla.so_size) < 0) {
		perror("SO_RCVBUF");
	}

	while (1) {
		error=splice(s2, NULL, pipefd[1], NULL, 4*4096, SPLICE_F_MOVE);
		if (error <= 0) {
			perror("in splice");
			return -1;
		} 
		error=splice(pipefd[0], NULL, STDOUT_FILENO, NULL, error, SPLICE_F_MOVE);
		if (error <= 0) {
			perror("out splice");
			return -1;
		} 
	}
	return -1;
}

int do_client(int s)
{
	int error;
	int pipefd[2], fd=STDIN_FILENO;
	loff_t off=0;
    
	if (pipe(pipefd) < 0) {
	    perror("pipe error:");
	    return -1;
	}
	
	if (connect(s, (struct sockaddr*)&cla.dst, sizeof(cla.dst)) < 0) {
	    perror("connect error:");
	    return -1;
	}
	
	if (cla.so_size && setsockopt(s, SOL_SOCKET, SO_SNDBUF, &cla.so_size, sizeof cla.so_size) < 0) {
		perror("SO_RCVBUF");
	}

	while(1) {
		off=0;
		error=splice(fd, &off, pipefd[1], NULL, 4*4096, SPLICE_F_MOVE);
		if (error <= 0) {
			perror("in splice");
			return -1;
		} 
		error=splice(pipefd[0], NULL, s, NULL, error, SPLICE_F_MOVE);
		if (error <= 0) {
			perror("out splice");
			return -1;
		}
	}
	close(s);
	return -1;
}


static struct argp argp = { options, parse_opt, args_doc, doc };
int main(int argc, char **argv)
{
	int s;

	if (argp_parse(&argp, argc, argv, 0, 0, &cla) < 0)
		return -1;

	if (cla.dst.sin_addr.s_addr)
		cla.dst.sin_port=htons(cla.port);
	else
		cla.src.sin_port=htons(cla.port);

	if ((s=prepare_socket()) < 0)
		return -1;

	if (cla.dst.sin_addr.s_addr)
		return do_client(s);
	else
		return do_server(s);
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 12:53         ` Octavian Purdila
@ 2008-05-27 13:21           ` Evgeniy Polyakov
  2008-05-27 14:03             ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 13:21 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev

On Tue, May 27, 2008 at 03:53:49PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> When you interrupt the program, the system will crash.

Cool!

I've reproduced the problem and will try to work it out, thank you.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 13:21           ` Evgeniy Polyakov
@ 2008-05-27 14:03             ` Evgeniy Polyakov
  2008-05-27 14:39               ` Octavian Purdila
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 14:03 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Tue, May 27, 2008 at 05:21:48PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Tue, May 27, 2008 at 03:53:49PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> > When you interrupt the program, the system will crash.
> 
> Cool!
> 
> I've reproduced the problem and will try to work it out, thank you.

Attached patch fixes the crash for me, Octavian could you please test
it.

David, is __kfree_skb() usage in tcp_read_sock() an optimisation only?
With this patch I do not see any leaks, but I did not investigate it
deep enough. If approach seems correct, I will clean things up.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..d285817 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1349,6 +1349,7 @@ done:
 
 	if (spd.nr_pages) {
 		int ret;
+		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1360,9 @@ done:
 		 * we call into ->sendpage() with the i_mutex lock held
 		 * and networking will grab the socket lock.
 		 */
-		release_sock(__skb->sk);
+		release_sock(sk);
 		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(__skb->sk);
+		lock_sock(sk);
 		return ret;
 	}
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..b8318e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1182,6 +1182,23 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 	return NULL;
 }
 
+#ifdef CONFIG_NET_DMA
+static inline void __sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early)
+{
+	__skb_unlink(skb, &sk->sk_receive_queue);
+	if (!copied_early)
+		kfree_skb(skb);
+	else
+		__skb_queue_tail(&sk->sk_async_wait_queue, skb);
+}
+#else
+static inline void __sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early)
+{
+	__skb_unlink(skb, &sk->sk_receive_queue);
+	kfree_skb(skb);
+}
+#endif
+
 /*
  * This routine provides an alternative to tcp_recvmsg() for routines
  * that would like to handle copying from skbuffs directly in 'sendfile'
@@ -1231,11 +1248,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				break;
 		}
 		if (tcp_hdr(skb)->fin) {
-			sk_eat_skb(sk, skb, 0);
+			__sk_eat_skb(sk, skb, 0);
 			++seq;
 			break;
 		}
-		sk_eat_skb(sk, skb, 0);
+		__sk_eat_skb(sk, skb, 0);
 		if (!desc->count)
 			break;
 	}

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 14:03             ` Evgeniy Polyakov
@ 2008-05-27 14:39               ` Octavian Purdila
  2008-05-27 15:09                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Octavian Purdila @ 2008-05-27 14:39 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem

On Tuesday 27 May 2008, Evgeniy Polyakov wrote:
> On Tue, May 27, 2008 at 05:21:48PM +0400, Evgeniy Polyakov 
(johnpol@2ka.mipt.ru) wrote:
> > On Tue, May 27, 2008 at 03:53:49PM +0300, Octavian Purdila 
(opurdila@ixiacom.com) wrote:
> > > When you interrupt the program, the system will crash.
> >
> > Cool!
> >
> > I've reproduced the problem and will try to work it out, thank you.
>
> Attached patch fixes the crash for me, Octavian could you please test
> it.
>

On my system it still crashes, if I interrupt the program before splice 
blocks:

[   16.170009] NIP [8014d0c0] tcp_read_sock+0x138/0x1f8
[   16.170009] LR [8014d090] tcp_read_sock+0x108/0x1f8
[   16.170009] Call Trace:
[   16.170009] [bcd21dd0] [8014d090] tcp_read_sock+0x108/0x1f8 (unreliable)
[   16.170009] [bcd21e20] [8014b590] __tcp_splice_read+0x34/0x44
[   16.170009] [bcd21e40] [8014b620] tcp_splice_read+0x80/0x220
[   16.170009] [bcd21e90] [80105730] sock_splice_read+0x2c/0x44
[   16.170009] [bcd21ea0] [8008a374] do_splice_to+0x90/0xac
[   16.170009] [bcd21ed0] [8008a850] do_splice+0x258/0x2f0
[   16.170009] [bcd21f10] [8008b1d4] sys_splice+0xe0/0xe8
[   16.170009] [bcd21f40] [8000ff14] ret_from_syscall+0x0/0x38


Thanks,
tavi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 14:39               ` Octavian Purdila
@ 2008-05-27 15:09                 ` Evgeniy Polyakov
  2008-05-27 15:12                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 15:09 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Tue, May 27, 2008 at 05:39:11PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> On my system it still crashes, if I interrupt the program before splice 
> blocks:

Yes, that was somewhat stupid patch.
Since I can not reproduce it anymore, let's try to upload this task to you :)
Please test this one assuming there are no other patches applied before.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..e2a99a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1295,7 +1295,7 @@ err:
  * the frag list, if such a thing exists. We'd probably need to recurse to
  * handle that cleanly.
  */
-int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
 		    unsigned int flags)
 {
@@ -1308,16 +1308,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 		.ops = &sock_pipe_buf_ops,
 		.spd_release = sock_spd_release,
 	};
-	struct sk_buff *skb;
-
-	/*
-	 * I'd love to avoid the clone here, but tcp_read_sock()
-	 * ignores reference counts and unconditonally kills the sk_buff
-	 * on return from the actor.
-	 */
-	skb = skb_clone(__skb, GFP_KERNEL);
-	if (unlikely(!skb))
-		return -ENOMEM;
 
 	/*
 	 * __skb_splice_bits() only fails if the output has no room left,
@@ -1341,14 +1331,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 	}
 
 done:
-	/*
-	 * drop our reference to the clone, the pipe consumption will
-	 * drop the rest.
-	 */
-	kfree_skb(skb);
-
 	if (spd.nr_pages) {
 		int ret;
+		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1344,9 @@ done:
 		 * we call into ->sendpage() with the i_mutex lock held
 		 * and networking will grab the socket lock.
 		 */
-		release_sock(__skb->sk);
+		release_sock(sk);
 		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(__skb->sk);
+		lock_sock(sk);
 		return ret;
 	}
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..54bb460 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1196,7 +1196,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 		  sk_read_actor_t recv_actor)
 {
-	struct sk_buff *skb;
+	struct sk_buff *skb, *__skb;
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 seq = tp->copied_seq;
 	u32 offset;
@@ -1204,10 +1204,14 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 
 	if (sk->sk_state == TCP_LISTEN)
 		return -ENOTCONN;
-	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
+	while ((__skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
 		if (offset < skb->len) {
 			size_t used, len;
 
+			skb = skb_clone(__skb, GFP_KERNEL);
+			if (unlikely(!skb))
+				break;
+
 			len = skb->len - offset;
 			/* Stop reading if we hit a patch of urgent data */
 			if (tp->urg_data) {
@@ -1215,29 +1219,35 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				if (urg_offset < len)
 					len = urg_offset;
 				if (!len)
-					break;
+					goto out_break;
 			}
 			used = recv_actor(desc, skb, offset, len);
 			if (used < 0) {
 				if (!copied)
 					copied = used;
-				break;
+				goto out_break;
 			} else if (used <= len) {
 				seq += used;
 				copied += used;
 				offset += used;
 			}
 			if (offset != skb->len)
-				break;
+				goto out_break;
 		}
 		if (tcp_hdr(skb)->fin) {
 			sk_eat_skb(sk, skb, 0);
 			++seq;
-			break;
+			goto out_break;
 		}
 		sk_eat_skb(sk, skb, 0);
+out:
+		kfree_skb(skb);
 		if (!desc->count)
-			break;
+			goto out_break;
+		continue;
+out_break:
+		kfree_skb(skb);
+		break;
 	}
 	tp->copied_seq = seq;
 


-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 15:09                 ` Evgeniy Polyakov
@ 2008-05-27 15:12                   ` Evgeniy Polyakov
  2008-05-27 15:22                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 15:12 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Tue, May 27, 2008 at 07:09:31PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> Please test this one assuming there are no other patches applied before.

A typo sneaked into the patch, please try this one.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..e2a99a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1295,7 +1295,7 @@ err:
  * the frag list, if such a thing exists. We'd probably need to recurse to
  * handle that cleanly.
  */
-int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
 		    unsigned int flags)
 {
@@ -1308,16 +1308,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 		.ops = &sock_pipe_buf_ops,
 		.spd_release = sock_spd_release,
 	};
-	struct sk_buff *skb;
-
-	/*
-	 * I'd love to avoid the clone here, but tcp_read_sock()
-	 * ignores reference counts and unconditonally kills the sk_buff
-	 * on return from the actor.
-	 */
-	skb = skb_clone(__skb, GFP_KERNEL);
-	if (unlikely(!skb))
-		return -ENOMEM;
 
 	/*
 	 * __skb_splice_bits() only fails if the output has no room left,
@@ -1341,14 +1331,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 	}
 
 done:
-	/*
-	 * drop our reference to the clone, the pipe consumption will
-	 * drop the rest.
-	 */
-	kfree_skb(skb);
-
 	if (spd.nr_pages) {
 		int ret;
+		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1344,9 @@ done:
 		 * we call into ->sendpage() with the i_mutex lock held
 		 * and networking will grab the socket lock.
 		 */
-		release_sock(__skb->sk);
+		release_sock(sk);
 		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(__skb->sk);
+		lock_sock(sk);
 		return ret;
 	}
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..fb8bfc2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1196,7 +1196,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 		  sk_read_actor_t recv_actor)
 {
-	struct sk_buff *skb;
+	struct sk_buff *skb, *__skb;
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 seq = tp->copied_seq;
 	u32 offset;
@@ -1204,10 +1204,14 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 
 	if (sk->sk_state == TCP_LISTEN)
 		return -ENOTCONN;
-	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
+	while ((__skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
 		if (offset < skb->len) {
 			size_t used, len;
 
+			skb = skb_clone(__skb, GFP_KERNEL);
+			if (unlikely(!skb))
+				break;
+
 			len = skb->len - offset;
 			/* Stop reading if we hit a patch of urgent data */
 			if (tp->urg_data) {
@@ -1215,29 +1219,35 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				if (urg_offset < len)
 					len = urg_offset;
 				if (!len)
-					break;
+					goto out_break;
 			}
 			used = recv_actor(desc, skb, offset, len);
 			if (used < 0) {
 				if (!copied)
 					copied = used;
-				break;
+				goto out_break;
 			} else if (used <= len) {
 				seq += used;
 				copied += used;
 				offset += used;
 			}
 			if (offset != skb->len)
-				break;
+				goto out_break;
 		}
 		if (tcp_hdr(skb)->fin) {
 			sk_eat_skb(sk, skb, 0);
 			++seq;
-			break;
+			goto out_break;
 		}
 		sk_eat_skb(sk, skb, 0);
+out:
+		kfree_skb(skb);
 		if (!desc->count)
 			break;
+		continue;
+out_break:
+		kfree_skb(skb);
+		break;
 	}
 	tp->copied_seq = seq;
 

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 15:12                   ` Evgeniy Polyakov
@ 2008-05-27 15:22                     ` Evgeniy Polyakov
  2008-05-27 15:33                       ` Octavian Purdila
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 15:22 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Tue, May 27, 2008 at 07:12:59PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Tue, May 27, 2008 at 07:09:31PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > Please test this one assuming there are no other patches applied before.
> 
> A typo sneaked into the patch, please try this one.

The same wrong one, sorry about that.

Idea is to hold skb between release/lock sock calls and thus do not
allow to free it by core stack when it is being released. Patch still
misses the case, when socket is released and skb was dequeued, so splice
will try to dequeue it again, which will crash. I will think on how to
fix the issue.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 15:22                     ` Evgeniy Polyakov
@ 2008-05-27 15:33                       ` Octavian Purdila
  2008-05-27 15:47                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Octavian Purdila @ 2008-05-27 15:33 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem

On Tuesday 27 May 2008, Evgeniy Polyakov wrote:
>
> The same wrong one, sorry about that.
>
> Idea is to hold skb between release/lock sock calls and thus do not
> allow to free it by core stack when it is being released. Patch still
> misses the case, when socket is released and skb was dequeued, so splice
> will try to dequeue it again, which will crash. I will think on how to
> fix the issue.


Yes, I think I got the idea you are trying to use here. But, somehow I feel 
uneasy with this approach :) Isn't it cleaner to keep the lock and try to 
avoid the deadlock on the sendfile() side? Or is that unfeasible?

I don't think we can drop the socket lock, it will introduce at least on type 
of races: since the skb we are processing is still on the socket queue, any 
entity accessing the socket queue will possible collide with us. 

Thanks,
tavi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 15:33                       ` Octavian Purdila
@ 2008-05-27 15:47                         ` Evgeniy Polyakov
  2008-05-27 17:28                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 15:47 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Tue, May 27, 2008 at 06:33:55PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> Yes, I think I got the idea you are trying to use here. But, somehow I feel 
> uneasy with this approach :) Isn't it cleaner to keep the lock and try to 
> avoid the deadlock on the sendfile() side? Or is that unfeasible?

Well, providing some flags to ->sendpage() callbacks to say that they
should not grab locks is a bit ugly to me, I think we can fix splice
code not to do wrong things.

> I don't think we can drop the socket lock, it will introduce at least on type 
> of races: since the skb we are processing is still on the socket queue, any 
> entity accessing the socket queue will possible collide with us. 

Each access is being done under socket lock, so it should be safe. But
socket release path drops skb from the list before dropping its
reference counter, so we are not allowed to unlink it again, but we can
check if skb is linked or not and make a decision based on that.


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 15:47                         ` Evgeniy Polyakov
@ 2008-05-27 17:28                           ` Evgeniy Polyakov
  2008-05-27 23:59                             ` Octavian Purdila
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-27 17:28 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Tue, May 27, 2008 at 07:47:10PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> Well, providing some flags to ->sendpage() callbacks to say that they
> should not grab locks is a bit ugly to me, I think we can fix splice
> code not to do wrong things.

And actually this will lead to data corruption, since different process
can enter and change socket values under us. Without temporal lock drop
we can deadlock (first process locked socket and then waiting for
i_mutex, while another one grabbed i_mutex and waiting for socket lock
in sendfile()).

> > I don't think we can drop the socket lock, it will introduce at least on type 
> > of races: since the skb we are processing is still on the socket queue, any 
> > entity accessing the socket queue will possible collide with us. 
> 
> Each access is being done under socket lock, so it should be safe. But
> socket release path drops skb from the list before dropping its
> reference counter, so we are not allowed to unlink it again, but we can
> check if skb is linked or not and make a decision based on that.

Please try attached patch on top of vanilla tree.
It does not use skb after socket was dropped, but instead search it
again when socket is locked, so if socket is alive, it will find it and
clean otherwise it will exit.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6087013..d285817 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1349,6 +1349,7 @@ done:
 
 	if (spd.nr_pages) {
 		int ret;
+		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse
@@ -1359,9 +1360,9 @@ done:
 		 * we call into ->sendpage() with the i_mutex lock held
 		 * and networking will grab the socket lock.
 		 */
-		release_sock(__skb->sk);
+		release_sock(sk);
 		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(__skb->sk);
+		lock_sock(sk);
 		return ret;
 	}
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 39b629a..217dc59 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1200,15 +1200,17 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 seq = tp->copied_seq;
 	u32 offset;
-	int copied = 0;
+	int copied = 0, fin;
+	size_t used, len, skb_len;
 
 	if (sk->sk_state == TCP_LISTEN)
 		return -ENOTCONN;
 	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
+		used = 0;
 		if (offset < skb->len) {
-			size_t used, len;
 
-			len = skb->len - offset;
+			skb_len = skb->len;
+			len = skb_len - offset;
 			/* Stop reading if we hit a patch of urgent data */
 			if (tp->urg_data) {
 				u32 urg_offset = tp->urg_seq - seq;
@@ -1217,6 +1219,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				if (!len)
 					break;
 			}
+
 			used = recv_actor(desc, skb, offset, len);
 			if (used < 0) {
 				if (!copied)
@@ -1227,15 +1230,22 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				copied += used;
 				offset += used;
 			}
-			if (offset != skb->len)
+			if (offset != skb_len)
 				break;
 		}
-		if (tcp_hdr(skb)->fin) {
-			sk_eat_skb(sk, skb, 0);
-			++seq;
+		skb = tcp_recv_skb(sk, seq-used, &offset);
+		if (!skb) {
+			if (!copied)
+				copied = -EINVAL;
 			break;
 		}
+		fin = tcp_hdr(skb)->fin;
 		sk_eat_skb(sk, skb, 0);
+
+		if (fin) {
+			++seq;
+			break;
+		}
 		if (!desc->count)
 			break;
 	}

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 17:28                           ` Evgeniy Polyakov
@ 2008-05-27 23:59                             ` Octavian Purdila
  2008-05-28  8:52                               ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Octavian Purdila @ 2008-05-27 23:59 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem

[-- Attachment #1: Type: text/plain, Size: 740 bytes --]

On Tuesday 27 May 2008, Evgeniy Polyakov wrote:
>
> Please try attached patch on top of vanilla tree.
> It does not use skb after socket was dropped, but instead search it
> again when socket is locked, so if socket is alive, it will find it and
> clean otherwise it will exit.
>

This fixes the crash, thanks.

One doubt though: suppose that while we drop the lock the skb gets aggregated 
with the one after it. If the original skb is fully consumed in the receive 
actor, then the we will eat the new, aggregated skb, loosing data. 

Here is a patch, based on your idea, which tries to cope with the above 
scenario. The !skb check was added for the case in which the actor does not 
consume anything in the current interration. 

tavi


[-- Attachment #2: a.diff --]
[-- Type: text/x-diff, Size: 1040 bytes --]

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0a9002b..0a0a663 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1358,7 +1358,8 @@ done:
 
 	if (spd.nr_pages) {
 		int ret;
-
+		struct sock *sk= __skb->sk;
+ 
 		/*
 		 * Drop the socket lock, otherwise we have reverse
 		 * locking dependencies between sk_lock and i_mutex
@@ -1368,9 +1369,9 @@ done:
 		 * we call into ->sendpage() with the i_mutex lock held
 		 * and networking will grab the socket lock.
 		 */
-		release_sock(__skb->sk);
+		release_sock(sk);
 		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(__skb->sk);
+		lock_sock(sk);
 		return ret;
 	}
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3e91b28..34049d0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1227,7 +1227,8 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				copied += used;
 				offset += used;
 			}
-			if (offset != skb->len)
+			skb = tcp_recv_skb(sk, seq-1, &offset);
+			if (!skb || (offset+1 != skb->len))
 				break;
 		}
 		if (tcp_hdr(skb)->fin) {

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-27 23:59                             ` Octavian Purdila
@ 2008-05-28  8:52                               ` Evgeniy Polyakov
  2008-05-28 13:20                                 ` Octavian Purdila
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-28  8:52 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Wed, May 28, 2008 at 02:59:30AM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> One doubt though: suppose that while we drop the lock the skb gets aggregated 
> with the one after it. If the original skb is fully consumed in the receive 
> actor, then the we will eat the new, aggregated skb, loosing data. 

How can it be aggregated with another skb? It is only possible that some
other reader consumes the data, but in that case sequence number will
not match and we will not find skb.

> Here is a patch, based on your idea, which tries to cope with the above 
> scenario. The !skb check was added for the case in which the actor does not 
> consume anything in the current interration. 

If it does not get any data, then skb will likely exists and will be
consumed in the next run. I preserved old semantic, when we free skb
only if we read it whole or in case of fin. With your changes we can
also free skb, if it was partially consumed and do not free it at all if
skb was not processed becuase it is too old (i.e. it lives in receive
queue, but we already read data siwth sequnce number, which corresponds
to it), no?


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28  8:52                               ` Evgeniy Polyakov
@ 2008-05-28 13:20                                 ` Octavian Purdila
  2008-05-28 14:11                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 29+ messages in thread
From: Octavian Purdila @ 2008-05-28 13:20 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem

On Wednesday 28 May 2008, Evgeniy Polyakov wrote:

> > One doubt though: suppose that while we drop the lock the skb gets
> > aggregated with the one after it. If the original skb is fully consumed
> > in the receive actor, then the we will eat the new, aggregated skb,
> > loosing data.
>
> How can it be aggregated with another skb? 

I think that tcp_collapse does this kind of aggregation when we have memory 
pressure: will create a new skb and move the data from the old skbs in the 
new one, freeing the old skb. Right?

> It is only possible that some
> other reader consumes the data, but in that case sequence number will
> not match and we will not find skb. 
>

Hmm, I didn't thought about this, but you are right, we could have other 
consumer sneaking in.

> > Here is a patch, based on your idea, which tries to cope with the above
> > scenario. The !skb check was added for the case in which the actor does
> > not consume anything in the current interration.
>
> If it does not get any data, then skb will likely exists and will be
> consumed in the next run. 

If the consumer didn't consume data, the seq number will not be updated, and 
seq-1 will correspond to a previous skb, which was already dequeued and 
processed. Thus tcp_recv_skb(seq-1) will return NULL.

> I preserved old semantic, when we free skb  
> only if we read it whole or in case of fin. With your changes we can
> also free skb, if it was partially consumed 

If the skb was partially consumed then tcp_recv_skb(seq-1) will return the 
same skb and the offset +1 != skb->len, thus we will not free it.

> and do not free it at all if 
> skb was not processed becuase it is too old (i.e. it lives in receive
> queue, but we already read data siwth sequnce number, which corresponds
> to it), no?

I didn't get this part, shouldn't the entity which dequeues the packet also 
free it? 

Anyways, I am a newbie in this area, so if my logic doesn't make any sense 
please be patient with me :)

Thanks,
tavi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28 13:20                                 ` Octavian Purdila
@ 2008-05-28 14:11                                   ` Evgeniy Polyakov
  2008-05-28 15:20                                     ` Octavian Purdila
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-28 14:11 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Wed, May 28, 2008 at 04:20:55PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> > I preserved old semantic, when we free skb  
> > only if we read it whole or in case of fin. With your changes we can
> > also free skb, if it was partially consumed 
> 
> If the skb was partially consumed then tcp_recv_skb(seq-1) will return the 
> same skb and the offset +1 != skb->len, thus we will not free it.

I understand now, please correct me if I got your idea wrong.
We only ned to search for the skb again only in case we processed it and
it was possible that socket lock was dropped. So, the only needed place
to put tcp_recv_skb() is where you pointed. Next, to find current skb we
have to get sequence number inside its data, and seq-1 is indeed that
case, even if skb was changed under us (like combined with others), we
still will find it in the queue. Offset in tcp_recv_skb() is set to
number of bytes between starting sequence number for skb and given
sequence number, thus if we search for (seq-1), it will point to the
last byte processed by the code, and thus offset will 'point' to last
processed byte, not byte to start from, so we check against (offset+1)
and if (offset+1) equals to the size of the skb found we can free this
skb (in the code below). If it is not the case and skb contains data
behind (offset+1) we break and the rest of the data will be processed
in the next run. If there is no skb, we just break out of the loop.

So yes, your patch is simpler and faster than mine so you should push it
upstream. Fortunately David Miller is in copy and will (David, will
you?) pick it up and push, likely it is also needed for stable?

> Anyways, I am a newbie in this area, so if my logic doesn't make any sense 
> please be patient with me :)

We all are newbies somewhere and even splice code itself is rather new :)


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28 14:11                                   ` Evgeniy Polyakov
@ 2008-05-28 15:20                                     ` Octavian Purdila
  2008-05-28 15:42                                       ` Evgeniy Polyakov
  2008-05-28 17:08                                       ` Octavian Purdila
  0 siblings, 2 replies; 29+ messages in thread
From: Octavian Purdila @ 2008-05-28 15:20 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem

On Wednesday 28 May 2008, Evgeniy Polyakov wrote:

> > > I preserved old semantic, when we free skb
> > > only if we read it whole or in case of fin. With your changes we can
> > > also free skb, if it was partially consumed
> >
> > If the skb was partially consumed then tcp_recv_skb(seq-1) will return
> > the same skb and the offset +1 != skb->len, thus we will not free it.
>
> I understand now, please correct me if I got your idea wrong.
> We only ned to search for the skb again only in case we processed it and
> it was possible that socket lock was dropped. So, the only needed place
> to put tcp_recv_skb() is where you pointed. Next, to find current skb we
<snip>

Yes, that is correct. 

> So yes, your patch is simpler and faster than mine so you should push it
> upstream. Fortunately David Miller is in copy and will (David, will
> you?) pick it up and push, likely it is also needed for stable?
>

OK, will clean it up and post a proper patch soon. Which tree should I base it 
on?

Thanks,
tavi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28 15:20                                     ` Octavian Purdila
@ 2008-05-28 15:42                                       ` Evgeniy Polyakov
  2008-05-28 17:08                                       ` Octavian Purdila
  1 sibling, 0 replies; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-28 15:42 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Wed, May 28, 2008 at 06:20:02PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> OK, will clean it up and post a proper patch soon. Which tree should I base it 
> on?

Vanilla tree I think, it should apply without problems to all trees,
since patch is rather small.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28 15:20                                     ` Octavian Purdila
  2008-05-28 15:42                                       ` Evgeniy Polyakov
@ 2008-05-28 17:08                                       ` Octavian Purdila
  2008-05-28 17:51                                         ` Evgeniy Polyakov
  1 sibling, 1 reply; 29+ messages in thread
From: Octavian Purdila @ 2008-05-28 17:08 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem

On Wednesday 28 May 2008, Octavian Purdila wrote:
>
> > So yes, your patch is simpler and faster than mine so you should push it
> > upstream. Fortunately David Miller is in copy and will (David, will
> > you?) pick it up and push, likely it is also needed for stable?
>

Hmm, might have found a problem with this approach: say we drop the lock, 
queue the data into the pipe, and at the same time tcp_collapse() is called 
which frees some of the skbs of which data we just queued in the pipe. Later, 
when we will read from the pipe, we will get random data instead of socket 
data...


Thanks,
tavi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28 17:08                                       ` Octavian Purdila
@ 2008-05-28 17:51                                         ` Evgeniy Polyakov
  2008-05-28 18:02                                           ` Octavian Purdila
  0 siblings, 1 reply; 29+ messages in thread
From: Evgeniy Polyakov @ 2008-05-28 17:51 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem

On Wed, May 28, 2008 at 08:08:01PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote:
> Hmm, might have found a problem with this approach: say we drop the lock, 
> queue the data into the pipe, and at the same time tcp_collapse() is called 
> which frees some of the skbs of which data we just queued in the pipe. Later, 
> when we will read from the pipe, we will get random data instead of socket 
> data...
 
We queue data under the lock and clone appropriate skb (and then grab it
multiple times), so even it will be dropped, its data will not freed, and
thus we will be able to read it. Or you are talking about different
skbs?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28 17:51                                         ` Evgeniy Polyakov
@ 2008-05-28 18:02                                           ` Octavian Purdila
  2008-05-28 20:01                                             ` Jarek Poplawski
  0 siblings, 1 reply; 29+ messages in thread
From: Octavian Purdila @ 2008-05-28 18:02 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem

On Wednesday 28 May 2008, Evgeniy Polyakov wrote:

> We queue data under the lock and clone appropriate skb (and then grab it
> multiple times), so even it will be dropped, its data will not freed, and
> thus we will be able to read it. Or you are talking about different
> skbs?

You are right, I forgot about the clone.

Thanks,
tavi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28 18:02                                           ` Octavian Purdila
@ 2008-05-28 20:01                                             ` Jarek Poplawski
  2008-05-28 20:09                                               ` Octavian Purdila
  0 siblings, 1 reply; 29+ messages in thread
From: Jarek Poplawski @ 2008-05-28 20:01 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Evgeniy Polyakov, Ben Hutchings, netdev, davem

Octavian Purdila wrote, On 05/28/2008 08:02 PM:

> On Wednesday 28 May 2008, Evgeniy Polyakov wrote:
> 
>> We queue data under the lock and clone appropriate skb (and then grab it
>> multiple times), so even it will be dropped, its data will not freed, and
>> thus we will be able to read it. Or you are talking about different
>> skbs?
> 
> You are right, I forgot about the clone.
> 

Probably I miss something, but how does it help when tcp_collapse()
uses __kfree_skb()?

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28 20:01                                             ` Jarek Poplawski
@ 2008-05-28 20:09                                               ` Octavian Purdila
  2008-05-28 20:16                                                 ` Jarek Poplawski
  0 siblings, 1 reply; 29+ messages in thread
From: Octavian Purdila @ 2008-05-28 20:09 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Evgeniy Polyakov, Ben Hutchings, netdev, davem

On Wednesday 28 May 2008, Jarek Poplawski wrote:

> >> We queue data under the lock and clone appropriate skb (and then grab it
> >> multiple times), so even it will be dropped, its data will not freed,
> >> and thus we will be able to read it. Or you are talking about different
> >> skbs?
> >
> > You are right, I forgot about the clone.
>
> Probably I miss something, but how does it help when tcp_collapse()
> uses __kfree_skb()?
>

__kfree_skb() -> skb_release_all() -> skb_release_data():

static void skb_release_data(struct sk_buff *skb)
{
        if (!skb->cloned ||
<snip>
                kfree(skb->head);

Since we clone the skb in skb_splice_bits() the skb's data will only be freed 
when the last clone is deleted.

tavi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: race in skb_splice_bits?
  2008-05-28 20:09                                               ` Octavian Purdila
@ 2008-05-28 20:16                                                 ` Jarek Poplawski
  0 siblings, 0 replies; 29+ messages in thread
From: Jarek Poplawski @ 2008-05-28 20:16 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Evgeniy Polyakov, Ben Hutchings, netdev, davem

On Wed, May 28, 2008 at 11:09:13PM +0300, Octavian Purdila wrote:
> On Wednesday 28 May 2008, Jarek Poplawski wrote:
> 
> > >> We queue data under the lock and clone appropriate skb (and then grab it
> > >> multiple times), so even it will be dropped, its data will not freed,
> > >> and thus we will be able to read it. Or you are talking about different
> > >> skbs?
> > >
> > > You are right, I forgot about the clone.
> >
> > Probably I miss something, but how does it help when tcp_collapse()
> > uses __kfree_skb()?
> >
> 
> __kfree_skb() -> skb_release_all() -> skb_release_data():
> 
> static void skb_release_data(struct sk_buff *skb)
> {
>         if (!skb->cloned ||
> <snip>
>                 kfree(skb->head);
> 
> Since we clone the skb in skb_splice_bits() the skb's data will only be freed 
> when the last clone is deleted.

Right!

Thanks for explanation,
Jarek P.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-05-28 20:17 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-27  0:25 race in skb_splice_bits? Octavian Purdila
2008-05-27  2:08 ` Ben Hutchings
2008-05-27 10:41   ` Octavian Purdila
2008-05-27 11:01 ` Evgeniy Polyakov
2008-05-27 11:08   ` Ben Hutchings
2008-05-27 11:52     ` Evgeniy Polyakov
2008-05-27 11:56       ` Evgeniy Polyakov
2008-05-27 12:53         ` Octavian Purdila
2008-05-27 13:21           ` Evgeniy Polyakov
2008-05-27 14:03             ` Evgeniy Polyakov
2008-05-27 14:39               ` Octavian Purdila
2008-05-27 15:09                 ` Evgeniy Polyakov
2008-05-27 15:12                   ` Evgeniy Polyakov
2008-05-27 15:22                     ` Evgeniy Polyakov
2008-05-27 15:33                       ` Octavian Purdila
2008-05-27 15:47                         ` Evgeniy Polyakov
2008-05-27 17:28                           ` Evgeniy Polyakov
2008-05-27 23:59                             ` Octavian Purdila
2008-05-28  8:52                               ` Evgeniy Polyakov
2008-05-28 13:20                                 ` Octavian Purdila
2008-05-28 14:11                                   ` Evgeniy Polyakov
2008-05-28 15:20                                     ` Octavian Purdila
2008-05-28 15:42                                       ` Evgeniy Polyakov
2008-05-28 17:08                                       ` Octavian Purdila
2008-05-28 17:51                                         ` Evgeniy Polyakov
2008-05-28 18:02                                           ` Octavian Purdila
2008-05-28 20:01                                             ` Jarek Poplawski
2008-05-28 20:09                                               ` Octavian Purdila
2008-05-28 20:16                                                 ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).