* race in skb_splice_bits?
@ 2008-05-27 0:25 Octavian Purdila
2008-05-27 2:08 ` Ben Hutchings
2008-05-27 11:01 ` Evgeniy Polyakov
0 siblings, 2 replies; 29+ messages in thread
From: Octavian Purdila @ 2008-05-27 0:25 UTC (permalink / raw)
To: netdev
Hi,
The following socket lock dropping in skb_splice_bits seems to open a race
condition which causes an invalid kernel access:
> if (spd.nr_pages) {
> int ret;
>
> /*
> * Drop the socket lock, otherwise we have reverse
> * locking dependencies between sk_lock and i_mutex
> * here as compared to sendfile(). We enter here
> * with the socket lock held, and splice_to_pipe() will
> * grab the pipe inode lock. For sendfile() emulation,
> * we call into ->sendpage() with the i_mutex lock held
> * and networking will grab the socket lock.
> */
> release_sock(__skb->sk);
> ret = splice_to_pipe(pipe, &spd);
> lock_sock(__skb->sk);
> return ret;
> }
Setup:
- powerpc, non-SMP, no preemption, 2.6.25
- RX side: LRO enabled, splice from socket to /dev/null;
- TX side: MTU set to 128 bytes (on the TX side), GSO enabled, splice from
file to socket
The oops - on the RX side:
Unable to handle kernel paging request for data at address 0x00000030
Faulting instruction address: 0x80109ee0
Oops: Kernel access of bad area, sig: 11 [#1]
Ixia TCPX
Modules linked in: almfmanager(P) filtermanager ixnam_llm(P) ixna
m_tcpx(P) hwstate ixllm ixhostm ixsysctl(P) nlproc_driver
NIP: 80109ee0 LR: 80109edc CTR: 8010c52c
REGS: bcd25b90 TRAP: 0300 Tainted: P (2.6.25-00005-gf7b547d)
MSR: 00009032 <EE,ME,IR,DR> CR: 24000822 XER: 20000000
DAR: 00000030, DSISR: 40000000
TASK = bfbe1bf0[156] 'splice' THREAD: bcd24000
GPR00: 8010c94c bcd25c40 bfbe1bf0 00000000 00000000 802835f8 00000001 0000004c
GPR08: 00024000 00000100 00000032 bcd24000 00010dc4 100198b4 390046a8 0a5042f3
GPR16: 8028238c bd18fe00 00000008 10010000 6fbcbac0 00000000 10001060 bcd25dd8
GPR24: 8014b520 00000000 bcd25e30 bccefa00 bf33e300 fffffe00 bcd25d70 00000000
NIP [80109ee0] lock_sock_nested+0x1c/0x50
LR [80109edc] lock_sock_nested+0x18/0x50
Call Trace:
[bcd25c60] [8010c94c] skb_splice_bits+0x130/0x134
[bcd25dc0] [8014b548] tcp_splice_data_recv+0x28/0x38
[bcd25dd0] [8014d08c] tcp_read_sock+0x108/0x1f8
[bcd25e20] [8014b58c] __tcp_splice_read+0x34/0x44
[bcd25e40] [8014b61c] tcp_splice_read+0x80/0x220
[bcd25e90] [80105730] sock_splice_read+0x2c/0x44
[bcd25ea0] [8008a374] do_splice_to+0x90/0xac
[bcd25ed0] [8008a850] do_splice+0x258/0x2f0
[bcd25f10] [8008b1d4] sys_splice+0xe0/0xe8
[bcd25f40] [8000ff14] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0x10000894
LR = 0x10000e2c
Analysis:
Printks show that __skb->sk is non-NULL before splice_to_pipe and NULL after.
Using a hardware watchpoint I was able to see that the write in __skb->sk is
caused by __allock_skb()'s memset() which seems to indicate that the __skb is
freed between release_sock() and lock_sock(). Turning on slab debugging and
the hardware watchpoint shows that the free happens during tcp_collapse()
which was initiated as a result of an timer interrupt -> softirq -> NAPI
polling -> lro_flush_all().
Commenting out the sequence that drops the socket lock seems to fix the
problem on my setup.
Regards,
tavi
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: race in skb_splice_bits? 2008-05-27 0:25 race in skb_splice_bits? Octavian Purdila @ 2008-05-27 2:08 ` Ben Hutchings 2008-05-27 10:41 ` Octavian Purdila 2008-05-27 11:01 ` Evgeniy Polyakov 1 sibling, 1 reply; 29+ messages in thread From: Ben Hutchings @ 2008-05-27 2:08 UTC (permalink / raw) To: Octavian Purdila; +Cc: netdev Octavian Purdila wrote: > > Hi, > > The following socket lock dropping in skb_splice_bits seems to open a race > condition which causes an invalid kernel access: > > > if (spd.nr_pages) { > > int ret; > > > > /* > > * Drop the socket lock, otherwise we have reverse > > * locking dependencies between sk_lock and i_mutex > > * here as compared to sendfile(). We enter here > > * with the socket lock held, and splice_to_pipe() will > > * grab the pipe inode lock. For sendfile() emulation, > > * we call into ->sendpage() with the i_mutex lock held > > * and networking will grab the socket lock. > > */ > > release_sock(__skb->sk); > > ret = splice_to_pipe(pipe, &spd); > > lock_sock(__skb->sk); > > return ret; > > } Given the previous comment, that certainly looks wrong. <snip> > Commenting out the sequence that drops the socket lock seems to fix the > problem on my setup. But this could apparently cause deadlock. Surely the correct fix is to copy __skb->sk to a local variable before calling splice_to_pipe() so we can re-lock it? Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 2:08 ` Ben Hutchings @ 2008-05-27 10:41 ` Octavian Purdila 0 siblings, 0 replies; 29+ messages in thread From: Octavian Purdila @ 2008-05-27 10:41 UTC (permalink / raw) To: Ben Hutchings; +Cc: netdev On Tuesday 27 May 2008, Ben Hutchings wrote: > > Commenting out the sequence that drops the socket lock seems to fix the > > problem on my setup. > > But this could apparently cause deadlock. Surely the correct fix is > to copy __skb->sk to a local variable before calling splice_to_pipe() > so we can re-lock it? > I've tried that, but I think that the freed __skb might be touched later in tcp_read_sock: Faulting instruction address: 0x8014d0c0 Oops: Kernel access of bad area, sig: 11 [#1] Ixia TCPX Modules linked in: almfmanager(P) filtermanager ixnam_llm(P) ixna m_tcpx(P) hwstate ixllm ixhostm ixsysctl(P) nlproc_driver NIP: 8014d0c0 LR: 8014d090 CTR: 8010c52c REGS: bcd27d20 TRAP: 0300 Tainted: P (2.6.25-00005-gf7 b547d-dirty) MSR: 00009032 <EE,ME,IR,DR> CR: 24000822 XER: 20000000 DAR: 0000000c, DSISR: 40000000 TASK = bd0d7bd0[172] 'splice' THREAD: bcd26000 GPR00: 00000000 bcd27dd0 bd0d7bd0 fffffe00 00000000 802835f8 00000001 0000004c GPR08: 00024000 00000000 00000062 bcd26000 0023ac37 100198b4 390046a8 0a5042f3 GPR16: 8028238c bd18fe00 00000008 10010000 6fd01ac0 00000000 10001060 bcd27dd8 GPR24: 8014b524 00000000 bcd27e30 bcd07180 0000004c bcd071e4 a7c41e8b bfb3aa60 NIP [8014d0c0] tcp_read_sock+0x138/0x1f8 LR [8014d090] tcp_read_sock+0x108/0x1f8 Call Trace: [bcd27dd0] [8014d090] tcp_read_sock+0x108/0x1f8 (unreliable) [bcd27e20] [8014b590] __tcp_splice_read+0x34/0x44 [bcd27e40] [8014b620] tcp_splice_read+0x80/0x220 [bcd27e90] [80105730] sock_splice_read+0x2c/0x44 [bcd27ea0] [8008a374] do_splice_to+0x90/0xac [bcd27ed0] [8008a850] do_splice+0x258/0x2f0 [bcd27f10] [8008b1d4] sys_splice+0xe0/0xe8 [bcd27f40] [8000ff14] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0x10000894 LR = 0x10000e2c Thanks, tavi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 0:25 race in skb_splice_bits? Octavian Purdila 2008-05-27 2:08 ` Ben Hutchings @ 2008-05-27 11:01 ` Evgeniy Polyakov 2008-05-27 11:08 ` Ben Hutchings 1 sibling, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 11:01 UTC (permalink / raw) To: Octavian Purdila; +Cc: netdev On Tue, May 27, 2008 at 03:25:23AM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > > Hi, > > The following socket lock dropping in skb_splice_bits seems to open a race > condition which causes an invalid kernel access: > > > if (spd.nr_pages) { > > int ret; > > > > /* > > * Drop the socket lock, otherwise we have reverse > > * locking dependencies between sk_lock and i_mutex > > * here as compared to sendfile(). We enter here > > * with the socket lock held, and splice_to_pipe() will > > * grab the pipe inode lock. For sendfile() emulation, > > * we call into ->sendpage() with the i_mutex lock held > > * and networking will grab the socket lock. > > */ What about sock_hold() here? It will prevent from socket freeing and read/write to it will immediately return with error if socket was closed by another thread. > > release_sock(__skb->sk); > > ret = splice_to_pipe(pipe, &spd); > > lock_sock(__skb->sk); And appropriate sock_put(sk); -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 11:01 ` Evgeniy Polyakov @ 2008-05-27 11:08 ` Ben Hutchings 2008-05-27 11:52 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Ben Hutchings @ 2008-05-27 11:08 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Octavian Purdila, netdev Evgeniy Polyakov wrote: > On Tue, May 27, 2008 at 03:25:23AM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > > > > Hi, > > > > The following socket lock dropping in skb_splice_bits seems to open a race > > condition which causes an invalid kernel access: > > > > > if (spd.nr_pages) { > > > int ret; > > > > > > /* > > > * Drop the socket lock, otherwise we have reverse > > > * locking dependencies between sk_lock and i_mutex > > > * here as compared to sendfile(). We enter here > > > * with the socket lock held, and splice_to_pipe() will > > > * grab the pipe inode lock. For sendfile() emulation, > > > * we call into ->sendpage() with the i_mutex lock held > > > * and networking will grab the socket lock. > > > */ > > What about sock_hold() here? > It will prevent from socket freeing and read/write to it will > immediately return with error if socket was closed by another thread. <snip> We know the socket isn't going to go away because somewhere up the call stack someone has to be holding the socket in order to lock it. However, the skb may (and evidently sometimes does) go away during splice_to_pipe() so we can't look up the socket through it. However, from Octavian's later mail it seems we can't let the skb go away at all. So some wider changes seem to be required. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 11:08 ` Ben Hutchings @ 2008-05-27 11:52 ` Evgeniy Polyakov 2008-05-27 11:56 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 11:52 UTC (permalink / raw) To: Ben Hutchings; +Cc: Octavian Purdila, netdev On Tue, May 27, 2008 at 12:08:41PM +0100, Ben Hutchings (bhutchings@solarflare.com) wrote: > However, from Octavian's later mail it seems we can't let the skb go away > at all. So some wider changes seem to be required. Yes, I misread the original message. Octavian, could you please decode where bug occured via gdb: gdb $ l *(skb_splice_bits+0x130) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 11:52 ` Evgeniy Polyakov @ 2008-05-27 11:56 ` Evgeniy Polyakov 2008-05-27 12:53 ` Octavian Purdila 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 11:56 UTC (permalink / raw) To: Ben Hutchings; +Cc: Octavian Purdila, netdev On Tue, May 27, 2008 at 03:52:06PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > On Tue, May 27, 2008 at 12:08:41PM +0100, Ben Hutchings (bhutchings@solarflare.com) wrote: > > However, from Octavian's later mail it seems we can't let the skb go away > > at all. So some wider changes seem to be required. > > Yes, I misread the original message. > > Octavian, could you please decode where bug occured via gdb: > gdb > $ l *(skb_splice_bits+0x130) It is lock_sock()? And also please show what tcp_read_sock+0x108 tcp_read_sock+0x138 are. And do you have a test case for that? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 11:56 ` Evgeniy Polyakov @ 2008-05-27 12:53 ` Octavian Purdila 2008-05-27 13:21 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Octavian Purdila @ 2008-05-27 12:53 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev [-- Attachment #1: Type: text/plain, Size: 2157 bytes --] On Tuesday 27 May 2008, Evgeniy Polyakov wrote: > > > However, from Octavian's later mail it seems we can't let the skb go > > > away at all. So some wider changes seem to be required. > > > > Yes, I misread the original message. > > > > Octavian, could you please decode where bug occured via gdb: > > gdb > > $ l *(skb_splice_bits+0x130) > > It is lock_sock()? > Yes it is lock_sock(). The crash in lock_sock() occurs with the following sequence: release_sock(__skb->sk); ret = splice_to_pipe(pipe, &spd); lock_sock(__skb->sk); 0x8010c94c is in skb_splice_bits (include/net/sock.h:837). 832 833 extern void lock_sock_nested(struct sock *sk, int subclass); 834 835 static inline void lock_sock(struct sock *sk) 836 { 837 lock_sock_nested(sk, 0); 838 } 839 840 extern void release_sock(struct sock *sk); 841 > And also please show what > tcp_read_sock+0x108 > tcp_read_sock+0x138 > > are. This crash occurs with the following sequence: struct sock *sk= __skb->sk; release_sock(sk); ret = splice_to_pipe(pipe, &spd); lock_sock(sk); 0x8014d090 is in tcp_read_sock (net/ipv4/tcp.c:1225). 1220 used = recv_actor(desc, skb, offset, len); 1221 if (used < 0) { 1222 if (!copied) 1223 copied = used; 1224 break; 1225 } else if (used <= len) { 1226 seq += used; 1227 copied += used; 1228 offset += used; 1229 } > And do you have a test case for that? Yes. I've attached my test program which triggers both crashes. To improve the reproducibility rate you need to use a relatively high buffer for splice (I used 16384) and a small MTU (I used 128). This will make the 1st splice on the receive side block (due to not enough PIPE_BUFFERS it seems), and will give plenty of time to the race condition to manifest itself. When you interrupt the program, the system will crash. Thanks, tavi [-- Attachment #2: splice.c --] [-- Type: text/x-csrc, Size: 7932 bytes --] #define _GNU_SOURCE #include <stdlib.h> #include <string.h> #include <error.h> #include <argp.h> #include <sys/socket.h> #include <netinet/in.h> #include <arpa/inet.h> #include <sys/epoll.h> #include <sys/fcntl.h> #include <sys/wait.h> #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <netdb.h> #include <net/if.h> #include <signal.h> #include <sys/socket.h> #include <netpacket/packet.h> #include <net/ethernet.h> /* the L2 protocols */ #include <assert.h> #include <netinet/tcp.h> #include <linux/unistd.h> #ifdef __PPC__ #undef __syscall_nr #define __syscall_nr(nr, type, name, args...) \ unsigned long __sc_ret, __sc_err; \ { \ register unsigned long __sc_0 __asm__ ("r0"); \ register unsigned long __sc_3 __asm__ ("r3"); \ register unsigned long __sc_4 __asm__ ("r4"); \ register unsigned long __sc_5 __asm__ ("r5"); \ register unsigned long __sc_6 __asm__ ("r6"); \ register unsigned long __sc_7 __asm__ ("r7"); \ register unsigned long __sc_8 __asm__ ("r8"); \ \ __sc_loadargs_##nr(name, args); \ __asm__ __volatile__ \ ("sc \n\t" \ "mfcr %0 " \ : "=&r" (__sc_0), \ "=&r" (__sc_3), "=&r" (__sc_4), \ "=&r" (__sc_5), "=&r" (__sc_6), \ "=&r" (__sc_7), "=&r" (__sc_8) \ : __sc_asm_input_##nr \ : "cr0", "ctr", "memory", \ "r9", "r10","r11", "r12"); \ __sc_ret = __sc_3; \ __sc_err = __sc_0; \ } \ if (__sc_err & 0x10000000) \ { \ errno = __sc_ret; \ __sc_ret = -1; \ } \ return (type) __sc_ret #define __sc_loadargs_6(name, arg1, arg2, arg3, arg4, arg5, arg6) \ __sc_loadargs_5(name, arg1, arg2, arg3, arg4, arg5); \ __sc_8 = (unsigned long) (arg6) #define __sc_asm_input_6 __sc_asm_input_5, "6" (__sc_8) #define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,type5,arg5,type6,arg6) \ type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5 arg5, type6 arg6) \ { \ __syscall_nr(6, type, name, arg1, arg2, arg3, arg4, arg5, arg6); \ } #define SPLICE_F_MOVE 0x01 #define SPLICE_F_NONBLOCK 0x02 #define SPLICE_F_MORE 0x04 #define SPLICE_F_GIFT 0x08 #define __NR_splice 283 _syscall6(int, splice, int, fd_in, loff_t*, off_in, int, fd_out, loff_t*, off_out, size_t, len, unsigned int, flags); #endif /* * Command line parser */ const char *argp_program_version = "0.1"; const char *argp_program_bug_address = "opurdila@ixiacom.com"; static char doc[] = ""; static char args_doc[] = ""; static struct argp_option options[] = { {"interface", 'i', "string", 0, "interface"}, {"dst", 'd', "string", 0, "dst"}, {"src", 's', "string", 0, "src"}, {"port", 'p', "string", 0, "port"}, {"socket-buffer", 'S', "string", 0, "packet size"}, {"interval", 'I', "int", 0, "interval between packet send operations"}, {"count", 'c', "int", 0, "counter"}, {0}, }; typedef struct { int size, ifindex, interval, count, port, so_size; struct sockaddr_in src, dst; const char *ifname; } cl_args_t; cl_args_t cla; static error_t parse_opt(int key, char *arg, struct argp_state *state) { cl_args_t *cla = state->input; switch (key) { case 'p': { cla->port=strtol(arg, NULL, 0); break; } case 'c': { cla->count=strtol(arg, NULL, 0); break; } case 'I': { cla->interval=strtol(arg, NULL, 0); break; } case 'S': { cla->so_size=strtol(arg, NULL, 0); break; } case 'd': { cla->dst.sin_family=AF_INET; if (inet_pton(AF_INET, arg, &cla->dst.sin_addr) < 0) { printf("bad address: %s\n", arg); return -1; } break; } case 's': { cla->src.sin_family=AF_INET; if (inet_pton(AF_INET, arg, &cla->src.sin_addr) < 0) { printf("bad address: %s\n", arg); return -1; } break; } case 'i': { cla->ifindex=if_nametoindex(arg); if (cla->ifindex == 0) { printf("invalid interface: %s\n", arg); return -1; } cla->ifname=arg; break; } case ARGP_KEY_ARG: break; default: return ARGP_ERR_UNKNOWN; } return 0; } int prepare_socket(void) { int s=socket(PF_INET, SOCK_STREAM, 0); if (s < 0) { printf("failed to create socket: %s\n", strerror(errno)); return -1; } if ((cla.src.sin_addr.s_addr || cla.src.sin_port) && bind(s, (const struct sockaddr*)&cla.src, sizeof(cla.src)) < 0) { printf("failed to bind socket: %s\n", strerror(errno)); return -1; } if (cla.ifname) { struct ifreq ifr; strcpy((char *)&ifr.ifr_name, cla.ifname); if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, &ifr, sizeof (ifr)) < 0) { printf("failed to bind to interface %s: %s\n", cla.ifname, strerror(errno)); return -1; } } return s; } int do_server(int s) { int error, s2; int pipefd[2]; if (pipe(pipefd) < 0) { perror("pipe error:"); return -1; } if (listen(s, 1) < 0) { perror("listen error:"); return -1; } if ((s2=accept(s, NULL, NULL)) < 0) { perror("accept error:"); return -1; } /* 6% improvement by _lowering_ the buffer size :O */ if (cla.so_size && setsockopt(s2, SOL_SOCKET, SO_RCVBUF, &cla.so_size, sizeof cla.so_size) < 0) { perror("SO_RCVBUF"); } while (1) { error=splice(s2, NULL, pipefd[1], NULL, 4*4096, SPLICE_F_MOVE); if (error <= 0) { perror("in splice"); return -1; } error=splice(pipefd[0], NULL, STDOUT_FILENO, NULL, error, SPLICE_F_MOVE); if (error <= 0) { perror("out splice"); return -1; } } return -1; } int do_client(int s) { int error; int pipefd[2], fd=STDIN_FILENO; loff_t off=0; if (pipe(pipefd) < 0) { perror("pipe error:"); return -1; } if (connect(s, (struct sockaddr*)&cla.dst, sizeof(cla.dst)) < 0) { perror("connect error:"); return -1; } if (cla.so_size && setsockopt(s, SOL_SOCKET, SO_SNDBUF, &cla.so_size, sizeof cla.so_size) < 0) { perror("SO_RCVBUF"); } while(1) { off=0; error=splice(fd, &off, pipefd[1], NULL, 4*4096, SPLICE_F_MOVE); if (error <= 0) { perror("in splice"); return -1; } error=splice(pipefd[0], NULL, s, NULL, error, SPLICE_F_MOVE); if (error <= 0) { perror("out splice"); return -1; } } close(s); return -1; } static struct argp argp = { options, parse_opt, args_doc, doc }; int main(int argc, char **argv) { int s; if (argp_parse(&argp, argc, argv, 0, 0, &cla) < 0) return -1; if (cla.dst.sin_addr.s_addr) cla.dst.sin_port=htons(cla.port); else cla.src.sin_port=htons(cla.port); if ((s=prepare_socket()) < 0) return -1; if (cla.dst.sin_addr.s_addr) return do_client(s); else return do_server(s); } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 12:53 ` Octavian Purdila @ 2008-05-27 13:21 ` Evgeniy Polyakov 2008-05-27 14:03 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 13:21 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev On Tue, May 27, 2008 at 03:53:49PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > When you interrupt the program, the system will crash. Cool! I've reproduced the problem and will try to work it out, thank you. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 13:21 ` Evgeniy Polyakov @ 2008-05-27 14:03 ` Evgeniy Polyakov 2008-05-27 14:39 ` Octavian Purdila 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 14:03 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Tue, May 27, 2008 at 05:21:48PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > On Tue, May 27, 2008 at 03:53:49PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > > When you interrupt the program, the system will crash. > > Cool! > > I've reproduced the problem and will try to work it out, thank you. Attached patch fixes the crash for me, Octavian could you please test it. David, is __kfree_skb() usage in tcp_read_sock() an optimisation only? With this patch I do not see any leaks, but I did not investigate it deep enough. If approach seems correct, I will clean things up. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6087013..d285817 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1349,6 +1349,7 @@ done: if (spd.nr_pages) { int ret; + struct sock *sk = __skb->sk; /* * Drop the socket lock, otherwise we have reverse @@ -1359,9 +1360,9 @@ done: * we call into ->sendpage() with the i_mutex lock held * and networking will grab the socket lock. */ - release_sock(__skb->sk); + release_sock(sk); ret = splice_to_pipe(pipe, &spd); - lock_sock(__skb->sk); + lock_sock(sk); return ret; } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 39b629a..b8318e9 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1182,6 +1182,23 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) return NULL; } +#ifdef CONFIG_NET_DMA +static inline void __sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early) +{ + __skb_unlink(skb, &sk->sk_receive_queue); + if (!copied_early) + kfree_skb(skb); + else + __skb_queue_tail(&sk->sk_async_wait_queue, skb); +} +#else +static inline void __sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early) +{ + __skb_unlink(skb, &sk->sk_receive_queue); + kfree_skb(skb); +} +#endif + /* * This routine provides an alternative to tcp_recvmsg() for routines * that would like to handle copying from skbuffs directly in 'sendfile' @@ -1231,11 +1248,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, break; } if (tcp_hdr(skb)->fin) { - sk_eat_skb(sk, skb, 0); + __sk_eat_skb(sk, skb, 0); ++seq; break; } - sk_eat_skb(sk, skb, 0); + __sk_eat_skb(sk, skb, 0); if (!desc->count) break; } -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 14:03 ` Evgeniy Polyakov @ 2008-05-27 14:39 ` Octavian Purdila 2008-05-27 15:09 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Octavian Purdila @ 2008-05-27 14:39 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem On Tuesday 27 May 2008, Evgeniy Polyakov wrote: > On Tue, May 27, 2008 at 05:21:48PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > > On Tue, May 27, 2008 at 03:53:49PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > > > When you interrupt the program, the system will crash. > > > > Cool! > > > > I've reproduced the problem and will try to work it out, thank you. > > Attached patch fixes the crash for me, Octavian could you please test > it. > On my system it still crashes, if I interrupt the program before splice blocks: [ 16.170009] NIP [8014d0c0] tcp_read_sock+0x138/0x1f8 [ 16.170009] LR [8014d090] tcp_read_sock+0x108/0x1f8 [ 16.170009] Call Trace: [ 16.170009] [bcd21dd0] [8014d090] tcp_read_sock+0x108/0x1f8 (unreliable) [ 16.170009] [bcd21e20] [8014b590] __tcp_splice_read+0x34/0x44 [ 16.170009] [bcd21e40] [8014b620] tcp_splice_read+0x80/0x220 [ 16.170009] [bcd21e90] [80105730] sock_splice_read+0x2c/0x44 [ 16.170009] [bcd21ea0] [8008a374] do_splice_to+0x90/0xac [ 16.170009] [bcd21ed0] [8008a850] do_splice+0x258/0x2f0 [ 16.170009] [bcd21f10] [8008b1d4] sys_splice+0xe0/0xe8 [ 16.170009] [bcd21f40] [8000ff14] ret_from_syscall+0x0/0x38 Thanks, tavi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 14:39 ` Octavian Purdila @ 2008-05-27 15:09 ` Evgeniy Polyakov 2008-05-27 15:12 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 15:09 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Tue, May 27, 2008 at 05:39:11PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > On my system it still crashes, if I interrupt the program before splice > blocks: Yes, that was somewhat stupid patch. Since I can not reproduce it anymore, let's try to upload this task to you :) Please test this one assuming there are no other patches applied before. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6087013..e2a99a6 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1295,7 +1295,7 @@ err: * the frag list, if such a thing exists. We'd probably need to recurse to * handle that cleanly. */ -int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, +int skb_splice_bits(struct sk_buff *skb, unsigned int offset, struct pipe_inode_info *pipe, unsigned int tlen, unsigned int flags) { @@ -1308,16 +1308,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, .ops = &sock_pipe_buf_ops, .spd_release = sock_spd_release, }; - struct sk_buff *skb; - - /* - * I'd love to avoid the clone here, but tcp_read_sock() - * ignores reference counts and unconditonally kills the sk_buff - * on return from the actor. - */ - skb = skb_clone(__skb, GFP_KERNEL); - if (unlikely(!skb)) - return -ENOMEM; /* * __skb_splice_bits() only fails if the output has no room left, @@ -1341,14 +1331,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, } done: - /* - * drop our reference to the clone, the pipe consumption will - * drop the rest. - */ - kfree_skb(skb); - if (spd.nr_pages) { int ret; + struct sock *sk = __skb->sk; /* * Drop the socket lock, otherwise we have reverse @@ -1359,9 +1344,9 @@ done: * we call into ->sendpage() with the i_mutex lock held * and networking will grab the socket lock. */ - release_sock(__skb->sk); + release_sock(sk); ret = splice_to_pipe(pipe, &spd); - lock_sock(__skb->sk); + lock_sock(sk); return ret; } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 39b629a..54bb460 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1196,7 +1196,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor) { - struct sk_buff *skb; + struct sk_buff *skb, *__skb; struct tcp_sock *tp = tcp_sk(sk); u32 seq = tp->copied_seq; u32 offset; @@ -1204,10 +1204,14 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, if (sk->sk_state == TCP_LISTEN) return -ENOTCONN; - while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { + while ((__skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { if (offset < skb->len) { size_t used, len; + skb = skb_clone(__skb, GFP_KERNEL); + if (unlikely(!skb)) + break; + len = skb->len - offset; /* Stop reading if we hit a patch of urgent data */ if (tp->urg_data) { @@ -1215,29 +1219,35 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, if (urg_offset < len) len = urg_offset; if (!len) - break; + goto out_break; } used = recv_actor(desc, skb, offset, len); if (used < 0) { if (!copied) copied = used; - break; + goto out_break; } else if (used <= len) { seq += used; copied += used; offset += used; } if (offset != skb->len) - break; + goto out_break; } if (tcp_hdr(skb)->fin) { sk_eat_skb(sk, skb, 0); ++seq; - break; + goto out_break; } sk_eat_skb(sk, skb, 0); +out: + kfree_skb(skb); if (!desc->count) - break; + goto out_break; + continue; +out_break: + kfree_skb(skb); + break; } tp->copied_seq = seq; -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 15:09 ` Evgeniy Polyakov @ 2008-05-27 15:12 ` Evgeniy Polyakov 2008-05-27 15:22 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 15:12 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Tue, May 27, 2008 at 07:09:31PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > Please test this one assuming there are no other patches applied before. A typo sneaked into the patch, please try this one. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6087013..e2a99a6 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1295,7 +1295,7 @@ err: * the frag list, if such a thing exists. We'd probably need to recurse to * handle that cleanly. */ -int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, +int skb_splice_bits(struct sk_buff *skb, unsigned int offset, struct pipe_inode_info *pipe, unsigned int tlen, unsigned int flags) { @@ -1308,16 +1308,6 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, .ops = &sock_pipe_buf_ops, .spd_release = sock_spd_release, }; - struct sk_buff *skb; - - /* - * I'd love to avoid the clone here, but tcp_read_sock() - * ignores reference counts and unconditonally kills the sk_buff - * on return from the actor. - */ - skb = skb_clone(__skb, GFP_KERNEL); - if (unlikely(!skb)) - return -ENOMEM; /* * __skb_splice_bits() only fails if the output has no room left, @@ -1341,14 +1331,9 @@ int skb_splice_bits(struct sk_buff *__skb, unsigned int offset, } done: - /* - * drop our reference to the clone, the pipe consumption will - * drop the rest. - */ - kfree_skb(skb); - if (spd.nr_pages) { int ret; + struct sock *sk = __skb->sk; /* * Drop the socket lock, otherwise we have reverse @@ -1359,9 +1344,9 @@ done: * we call into ->sendpage() with the i_mutex lock held * and networking will grab the socket lock. */ - release_sock(__skb->sk); + release_sock(sk); ret = splice_to_pipe(pipe, &spd); - lock_sock(__skb->sk); + lock_sock(sk); return ret; } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 39b629a..fb8bfc2 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1196,7 +1196,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor) { - struct sk_buff *skb; + struct sk_buff *skb, *__skb; struct tcp_sock *tp = tcp_sk(sk); u32 seq = tp->copied_seq; u32 offset; @@ -1204,10 +1204,14 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, if (sk->sk_state == TCP_LISTEN) return -ENOTCONN; - while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { + while ((__skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { if (offset < skb->len) { size_t used, len; + skb = skb_clone(__skb, GFP_KERNEL); + if (unlikely(!skb)) + break; + len = skb->len - offset; /* Stop reading if we hit a patch of urgent data */ if (tp->urg_data) { @@ -1215,29 +1219,35 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, if (urg_offset < len) len = urg_offset; if (!len) - break; + goto out_break; } used = recv_actor(desc, skb, offset, len); if (used < 0) { if (!copied) copied = used; - break; + goto out_break; } else if (used <= len) { seq += used; copied += used; offset += used; } if (offset != skb->len) - break; + goto out_break; } if (tcp_hdr(skb)->fin) { sk_eat_skb(sk, skb, 0); ++seq; - break; + goto out_break; } sk_eat_skb(sk, skb, 0); +out: + kfree_skb(skb); if (!desc->count) break; + continue; +out_break: + kfree_skb(skb); + break; } tp->copied_seq = seq; -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 15:12 ` Evgeniy Polyakov @ 2008-05-27 15:22 ` Evgeniy Polyakov 2008-05-27 15:33 ` Octavian Purdila 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 15:22 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Tue, May 27, 2008 at 07:12:59PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > On Tue, May 27, 2008 at 07:09:31PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > > Please test this one assuming there are no other patches applied before. > > A typo sneaked into the patch, please try this one. The same wrong one, sorry about that. Idea is to hold skb between release/lock sock calls and thus do not allow to free it by core stack when it is being released. Patch still misses the case, when socket is released and skb was dequeued, so splice will try to dequeue it again, which will crash. I will think on how to fix the issue. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 15:22 ` Evgeniy Polyakov @ 2008-05-27 15:33 ` Octavian Purdila 2008-05-27 15:47 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Octavian Purdila @ 2008-05-27 15:33 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem On Tuesday 27 May 2008, Evgeniy Polyakov wrote: > > The same wrong one, sorry about that. > > Idea is to hold skb between release/lock sock calls and thus do not > allow to free it by core stack when it is being released. Patch still > misses the case, when socket is released and skb was dequeued, so splice > will try to dequeue it again, which will crash. I will think on how to > fix the issue. Yes, I think I got the idea you are trying to use here. But, somehow I feel uneasy with this approach :) Isn't it cleaner to keep the lock and try to avoid the deadlock on the sendfile() side? Or is that unfeasible? I don't think we can drop the socket lock, it will introduce at least on type of races: since the skb we are processing is still on the socket queue, any entity accessing the socket queue will possible collide with us. Thanks, tavi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 15:33 ` Octavian Purdila @ 2008-05-27 15:47 ` Evgeniy Polyakov 2008-05-27 17:28 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 15:47 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Tue, May 27, 2008 at 06:33:55PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > Yes, I think I got the idea you are trying to use here. But, somehow I feel > uneasy with this approach :) Isn't it cleaner to keep the lock and try to > avoid the deadlock on the sendfile() side? Or is that unfeasible? Well, providing some flags to ->sendpage() callbacks to say that they should not grab locks is a bit ugly to me, I think we can fix splice code not to do wrong things. > I don't think we can drop the socket lock, it will introduce at least on type > of races: since the skb we are processing is still on the socket queue, any > entity accessing the socket queue will possible collide with us. Each access is being done under socket lock, so it should be safe. But socket release path drops skb from the list before dropping its reference counter, so we are not allowed to unlink it again, but we can check if skb is linked or not and make a decision based on that. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 15:47 ` Evgeniy Polyakov @ 2008-05-27 17:28 ` Evgeniy Polyakov 2008-05-27 23:59 ` Octavian Purdila 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-27 17:28 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Tue, May 27, 2008 at 07:47:10PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > Well, providing some flags to ->sendpage() callbacks to say that they > should not grab locks is a bit ugly to me, I think we can fix splice > code not to do wrong things. And actually this will lead to data corruption, since different process can enter and change socket values under us. Without temporal lock drop we can deadlock (first process locked socket and then waiting for i_mutex, while another one grabbed i_mutex and waiting for socket lock in sendfile()). > > I don't think we can drop the socket lock, it will introduce at least on type > > of races: since the skb we are processing is still on the socket queue, any > > entity accessing the socket queue will possible collide with us. > > Each access is being done under socket lock, so it should be safe. But > socket release path drops skb from the list before dropping its > reference counter, so we are not allowed to unlink it again, but we can > check if skb is linked or not and make a decision based on that. Please try attached patch on top of vanilla tree. It does not use skb after socket was dropped, but instead search it again when socket is locked, so if socket is alive, it will find it and clean otherwise it will exit. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6087013..d285817 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1349,6 +1349,7 @@ done: if (spd.nr_pages) { int ret; + struct sock *sk = __skb->sk; /* * Drop the socket lock, otherwise we have reverse @@ -1359,9 +1360,9 @@ done: * we call into ->sendpage() with the i_mutex lock held * and networking will grab the socket lock. */ - release_sock(__skb->sk); + release_sock(sk); ret = splice_to_pipe(pipe, &spd); - lock_sock(__skb->sk); + lock_sock(sk); return ret; } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 39b629a..217dc59 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1200,15 +1200,17 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, struct tcp_sock *tp = tcp_sk(sk); u32 seq = tp->copied_seq; u32 offset; - int copied = 0; + int copied = 0, fin; + size_t used, len, skb_len; if (sk->sk_state == TCP_LISTEN) return -ENOTCONN; while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { + used = 0; if (offset < skb->len) { - size_t used, len; - len = skb->len - offset; + skb_len = skb->len; + len = skb_len - offset; /* Stop reading if we hit a patch of urgent data */ if (tp->urg_data) { u32 urg_offset = tp->urg_seq - seq; @@ -1217,6 +1219,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, if (!len) break; } + used = recv_actor(desc, skb, offset, len); if (used < 0) { if (!copied) @@ -1227,15 +1230,22 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, copied += used; offset += used; } - if (offset != skb->len) + if (offset != skb_len) break; } - if (tcp_hdr(skb)->fin) { - sk_eat_skb(sk, skb, 0); - ++seq; + skb = tcp_recv_skb(sk, seq-used, &offset); + if (!skb) { + if (!copied) + copied = -EINVAL; break; } + fin = tcp_hdr(skb)->fin; sk_eat_skb(sk, skb, 0); + + if (fin) { + ++seq; + break; + } if (!desc->count) break; } -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 17:28 ` Evgeniy Polyakov @ 2008-05-27 23:59 ` Octavian Purdila 2008-05-28 8:52 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Octavian Purdila @ 2008-05-27 23:59 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem [-- Attachment #1: Type: text/plain, Size: 740 bytes --] On Tuesday 27 May 2008, Evgeniy Polyakov wrote: > > Please try attached patch on top of vanilla tree. > It does not use skb after socket was dropped, but instead search it > again when socket is locked, so if socket is alive, it will find it and > clean otherwise it will exit. > This fixes the crash, thanks. One doubt though: suppose that while we drop the lock the skb gets aggregated with the one after it. If the original skb is fully consumed in the receive actor, then the we will eat the new, aggregated skb, loosing data. Here is a patch, based on your idea, which tries to cope with the above scenario. The !skb check was added for the case in which the actor does not consume anything in the current interration. tavi [-- Attachment #2: a.diff --] [-- Type: text/x-diff, Size: 1040 bytes --] diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 0a9002b..0a0a663 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1358,7 +1358,8 @@ done: if (spd.nr_pages) { int ret; - + struct sock *sk= __skb->sk; + /* * Drop the socket lock, otherwise we have reverse * locking dependencies between sk_lock and i_mutex @@ -1368,9 +1369,9 @@ done: * we call into ->sendpage() with the i_mutex lock held * and networking will grab the socket lock. */ - release_sock(__skb->sk); + release_sock(sk); ret = splice_to_pipe(pipe, &spd); - lock_sock(__skb->sk); + lock_sock(sk); return ret; } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 3e91b28..34049d0 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1227,7 +1227,8 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, copied += used; offset += used; } - if (offset != skb->len) + skb = tcp_recv_skb(sk, seq-1, &offset); + if (!skb || (offset+1 != skb->len)) break; } if (tcp_hdr(skb)->fin) { ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-27 23:59 ` Octavian Purdila @ 2008-05-28 8:52 ` Evgeniy Polyakov 2008-05-28 13:20 ` Octavian Purdila 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-28 8:52 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Wed, May 28, 2008 at 02:59:30AM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > One doubt though: suppose that while we drop the lock the skb gets aggregated > with the one after it. If the original skb is fully consumed in the receive > actor, then the we will eat the new, aggregated skb, loosing data. How can it be aggregated with another skb? It is only possible that some other reader consumes the data, but in that case sequence number will not match and we will not find skb. > Here is a patch, based on your idea, which tries to cope with the above > scenario. The !skb check was added for the case in which the actor does not > consume anything in the current interration. If it does not get any data, then skb will likely exists and will be consumed in the next run. I preserved old semantic, when we free skb only if we read it whole or in case of fin. With your changes we can also free skb, if it was partially consumed and do not free it at all if skb was not processed becuase it is too old (i.e. it lives in receive queue, but we already read data siwth sequnce number, which corresponds to it), no? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 8:52 ` Evgeniy Polyakov @ 2008-05-28 13:20 ` Octavian Purdila 2008-05-28 14:11 ` Evgeniy Polyakov 0 siblings, 1 reply; 29+ messages in thread From: Octavian Purdila @ 2008-05-28 13:20 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem On Wednesday 28 May 2008, Evgeniy Polyakov wrote: > > One doubt though: suppose that while we drop the lock the skb gets > > aggregated with the one after it. If the original skb is fully consumed > > in the receive actor, then the we will eat the new, aggregated skb, > > loosing data. > > How can it be aggregated with another skb? I think that tcp_collapse does this kind of aggregation when we have memory pressure: will create a new skb and move the data from the old skbs in the new one, freeing the old skb. Right? > It is only possible that some > other reader consumes the data, but in that case sequence number will > not match and we will not find skb. > Hmm, I didn't thought about this, but you are right, we could have other consumer sneaking in. > > Here is a patch, based on your idea, which tries to cope with the above > > scenario. The !skb check was added for the case in which the actor does > > not consume anything in the current interration. > > If it does not get any data, then skb will likely exists and will be > consumed in the next run. If the consumer didn't consume data, the seq number will not be updated, and seq-1 will correspond to a previous skb, which was already dequeued and processed. Thus tcp_recv_skb(seq-1) will return NULL. > I preserved old semantic, when we free skb > only if we read it whole or in case of fin. With your changes we can > also free skb, if it was partially consumed If the skb was partially consumed then tcp_recv_skb(seq-1) will return the same skb and the offset +1 != skb->len, thus we will not free it. > and do not free it at all if > skb was not processed becuase it is too old (i.e. it lives in receive > queue, but we already read data siwth sequnce number, which corresponds > to it), no? I didn't get this part, shouldn't the entity which dequeues the packet also free it? Anyways, I am a newbie in this area, so if my logic doesn't make any sense please be patient with me :) Thanks, tavi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 13:20 ` Octavian Purdila @ 2008-05-28 14:11 ` Evgeniy Polyakov 2008-05-28 15:20 ` Octavian Purdila 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-28 14:11 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Wed, May 28, 2008 at 04:20:55PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > > I preserved old semantic, when we free skb > > only if we read it whole or in case of fin. With your changes we can > > also free skb, if it was partially consumed > > If the skb was partially consumed then tcp_recv_skb(seq-1) will return the > same skb and the offset +1 != skb->len, thus we will not free it. I understand now, please correct me if I got your idea wrong. We only ned to search for the skb again only in case we processed it and it was possible that socket lock was dropped. So, the only needed place to put tcp_recv_skb() is where you pointed. Next, to find current skb we have to get sequence number inside its data, and seq-1 is indeed that case, even if skb was changed under us (like combined with others), we still will find it in the queue. Offset in tcp_recv_skb() is set to number of bytes between starting sequence number for skb and given sequence number, thus if we search for (seq-1), it will point to the last byte processed by the code, and thus offset will 'point' to last processed byte, not byte to start from, so we check against (offset+1) and if (offset+1) equals to the size of the skb found we can free this skb (in the code below). If it is not the case and skb contains data behind (offset+1) we break and the rest of the data will be processed in the next run. If there is no skb, we just break out of the loop. So yes, your patch is simpler and faster than mine so you should push it upstream. Fortunately David Miller is in copy and will (David, will you?) pick it up and push, likely it is also needed for stable? > Anyways, I am a newbie in this area, so if my logic doesn't make any sense > please be patient with me :) We all are newbies somewhere and even splice code itself is rather new :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 14:11 ` Evgeniy Polyakov @ 2008-05-28 15:20 ` Octavian Purdila 2008-05-28 15:42 ` Evgeniy Polyakov 2008-05-28 17:08 ` Octavian Purdila 0 siblings, 2 replies; 29+ messages in thread From: Octavian Purdila @ 2008-05-28 15:20 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem On Wednesday 28 May 2008, Evgeniy Polyakov wrote: > > > I preserved old semantic, when we free skb > > > only if we read it whole or in case of fin. With your changes we can > > > also free skb, if it was partially consumed > > > > If the skb was partially consumed then tcp_recv_skb(seq-1) will return > > the same skb and the offset +1 != skb->len, thus we will not free it. > > I understand now, please correct me if I got your idea wrong. > We only ned to search for the skb again only in case we processed it and > it was possible that socket lock was dropped. So, the only needed place > to put tcp_recv_skb() is where you pointed. Next, to find current skb we <snip> Yes, that is correct. > So yes, your patch is simpler and faster than mine so you should push it > upstream. Fortunately David Miller is in copy and will (David, will > you?) pick it up and push, likely it is also needed for stable? > OK, will clean it up and post a proper patch soon. Which tree should I base it on? Thanks, tavi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 15:20 ` Octavian Purdila @ 2008-05-28 15:42 ` Evgeniy Polyakov 2008-05-28 17:08 ` Octavian Purdila 1 sibling, 0 replies; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-28 15:42 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Wed, May 28, 2008 at 06:20:02PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > OK, will clean it up and post a proper patch soon. Which tree should I base it > on? Vanilla tree I think, it should apply without problems to all trees, since patch is rather small. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 15:20 ` Octavian Purdila 2008-05-28 15:42 ` Evgeniy Polyakov @ 2008-05-28 17:08 ` Octavian Purdila 2008-05-28 17:51 ` Evgeniy Polyakov 1 sibling, 1 reply; 29+ messages in thread From: Octavian Purdila @ 2008-05-28 17:08 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem On Wednesday 28 May 2008, Octavian Purdila wrote: > > > So yes, your patch is simpler and faster than mine so you should push it > > upstream. Fortunately David Miller is in copy and will (David, will > > you?) pick it up and push, likely it is also needed for stable? > Hmm, might have found a problem with this approach: say we drop the lock, queue the data into the pipe, and at the same time tcp_collapse() is called which frees some of the skbs of which data we just queued in the pipe. Later, when we will read from the pipe, we will get random data instead of socket data... Thanks, tavi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 17:08 ` Octavian Purdila @ 2008-05-28 17:51 ` Evgeniy Polyakov 2008-05-28 18:02 ` Octavian Purdila 0 siblings, 1 reply; 29+ messages in thread From: Evgeniy Polyakov @ 2008-05-28 17:51 UTC (permalink / raw) To: Octavian Purdila; +Cc: Ben Hutchings, netdev, davem On Wed, May 28, 2008 at 08:08:01PM +0300, Octavian Purdila (opurdila@ixiacom.com) wrote: > Hmm, might have found a problem with this approach: say we drop the lock, > queue the data into the pipe, and at the same time tcp_collapse() is called > which frees some of the skbs of which data we just queued in the pipe. Later, > when we will read from the pipe, we will get random data instead of socket > data... We queue data under the lock and clone appropriate skb (and then grab it multiple times), so even it will be dropped, its data will not freed, and thus we will be able to read it. Or you are talking about different skbs? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 17:51 ` Evgeniy Polyakov @ 2008-05-28 18:02 ` Octavian Purdila 2008-05-28 20:01 ` Jarek Poplawski 0 siblings, 1 reply; 29+ messages in thread From: Octavian Purdila @ 2008-05-28 18:02 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ben Hutchings, netdev, davem On Wednesday 28 May 2008, Evgeniy Polyakov wrote: > We queue data under the lock and clone appropriate skb (and then grab it > multiple times), so even it will be dropped, its data will not freed, and > thus we will be able to read it. Or you are talking about different > skbs? You are right, I forgot about the clone. Thanks, tavi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 18:02 ` Octavian Purdila @ 2008-05-28 20:01 ` Jarek Poplawski 2008-05-28 20:09 ` Octavian Purdila 0 siblings, 1 reply; 29+ messages in thread From: Jarek Poplawski @ 2008-05-28 20:01 UTC (permalink / raw) To: Octavian Purdila; +Cc: Evgeniy Polyakov, Ben Hutchings, netdev, davem Octavian Purdila wrote, On 05/28/2008 08:02 PM: > On Wednesday 28 May 2008, Evgeniy Polyakov wrote: > >> We queue data under the lock and clone appropriate skb (and then grab it >> multiple times), so even it will be dropped, its data will not freed, and >> thus we will be able to read it. Or you are talking about different >> skbs? > > You are right, I forgot about the clone. > Probably I miss something, but how does it help when tcp_collapse() uses __kfree_skb()? Regards, Jarek P. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 20:01 ` Jarek Poplawski @ 2008-05-28 20:09 ` Octavian Purdila 2008-05-28 20:16 ` Jarek Poplawski 0 siblings, 1 reply; 29+ messages in thread From: Octavian Purdila @ 2008-05-28 20:09 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Evgeniy Polyakov, Ben Hutchings, netdev, davem On Wednesday 28 May 2008, Jarek Poplawski wrote: > >> We queue data under the lock and clone appropriate skb (and then grab it > >> multiple times), so even it will be dropped, its data will not freed, > >> and thus we will be able to read it. Or you are talking about different > >> skbs? > > > > You are right, I forgot about the clone. > > Probably I miss something, but how does it help when tcp_collapse() > uses __kfree_skb()? > __kfree_skb() -> skb_release_all() -> skb_release_data(): static void skb_release_data(struct sk_buff *skb) { if (!skb->cloned || <snip> kfree(skb->head); Since we clone the skb in skb_splice_bits() the skb's data will only be freed when the last clone is deleted. tavi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: race in skb_splice_bits? 2008-05-28 20:09 ` Octavian Purdila @ 2008-05-28 20:16 ` Jarek Poplawski 0 siblings, 0 replies; 29+ messages in thread From: Jarek Poplawski @ 2008-05-28 20:16 UTC (permalink / raw) To: Octavian Purdila; +Cc: Evgeniy Polyakov, Ben Hutchings, netdev, davem On Wed, May 28, 2008 at 11:09:13PM +0300, Octavian Purdila wrote: > On Wednesday 28 May 2008, Jarek Poplawski wrote: > > > >> We queue data under the lock and clone appropriate skb (and then grab it > > >> multiple times), so even it will be dropped, its data will not freed, > > >> and thus we will be able to read it. Or you are talking about different > > >> skbs? > > > > > > You are right, I forgot about the clone. > > > > Probably I miss something, but how does it help when tcp_collapse() > > uses __kfree_skb()? > > > > __kfree_skb() -> skb_release_all() -> skb_release_data(): > > static void skb_release_data(struct sk_buff *skb) > { > if (!skb->cloned || > <snip> > kfree(skb->head); > > Since we clone the skb in skb_splice_bits() the skb's data will only be freed > when the last clone is deleted. Right! Thanks for explanation, Jarek P. ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2008-05-28 20:17 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-27 0:25 race in skb_splice_bits? Octavian Purdila 2008-05-27 2:08 ` Ben Hutchings 2008-05-27 10:41 ` Octavian Purdila 2008-05-27 11:01 ` Evgeniy Polyakov 2008-05-27 11:08 ` Ben Hutchings 2008-05-27 11:52 ` Evgeniy Polyakov 2008-05-27 11:56 ` Evgeniy Polyakov 2008-05-27 12:53 ` Octavian Purdila 2008-05-27 13:21 ` Evgeniy Polyakov 2008-05-27 14:03 ` Evgeniy Polyakov 2008-05-27 14:39 ` Octavian Purdila 2008-05-27 15:09 ` Evgeniy Polyakov 2008-05-27 15:12 ` Evgeniy Polyakov 2008-05-27 15:22 ` Evgeniy Polyakov 2008-05-27 15:33 ` Octavian Purdila 2008-05-27 15:47 ` Evgeniy Polyakov 2008-05-27 17:28 ` Evgeniy Polyakov 2008-05-27 23:59 ` Octavian Purdila 2008-05-28 8:52 ` Evgeniy Polyakov 2008-05-28 13:20 ` Octavian Purdila 2008-05-28 14:11 ` Evgeniy Polyakov 2008-05-28 15:20 ` Octavian Purdila 2008-05-28 15:42 ` Evgeniy Polyakov 2008-05-28 17:08 ` Octavian Purdila 2008-05-28 17:51 ` Evgeniy Polyakov 2008-05-28 18:02 ` Octavian Purdila 2008-05-28 20:01 ` Jarek Poplawski 2008-05-28 20:09 ` Octavian Purdila 2008-05-28 20:16 ` Jarek Poplawski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).