From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Bruce \"Brutus\" Curtis" Subject: [RFC][PATCH v2] net-tcp: TCP/IP stack bypass for loopback connections Date: Mon, 6 Aug 2012 17:13:49 -0700 Message-ID: <1344298429-19765-1-git-send-email-brutus@google.com> Cc: Eric Dumazet , netdev@vger.kernel.org, "Bruce \"Brutus\" Curtis" To: "David S. Miller" Return-path: Received: from mail-we0-f202.google.com ([74.125.82.202]:39963 "EHLO mail-we0-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756863Ab2HGAOS (ORCPT ); Mon, 6 Aug 2012 20:14:18 -0400 Received: by weyr1 with SMTP id r1so145268wey.1 for ; Mon, 06 Aug 2012 17:14:17 -0700 (PDT) Sender: netdev-owner@vger.kernel.org List-ID: From: "Bruce \"Brutus\" Curtis" TCP/IP loopback socket pair stack bypass, based on an idea by, and rough upstream patch from, David Miller called "friends", the data structure modifcations and connection scheme are reused with extensive data-path changes. A new sysctl, net.ipv4.tcp_friends, is added: 0: disable friends and use the stock data path. 1: enable friends and bypass the stack data path, the default. Note, when friends is enabled any loopback interpose, e.g. tcpdump, will only see the TCP/IP packets during connection establishment and finish, all data bypasses the stack and instead is delivered to the destination socket directly. This is a 2nd version, the 1st was implemented as an interpose with new orthogonal friends functions, this version is coded in-line with the existing TCP code. Testing done on a 4 socket 2.2GHz "Quad-Core AMD Opteron(tm) Processor 8354 CPU" based system, netperf results for a single connection show increased TCP_STREAM throughput, increased TCP_RR and TCP_CRR transaction rate for most message sizes vs baseline and comparable to AF_UNIX. Significant increase (up to 5x) in aggregate throughput for multiple netperf runs (STREAM 32KB I/O x N) is seen. Some results: Default netperf: netperf netperf -t STREAM_STREAM netperf -t STREAM_STREAM -- -s 51882 -m 16384 -M 87380 netperf Baseline AF_UNIX AF_UNIX Friends Mbits/S Mbits/S Mbits/S Mbits/S 6860 714 8% 9444 138% 1323% 10576 154% 1481% 112% Note, for the AF_UNIX (STREAM_STREAM) test 2 results are listed, 1st with no options but as the defaults for AF_UNIX sockets are much lower performaning a 2nd set of runs with a socket buffer size and send/recv buffer sizes equivalent to AF_INET (TCP_STREAM) are done. Note, all subsequent AF_UNIX (STREAM_STREAM, STREAM_RR) tests are done with "-s 51882" such that the same total effective socket buffering is used as for the AF_INET runs defaults (16384+NNNNN/2). STREAM 32KB I/O x N: netperf -l 100 -t TCP_STREAM -- -m 32K -M 32K netperf -l 100 -t STREAM_STREAM -- -s 51882 -m 32K -M 32K netperf -l 100 -t TCP_STREAM -- -m 32K -M 32K Baseline AF_UNIX Friends N COC Mbits/S Mbits/S Mbits/S 1 - 8616 9416 109% 11116 129% 118% 2 - 15419 17076 111% 20267 131% 119% 16 2 59497 303029 509% 347349 584% 115% 32 4 54223 273637 505% 272891 503% 100% 256 32 58244 85476 147% 273696 470% 320% 512 64 58745 87402 149% 260837 444% 298% 1600 200 83161 158915 191% 383947 462% 242% COC = Cpu Over Commit ratio (16 core platform) STREAM: netperf -l 100 -t TCP_STREAM netperf -l 100 -t STREAM_STREAM -- -s 51882 netperf -l 100 -t TCP_STREAM netperf Baseline AF_UNIX Friends -m/-M N Mbits/S Mbits/S Mbits/S 64 1020 445 44% 515 50% 116% 1K 4881 4340 89% 5070 104% 117% 8K 5933 8387 141% 9770 165% 116% 32K 8168 9538 117% 11067 135% 116% 64K 9116 9774 107% 11515 126% 118% 128K 9053 10044 111% 13082 145% 130% 256K 9642 10351 107% 13470 140% 130% 512K 10050 10142 101% 13327 133% 131% 1M 8640 9843 114% 12201 141% 124% 16M 7179 9619 134% 11316 158% 118% RR: netperf -l 100 -t TCP_RR netperf -l 100 -t STREAM_RR -- -s 51882 -m 16384 -M 87380 netperf -l 100 -t TCP_RR netperf Baseline AF_UNIX Friends -r N,N Trans./S Trans./S Trans./S 64 47913 99681 208% 98225 205% 99% 1K 44045 92327 210% 91608 208% 99% 8K 26732 33201 124% 33025 124% 99% 32K 10903 11972 110% 13574 124% 113% 64K 7113 6718 94% 7176 101% 107% 128K 4191 3431 82% 3695 88% 108% 256K 2324 1937 83% 2147 92% 111% 512K 958 1056 110% 1202 125% 114% 1M 404 508 126% 497 123% 98% 16M 26.1 34.1 131% 32.9 126% 96% CRR: netperf -l 100 -t TCP_CRR netperf -l 100 -t TCP_CRR netperf Baseline AF_UNIX Friends -r N Trans./S Trans./S Trans./S 64 14690 - 18191 124% - 1K 14258 - 17492 123% - 8K 11535 - 14012 121% - 32K 7035 - 8974 128% - 64K 4312 - 5654 131% - 128K 2252 - 3179 141% - 256K 1237 - 2008 162% - 512K 17.5* - 1079 ? - 1M 4.93* - 458 ? - 16M 8.29* - 32.5 ? - Note, "-" denotes test not supported for transport. "*" denotes test results reported without statistical confidence. "?" denotes results not comparable. SPLICE 32KB I/O: Source Sink Baseline Friends FSFS Mbits/S Mbits/S ---- 8042 10810 134% Z--- 7071 9773 138% --N- 8039 10820 135% Z-N- 7902 9796 124% -S-- 17326 37496 216% ZS-- 9008 9573 106% -SN- 16154 36269 225% ZSN- 9531 9640 101% ---S 8466 9933 117% Z--S 8000 9453 118% --NS 12783 11379 89% Z-NS 11055 9489 86% -S-S 12741 24380 191% ZS-S 8097 10242 126% -SNS 16657 30954 186% ZSNS 12108 12763 105% Note, "Z" source File /dev/zero, "-" source user memory "N" sink File /dev/null, "-" sink user memory "S" Splice on, "-" Splice off Signed-off-by: Bruce \"Brutus\" Curtis --- include/linux/skbuff.h | 2 + include/net/request_sock.h | 1 + include/net/sock.h | 32 +++- include/net/tcp.h | 3 +- net/core/skbuff.c | 1 + net/core/stream.c | 36 +++ net/ipv4/inet_connection_sock.c | 21 ++ net/ipv4/sysctl_net_ipv4.c | 7 + net/ipv4/tcp.c | 500 ++++++++++++++++++++++++++++++++++----- net/ipv4/tcp_input.c | 24 ++- net/ipv4/tcp_ipv4.c | 2 + net/ipv4/tcp_minisocks.c | 5 + net/ipv4/tcp_output.c | 18 ++- net/ipv6/tcp_ipv6.c | 1 + 14 files changed, 578 insertions(+), 75 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 642cb73..2fbca93 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -332,6 +332,7 @@ typedef unsigned char *sk_buff_data_t; * @cb: Control buffer. Free for use by every layer. Put private vars here * @_skb_refdst: destination entry (with norefcount bit) * @sp: the security path, used for xfrm + * @friend: loopback friend socket * @len: Length of actual data * @data_len: Data length * @mac_len: Length of link layer header @@ -407,6 +408,7 @@ struct sk_buff { #ifdef CONFIG_XFRM struct sec_path *sp; #endif + struct sock *friend; unsigned int len, data_len; __u16 mac_len, diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 4c0766e..2c74420 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -63,6 +63,7 @@ struct request_sock { unsigned long expires; const struct request_sock_ops *rsk_ops; struct sock *sk; + struct sock *friend; u32 secid; u32 peer_secid; }; diff --git a/include/net/sock.h b/include/net/sock.h index dcb54a0..3b371f5 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -197,6 +197,7 @@ struct cg_proto; * @sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings * @sk_lock: synchronizer * @sk_rcvbuf: size of receive buffer in bytes + * @sk_friend: loopback friend socket * @sk_wq: sock wait queue and async head * @sk_rx_dst: receive input route used by early tcp demux * @sk_dst_cache: destination cache @@ -286,6 +287,14 @@ struct sock { socket_lock_t sk_lock; struct sk_buff_head sk_receive_queue; /* + * If socket has a friend (sk_friend != NULL) then a send skb is + * enqueued directly to the friend's sk_receive_queue such that: + * + * sk_sndbuf -> sk_sndbuf + sk_friend->sk_rcvbuf + * sk_wmem_queued -> sk_friend->sk_rmem_alloc + */ + struct sock *sk_friend; + /* * The backlog queue is special, it is always used with * the per-socket spinlock held and requires low latency * access. Therefore we special case it's implementation. @@ -673,24 +682,40 @@ static inline bool sk_acceptq_is_full(const struct sock *sk) return sk->sk_ack_backlog > sk->sk_max_ack_backlog; } +static inline int sk_wmem_queued_get(const struct sock *sk) +{ + if (sk->sk_friend) + return atomic_read(&sk->sk_friend->sk_rmem_alloc); + else + return sk->sk_wmem_queued; +} + +static inline int sk_sndbuf_get(const struct sock *sk) +{ + if (sk->sk_friend) + return sk->sk_sndbuf + sk->sk_friend->sk_rcvbuf; + else + return sk->sk_sndbuf; +} + /* * Compute minimal free write space needed to queue new packets. */ static inline int sk_stream_min_wspace(const struct sock *sk) { - return sk->sk_wmem_queued >> 1; + return sk_wmem_queued_get(sk) >> 1; } static inline int sk_stream_wspace(const struct sock *sk) { - return sk->sk_sndbuf - sk->sk_wmem_queued; + return sk_sndbuf_get(sk) - sk_wmem_queued_get(sk); } extern void sk_stream_write_space(struct sock *sk); static inline bool sk_stream_memory_free(const struct sock *sk) { - return sk->sk_wmem_queued < sk->sk_sndbuf; + return sk_wmem_queued_get(sk) < sk_sndbuf_get(sk); } /* OOB backlog add */ @@ -794,6 +819,7 @@ static inline void sock_rps_reset_rxhash(struct sock *sk) }) extern int sk_stream_wait_connect(struct sock *sk, long *timeo_p); +extern int sk_stream_wait_friend(struct sock *sk, long *timeo_p); extern int sk_stream_wait_memory(struct sock *sk, long *timeo_p); extern void sk_stream_wait_close(struct sock *sk, long timeo_p); extern int sk_stream_error(struct sock *sk, int flags, int err); diff --git a/include/net/tcp.h b/include/net/tcp.h index 53fb7d8..011ba42 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -253,6 +253,7 @@ extern int sysctl_tcp_cookie_size; extern int sysctl_tcp_thin_linear_timeouts; extern int sysctl_tcp_thin_dupack; extern int sysctl_tcp_early_retrans; +extern int sysctl_tcp_friends; extern atomic_long_t tcp_memory_allocated; extern struct percpu_counter tcp_sockets_allocated; @@ -978,7 +979,7 @@ static inline bool tcp_prequeue(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); - if (sysctl_tcp_low_latency || !tp->ucopy.task) + if (sysctl_tcp_low_latency || !tp->ucopy.task || sk->sk_friend) return false; __skb_queue_tail(&tp->ucopy.prequeue, skb); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5a789a8..5702145 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -634,6 +634,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old) #ifdef CONFIG_XFRM new->sp = secpath_get(old->sp); #endif + new->friend = old->friend; memcpy(new->cb, old->cb, sizeof(old->cb)); new->csum = old->csum; new->local_df = old->local_df; diff --git a/net/core/stream.c b/net/core/stream.c index f5df85d..85e5b03 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -83,6 +83,42 @@ int sk_stream_wait_connect(struct sock *sk, long *timeo_p) EXPORT_SYMBOL(sk_stream_wait_connect); /** + * sk_stream_wait_friend - Wait for a socket to make friends + * @sk: sock to wait on + * @timeo_p: for how long to wait + * + * Must be called with the socket locked. + */ +int sk_stream_wait_friend(struct sock *sk, long *timeo_p) +{ + struct task_struct *tsk = current; + DEFINE_WAIT(wait); + int done; + + do { + int err = sock_error(sk); + if (err) + return err; + if (!sk->sk_friend) + return -EBADFD; + if (!*timeo_p) + return -EAGAIN; + if (signal_pending(tsk)) + return sock_intr_errno(*timeo_p); + + prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); + sk->sk_write_pending++; + done = sk_wait_event(sk, timeo_p, + !sk->sk_err && + sk->sk_friend->sk_friend); + finish_wait(sk_sleep(sk), &wait); + sk->sk_write_pending--; + } while (!done); + return 0; +} +EXPORT_SYMBOL(sk_stream_wait_friend); + +/** * sk_stream_closing - Return 1 if we still have things to send in our buffers. * @sk: socket to verify */ diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 034ddbe..d5589e3 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -626,6 +626,27 @@ struct sock *inet_csk_clone_lock(const struct sock *sk, if (newsk != NULL) { struct inet_connection_sock *newicsk = inet_csk(newsk); + if (req->friend) { + /* + * Make friends with the requestor but the ACK of + * the request is already in-flight so the race is + * on to make friends before the ACK is processed. + * If the requestor's sk_friend value is != NULL + * then the requestor has already processed the + * ACK so indicate state change to wake'm up. + */ + u64 was; + + sock_hold(req->friend); + newsk->sk_friend = req->friend; + sock_hold(newsk); + was = atomic_long_xchg(&req->friend->sk_friend, + (u64)newsk); + /* If requester already connect()ed, maybe sleeping */ + if (was && !sock_flag(req->friend, SOCK_DEAD)) + sk->sk_state_change(req->friend); + } + newsk->sk_state = TCP_SYN_RECV; newicsk->icsk_bind_hash = NULL; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 12aa0c5..3fc3d50 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -716,6 +716,13 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = &zero }, + { + .procname = "tcp_friends", + .data = &sysctl_tcp_friends, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, { } }; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 3ba605f..a87f0f5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -309,6 +309,38 @@ struct tcp_splice_state { }; /* + * Friends? If not a friend return 0, else if friend is also a friend + * return 1, else wait for friend to be ready and return 1 if friends + * else -errno. In all cases if *friendp != NULL return friend pointer + * else NULL. + */ +static inline int tcp_friends(struct sock *sk, struct sock **friendp, + long *timeo) +{ + struct sock *friend = sk->sk_friend; + int ret = 0; + + if (!friend) + goto out; + if (unlikely(!friend->sk_friend)) { + /* Friendship not complete, wait? */ + if (!timeo) { + ret = -EAGAIN; + goto out; + } + ret = sk_stream_wait_friend(sk, timeo); + if (ret != 0) + goto out; + friend = sk->sk_friend; + } + ret = 1; +out: + if (friendp) + *friendp = friend; + return ret; +} + +/* * Pressure flag: try to collapse. * Technical note: it is used by multiple contexts non atomically. * All the __sk_mem_schedule() is of this nature: accounting @@ -587,6 +619,73 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg) } EXPORT_SYMBOL(tcp_ioctl); +static inline struct sk_buff *tcp_friend_tail(struct sock *sk, int *copy) +{ + struct sock *friend = sk->sk_friend; + struct sk_buff *skb = NULL; + int sz = 0; + + if (skb_peek_tail(&friend->sk_receive_queue)) { + spin_lock_bh(&friend->sk_lock.slock); + skb = skb_peek_tail(&friend->sk_receive_queue); + if (skb && skb->friend) { + if (!*copy) + sz = skb_tailroom(skb); + else + sz = *copy - skb->len; + } + if (!skb || sz <= 0) + spin_unlock_bh(&friend->sk_lock.slock); + } + + *copy = sz; + return skb; +} + +static inline void tcp_friend_seq(struct sock *sk, int copy, int charge) +{ + struct sock *friend = sk->sk_friend; + struct tcp_sock *tp = tcp_sk(friend); + + if (charge) { + sk_mem_charge(friend, charge); + atomic_add(charge, &friend->sk_rmem_alloc); + } + tp->rcv_nxt += copy; + tp->rcv_wup += copy; + spin_unlock_bh(&friend->sk_lock.slock); + + friend->sk_data_ready(friend, copy); + + tp = tcp_sk(sk); + tp->snd_nxt += copy; + tp->pushed_seq += copy; + tp->snd_una += copy; + tp->snd_up += copy; +} + +static inline int tcp_friend_push(struct sock *sk, struct sk_buff *skb) +{ + struct sock *friend = sk->sk_friend; + int ret = 0; + + if (friend->sk_shutdown & RCV_SHUTDOWN) { + __kfree_skb(skb); + return -ECONNRESET; + } + + spin_lock_bh(&friend->sk_lock.slock); + skb->friend = sk; + skb_set_owner_r(skb, friend); + __skb_queue_tail(&friend->sk_receive_queue, skb); + if (!sk_rmem_schedule(friend, skb->truesize)) + ret = 1; + + tcp_friend_seq(sk, skb->len, 0); + + return ret; +} + static inline void tcp_mark_push(struct tcp_sock *tp, struct sk_buff *skb) { TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH; @@ -603,8 +702,12 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb) struct tcp_sock *tp = tcp_sk(sk); struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); - skb->csum = 0; tcb->seq = tcb->end_seq = tp->write_seq; + if (sk->sk_friend) { + skb->friend = sk->sk_friend; + return; + } + skb->csum = 0; tcb->tcp_flags = TCPHDR_ACK; tcb->sacked = 0; skb_header_release(skb); @@ -756,6 +859,21 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, } EXPORT_SYMBOL(tcp_splice_read); +static inline struct sk_buff *tcp_friend_alloc_skb(struct sock *sk, int size) +{ + struct sk_buff *skb; + + skb = alloc_skb(size, sk->sk_allocation); + if (skb) + skb->avail_size = skb_tailroom(skb); + else { + sk->sk_prot->enter_memory_pressure(sk); + sk_stream_moderate_sndbuf(sk); + } + + return skb; +} + struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp) { struct sk_buff *skb; @@ -813,13 +931,47 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, return max(xmit_size_goal, mss_now); } +static unsigned int tcp_friend_xmit_size_goal(struct sock *sk, int size_goal) +{ + u32 tmp = SKB_TRUESIZE(size_goal); + + /* + * If goal is zero (for non linear) or truesize of goal >= largest + * skb return largest, else for tail fill find smallest order that + * fits 8 or more truesized, else use requested truesize. + */ + if (size_goal == 0 || tmp >= SKB_MAX_ORDER(0, 3)) + tmp = SKB_MAX_ORDER(0, 3); + else if (tmp <= (SKB_MAX_ORDER(0, 0) >> 3)) + tmp = SKB_MAX_ORDER(0, 0); + else if (tmp <= (SKB_MAX_ORDER(0, 1) >> 3)) + tmp = SKB_MAX_ORDER(0, 1); + else if (tmp <= (SKB_MAX_ORDER(0, 2) >> 3)) + tmp = SKB_MAX_ORDER(0, 2); + else if (tmp <= (SKB_MAX_ORDER(0, 3) >> 3)) + tmp = SKB_MAX_ORDER(0, 3); + + /* At least 2 truesized in sk_buf */ + if (tmp > (sk_sndbuf_get(sk) >> 1)) + tmp = (sk_sndbuf_get(sk) >> 1) - SKB_TRUESIZE(0); + + return tmp; +} + static int tcp_send_mss(struct sock *sk, int *size_goal, int flags) { int mss_now; + int tmp; - mss_now = tcp_current_mss(sk); - *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB)); + if (sk->sk_friend) { + mss_now = tcp_friend_xmit_size_goal(sk, *size_goal); + tmp = mss_now; + } else { + mss_now = tcp_current_mss(sk); + tmp = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB)); + } + *size_goal = tmp; return mss_now; } @@ -830,6 +982,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse int mss_now, size_goal; int err; ssize_t copied; + struct sock *friend; + bool friend_tail = false; long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); /* Wait for a connection to finish. */ @@ -837,6 +991,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) goto out_err; + err = tcp_friends(sk, &friend, &timeo); + if (err < 0) + goto out_err; + clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); mss_now = tcp_send_mss(sk, &size_goal, flags); @@ -847,19 +1005,40 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse goto out_err; while (psize > 0) { - struct sk_buff *skb = tcp_write_queue_tail(sk); + struct sk_buff *skb; struct page *page = pages[poffset / PAGE_SIZE]; int copy, i; int offset = poffset % PAGE_SIZE; int size = min_t(size_t, psize, PAGE_SIZE - offset); bool can_coalesce; - if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) { + if (sk->sk_friend) { + if (sk->sk_friend->sk_shutdown & RCV_SHUTDOWN) { + sk->sk_err = ECONNRESET; + err = -EPIPE; + goto out_err; + } + copy = size_goal; + skb = tcp_friend_tail(sk, ©); + if (copy > 0) + friend_tail = true; + } else if (!tcp_send_head(sk)) { + copy = 0; + } else { + skb = tcp_write_queue_tail(sk); + copy = size_goal - skb->len; + } + + if (copy <= 0) { new_segment: if (!sk_stream_memory_free(sk)) goto wait_for_sndbuf; - skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation); + if (sk->sk_friend) + skb = tcp_friend_alloc_skb(sk, 0); + else + skb = sk_stream_alloc_skb(sk, 0, + sk->sk_allocation); if (!skb) goto wait_for_memory; @@ -873,10 +1052,16 @@ new_segment: i = skb_shinfo(skb)->nr_frags; can_coalesce = skb_can_coalesce(skb, i, page, offset); if (!can_coalesce && i >= MAX_SKB_FRAGS) { - tcp_mark_push(tp, skb); + if (friend) { + if (friend_tail) { + tcp_friend_seq(sk, 0, 0); + friend_tail = false; + } + } else + tcp_mark_push(tp, skb); goto new_segment; } - if (!sk_wmem_schedule(sk, copy)) + if (!friend && !sk_wmem_schedule(sk, copy)) goto wait_for_memory; if (can_coalesce) { @@ -889,19 +1074,40 @@ new_segment: skb->len += copy; skb->data_len += copy; skb->truesize += copy; - sk->sk_wmem_queued += copy; - sk_mem_charge(sk, copy); - skb->ip_summed = CHECKSUM_PARTIAL; tp->write_seq += copy; TCP_SKB_CB(skb)->end_seq += copy; skb_shinfo(skb)->gso_segs = 0; - if (!copied) - TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH; - copied += copy; poffset += copy; - if (!(psize -= copy)) + psize -= copy; + + if (friend) { + if (friend_tail) { + tcp_friend_seq(sk, copy, copy); + friend_tail = false; + } else { + err = tcp_friend_push(sk, skb); + if (err < 0) { + sk->sk_err = -err; + goto out_err; + } + if (err > 0) + goto wait_for_sndbuf; + } + if (!psize) + goto out; + continue; + } + + sk->sk_wmem_queued += copy; + sk_mem_charge(sk, copy); + skb->ip_summed = CHECKSUM_PARTIAL; + + if (copied == copy) + TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH; + + if (!psize) goto out; if (skb->len < size_goal || (flags & MSG_OOB)) @@ -922,6 +1128,7 @@ wait_for_memory: if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) goto do_error; + size_goal = -mss_now; mss_now = tcp_send_mss(sk, &size_goal, flags); } @@ -984,8 +1191,9 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; int iovlen, flags, err, copied; - int mss_now = 0, size_goal; - bool sg; + int mss_now = 0, size_goal = size; + struct sock *friend; + bool sg, friend_tail = false; long timeo; lock_sock(sk); @@ -998,6 +1206,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) goto out_err; + err = tcp_friends(sk, &friend, &timeo); + if (err < 0) + goto out_err; + if (unlikely(tp->repair)) { if (tp->repair_queue == TCP_RECV_QUEUE) { copied = tcp_send_rcvq(sk, msg, size); @@ -1037,24 +1249,40 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, int copy = 0; int max = size_goal; - skb = tcp_write_queue_tail(sk); - if (tcp_send_head(sk)) { - if (skb->ip_summed == CHECKSUM_NONE) - max = mss_now; - copy = max - skb->len; + if (friend) { + if (friend->sk_shutdown & RCV_SHUTDOWN) { + sk->sk_err = ECONNRESET; + err = -EPIPE; + goto out_err; + } + skb = tcp_friend_tail(sk, ©); + if (copy) + friend_tail = true; + } else { + skb = tcp_write_queue_tail(sk); + if (tcp_send_head(sk)) { + if (skb->ip_summed == CHECKSUM_NONE) + max = mss_now; + copy = max - skb->len; + } } if (copy <= 0) { new_segment: - /* Allocate new segment. If the interface is SG, - * allocate skb fitting to single page. - */ if (!sk_stream_memory_free(sk)) goto wait_for_sndbuf; - skb = sk_stream_alloc_skb(sk, - select_size(sk, sg), - sk->sk_allocation); + if (friend) + skb = tcp_friend_alloc_skb(sk, max); + else { + /* Allocate new segment. If the + * interface is SG, allocate skb + * fitting to single page. + */ + skb = sk_stream_alloc_skb(sk, + select_size(sk, sg), + sk->sk_allocation); + } if (!skb) goto wait_for_memory; @@ -1086,6 +1314,8 @@ new_segment: struct page *page = sk->sk_sndmsg_page; int off; + BUG_ON(friend); + if (page && page_count(page) == 1) sk->sk_sndmsg_off = 0; @@ -1155,16 +1385,34 @@ new_segment: sk->sk_sndmsg_off = off + copy; } - if (!copied) - TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH; - tp->write_seq += copy; TCP_SKB_CB(skb)->end_seq += copy; skb_shinfo(skb)->gso_segs = 0; from += copy; copied += copy; - if ((seglen -= copy) == 0 && iovlen == 0) + seglen -= copy; + + if (friend) { + if (friend_tail) { + tcp_friend_seq(sk, copy, 0); + friend_tail = false; + } else { + err = tcp_friend_push(sk, skb); + if (err < 0) { + sk->sk_err = -err; + goto out_err; + } + if (err > 0) + goto wait_for_sndbuf; + } + continue; + } + + if (copied == copy) + TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH; + + if (seglen == 0 && iovlen == 0) goto out; if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair)) @@ -1186,6 +1434,7 @@ wait_for_memory: if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) goto do_error; + size_goal = -mss_now; mss_now = tcp_send_mss(sk, &size_goal, flags); } } @@ -1197,13 +1446,19 @@ out: return copied; do_fault: - if (!skb->len) { - tcp_unlink_write_queue(skb, sk); - /* It is the one place in all of TCP, except connection - * reset, where we can be unlinking the send_head. - */ - tcp_check_send_head(sk, skb); - sk_wmem_free_skb(sk, skb); + if (friend_tail) + spin_unlock_bh(&friend->sk_lock.slock); + else if (!skb->len) { + if (friend) { + __kfree_skb(skb); + } else { + tcp_unlink_write_queue(skb, sk); + /* It is the one place in all of TCP, except connection + * reset, where we can be unlinking the send_head. + */ + tcp_check_send_head(sk, skb); + sk_wmem_free_skb(sk, skb); + } } do_error: @@ -1216,6 +1471,13 @@ out_err: } EXPORT_SYMBOL(tcp_sendmsg); +static inline void tcp_friend_write_space(struct sock *sk) +{ + /* Queued data below 1/4th of sndbuf? */ + if ((sk_sndbuf_get(sk) >> 2) > sk_wmem_queued_get(sk)) + sk->sk_friend->sk_write_space(sk->sk_friend); +} + /* * Handle reading urgent data. BSD has very simple semantics for * this, no blocking and very strange errors 8) @@ -1294,7 +1556,12 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied) struct tcp_sock *tp = tcp_sk(sk); bool time_to_ack = false; - struct sk_buff *skb = skb_peek(&sk->sk_receive_queue); + struct sk_buff *skb; + + if (sk->sk_friend) + return; + + skb = skb_peek(&sk->sk_receive_queue); WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq), "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n", @@ -1405,9 +1672,9 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) skb_queue_walk(&sk->sk_receive_queue, skb) { offset = seq - TCP_SKB_CB(skb)->seq; - if (tcp_hdr(skb)->syn) + if (!skb->friend && tcp_hdr(skb)->syn) offset--; - if (offset < skb->len || tcp_hdr(skb)->fin) { + if (offset < skb->len || (!skb->friend && tcp_hdr(skb)->fin)) { *off = offset; return skb; } @@ -1434,14 +1701,27 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, u32 seq = tp->copied_seq; u32 offset; int copied = 0; + struct sock *friend = sk->sk_friend; if (sk->sk_state == TCP_LISTEN) return -ENOTCONN; + + if (friend) { + int err; + long timeo = sock_rcvtimeo(sk, false); + + err = tcp_friends(sk, &friend, &timeo); + if (err < 0) + return err; + spin_lock_bh(&sk->sk_lock.slock); + } + while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { if (offset < skb->len) { int used; size_t len; + again: len = skb->len - offset; /* Stop reading if we hit a patch of urgent data */ if (tp->urg_data) { @@ -1451,7 +1731,13 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, if (!len) break; } + if (sk->sk_friend) + spin_unlock_bh(&sk->sk_lock.slock); + used = recv_actor(desc, skb, offset, len); + + if (sk->sk_friend) + spin_lock_bh(&sk->sk_lock.slock); if (used < 0) { if (!copied) copied = used; @@ -1461,17 +1747,31 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, copied += used; offset += used; } - /* - * If recv_actor drops the lock (e.g. TCP splice - * receive) the skb pointer might be invalid when - * getting here: tcp_collapse might have deleted it - * while aggregating skbs from the socket queue. - */ - skb = tcp_recv_skb(sk, seq-1, &offset); - if (!skb || (offset+1 != skb->len)) - break; + if (skb->friend) { + if (offset < skb->len) { + /* + * Friend did an skb_put() while we + * were away so process the same skb. + */ + tp->copied_seq = seq; + if (!desc->count) + break; + goto again; + } + } else { + /* + * If recv_actor drops the lock (e.g. TCP + * splice receive) the skb pointer might be + * invalid when getting here: tcp_collapse + * might have deleted it while aggregating + * skbs from the socket queue. + */ + skb = tcp_recv_skb(sk, seq-1, &offset); + if (!skb || (offset+1 != skb->len)) + break; + } } - if (tcp_hdr(skb)->fin) { + if (!skb->friend && tcp_hdr(skb)->fin) { sk_eat_skb(sk, skb, false); ++seq; break; @@ -1483,11 +1783,16 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, } tp->copied_seq = seq; - tcp_rcv_space_adjust(sk); + if (sk->sk_friend) { + spin_unlock_bh(&sk->sk_lock.slock); + tcp_friend_write_space(sk); + } else { + tcp_rcv_space_adjust(sk); - /* Clean up data we have read: This will do ACK frames. */ - if (copied > 0) - tcp_cleanup_rbuf(sk, copied); + /* Clean up data we have read: This will do ACK frames. */ + if (copied > 0) + tcp_cleanup_rbuf(sk, copied); + } return copied; } EXPORT_SYMBOL(tcp_read_sock); @@ -1515,6 +1820,9 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, bool copied_early = false; struct sk_buff *skb; u32 urg_hole = 0; + int skb_len; + struct sock *friend; + bool locked = false; lock_sock(sk); @@ -1524,6 +1832,10 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, timeo = sock_rcvtimeo(sk, nonblock); + err = tcp_friends(sk, &friend, &timeo); + if (err < 0) + goto out; + /* Urgent data needs to be handled specially. */ if (flags & MSG_OOB) goto recv_urg; @@ -1562,7 +1874,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, available = TCP_SKB_CB(skb)->seq + skb->len - (*seq); if ((available < target) && (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) && - !sysctl_tcp_low_latency && + !sysctl_tcp_low_latency && !friends && net_dma_find_channel()) { preempt_enable_no_resched(); tp->ucopy.pinned_list = @@ -1586,9 +1898,30 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, } } - /* Next get a buffer. */ + /* + * Next get a buffer. Note, for socket friends a sk_friend + * sendmsg() can either skb_queue_tail() a new skb directly + * or skb_put() to the tail skb while holding sk_lock.slock. + */ + if (friend && !locked) { + spin_lock_bh(&sk->sk_lock.slock); + locked = true; + } skb_queue_walk(&sk->sk_receive_queue, skb) { + offset = *seq - TCP_SKB_CB(skb)->seq; + skb_len = skb->len; + if (friend) { + spin_unlock_bh(&sk->sk_lock.slock); + locked = false; + if (skb->friend) { + if (offset < skb_len) + goto found_ok_skb; + BUG_ON(!(flags & MSG_PEEK)); + break; + } + } + /* Now that we have two receive queues this * shouldn't happen. */ @@ -1598,10 +1931,9 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, flags)) break; - offset = *seq - TCP_SKB_CB(skb)->seq; if (tcp_hdr(skb)->syn) offset--; - if (offset < skb->len) + if (offset < skb_len) goto found_ok_skb; if (tcp_hdr(skb)->fin) goto found_fin_ok; @@ -1612,6 +1944,11 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, /* Well, if we have backlog, try to process it now yet. */ + if (friend && locked) { + spin_unlock_bh(&sk->sk_lock.slock); + locked = false; + } + if (copied >= target && !sk->sk_backlog.tail) break; @@ -1658,7 +1995,8 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, tcp_cleanup_rbuf(sk, copied); - if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) { + if (!sysctl_tcp_low_latency && !friend && + tp->ucopy.task == user_recv) { /* Install new reader */ if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) { user_recv = current; @@ -1753,7 +2091,7 @@ do_prequeue: found_ok_skb: /* Ok so how much can we use? */ - used = skb->len - offset; + used = skb_len - offset; if (len < used) used = len; @@ -1799,7 +2137,7 @@ do_prequeue: dma_async_memcpy_issue_pending(tp->ucopy.dma_chan); - if ((offset + used) == skb->len) + if ((offset + used) == skb_len) copied_early = true; } else @@ -1819,6 +2157,7 @@ do_prequeue: *seq += used; copied += used; len -= used; + offset += used; tcp_rcv_space_adjust(sk); @@ -1827,11 +2166,36 @@ skip_copy: tp->urg_data = 0; tcp_fast_path_check(sk); } - if (used + offset < skb->len) + + if (friend) { + spin_lock_bh(&sk->sk_lock.slock); + locked = true; + skb_len = skb->len; + if (offset < skb_len) { + if (skb->friend && len > 0) { + /* + * Friend did an skb_put() while we + * were away so process the same skb. + */ + spin_unlock_bh(&sk->sk_lock.slock); + locked = false; + goto found_ok_skb; + } + continue; + } + if (!(flags & MSG_PEEK)) { + __skb_unlink(skb, &sk->sk_receive_queue); + __kfree_skb(skb); + tcp_friend_write_space(sk); + } continue; + } - if (tcp_hdr(skb)->fin) + if (offset < skb_len) + continue; + else if (tcp_hdr(skb)->fin) goto found_fin_ok; + if (!(flags & MSG_PEEK)) { sk_eat_skb(sk, skb, copied_early); copied_early = false; @@ -1848,6 +2212,9 @@ skip_copy: break; } while (len > 0); + if (friend && locked) + spin_unlock_bh(&sk->sk_lock.slock); + if (user_recv) { if (!skb_queue_empty(&tp->ucopy.prequeue)) { int chunk; @@ -2026,6 +2393,9 @@ void tcp_close(struct sock *sk, long timeout) goto adjudge_to_death; } + if (sk->sk_friend) + sock_put(sk->sk_friend); + /* We need to flush the recv. buffs. We do this only on the * descriptor close, not protocol-sourced closes, because the * reader process may not have drained the data yet! diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ca0d0e7..52ac297 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -524,6 +524,9 @@ void tcp_rcv_space_adjust(struct sock *sk) int time; int space; + if (sk->sk_friend) + return; + if (tp->rcvq_space.time == 0) goto new_measure; @@ -4516,8 +4519,9 @@ static int tcp_prune_queue(struct sock *sk); static int tcp_try_rmem_schedule(struct sock *sk, unsigned int size) { - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || - !sk_rmem_schedule(sk, size)) { + if (!sk->sk_friend && + (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || + !sk_rmem_schedule(sk, size))) { if (tcp_prune_queue(sk) < 0) return -1; @@ -5839,6 +5843,18 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, * state to ESTABLISHED..." */ + if (skb->friend) { + /* + * If friends haven't been made yet, our sk_friend + * still == NULL, then update with the ACK's friend + * value (the listen()er's sock addr) which is used + * as a place holder. + */ + atomic_long_cmpxchg(&sk->sk_friend, 0, + (u64)skb->friend); + } else + sk->sk_friend = NULL; + TCP_ECN_rcv_synack(tp, th); tp->snd_wl1 = TCP_SKB_CB(skb)->seq; @@ -5911,9 +5927,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, tcp_finish_connect(sk, skb); - if (sk->sk_write_pending || + if (!skb->friend && (sk->sk_write_pending || icsk->icsk_accept_queue.rskq_defer_accept || - icsk->icsk_ack.pingpong) { + icsk->icsk_ack.pingpong)) { /* Save one ACK. Data will be ready after * several ticks, if write_pending is set. * diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 64568fa..45ccafd 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1314,6 +1314,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops; #endif + req->friend = skb->friend; + tcp_clear_options(&tmp_opt); tmp_opt.mss_clamp = TCP_MSS_DEFAULT; tmp_opt.user_mss = tp->rx_opt.user_mss; diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 72b7c63..4ff285b 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -315,6 +315,11 @@ void tcp_time_wait(struct sock *sk, int state, int timeo) bool recycle_ok = false; bool recycle_on = false; + if (sk->sk_friend) { + tcp_done(sk); + return; + } + if (tcp_death_row.sysctl_tw_recycle && tp->rx_opt.ts_recent_stamp) { recycle_ok = tcp_remember_stamp(sk); recycle_on = true; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index c465d3e..542b34a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -65,6 +65,9 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1; int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */ EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size); +/* TCP loopback bypass */ +int sysctl_tcp_friends __read_mostly = 1; + /* Account for new data that has been sent to the network. */ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb) @@ -829,9 +832,14 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, tcb = TCP_SKB_CB(skb); memset(&opts, 0, sizeof(opts)); - if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) + if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) { + if (sysctl_tcp_friends) { + /* Only try to make friends if enabled */ + skb->friend = sk; + } + tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5); - else + } else tcp_options_size = tcp_established_options(sk, skb, &opts, &md5); tcp_header_size = tcp_options_size + sizeof(struct tcphdr); @@ -2506,6 +2514,12 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst, } memset(&opts, 0, sizeof(opts)); + + if (sysctl_tcp_friends) { + /* Only try to make friends if enabled */ + skb->friend = sk; + } + #ifdef CONFIG_SYN_COOKIES if (unlikely(req->cookie_ts)) TCP_SKB_CB(skb)->when = cookie_init_timestamp(req); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 6cc67ed..33f9d47 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1066,6 +1066,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb) tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops; #endif + req->friend = skb->friend; tcp_clear_options(&tmp_opt); tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr); tmp_opt.user_mss = tp->rx_opt.user_mss; -- 1.7.7.3