* [PATCH v2 net-next 1/8] tcp: tsq: add tsq_flags / tsq_enum
2016-12-03 19:14 [PATCH v2 net-next 0/8] tcp: tsq: performance series Eric Dumazet
@ 2016-12-03 19:14 ` Eric Dumazet
2016-12-03 19:14 ` [PATCH v2 net-next 2/8] tcp: tsq: remove one locked operation in tcp_wfree() Eric Dumazet
` (7 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-03 19:14 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Eric Dumazet
This is a cleanup, to ease code review of following patches.
Old 'enum tsq_flags' is renamed, and a new enumeration is added
with the flags used in cmpxchg() operations as opposed to
single bit operations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/tcp.h | 11 ++++++++++-
net/ipv4/tcp_output.c | 16 ++++++++--------
2 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 734bab4c3bef..d8be083ab0b0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -364,7 +364,7 @@ struct tcp_sock {
u32 *saved_syn;
};
-enum tsq_flags {
+enum tsq_enum {
TSQ_THROTTLED,
TSQ_QUEUED,
TCP_TSQ_DEFERRED, /* tcp_tasklet_func() found socket was owned */
@@ -375,6 +375,15 @@ enum tsq_flags {
*/
};
+enum tsq_flags {
+ TSQF_THROTTLED = (1UL << TSQ_THROTTLED),
+ TSQF_QUEUED = (1UL << TSQ_QUEUED),
+ TCPF_TSQ_DEFERRED = (1UL << TCP_TSQ_DEFERRED),
+ TCPF_WRITE_TIMER_DEFERRED = (1UL << TCP_WRITE_TIMER_DEFERRED),
+ TCPF_DELACK_TIMER_DEFERRED = (1UL << TCP_DELACK_TIMER_DEFERRED),
+ TCPF_MTU_REDUCED_DEFERRED = (1UL << TCP_MTU_REDUCED_DEFERRED),
+};
+
static inline struct tcp_sock *tcp_sk(const struct sock *sk)
{
return (struct tcp_sock *)sk;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c7adcb57654e..8f0289b0fb24 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -784,10 +784,10 @@ static void tcp_tasklet_func(unsigned long data)
}
}
-#define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) | \
- (1UL << TCP_WRITE_TIMER_DEFERRED) | \
- (1UL << TCP_DELACK_TIMER_DEFERRED) | \
- (1UL << TCP_MTU_REDUCED_DEFERRED))
+#define TCP_DEFERRED_ALL (TCPF_TSQ_DEFERRED | \
+ TCPF_WRITE_TIMER_DEFERRED | \
+ TCPF_DELACK_TIMER_DEFERRED | \
+ TCPF_MTU_REDUCED_DEFERRED)
/**
* tcp_release_cb - tcp release_sock() callback
* @sk: socket
@@ -808,7 +808,7 @@ void tcp_release_cb(struct sock *sk)
nflags = flags & ~TCP_DEFERRED_ALL;
} while (cmpxchg(&tp->tsq_flags, flags, nflags) != flags);
- if (flags & (1UL << TCP_TSQ_DEFERRED))
+ if (flags & TCPF_TSQ_DEFERRED)
tcp_tsq_handler(sk);
/* Here begins the tricky part :
@@ -822,15 +822,15 @@ void tcp_release_cb(struct sock *sk)
*/
sock_release_ownership(sk);
- if (flags & (1UL << TCP_WRITE_TIMER_DEFERRED)) {
+ if (flags & TCPF_WRITE_TIMER_DEFERRED) {
tcp_write_timer_handler(sk);
__sock_put(sk);
}
- if (flags & (1UL << TCP_DELACK_TIMER_DEFERRED)) {
+ if (flags & TCPF_DELACK_TIMER_DEFERRED) {
tcp_delack_timer_handler(sk);
__sock_put(sk);
}
- if (flags & (1UL << TCP_MTU_REDUCED_DEFERRED)) {
+ if (flags & TCPF_MTU_REDUCED_DEFERRED) {
inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
__sock_put(sk);
}
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 2/8] tcp: tsq: remove one locked operation in tcp_wfree()
2016-12-03 19:14 [PATCH v2 net-next 0/8] tcp: tsq: performance series Eric Dumazet
2016-12-03 19:14 ` [PATCH v2 net-next 1/8] tcp: tsq: add tsq_flags / tsq_enum Eric Dumazet
@ 2016-12-03 19:14 ` Eric Dumazet
2016-12-03 19:14 ` [PATCH v2 net-next 3/8] tcp: tsq: add shortcut in tcp_tasklet_func() Eric Dumazet
` (6 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-03 19:14 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Eric Dumazet
Instead of atomically clear TSQ_THROTTLED and atomically set TSQ_QUEUED
bits, use one cmpxchg() to perform a single locked operation.
Since the following patch will also set TCP_TSQ_DEFERRED here,
this cmpxchg() will make this addition free.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_output.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8f0289b0fb24..4adaf8e1bb63 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -860,6 +860,7 @@ void tcp_wfree(struct sk_buff *skb)
{
struct sock *sk = skb->sk;
struct tcp_sock *tp = tcp_sk(sk);
+ unsigned long flags, nval, oval;
int wmem;
/* Keep one reference on sk_wmem_alloc.
@@ -877,11 +878,17 @@ void tcp_wfree(struct sk_buff *skb)
if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
goto out;
- if (test_and_clear_bit(TSQ_THROTTLED, &tp->tsq_flags) &&
- !test_and_set_bit(TSQ_QUEUED, &tp->tsq_flags)) {
- unsigned long flags;
+ for (oval = READ_ONCE(tp->tsq_flags);; oval = nval) {
struct tsq_tasklet *tsq;
+ if (!(oval & TSQF_THROTTLED) || (oval & TSQF_QUEUED))
+ goto out;
+
+ nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED;
+ nval = cmpxchg(&tp->tsq_flags, oval, nval);
+ if (nval != oval)
+ continue;
+
/* queue this socket to tasklet queue */
local_irq_save(flags);
tsq = this_cpu_ptr(&tsq_tasklet);
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 3/8] tcp: tsq: add shortcut in tcp_tasklet_func()
2016-12-03 19:14 [PATCH v2 net-next 0/8] tcp: tsq: performance series Eric Dumazet
2016-12-03 19:14 ` [PATCH v2 net-next 1/8] tcp: tsq: add tsq_flags / tsq_enum Eric Dumazet
2016-12-03 19:14 ` [PATCH v2 net-next 2/8] tcp: tsq: remove one locked operation in tcp_wfree() Eric Dumazet
@ 2016-12-03 19:14 ` Eric Dumazet
2016-12-03 19:14 ` [PATCH v2 net-next 4/8] tcp: tsq: avoid one atomic in tcp_wfree() Eric Dumazet
` (5 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-03 19:14 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Eric Dumazet
Under high stress, I've seen tcp_tasklet_func() consuming
~700 usec, handling ~150 tcp sockets.
By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance
for other cpus/threads entering tcp_write_xmit() to grab it,
allowing tcp_tasklet_func() to skip sockets that already did
an xmit cycle.
In the future, we might give to ACK processing an increased
budget to reduce even more tcp_tasklet_func() amount of work.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_output.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4adaf8e1bb63..fa23b688a6f3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -767,19 +767,19 @@ static void tcp_tasklet_func(unsigned long data)
list_for_each_safe(q, n, &list) {
tp = list_entry(q, struct tcp_sock, tsq_node);
list_del(&tp->tsq_node);
+ clear_bit(TSQ_QUEUED, &tp->tsq_flags);
sk = (struct sock *)tp;
- bh_lock_sock(sk);
-
- if (!sock_owned_by_user(sk)) {
- tcp_tsq_handler(sk);
- } else {
- /* defer the work to tcp_release_cb() */
- set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
+ if (!sk->sk_lock.owned &&
+ test_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags)) {
+ bh_lock_sock(sk);
+ if (!sock_owned_by_user(sk)) {
+ clear_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
+ tcp_tsq_handler(sk);
+ }
+ bh_unlock_sock(sk);
}
- bh_unlock_sock(sk);
- clear_bit(TSQ_QUEUED, &tp->tsq_flags);
sk_free(sk);
}
}
@@ -884,7 +884,7 @@ void tcp_wfree(struct sk_buff *skb)
if (!(oval & TSQF_THROTTLED) || (oval & TSQF_QUEUED))
goto out;
- nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED;
+ nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED;
nval = cmpxchg(&tp->tsq_flags, oval, nval);
if (nval != oval)
continue;
@@ -2229,6 +2229,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
break;
+ if (test_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags))
+ clear_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
if (tcp_small_queue_check(sk, skb, 0))
break;
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 4/8] tcp: tsq: avoid one atomic in tcp_wfree()
2016-12-03 19:14 [PATCH v2 net-next 0/8] tcp: tsq: performance series Eric Dumazet
` (2 preceding siblings ...)
2016-12-03 19:14 ` [PATCH v2 net-next 3/8] tcp: tsq: add shortcut in tcp_tasklet_func() Eric Dumazet
@ 2016-12-03 19:14 ` Eric Dumazet
2016-12-03 19:14 ` [PATCH v2 net-next 5/8] tcp: tsq: add a shortcut in tcp_small_queue_check() Eric Dumazet
` (4 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-03 19:14 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Eric Dumazet
Under high load, tcp_wfree() has an atomic operation trying
to schedule a tasklet over and over.
We can schedule it only if our per cpu list was empty.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_output.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fa23b688a6f3..0db63efe5b8b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -880,6 +880,7 @@ void tcp_wfree(struct sk_buff *skb)
for (oval = READ_ONCE(tp->tsq_flags);; oval = nval) {
struct tsq_tasklet *tsq;
+ bool empty;
if (!(oval & TSQF_THROTTLED) || (oval & TSQF_QUEUED))
goto out;
@@ -892,8 +893,10 @@ void tcp_wfree(struct sk_buff *skb)
/* queue this socket to tasklet queue */
local_irq_save(flags);
tsq = this_cpu_ptr(&tsq_tasklet);
+ empty = list_empty(&tsq->head);
list_add(&tp->tsq_node, &tsq->head);
- tasklet_schedule(&tsq->tasklet);
+ if (empty)
+ tasklet_schedule(&tsq->tasklet);
local_irq_restore(flags);
return;
}
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 5/8] tcp: tsq: add a shortcut in tcp_small_queue_check()
2016-12-03 19:14 [PATCH v2 net-next 0/8] tcp: tsq: performance series Eric Dumazet
` (3 preceding siblings ...)
2016-12-03 19:14 ` [PATCH v2 net-next 4/8] tcp: tsq: avoid one atomic in tcp_wfree() Eric Dumazet
@ 2016-12-03 19:14 ` Eric Dumazet
2016-12-03 19:14 ` [PATCH v2 net-next 6/8] tcp: tcp_mtu_probe() is likely to exit early Eric Dumazet
` (3 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-03 19:14 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Eric Dumazet
Always allow the two first skbs in write queue to be sent,
regardless of sk_wmem_alloc/sk_pacing_rate values.
This helps a lot in situations where TX completions are delayed either
because of driver latencies or softirq latencies.
Test is done with no cache line misses.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_output.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0db63efe5b8b..d5c46749adab 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2091,6 +2091,15 @@ static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
limit <<= factor;
if (atomic_read(&sk->sk_wmem_alloc) > limit) {
+ /* Always send the 1st or 2nd skb in write queue.
+ * No need to wait for TX completion to call us back,
+ * after softirq/tasklet schedule.
+ * This helps when TX completions are delayed too much.
+ */
+ if (skb == sk->sk_write_queue.next ||
+ skb->prev == sk->sk_write_queue.next)
+ return false;
+
set_bit(TSQ_THROTTLED, &tcp_sk(sk)->tsq_flags);
/* It is possible TX completion already happened
* before we set TSQ_THROTTLED, so we must
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 6/8] tcp: tcp_mtu_probe() is likely to exit early
2016-12-03 19:14 [PATCH v2 net-next 0/8] tcp: tsq: performance series Eric Dumazet
` (4 preceding siblings ...)
2016-12-03 19:14 ` [PATCH v2 net-next 5/8] tcp: tsq: add a shortcut in tcp_small_queue_check() Eric Dumazet
@ 2016-12-03 19:14 ` Eric Dumazet
2016-12-03 19:14 ` [PATCH v2 net-next 7/8] net: reorganize struct sock for better data locality Eric Dumazet
` (2 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-03 19:14 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Eric Dumazet
Adding a likely() in tcp_mtu_probe() moves its code which used to
be inlined in front of tcp_write_xmit()
We still have a cache line miss to access icsk->icsk_mtup.enabled,
we will probably have to reorganize fields to help data locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_output.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d5c46749adab..5f04bee4c86a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1932,26 +1932,26 @@ static inline void tcp_mtu_check_reprobe(struct sock *sk)
*/
static int tcp_mtu_probe(struct sock *sk)
{
- struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb, *nskb, *next;
struct net *net = sock_net(sk);
- int len;
int probe_size;
int size_needed;
- int copy;
+ int copy, len;
int mss_now;
int interval;
/* Not currently probing/verifying,
* not in recovery,
* have enough cwnd, and
- * not SACKing (the variable headers throw things off) */
- if (!icsk->icsk_mtup.enabled ||
- icsk->icsk_mtup.probe_size ||
- inet_csk(sk)->icsk_ca_state != TCP_CA_Open ||
- tp->snd_cwnd < 11 ||
- tp->rx_opt.num_sacks || tp->rx_opt.dsack)
+ * not SACKing (the variable headers throw things off)
+ */
+ if (likely(!icsk->icsk_mtup.enabled ||
+ icsk->icsk_mtup.probe_size ||
+ inet_csk(sk)->icsk_ca_state != TCP_CA_Open ||
+ tp->snd_cwnd < 11 ||
+ tp->rx_opt.num_sacks || tp->rx_opt.dsack))
return -1;
/* Use binary search for probe_size between tcp_mss_base,
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 7/8] net: reorganize struct sock for better data locality
2016-12-03 19:14 [PATCH v2 net-next 0/8] tcp: tsq: performance series Eric Dumazet
` (5 preceding siblings ...)
2016-12-03 19:14 ` [PATCH v2 net-next 6/8] tcp: tcp_mtu_probe() is likely to exit early Eric Dumazet
@ 2016-12-03 19:14 ` Eric Dumazet
2016-12-05 12:36 ` Paolo Abeni
2016-12-03 19:14 ` [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc Eric Dumazet
2016-12-05 19:06 ` [PATCH v2 net-next 0/8] tcp: tsq: performance series David Miller
8 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-03 19:14 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Eric Dumazet
Group fields used in TX path, and keep some cache lines mostly read
to permit sharing among cpus.
Gained two 4 bytes holes on 64bit arches.
Added a place holder for tcp tsq_flags, next to sk_wmem_alloc
to speed up tcp_wfree() in the following patch.
I have not added ____cacheline_aligned_in_smp, this might be done later.
I prefer doing this once inet and tcp/udp sockets reorg is also done.
Tested with both TCP and UDP.
UDP receiver performance under flood increased by ~20 % :
Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops
was moved away from a critical cache line, now mostly read and shared.
/* --- cacheline 4 boundary (256 bytes) --- */
unsigned int sk_napi_id; /* 0x100 0x4 */
int sk_rcvbuf; /* 0x104 0x4 */
struct sk_filter * sk_filter; /* 0x108 0x8 */
union {
struct socket_wq * sk_wq; /* 0x8 */
struct socket_wq * sk_wq_raw; /* 0x8 */
}; /* 0x110 0x8 */
struct xfrm_policy * sk_policy[2]; /* 0x118 0x10 */
struct dst_entry * sk_rx_dst; /* 0x128 0x8 */
struct dst_entry * sk_dst_cache; /* 0x130 0x8 */
atomic_t sk_omem_alloc; /* 0x138 0x4 */
int sk_sndbuf; /* 0x13c 0x4 */
/* --- cacheline 5 boundary (320 bytes) --- */
int sk_wmem_queued; /* 0x140 0x4 */
atomic_t sk_wmem_alloc; /* 0x144 0x4 */
long unsigned int sk_tsq_flags; /* 0x148 0x8 */
struct sk_buff * sk_send_head; /* 0x150 0x8 */
struct sk_buff_head sk_write_queue; /* 0x158 0x18 */
__s32 sk_peek_off; /* 0x170 0x4 */
int sk_write_pending; /* 0x174 0x4 */
long int sk_sndtimeo; /* 0x178 0x8 */
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sock.h | 51 +++++++++++++++++++++++++++------------------------
1 file changed, 27 insertions(+), 24 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 69afda6bea15..6dfe3aa22b97 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -343,6 +343,9 @@ struct sock {
#define sk_rxhash __sk_common.skc_rxhash
socket_lock_t sk_lock;
+ atomic_t sk_drops;
+ int sk_rcvlowat;
+ struct sk_buff_head sk_error_queue;
struct sk_buff_head sk_receive_queue;
/*
* The backlog queue is special, it is always used with
@@ -359,14 +362,13 @@ struct sock {
struct sk_buff *tail;
} sk_backlog;
#define sk_rmem_alloc sk_backlog.rmem_alloc
- int sk_forward_alloc;
- __u32 sk_txhash;
+ int sk_forward_alloc;
#ifdef CONFIG_NET_RX_BUSY_POLL
- unsigned int sk_napi_id;
unsigned int sk_ll_usec;
+ /* ===== mostly read cache line ===== */
+ unsigned int sk_napi_id;
#endif
- atomic_t sk_drops;
int sk_rcvbuf;
struct sk_filter __rcu *sk_filter;
@@ -379,11 +381,30 @@ struct sock {
#endif
struct dst_entry *sk_rx_dst;
struct dst_entry __rcu *sk_dst_cache;
- /* Note: 32bit hole on 64bit arches */
- atomic_t sk_wmem_alloc;
atomic_t sk_omem_alloc;
int sk_sndbuf;
+
+ /* ===== cache line for TX ===== */
+ int sk_wmem_queued;
+ atomic_t sk_wmem_alloc;
+ unsigned long sk_tsq_flags;
+ struct sk_buff *sk_send_head;
struct sk_buff_head sk_write_queue;
+ __s32 sk_peek_off;
+ int sk_write_pending;
+ long sk_sndtimeo;
+ struct timer_list sk_timer;
+ __u32 sk_priority;
+ __u32 sk_mark;
+ u32 sk_pacing_rate; /* bytes per second */
+ u32 sk_max_pacing_rate;
+ struct page_frag sk_frag;
+ netdev_features_t sk_route_caps;
+ netdev_features_t sk_route_nocaps;
+ int sk_gso_type;
+ unsigned int sk_gso_max_size;
+ gfp_t sk_allocation;
+ __u32 sk_txhash;
/*
* Because of non atomicity rules, all
@@ -414,42 +435,24 @@ struct sock {
#define SK_PROTOCOL_MAX U8_MAX
kmemcheck_bitfield_end(flags);
- int sk_wmem_queued;
- gfp_t sk_allocation;
- u32 sk_pacing_rate; /* bytes per second */
- u32 sk_max_pacing_rate;
- netdev_features_t sk_route_caps;
- netdev_features_t sk_route_nocaps;
- int sk_gso_type;
- unsigned int sk_gso_max_size;
u16 sk_gso_max_segs;
- int sk_rcvlowat;
unsigned long sk_lingertime;
- struct sk_buff_head sk_error_queue;
struct proto *sk_prot_creator;
rwlock_t sk_callback_lock;
int sk_err,
sk_err_soft;
u32 sk_ack_backlog;
u32 sk_max_ack_backlog;
- __u32 sk_priority;
- __u32 sk_mark;
kuid_t sk_uid;
struct pid *sk_peer_pid;
const struct cred *sk_peer_cred;
long sk_rcvtimeo;
- long sk_sndtimeo;
- struct timer_list sk_timer;
ktime_t sk_stamp;
u16 sk_tsflags;
u8 sk_shutdown;
u32 sk_tskey;
struct socket *sk_socket;
void *sk_user_data;
- struct page_frag sk_frag;
- struct sk_buff *sk_send_head;
- __s32 sk_peek_off;
- int sk_write_pending;
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 7/8] net: reorganize struct sock for better data locality
2016-12-03 19:14 ` [PATCH v2 net-next 7/8] net: reorganize struct sock for better data locality Eric Dumazet
@ 2016-12-05 12:36 ` Paolo Abeni
2016-12-05 14:30 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Paolo Abeni @ 2016-12-05 12:36 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S . Miller, netdev, Yuchung Cheng, Eric Dumazet
Hi Eric,
On Sat, 2016-12-03 at 11:14 -0800, Eric Dumazet wrote:
> Group fields used in TX path, and keep some cache lines mostly read
> to permit sharing among cpus.
>
> Gained two 4 bytes holes on 64bit arches.
>
> Added a place holder for tcp tsq_flags, next to sk_wmem_alloc
> to speed up tcp_wfree() in the following patch.
>
> I have not added ____cacheline_aligned_in_smp, this might be done later.
> I prefer doing this once inet and tcp/udp sockets reorg is also done.
>
> Tested with both TCP and UDP.
>
> UDP receiver performance under flood increased by ~20 % :
> Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops
> was moved away from a critical cache line, now mostly read and shared.
I cherry-picked this patch only for some UDP benchmark. Under flood with
many concurrent flows, I see this 20% improvement and a relevant
decrease in system load.
Nice work, thanks Eric!
Tested-by: Paolo Abeni <pabeni@redhat.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 7/8] net: reorganize struct sock for better data locality
2016-12-05 12:36 ` Paolo Abeni
@ 2016-12-05 14:30 ` Eric Dumazet
0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-05 14:30 UTC (permalink / raw)
To: Paolo Abeni; +Cc: Eric Dumazet, David S . Miller, netdev, Yuchung Cheng
On Mon, 2016-12-05 at 13:36 +0100, Paolo Abeni wrote:
> Hi Eric,
> I cherry-picked this patch only for some UDP benchmark. Under flood with
> many concurrent flows, I see this 20% improvement and a relevant
> decrease in system load.
>
> Nice work, thanks Eric!
>
> Tested-by: Paolo Abeni <pabeni@redhat.com>
Excellent, thanks a lot for testing !
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc
2016-12-03 19:14 [PATCH v2 net-next 0/8] tcp: tsq: performance series Eric Dumazet
` (6 preceding siblings ...)
2016-12-03 19:14 ` [PATCH v2 net-next 7/8] net: reorganize struct sock for better data locality Eric Dumazet
@ 2016-12-03 19:14 ` Eric Dumazet
2016-12-04 0:16 ` David Miller
2016-12-05 19:06 ` [PATCH v2 net-next 0/8] tcp: tsq: performance series David Miller
8 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-03 19:14 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Eric Dumazet
tsq_flags being in the same cache line than sk_wmem_alloc
makes a lot of sense. Both fields are changed from tcp_wfree()
and more generally by various TSQ related functions.
Prior patch made room in struct sock and added sk_tsq_flags,
this patch deletes tsq_flags from struct tcp_sock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/tcp.h | 1 -
net/ipv4/tcp.c | 4 ++--
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_output.c | 24 +++++++++++-------------
net/ipv4/tcp_timer.c | 4 ++--
net/ipv6/tcp_ipv6.c | 2 +-
6 files changed, 17 insertions(+), 20 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d8be083ab0b0..fc5848dad7a4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -186,7 +186,6 @@ struct tcp_sock {
u32 tsoffset; /* timestamp offset */
struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
- unsigned long tsq_flags;
/* Data for direct copy to user */
struct {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1149b48700a1..1ef3165114ba 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -663,9 +663,9 @@ static void tcp_push(struct sock *sk, int flags, int mss_now,
if (tcp_should_autocork(sk, skb, size_goal)) {
/* avoid atomic op if TSQ_THROTTLED bit is already set */
- if (!test_bit(TSQ_THROTTLED, &tp->tsq_flags)) {
+ if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPAUTOCORKING);
- set_bit(TSQ_THROTTLED, &tp->tsq_flags);
+ set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
}
/* It is possible TX completion already happened
* before we set TSQ_THROTTLED.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index b50f05905ced..30d81f533ada 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -443,7 +443,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
if (!sock_owned_by_user(sk)) {
tcp_v4_mtu_reduced(sk);
} else {
- if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
+ if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &sk->sk_tsq_flags))
sock_hold(sk);
}
goto out;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5f04bee4c86a..b45101f3d2bd 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -767,14 +767,15 @@ static void tcp_tasklet_func(unsigned long data)
list_for_each_safe(q, n, &list) {
tp = list_entry(q, struct tcp_sock, tsq_node);
list_del(&tp->tsq_node);
- clear_bit(TSQ_QUEUED, &tp->tsq_flags);
sk = (struct sock *)tp;
+ clear_bit(TSQ_QUEUED, &sk->sk_tsq_flags);
+
if (!sk->sk_lock.owned &&
- test_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags)) {
+ test_bit(TCP_TSQ_DEFERRED, &sk->sk_tsq_flags)) {
bh_lock_sock(sk);
if (!sock_owned_by_user(sk)) {
- clear_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
+ clear_bit(TCP_TSQ_DEFERRED, &sk->sk_tsq_flags);
tcp_tsq_handler(sk);
}
bh_unlock_sock(sk);
@@ -797,16 +798,15 @@ static void tcp_tasklet_func(unsigned long data)
*/
void tcp_release_cb(struct sock *sk)
{
- struct tcp_sock *tp = tcp_sk(sk);
unsigned long flags, nflags;
/* perform an atomic operation only if at least one flag is set */
do {
- flags = tp->tsq_flags;
+ flags = sk->sk_tsq_flags;
if (!(flags & TCP_DEFERRED_ALL))
return;
nflags = flags & ~TCP_DEFERRED_ALL;
- } while (cmpxchg(&tp->tsq_flags, flags, nflags) != flags);
+ } while (cmpxchg(&sk->sk_tsq_flags, flags, nflags) != flags);
if (flags & TCPF_TSQ_DEFERRED)
tcp_tsq_handler(sk);
@@ -878,7 +878,7 @@ void tcp_wfree(struct sk_buff *skb)
if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
goto out;
- for (oval = READ_ONCE(tp->tsq_flags);; oval = nval) {
+ for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
struct tsq_tasklet *tsq;
bool empty;
@@ -886,7 +886,7 @@ void tcp_wfree(struct sk_buff *skb)
goto out;
nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED;
- nval = cmpxchg(&tp->tsq_flags, oval, nval);
+ nval = cmpxchg(&sk->sk_tsq_flags, oval, nval);
if (nval != oval)
continue;
@@ -2100,7 +2100,7 @@ static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
skb->prev == sk->sk_write_queue.next)
return false;
- set_bit(TSQ_THROTTLED, &tcp_sk(sk)->tsq_flags);
+ set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
/* It is possible TX completion already happened
* before we set TSQ_THROTTLED, so we must
* test again the condition.
@@ -2241,8 +2241,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
break;
- if (test_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags))
- clear_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
+ if (test_bit(TCP_TSQ_DEFERRED, &sk->sk_tsq_flags))
+ clear_bit(TCP_TSQ_DEFERRED, &sk->sk_tsq_flags);
if (tcp_small_queue_check(sk, skb, 0))
break;
@@ -3545,8 +3545,6 @@ void tcp_send_ack(struct sock *sk)
/* We do not want pure acks influencing TCP Small Queues or fq/pacing
* too much.
* SKB_TRUESIZE(max(1 .. 66, MAX_TCP_HEADER)) is unfortunately ~784
- * We also avoid tcp_wfree() overhead (cache line miss accessing
- * tp->tsq_flags) by using regular sock_wfree()
*/
skb_set_tcp_pure_ack(buff);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 3ea1cf804748..3705075f42c3 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -310,7 +310,7 @@ static void tcp_delack_timer(unsigned long data)
inet_csk(sk)->icsk_ack.blocked = 1;
__NET_INC_STATS(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
/* deleguate our work to tcp_release_cb() */
- if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &sk->sk_tsq_flags))
sock_hold(sk);
}
bh_unlock_sock(sk);
@@ -592,7 +592,7 @@ static void tcp_write_timer(unsigned long data)
tcp_write_timer_handler(sk);
} else {
/* delegate our work to tcp_release_cb() */
- if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &sk->sk_tsq_flags))
sock_hold(sk);
}
bh_unlock_sock(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index a2185a214abc..73bc8fc68acd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -399,7 +399,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
if (!sock_owned_by_user(sk))
tcp_v6_mtu_reduced(sk);
else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
- &tp->tsq_flags))
+ &sk->sk_tsq_flags))
sock_hold(sk);
goto out;
}
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc
2016-12-03 19:14 ` [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc Eric Dumazet
@ 2016-12-04 0:16 ` David Miller
2016-12-04 1:13 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2016-12-04 0:16 UTC (permalink / raw)
To: edumazet; +Cc: netdev, ycheng, eric.dumazet
From: Eric Dumazet <edumazet@google.com>
Date: Sat, 3 Dec 2016 11:14:57 -0800
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index d8be083ab0b0..fc5848dad7a4 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -186,7 +186,6 @@ struct tcp_sock {
> u32 tsoffset; /* timestamp offset */
>
> struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
> - unsigned long tsq_flags;
>
> /* Data for direct copy to user */
> struct {
Hmmm, did you forget to "git add include/net/sock.h" before making
this commit?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc
2016-12-04 0:16 ` David Miller
@ 2016-12-04 1:13 ` Eric Dumazet
2016-12-04 1:37 ` David Miller
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-04 1:13 UTC (permalink / raw)
To: David Miller; +Cc: edumazet, netdev, ycheng
On Sat, 2016-12-03 at 19:16 -0500, David Miller wrote:
> From: Eric Dumazet <edumazet@google.com>
> Date: Sat, 3 Dec 2016 11:14:57 -0800
>
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index d8be083ab0b0..fc5848dad7a4 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -186,7 +186,6 @@ struct tcp_sock {
> > u32 tsoffset; /* timestamp offset */
> >
> > struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
> > - unsigned long tsq_flags;
> >
> > /* Data for direct copy to user */
> > struct {
>
> Hmmm, did you forget to "git add include/net/sock.h" before making
> this commit?
sk_tsq_flags was added in prior patch in the series ( 7/8 net:
reorganize struct sock for better data locality)
What is the problem with this part ?
Thanks
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc
2016-12-04 1:13 ` Eric Dumazet
@ 2016-12-04 1:37 ` David Miller
2016-12-05 2:45 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2016-12-04 1:37 UTC (permalink / raw)
To: eric.dumazet; +Cc: edumazet, netdev, ycheng
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 03 Dec 2016 17:13:51 -0800
> On Sat, 2016-12-03 at 19:16 -0500, David Miller wrote:
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Sat, 3 Dec 2016 11:14:57 -0800
>>
>> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> > index d8be083ab0b0..fc5848dad7a4 100644
>> > --- a/include/linux/tcp.h
>> > +++ b/include/linux/tcp.h
>> > @@ -186,7 +186,6 @@ struct tcp_sock {
>> > u32 tsoffset; /* timestamp offset */
>> >
>> > struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
>> > - unsigned long tsq_flags;
>> >
>> > /* Data for direct copy to user */
>> > struct {
>>
>> Hmmm, did you forget to "git add include/net/sock.h" before making
>> this commit?
>
> sk_tsq_flags was added in prior patch in the series ( 7/8 net:
> reorganize struct sock for better data locality)
>
> What is the problem with this part ?
Sorry, just noticed by visual inspection. I expected the
struct sock part to show up in the same patch as the one
that removed it from tcp_sock and adjusted the users.
I'll re-review this series, thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc
2016-12-04 1:37 ` David Miller
@ 2016-12-05 2:45 ` Eric Dumazet
0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-05 2:45 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, netdev, Yuchung Cheng
On Sat, Dec 3, 2016 at 5:37 PM, David Miller <davem@davemloft.net> wrote:
>
> Sorry, just noticed by visual inspection. I expected the
> struct sock part to show up in the same patch as the one
> that removed it from tcp_sock and adjusted the users.
>
> I'll re-review this series, thanks.
Yes, I wanted to have after patch 7, the final cache line disposition
of struct sock.
(Quite critical for future bisections if needed, or performance tests
I mentioned)
I could have used a 'unsigned long _temp_padding', but just chose the
final name for the field.
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 0/8] tcp: tsq: performance series
2016-12-03 19:14 [PATCH v2 net-next 0/8] tcp: tsq: performance series Eric Dumazet
` (7 preceding siblings ...)
2016-12-03 19:14 ` [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc Eric Dumazet
@ 2016-12-05 19:06 ` David Miller
8 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2016-12-05 19:06 UTC (permalink / raw)
To: edumazet; +Cc: netdev, ycheng, eric.dumazet
From: Eric Dumazet <edumazet@google.com>
Date: Sat, 3 Dec 2016 11:14:49 -0800
> Under very high TX stress, CPU handling NIC TX completions can spend
> considerable amount of cycles handling TSQ (TCP Small Queues) logic.
>
> This patch series avoids some atomic operations, but most notable
> patch is the 3rd one, allowing other cpus processing ACK packets and
> calling tcp_write_xmit() to grab TCP_TSQ_DEFERRED so that
> tcp_tasklet_func() can skip already processed sockets.
>
> This avoid lots of lock acquisitions and cache lines accesses,
> particularly under load.
>
> In v2, I added :
>
> - tcp_small_queue_check() change to allow 1st and 2nd packets
> in write queue to be sent, even in the case TX completion of
> already acknowledged packets did not happen yet.
> This helps when TX completion coalescing parameters are set
> even to insane values, and/or busy polling is used.
>
> - A reorganization of struct sock fields to
> lower false sharing and increase data locality.
>
> - Then I moved tsq_flags from tcp_sock to struct sock also
> to reduce cache line misses during TX completions.
>
> I measured an overall throughput gain of 22 % for heavy TCP use
> over a single TX queue.
Looks fantastic, series applied, thanks Eric.
^ permalink raw reply [flat|nested] 16+ messages in thread