* [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used
@ 2016-04-25 17:39 Eric Dumazet
2016-04-25 17:39 ` [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations Eric Dumazet
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-04-25 17:39 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
We can avoid some atomic operations on sockets not using FASYNC
Eric Dumazet (2):
net: SOCKWQ_ASYNC_NOSPACE optimizations
net: SOCKWQ_ASYNC_WAITDATA optimizations
include/net/sock.h | 8 ++++++++
1 file changed, 8 insertions(+)
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations
2016-04-25 17:39 [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used Eric Dumazet
@ 2016-04-25 17:39 ` Eric Dumazet
2016-05-02 16:16 ` Jiri Pirko
2016-04-25 17:39 ` [PATCH v2 net-next] tcp-tso: do not split TSO packets at retransmit time Eric Dumazet
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-04-25 17:39 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
so that a SIGIO signal is sent when needed.
tcp_sendmsg() clears the bit.
tcp_poll() sets the bit when stream is not writeable.
We can avoid two atomic operations by first checking if socket
is actually interested in the FASYNC business (most sockets in
real applications do not use AIO, but select()/poll()/epoll())
This also removes one cache line miss to access sk->sk_wq->flags
in tcp_sendmsg()
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sock.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/include/net/sock.h b/include/net/sock.h
index d63b8494124e..0f48aad9f8e8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1940,11 +1940,17 @@ static inline unsigned long sock_wspace(struct sock *sk)
*/
static inline void sk_set_bit(int nr, struct sock *sk)
{
+ if (nr == SOCKWQ_ASYNC_NOSPACE && !sock_flag(sk, SOCK_FASYNC))
+ return;
+
set_bit(nr, &sk->sk_wq_raw->flags);
}
static inline void sk_clear_bit(int nr, struct sock *sk)
{
+ if (nr == SOCKWQ_ASYNC_NOSPACE && !sock_flag(sk, SOCK_FASYNC))
+ return;
+
clear_bit(nr, &sk->sk_wq_raw->flags);
}
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next] tcp-tso: do not split TSO packets at retransmit time
2016-04-25 17:39 [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used Eric Dumazet
2016-04-25 17:39 ` [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations Eric Dumazet
@ 2016-04-25 17:39 ` Eric Dumazet
2016-04-25 17:43 ` Eric Dumazet
2016-04-25 17:39 ` [PATCH net-next 2/2] net: SOCKWQ_ASYNC_WAITDATA optimizations Eric Dumazet
2016-04-28 3:10 ` [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used David Miller
3 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-04-25 17:39 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.
Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
- Less memory overhead, because write queues have less skbs
- Less cpu overhead at ACK processing.
- Better SACK processing, as lot of studies mentioned how
awful linux was at this ;)
- Less cpu overhead to send the rtx packets
(IP stack traversal, netfilter traversal, drivers...)
- Better latencies in presence of losses.
- Smaller spikes in fq like packet schedulers, as retransmits
are not constrained by TCP Small Queues.
1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
---
include/net/tcp.h | 4 ++--
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_output.c | 64 +++++++++++++++++++++++----------------------------
net/ipv4/tcp_timer.c | 4 ++--
4 files changed, 34 insertions(+), 40 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index fd40f8c64d5f..0dc272dcd772 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -538,8 +538,8 @@ __u32 cookie_v6_init_sequence(const struct sk_buff *skb, __u16 *mss);
void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
int nonagle);
bool tcp_may_send_now(struct sock *sk);
-int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
-int tcp_retransmit_skb(struct sock *, struct sk_buff *);
+int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
+int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs);
void tcp_retransmit_timer(struct sock *sk);
void tcp_xmit_retransmit_queue(struct sock *);
void tcp_simple_retransmit(struct sock *);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 90e0d9256b74..729e489b5608 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5543,7 +5543,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
if (data) { /* Retransmit unacked data in SYN */
tcp_for_write_queue_from(data, sk) {
if (data == tcp_send_head(sk) ||
- __tcp_retransmit_skb(sk, data))
+ __tcp_retransmit_skb(sk, data, 1))
break;
}
tcp_rearm_rto(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6451b83d81e9..4876b256a70a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2266,7 +2266,7 @@ void tcp_send_loss_probe(struct sock *sk)
if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
goto rearm_timer;
- if (__tcp_retransmit_skb(sk, skb))
+ if (__tcp_retransmit_skb(sk, skb, 1))
goto rearm_timer;
/* Record snd_nxt for loss detection. */
@@ -2551,17 +2551,17 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
* state updates are done by the caller. Returns non-zero if an
* error occurred which prevented the send.
*/
-int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
+int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
{
- struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
unsigned int cur_mss;
- int err;
+ int diff, len, err;
+
- /* Inconslusive MTU probe */
- if (icsk->icsk_mtup.probe_size) {
+ /* Inconclusive MTU probe */
+ if (icsk->icsk_mtup.probe_size)
icsk->icsk_mtup.probe_size = 0;
- }
/* Do not sent more than we queued. 1/4 is reserved for possible
* copying overhead: fragmentation, tunneling, mangling etc.
@@ -2594,30 +2594,27 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
TCP_SKB_CB(skb)->seq != tp->snd_una)
return -EAGAIN;
- if (skb->len > cur_mss) {
- if (tcp_fragment(sk, skb, cur_mss, cur_mss, GFP_ATOMIC))
+ len = cur_mss * segs;
+ if (skb->len > len) {
+ if (tcp_fragment(sk, skb, len, cur_mss, GFP_ATOMIC))
return -ENOMEM; /* We'll try again later. */
} else {
- int oldpcount = tcp_skb_pcount(skb);
+ if (skb_unclone(skb, GFP_ATOMIC))
+ return -ENOMEM;
- if (unlikely(oldpcount > 1)) {
- if (skb_unclone(skb, GFP_ATOMIC))
- return -ENOMEM;
- tcp_init_tso_segs(skb, cur_mss);
- tcp_adjust_pcount(sk, skb, oldpcount - tcp_skb_pcount(skb));
- }
+ diff = tcp_skb_pcount(skb);
+ tcp_set_skb_tso_segs(skb, cur_mss);
+ diff -= tcp_skb_pcount(skb);
+ if (diff)
+ tcp_adjust_pcount(sk, skb, diff);
+ if (skb->len < cur_mss)
+ tcp_retrans_try_collapse(sk, skb, cur_mss);
}
/* RFC3168, section 6.1.1.1. ECN fallback */
if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN_ECN) == TCPHDR_SYN_ECN)
tcp_ecn_clear_syn(sk, skb);
- tcp_retrans_try_collapse(sk, skb, cur_mss);
-
- /* Make a copy, if the first transmission SKB clone we made
- * is still in somebody's hands, else make a clone.
- */
-
/* make sure skb->data is aligned on arches that require it
* and check if ack-trimming & collapsing extended the headroom
* beyond what csum_start can cover.
@@ -2633,20 +2630,22 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
}
if (likely(!err)) {
+ segs = tcp_skb_pcount(skb);
+
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
/* Update global TCP statistics. */
- TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
+ TCP_ADD_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS, segs);
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
- tp->total_retrans++;
+ tp->total_retrans += segs;
}
return err;
}
-int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
+int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
{
struct tcp_sock *tp = tcp_sk(sk);
- int err = __tcp_retransmit_skb(sk, skb);
+ int err = __tcp_retransmit_skb(sk, skb, segs);
if (err == 0) {
#if FASTRETRANS_DEBUG > 0
@@ -2737,6 +2736,7 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
tcp_for_write_queue_from(skb, sk) {
__u8 sacked = TCP_SKB_CB(skb)->sacked;
+ int segs;
if (skb == tcp_send_head(sk))
break;
@@ -2744,14 +2744,8 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
if (!hole)
tp->retransmit_skb_hint = skb;
- /* Assume this retransmit will generate
- * only one packet for congestion window
- * calculation purposes. This works because
- * tcp_retransmit_skb() will chop up the
- * packet to be MSS sized and all the
- * packet counting works out.
- */
- if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
+ segs = tp->snd_cwnd - tcp_packets_in_flight(tp);
+ if (segs <= 0)
return;
if (fwd_rexmitting) {
@@ -2788,7 +2782,7 @@ begin_fwd:
if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS))
continue;
- if (tcp_retransmit_skb(sk, skb))
+ if (tcp_retransmit_skb(sk, skb, segs))
return;
NET_INC_STATS_BH(sock_net(sk), mib_idx);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 49bc474f8e35..373b03e78aaa 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -404,7 +404,7 @@ void tcp_retransmit_timer(struct sock *sk)
goto out;
}
tcp_enter_loss(sk);
- tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
+ tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1);
__sk_dst_reset(sk);
goto out_reset_timer;
}
@@ -436,7 +436,7 @@ void tcp_retransmit_timer(struct sock *sk)
tcp_enter_loss(sk);
- if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
+ if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk), 1) > 0) {
/* Retransmission failed because of local congestion,
* do not backoff.
*/
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH net-next 2/2] net: SOCKWQ_ASYNC_WAITDATA optimizations
2016-04-25 17:39 [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used Eric Dumazet
2016-04-25 17:39 ` [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations Eric Dumazet
2016-04-25 17:39 ` [PATCH v2 net-next] tcp-tso: do not split TSO packets at retransmit time Eric Dumazet
@ 2016-04-25 17:39 ` Eric Dumazet
2016-04-28 3:10 ` [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used David Miller
3 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-04-25 17:39 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data()
and equivalent functions, so that sock_wake_async() can send
a SIGIO only when necessary.
Since these atomic operations are really not needed unless
socket expressed interest in FASYNC, we can omit them in most
cases.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sock.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 0f48aad9f8e8..3df778ccaa82 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1940,7 +1940,8 @@ static inline unsigned long sock_wspace(struct sock *sk)
*/
static inline void sk_set_bit(int nr, struct sock *sk)
{
- if (nr == SOCKWQ_ASYNC_NOSPACE && !sock_flag(sk, SOCK_FASYNC))
+ if ((nr == SOCKWQ_ASYNC_NOSPACE || nr == SOCKWQ_ASYNC_WAITDATA) &&
+ !sock_flag(sk, SOCK_FASYNC))
return;
set_bit(nr, &sk->sk_wq_raw->flags);
@@ -1948,7 +1949,8 @@ static inline void sk_set_bit(int nr, struct sock *sk)
static inline void sk_clear_bit(int nr, struct sock *sk)
{
- if (nr == SOCKWQ_ASYNC_NOSPACE && !sock_flag(sk, SOCK_FASYNC))
+ if ((nr == SOCKWQ_ASYNC_NOSPACE || nr == SOCKWQ_ASYNC_WAITDATA) &&
+ !sock_flag(sk, SOCK_FASYNC))
return;
clear_bit(nr, &sk->sk_wq_raw->flags);
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next] tcp-tso: do not split TSO packets at retransmit time
2016-04-25 17:39 ` [PATCH v2 net-next] tcp-tso: do not split TSO packets at retransmit time Eric Dumazet
@ 2016-04-25 17:43 ` Eric Dumazet
0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-04-25 17:43 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller; +Cc: netdev
On Mon, 2016-04-25 at 10:39 -0700, Eric Dumazet wrote:
> Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
>
humpf.
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> ---
> include/net/tcp.h | 4 ++--
> net/ipv4/tcp_input.c | 2 +-
> net/ipv4/tcp_output.c | 64 +++++++++++++++++++++++----------------------------
> net/ipv4/tcp_timer.c | 4 ++--
> 4 files changed, 34 insertions(+), 40 deletions(-)
Please ignore, I forgot to clean this one, it was already merged.
Sorry for the noise.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used
2016-04-25 17:39 [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used Eric Dumazet
` (2 preceding siblings ...)
2016-04-25 17:39 ` [PATCH net-next 2/2] net: SOCKWQ_ASYNC_WAITDATA optimizations Eric Dumazet
@ 2016-04-28 3:10 ` David Miller
3 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2016-04-28 3:10 UTC (permalink / raw)
To: edumazet; +Cc: netdev, eric.dumazet
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 25 Apr 2016 10:39:31 -0700
> We can avoid some atomic operations on sockets not using FASYNC
I guess a user can do weird things and set/clear the FASYNC bit in the
middle of the SOCKWQ_ASYNC_ bit being set, and reset the FASYNC bit
later and the SOCKWQ_* state is stale.
However, that's probably not worth handling explicitly.
Series applied, thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations
2016-04-25 17:39 ` [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations Eric Dumazet
@ 2016-05-02 16:16 ` Jiri Pirko
2016-05-02 16:22 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Jiri Pirko @ 2016-05-02 16:16 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet, eladr, idosch
Mon, Apr 25, 2016 at 07:39:32PM CEST, edumazet@google.com wrote:
>SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
>so that a SIGIO signal is sent when needed.
>
>tcp_sendmsg() clears the bit.
>tcp_poll() sets the bit when stream is not writeable.
>
>We can avoid two atomic operations by first checking if socket
>is actually interested in the FASYNC business (most sockets in
>real applications do not use AIO, but select()/poll()/epoll())
>
>This also removes one cache line miss to access sk->sk_wq->flags
>in tcp_sendmsg()
>
>Signed-off-by: Eric Dumazet <edumazet@google.com>
I just bisected down to this. This is causing a regression for me when
my nfs mount becomes stuck. I can easily reproduce this if you need to
test the fix.
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations
2016-05-02 16:16 ` Jiri Pirko
@ 2016-05-02 16:22 ` Eric Dumazet
2016-05-02 19:12 ` Jiri Pirko
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-05-02 16:22 UTC (permalink / raw)
To: Jiri Pirko; +Cc: Eric Dumazet, David S . Miller, netdev, eladr, idosch
On Mon, 2016-05-02 at 18:16 +0200, Jiri Pirko wrote:
> Mon, Apr 25, 2016 at 07:39:32PM CEST, edumazet@google.com wrote:
> >SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
> >so that a SIGIO signal is sent when needed.
> >
> >tcp_sendmsg() clears the bit.
> >tcp_poll() sets the bit when stream is not writeable.
> >
> >We can avoid two atomic operations by first checking if socket
> >is actually interested in the FASYNC business (most sockets in
> >real applications do not use AIO, but select()/poll()/epoll())
> >
> >This also removes one cache line miss to access sk->sk_wq->flags
> >in tcp_sendmsg()
> >
> >Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> I just bisected down to this. This is causing a regression for me when
> my nfs mount becomes stuck. I can easily reproduce this if you need to
> test the fix.
What do you mean by 'when nfs mount becomes stuck' ?
Is this patch making nfs not functional , or does it make recovery from
some nfs error bad ?
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations
2016-05-02 16:22 ` Eric Dumazet
@ 2016-05-02 19:12 ` Jiri Pirko
2016-05-02 20:23 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Jiri Pirko @ 2016-05-02 19:12 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Eric Dumazet, David S . Miller, netdev, eladr, idosch
Mon, May 02, 2016 at 06:22:18PM CEST, eric.dumazet@gmail.com wrote:
>On Mon, 2016-05-02 at 18:16 +0200, Jiri Pirko wrote:
>> Mon, Apr 25, 2016 at 07:39:32PM CEST, edumazet@google.com wrote:
>> >SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
>> >so that a SIGIO signal is sent when needed.
>> >
>> >tcp_sendmsg() clears the bit.
>> >tcp_poll() sets the bit when stream is not writeable.
>> >
>> >We can avoid two atomic operations by first checking if socket
>> >is actually interested in the FASYNC business (most sockets in
>> >real applications do not use AIO, but select()/poll()/epoll())
>> >
>> >This also removes one cache line miss to access sk->sk_wq->flags
>> >in tcp_sendmsg()
>> >
>> >Signed-off-by: Eric Dumazet <edumazet@google.com>
>>
>> I just bisected down to this. This is causing a regression for me when
>> my nfs mount becomes stuck. I can easily reproduce this if you need to
>> test the fix.
>
>What do you mean by 'when nfs mount becomes stuck' ?
>
>Is this patch making nfs not functional , or does it make recovery from
>some nfs error bad ?
I can mount nfs on the host. But when I do something (compile a kernel
module in my case), it gets stuck. Then I cannot even ssh to the machine.
No messages in dmesg. I didn't debug it any further. I just bisected and
verified that this patch caused this behaviour.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations
2016-05-02 19:12 ` Jiri Pirko
@ 2016-05-02 20:23 ` Eric Dumazet
2016-05-02 20:31 ` David Miller
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-05-02 20:23 UTC (permalink / raw)
To: Jiri Pirko; +Cc: Eric Dumazet, David S . Miller, netdev, eladr, idosch
On Mon, 2016-05-02 at 21:12 +0200, Jiri Pirko wrote:
> Mon, May 02, 2016 at 06:22:18PM CEST, eric.dumazet@gmail.com wrote:
> >On Mon, 2016-05-02 at 18:16 +0200, Jiri Pirko wrote:
> >> Mon, Apr 25, 2016 at 07:39:32PM CEST, edumazet@google.com wrote:
> >> >SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
> >> >so that a SIGIO signal is sent when needed.
> >> >
> >> >tcp_sendmsg() clears the bit.
> >> >tcp_poll() sets the bit when stream is not writeable.
> >> >
> >> >We can avoid two atomic operations by first checking if socket
> >> >is actually interested in the FASYNC business (most sockets in
> >> >real applications do not use AIO, but select()/poll()/epoll())
> >> >
> >> >This also removes one cache line miss to access sk->sk_wq->flags
> >> >in tcp_sendmsg()
> >> >
> >> >Signed-off-by: Eric Dumazet <edumazet@google.com>
> >>
> >> I just bisected down to this. This is causing a regression for me when
> >> my nfs mount becomes stuck. I can easily reproduce this if you need to
> >> test the fix.
> >
> >What do you mean by 'when nfs mount becomes stuck' ?
> >
> >Is this patch making nfs not functional , or does it make recovery from
> >some nfs error bad ?
>
> I can mount nfs on the host. But when I do something (compile a kernel
> module in my case), it gets stuck. Then I cannot even ssh to the machine.
> No messages in dmesg. I didn't debug it any further. I just bisected and
> verified that this patch caused this behaviour.
Interesting.
It looks like net/sunrpc/xprtsock.c should set SOCK_FASYNC
even if it is not actually using fasync_list
Could you try this quick hack to check if this is the right way ?
Thanks !
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index a6c68dc086af83233ee315642638f4a1990ee622..b90c5397b5e137c6cc8accad6eebe2b876363d4e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1950,6 +1950,7 @@ static int xs_local_finish_connecting(struct rpc_xprt *xprt,
sk->sk_user_data = xprt;
sk->sk_data_ready = xs_data_ready;
sk->sk_write_space = xs_udp_write_space;
+ sock_set_flag(sk, SOCK_FASYNC);
sk->sk_error_report = xs_error_report;
sk->sk_allocation = GFP_NOIO;
@@ -2136,6 +2137,7 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
sk->sk_user_data = xprt;
sk->sk_data_ready = xs_data_ready;
sk->sk_write_space = xs_udp_write_space;
+ sock_set_flag(sk, SOCK_FASYNC);
sk->sk_allocation = GFP_NOIO;
xprt_set_connected(xprt);
@@ -2237,6 +2239,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
sk->sk_data_ready = xs_tcp_data_ready;
sk->sk_state_change = xs_tcp_state_change;
sk->sk_write_space = xs_tcp_write_space;
+ sock_set_flag(sk, SOCK_FASYNC);
sk->sk_error_report = xs_error_report;
sk->sk_allocation = GFP_NOIO;
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations
2016-05-02 20:23 ` Eric Dumazet
@ 2016-05-02 20:31 ` David Miller
2016-05-02 20:55 ` Eric Dumazet
2016-05-02 20:45 ` Jiri Pirko
2016-05-13 4:41 ` [PATCH net-next] sunrpc: set SOCK_FASYNC Eric Dumazet
2 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2016-05-02 20:31 UTC (permalink / raw)
To: eric.dumazet; +Cc: jiri, edumazet, netdev, eladr, idosch
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 02 May 2016 13:23:27 -0700
> It looks like net/sunrpc/xprtsock.c should set SOCK_FASYNC
> even if it is not actually using fasync_list
>
> Could you try this quick hack to check if this is the right way ?
Indeed, it tests the ASYNC bit without enabling FASYNC.
There are three other places that do this: macvtap, tun, dlm lowcomms.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations
2016-05-02 20:23 ` Eric Dumazet
2016-05-02 20:31 ` David Miller
@ 2016-05-02 20:45 ` Jiri Pirko
2016-05-09 6:24 ` Jiri Pirko
2016-05-13 4:41 ` [PATCH net-next] sunrpc: set SOCK_FASYNC Eric Dumazet
2 siblings, 1 reply; 16+ messages in thread
From: Jiri Pirko @ 2016-05-02 20:45 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Eric Dumazet, David S . Miller, netdev, eladr, idosch
Mon, May 02, 2016 at 10:23:27PM CEST, eric.dumazet@gmail.com wrote:
>On Mon, 2016-05-02 at 21:12 +0200, Jiri Pirko wrote:
>> Mon, May 02, 2016 at 06:22:18PM CEST, eric.dumazet@gmail.com wrote:
>> >On Mon, 2016-05-02 at 18:16 +0200, Jiri Pirko wrote:
>> >> Mon, Apr 25, 2016 at 07:39:32PM CEST, edumazet@google.com wrote:
>> >> >SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
>> >> >so that a SIGIO signal is sent when needed.
>> >> >
>> >> >tcp_sendmsg() clears the bit.
>> >> >tcp_poll() sets the bit when stream is not writeable.
>> >> >
>> >> >We can avoid two atomic operations by first checking if socket
>> >> >is actually interested in the FASYNC business (most sockets in
>> >> >real applications do not use AIO, but select()/poll()/epoll())
>> >> >
>> >> >This also removes one cache line miss to access sk->sk_wq->flags
>> >> >in tcp_sendmsg()
>> >> >
>> >> >Signed-off-by: Eric Dumazet <edumazet@google.com>
>> >>
>> >> I just bisected down to this. This is causing a regression for me when
>> >> my nfs mount becomes stuck. I can easily reproduce this if you need to
>> >> test the fix.
>> >
>> >What do you mean by 'when nfs mount becomes stuck' ?
>> >
>> >Is this patch making nfs not functional , or does it make recovery from
>> >some nfs error bad ?
>>
>> I can mount nfs on the host. But when I do something (compile a kernel
>> module in my case), it gets stuck. Then I cannot even ssh to the machine.
>> No messages in dmesg. I didn't debug it any further. I just bisected and
>> verified that this patch caused this behaviour.
>
>Interesting.
>
>It looks like net/sunrpc/xprtsock.c should set SOCK_FASYNC
>even if it is not actually using fasync_list
>
>Could you try this quick hack to check if this is the right way ?
Yep, works, I do not see the issue with this patch anymore. Thanks.
>
>Thanks !
>
>diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>index a6c68dc086af83233ee315642638f4a1990ee622..b90c5397b5e137c6cc8accad6eebe2b876363d4e 100644
>--- a/net/sunrpc/xprtsock.c
>+++ b/net/sunrpc/xprtsock.c
>@@ -1950,6 +1950,7 @@ static int xs_local_finish_connecting(struct rpc_xprt *xprt,
> sk->sk_user_data = xprt;
> sk->sk_data_ready = xs_data_ready;
> sk->sk_write_space = xs_udp_write_space;
>+ sock_set_flag(sk, SOCK_FASYNC);
> sk->sk_error_report = xs_error_report;
> sk->sk_allocation = GFP_NOIO;
>
>@@ -2136,6 +2137,7 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
> sk->sk_user_data = xprt;
> sk->sk_data_ready = xs_data_ready;
> sk->sk_write_space = xs_udp_write_space;
>+ sock_set_flag(sk, SOCK_FASYNC);
> sk->sk_allocation = GFP_NOIO;
>
> xprt_set_connected(xprt);
>@@ -2237,6 +2239,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
> sk->sk_data_ready = xs_tcp_data_ready;
> sk->sk_state_change = xs_tcp_state_change;
> sk->sk_write_space = xs_tcp_write_space;
>+ sock_set_flag(sk, SOCK_FASYNC);
> sk->sk_error_report = xs_error_report;
> sk->sk_allocation = GFP_NOIO;
>
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations
2016-05-02 20:31 ` David Miller
@ 2016-05-02 20:55 ` Eric Dumazet
0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-05-02 20:55 UTC (permalink / raw)
To: David Miller; +Cc: jiri, edumazet, netdev, eladr, idosch
On Mon, 2016-05-02 at 16:31 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 02 May 2016 13:23:27 -0700
>
> > It looks like net/sunrpc/xprtsock.c should set SOCK_FASYNC
> > even if it is not actually using fasync_list
> >
> > Could you try this quick hack to check if this is the right way ?
>
> Indeed, it tests the ASYNC bit without enabling FASYNC.
>
> There are three other places that do this: macvtap, tun, dlm lowcomms.
Yes, although macvtap and tun have a private usage of this bit.
When the flag was moved (commit ceb5d58b217098a657f3850b7a2640f995032e62
"net: fix sock_wake_async() rcu protection"), I did not change the code
in these drivers. And apparently nobody complained (linux-4.4)
drivers/net/macvtap.c:501: !test_and_clear_bit(SOCKWQ_ASYNC_NOSPACE, &sk->sk_socket->flags))
drivers/net/macvtap.c:588: (!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, &q->sock.flags) &&
drivers/net/tun.c:1111: (!test_and_set_bit(SOCKWQ_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
drivers/net/tun.c:1576: if (!test_and_clear_bit(SOCKWQ_ASYNC_NOSPACE, &sk->sk_socket->flags))
fs/dlm/lowcomms.c probably needs a fix.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations
2016-05-02 20:45 ` Jiri Pirko
@ 2016-05-09 6:24 ` Jiri Pirko
0 siblings, 0 replies; 16+ messages in thread
From: Jiri Pirko @ 2016-05-09 6:24 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Eric Dumazet, David S . Miller, netdev, eladr, idosch
Mon, May 02, 2016 at 10:45:44PM CEST, jiri@resnulli.us wrote:
>Mon, May 02, 2016 at 10:23:27PM CEST, eric.dumazet@gmail.com wrote:
>>On Mon, 2016-05-02 at 21:12 +0200, Jiri Pirko wrote:
>>> Mon, May 02, 2016 at 06:22:18PM CEST, eric.dumazet@gmail.com wrote:
>>> >On Mon, 2016-05-02 at 18:16 +0200, Jiri Pirko wrote:
>>> >> Mon, Apr 25, 2016 at 07:39:32PM CEST, edumazet@google.com wrote:
>>> >> >SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
>>> >> >so that a SIGIO signal is sent when needed.
>>> >> >
>>> >> >tcp_sendmsg() clears the bit.
>>> >> >tcp_poll() sets the bit when stream is not writeable.
>>> >> >
>>> >> >We can avoid two atomic operations by first checking if socket
>>> >> >is actually interested in the FASYNC business (most sockets in
>>> >> >real applications do not use AIO, but select()/poll()/epoll())
>>> >> >
>>> >> >This also removes one cache line miss to access sk->sk_wq->flags
>>> >> >in tcp_sendmsg()
>>> >> >
>>> >> >Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> >>
>>> >> I just bisected down to this. This is causing a regression for me when
>>> >> my nfs mount becomes stuck. I can easily reproduce this if you need to
>>> >> test the fix.
>>> >
>>> >What do you mean by 'when nfs mount becomes stuck' ?
>>> >
>>> >Is this patch making nfs not functional , or does it make recovery from
>>> >some nfs error bad ?
>>>
>>> I can mount nfs on the host. But when I do something (compile a kernel
>>> module in my case), it gets stuck. Then I cannot even ssh to the machine.
>>> No messages in dmesg. I didn't debug it any further. I just bisected and
>>> verified that this patch caused this behaviour.
>>
>>Interesting.
>>
>>It looks like net/sunrpc/xprtsock.c should set SOCK_FASYNC
>>even if it is not actually using fasync_list
>>
>>Could you try this quick hack to check if this is the right way ?
>
>Yep, works, I do not see the issue with this patch anymore. Thanks.
Eric, any news with this issue?
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH net-next] sunrpc: set SOCK_FASYNC
2016-05-02 20:23 ` Eric Dumazet
2016-05-02 20:31 ` David Miller
2016-05-02 20:45 ` Jiri Pirko
@ 2016-05-13 4:41 ` Eric Dumazet
2016-05-13 5:46 ` David Miller
2 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-05-13 4:41 UTC (permalink / raw)
To: Jiri Pirko; +Cc: Eric Dumazet, David S . Miller, netdev, Huang Ying
From: Eric Dumazet <edumazet@google.com>
sunrpc is using SOCKWQ_ASYNC_NOSPACE without setting SOCK_FASYNC,
so the recent optimizations done in sk_set_bit() and sk_clear_bit()
broke it.
There is still the risk that a subsequent sock_fasync() call
would clear SOCK_FASYNC, but sunrpc does not use this yet.
Fixes: 9317bb69824e ("net: SOCKWQ_ASYNC_NOSPACE optimizations")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jiri Pirko <jiri@resnulli.us>
Reported-by: Huang, Ying <ying.huang@intel.com>
Tested-by: Jiri Pirko <jiri@resnulli.us>
Tested-by: Huang, Ying <ying.huang@intel.com>
---
net/sunrpc/xprtsock.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index a6c68dc086af83233ee315642638f4a1990ee622..b90c5397b5e137c6cc8accad6eebe2b876363d4e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1950,6 +1950,7 @@ static int xs_local_finish_connecting(struct rpc_xprt *xprt,
sk->sk_user_data = xprt;
sk->sk_data_ready = xs_data_ready;
sk->sk_write_space = xs_udp_write_space;
+ sock_set_flag(sk, SOCK_FASYNC);
sk->sk_error_report = xs_error_report;
sk->sk_allocation = GFP_NOIO;
@@ -2136,6 +2137,7 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
sk->sk_user_data = xprt;
sk->sk_data_ready = xs_data_ready;
sk->sk_write_space = xs_udp_write_space;
+ sock_set_flag(sk, SOCK_FASYNC);
sk->sk_allocation = GFP_NOIO;
xprt_set_connected(xprt);
@@ -2237,6 +2239,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
sk->sk_data_ready = xs_tcp_data_ready;
sk->sk_state_change = xs_tcp_state_change;
sk->sk_write_space = xs_tcp_write_space;
+ sock_set_flag(sk, SOCK_FASYNC);
sk->sk_error_report = xs_error_report;
sk->sk_allocation = GFP_NOIO;
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH net-next] sunrpc: set SOCK_FASYNC
2016-05-13 4:41 ` [PATCH net-next] sunrpc: set SOCK_FASYNC Eric Dumazet
@ 2016-05-13 5:46 ` David Miller
0 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2016-05-13 5:46 UTC (permalink / raw)
To: eric.dumazet; +Cc: jiri, edumazet, netdev, ying.huang
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 12 May 2016 21:41:39 -0700
> From: Eric Dumazet <edumazet@google.com>
>
> sunrpc is using SOCKWQ_ASYNC_NOSPACE without setting SOCK_FASYNC,
> so the recent optimizations done in sk_set_bit() and sk_clear_bit()
> broke it.
>
> There is still the risk that a subsequent sock_fasync() call
> would clear SOCK_FASYNC, but sunrpc does not use this yet.
>
> Fixes: 9317bb69824e ("net: SOCKWQ_ASYNC_NOSPACE optimizations")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Jiri Pirko <jiri@resnulli.us>
> Reported-by: Huang, Ying <ying.huang@intel.com>
> Tested-by: Jiri Pirko <jiri@resnulli.us>
> Tested-by: Huang, Ying <ying.huang@intel.com>
Applied, thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-05-13 5:46 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-25 17:39 [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used Eric Dumazet
2016-04-25 17:39 ` [PATCH net-next 1/2] net: SOCKWQ_ASYNC_NOSPACE optimizations Eric Dumazet
2016-05-02 16:16 ` Jiri Pirko
2016-05-02 16:22 ` Eric Dumazet
2016-05-02 19:12 ` Jiri Pirko
2016-05-02 20:23 ` Eric Dumazet
2016-05-02 20:31 ` David Miller
2016-05-02 20:55 ` Eric Dumazet
2016-05-02 20:45 ` Jiri Pirko
2016-05-09 6:24 ` Jiri Pirko
2016-05-13 4:41 ` [PATCH net-next] sunrpc: set SOCK_FASYNC Eric Dumazet
2016-05-13 5:46 ` David Miller
2016-04-25 17:39 ` [PATCH v2 net-next] tcp-tso: do not split TSO packets at retransmit time Eric Dumazet
2016-04-25 17:43 ` Eric Dumazet
2016-04-25 17:39 ` [PATCH net-next 2/2] net: SOCKWQ_ASYNC_WAITDATA optimizations Eric Dumazet
2016-04-28 3:10 ` [PATCH net-next 0/2] net: avoid some atomic ops when FASYNC is not used David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).