From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Eric Dumazet <edumazet@google.com>,
Doug Porter <dsp@fb.com>,
Soheil Hassas Yeganeh <soheil@google.com>,
Neal Cardwell <ncardwell@google.com>,
"David S. Miller" <davem@davemloft.net>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH 4.14 36/78] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
Date: Tue, 10 May 2022 15:07:22 +0200 [thread overview]
Message-ID: <20220510130733.604544525@linuxfoundation.org> (raw)
In-Reply-To: <20220510130732.522479698@linuxfoundation.org>
From: Eric Dumazet <edumazet@google.com>
[ Upstream commit 4bfe744ff1644fbc0a991a2677dc874475dd6776 ]
I had this bug sitting for too long in my pile, it is time to fix it.
Thanks to Doug Porter for reminding me of it!
We had various attempts in the past, including commit
0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"),
but the issue is that TCP stack currently only generates
EPOLLOUT from input path, when tp->snd_una has advanced
and skb(s) cleaned from rtx queue.
If a flow has a big RTT, and/or receives SACKs, it is possible
that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
and no more data can be sent until tp->snd_una finally advances.
What is needed is to also check if POLLOUT needs to be generated
whenever tp->snd_nxt is advanced, from output path.
This bug triggers more often after an idle period, as
we do not receive ACK for at least one RTT. tcp_notsent_lowat
could be a fraction of what CWND and pacing rate would allow to
send during this RTT.
In a followup patch, I will remove the bogus call
to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
from tcp_check_space(). Fact that we have decided to generate
an EPOLLOUT does not mean the application has immediately
refilled the transmit queue. This optimistic call
might have been the reason the bug seemed not too serious.
Tested:
200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
$ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
$ cat bench_rr.sh
SUM=0
for i in {1..10}
do
V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
echo $V
SUM=$(($SUM + $V))
done
echo SUM=$SUM
Before patch:
$ bench_rr.sh
130000000
80000000
140000000
140000000
140000000
140000000
130000000
40000000
90000000
110000000
SUM=1140000000
After patch:
$ bench_rr.sh
430000000
590000000
530000000
450000000
450000000
350000000
450000000
490000000
480000000
460000000
SUM=4680000000 # This is 410 % of the value before patch.
Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Doug Porter <dsp@fb.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
include/net/tcp.h | 1 +
net/ipv4/tcp_input.c | 12 +++++++++++-
net/ipv4/tcp_output.c | 1 +
3 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4602959b58a1..181db7dab176 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -585,6 +585,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
void tcp_reset(struct sock *sk);
void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
void tcp_fin(struct sock *sk);
+void tcp_check_space(struct sock *sk);
/* tcp_timer.c */
void tcp_init_xmit_timers(struct sock *);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9382caeb721a..f5cc025003cd 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5114,7 +5114,17 @@ static void tcp_new_space(struct sock *sk)
sk->sk_write_space(sk);
}
-static void tcp_check_space(struct sock *sk)
+/* Caller made space either from:
+ * 1) Freeing skbs in rtx queues (after tp->snd_una has advanced)
+ * 2) Sent skbs from output queue (and thus advancing tp->snd_nxt)
+ *
+ * We might be able to generate EPOLLOUT to the application if:
+ * 1) Space consumed in output/rtx queues is below sk->sk_sndbuf/2
+ * 2) notsent amount (tp->write_seq - tp->snd_nxt) became
+ * small enough that tcp_stream_memory_free() decides it
+ * is time to generate EPOLLOUT.
+ */
+void tcp_check_space(struct sock *sk)
{
if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 83c0e859bb33..1a5c42c67d42 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -81,6 +81,7 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPORIGDATASENT,
tcp_skb_pcount(skb));
+ tcp_check_space(sk);
}
/* SND.NXT, if window was not shrunk or the amount of shrunk was less than one
--
2.35.1
next prev parent reply other threads:[~2022-05-10 13:20 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-10 13:06 [PATCH 4.14 00/78] 4.14.278-rc1 review Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 01/78] floppy: disable FDRAWCMD by default Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 02/78] hamradio: defer 6pack kfree after unregister_netdev Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 03/78] hamradio: remove needs_free_netdev to avoid UAF Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 04/78] net/sched: cls_u32: fix netns refcount changes in u32_change() Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 05/78] Revert "net: ethernet: stmmac: fix altr_tse_pcs function when using a fixed-link" Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 06/78] lightnvm: disable the subsystem Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 07/78] usb: mtu3: fix USB 3.0 dual-role-switch from device to host Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 08/78] USB: quirks: add a Realtek card reader Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 09/78] USB: quirks: add STRING quirk for VCOM device Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 10/78] USB: serial: whiteheat: fix heap overflow in WHITEHEAT_GET_DTR_RTS Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 11/78] USB: serial: cp210x: add PIDs for Kamstrup USB Meter Reader Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 12/78] USB: serial: option: add support for Cinterion MV32-WA/MV32-WB Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 13/78] USB: serial: option: add Telit 0x1057, 0x1058, 0x1075 compositions Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 14/78] xhci: stop polling roothubs after shutdown Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 15/78] iio: dac: ad5592r: Fix the missing return value Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 16/78] iio: dac: ad5446: Fix read_raw not returning set value Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 17/78] iio: magnetometer: ak8975: Fix the error handling in ak8975_power_on() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 18/78] usb: misc: fix improper handling of refcount in uss720_probe() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 19/78] usb: gadget: uvc: Fix crash when encoding data for usb request Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 20/78] usb: gadget: configfs: clear deactivation flag in configfs_composite_unbind() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 21/78] serial: 8250: Also set sticky MCR bits in console restoration Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 22/78] serial: 8250: Correct the clock for EndRun PTP/1588 PCIe device Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 23/78] hex2bin: make the function hex_to_bin constant-time Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 24/78] hex2bin: fix access beyond string end Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 25/78] USB: Fix xhci event ring dequeue pointer ERDP update issue Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 26/78] ARM: dts: imx6qdl-apalis: Fix sgtl5000 detection issue Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 27/78] phy: samsung: Fix missing of_node_put() in exynos_sata_phy_probe Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 28/78] phy: samsung: exynos5250-sata: fix missing device put in probe error paths Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 29/78] ARM: OMAP2+: Fix refcount leak in omap_gic_of_init Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 30/78] ARM: dts: Fix mmc order for omap3-gta04 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 31/78] ipvs: correctly print the memory size of ip_vs_conn_tab Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 32/78] mtd: rawnand: Fix return value check of wait_for_completion_timeout Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 33/78] sctp: check asoc strreset_chunk in sctp_generate_reconf_event Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 34/78] pinctrl: pistachio: fix use of irq_of_parse_and_map() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 35/78] ip_gre: Make o_seqno start from 0 in native mode Greg Kroah-Hartman
2022-05-10 13:07 ` Greg Kroah-Hartman [this message]
2022-05-10 13:07 ` [PATCH 4.14 37/78] bus: sunxi-rsb: Fix the return value of sunxi_rsb_device_create() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 38/78] clk: sunxi: sun9i-mmc: check return value after calling platform_get_resource() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 39/78] net: bcmgenet: hide status block before TX timestamping Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 40/78] bnx2x: fix napi API usage sequence Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 41/78] ASoC: wm8731: Disable the regulator when probing fails Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 42/78] x86: __memcpy_flushcache: fix wrong alignment if size > 2^32 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 43/78] cifs: destage any unwritten data to the server before calling copychunk_write Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 44/78] drivers: net: hippi: Fix deadlock in rr_close() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 45/78] x86/cpu: Load microcode during restore_processor_state() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 46/78] tty: n_gsm: fix wrong signal octet encoding in convergence layer type 2 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 47/78] tty: n_gsm: fix malformed counter for out of frame data Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 48/78] tty: n_gsm: fix insufficient txframe size Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 49/78] tty: n_gsm: fix missing explicit ldisc flush Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 50/78] tty: n_gsm: fix wrong command retry handling Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 51/78] tty: n_gsm: fix wrong command frame length field encoding Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 52/78] tty: n_gsm: fix incorrect UA handling Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 53/78] drm/vgem: Close use-after-free race in vgem_gem_create Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 54/78] MIPS: Fix CP0 counter erratum detection for R4k CPUs Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 55/78] parisc: Merge model and model name into one line in /proc/cpuinfo Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 56/78] ALSA: fireworks: fix wrong return count shorter than expected by 4 bytes Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 57/78] Revert "SUNRPC: attempt AF_LOCAL connect on setup" Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 58/78] firewire: fix potential uaf in outbound_phy_packet_callback() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 59/78] firewire: remove check of list iterator against head past the loop body Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 60/78] firewire: core: extend card->lock in fw_core_handle_bus_reset Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 61/78] ASoC: wm8958: Fix change notifications for DSP controls Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 62/78] can: grcan: grcan_close(): fix deadlock Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 63/78] can: grcan: use ofdev->dev when allocating DMA memory Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 64/78] nfc: replace improper check device_is_registered() in netlink related functions Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 65/78] nfc: nfcmrvl: main: reorder destructive operations in nfcmrvl_nci_unregister_dev to avoid bugs Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 66/78] NFC: netlink: fix sleep in atomic bug when firmware download timeout Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 67/78] hwmon: (adt7470) Fix warning on module removal Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 68/78] ASoC: dmaengine: Restore NULL prepare_slave_config() callback Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 69/78] net: emaclite: Add error handling for of_address_to_resource() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 70/78] smsc911x: allow using IRQ0 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 71/78] btrfs: always log symlinks in full mode Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 72/78] net: igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 73/78] kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 74/78] net: ipv6: ensure we call ipv6_mc_down() at most once Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 75/78] dm: fix mempool NULL pointer race when completing IO Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 76/78] dm: interlock pending dm_io and dm_wait_for_bios_completion Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 77/78] PCI: aardvark: Clear all MSIs at setup Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 78/78] PCI: aardvark: Fix reading MSI interrupt number Greg Kroah-Hartman
2022-05-11 1:10 ` [PATCH 4.14 00/78] 4.14.278-rc1 review Guenter Roeck
2022-05-11 9:18 ` Jon Hunter
2022-05-11 11:03 ` Naresh Kamboju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220510130733.604544525@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=davem@davemloft.net \
--cc=dsp@fb.com \
--cc=edumazet@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=ncardwell@google.com \
--cc=sashal@kernel.org \
--cc=soheil@google.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).