From: Alexandra Winter <wintera@linux.ibm.com>
To: Eric Dumazet <edumazet@google.com>,
David Miller <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>
Cc: Yuchung Cheng <ycheng@google.com>,
Soheil Hassas Yeganeh <soheil@google.com>,
Willem de Bruijn <willemb@google.com>,
Paolo Abeni <pabeni@redhat.com>,
Mat Martineau <mathew.j.martineau@linux.intel.com>,
Saeed Mahameed <saeedm@nvidia.com>,
Niklas Schnelle <schnelle@linux.ibm.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
netdev@vger.kernel.org, linux-s390@vger.kernel.org,
Heiko Carstens <hca@linux.ibm.com>,
Alexandra Winter <wintera@linux.ibm.com>
Subject: [RFC net] tcp: Fix performance regression for request-response workloads
Date: Wed, 7 Sep 2022 14:25:06 +0200 [thread overview]
Message-ID: <20220907122505.26953-1-wintera@linux.ibm.com> (raw)
Since linear payload was removed even for single small messages,
an additional page is required and we are measuring performance impact.
3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
explicitely allowed "payload in skb->head for first skb put in the queue,
to not impact RPC workloads."
472c2e07eef0 ("tcp: add one skb cache for tx")
made that obsolete and removed it.
When
d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
reverted it, this piece was not reverted and not added back in.
When running uperf with a request-response pattern with 1k payload
and 250 connections parallel, we measure 13% difference in throughput
for our PCI based network interfaces since 472c2e07eef0.
(our IO MMU is sensitive to the number of mapped pages)
Could you please consider allowing linear payload for the first
skb in queue again? A patch proposal is appended below.
Kind regards
Alexandra
---------------------------------------------------------------
tcp: allow linear skb payload for first in queue
Allow payload in skb->head for first skb in the queue,
RPC workloads will benefit.
Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx")
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
---
net/ipv4/tcp.c | 39 +++++++++++++++++++++++++++++++++++++--
1 file changed, 37 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e5011c136fdb..f7cbccd41d85 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1154,6 +1154,30 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
}
EXPORT_SYMBOL(tcp_sendpage);
+/* Do not bother using a page frag for very small frames.
+ * But use this heuristic only for the first skb in write queue.
+ *
+ * Having no payload in skb->head allows better SACK shifting
+ * in tcp_shift_skb_data(), reducing sack/rack overhead, because
+ * write queue has less skbs.
+ * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB.
+ * This also speeds up tso_fragment(), since it won't fallback
+ * to tcp_fragment().
+ */
+static int linear_payload_sz(bool first_skb)
+{
+ if (first_skb)
+ return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
+ return 0;
+}
+
+static int select_size(bool first_skb, bool zc)
+{
+ if (zc)
+ return 0;
+ return linear_payload_sz(first_skb);
+}
+
void tcp_free_fastopen_req(struct tcp_sock *tp)
{
if (tp->fastopen_req) {
@@ -1311,6 +1335,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
bool first_skb;
+ int linear;
new_segment:
if (!sk_stream_memory_free(sk))
@@ -1322,7 +1347,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
goto restart;
}
first_skb = tcp_rtx_and_write_queues_empty(sk);
- skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
+ linear = select_size(first_skb, zc);
+ skb = tcp_stream_alloc_skb(sk, linear,
+ sk->sk_allocation,
first_skb);
if (!skb)
goto wait_for_space;
@@ -1344,7 +1371,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (copy > msg_data_left(msg))
copy = msg_data_left(msg);
- if (!zc) {
+ /* Where to copy to? */
+ if (skb_availroom(skb) > 0 && !zc) {
+ /* We have some space in skb head. Superb! */
+ copy = min_t(int, copy, skb_availroom(skb));
+ err = skb_add_data_nocache(sk, skb, &msg->msg_iter,
+ copy);
+ if (err)
+ goto do_error;
+ } else if (!zc) {
bool merge = true;
int i = skb_shinfo(skb)->nr_frags;
struct page_frag *pfrag = sk_page_frag(sk);
--
2.24.3 (Apple Git-128)
next reply other threads:[~2022-09-07 12:26 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-07 12:25 Alexandra Winter [this message]
2022-09-07 16:06 ` [RFC net] tcp: Fix performance regression for request-response workloads Eric Dumazet
2022-09-08 9:40 ` Christian Borntraeger
2022-09-08 12:41 ` Eric Dumazet
2022-09-26 10:06 ` [RFC net] net/mlx5: " Alexandra Winter
2022-09-30 23:37 ` Saeed Mahameed
2022-12-29 8:27 ` Alexandra Winter
2023-04-27 9:44 ` Alexandra Winter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220907122505.26953-1-wintera@linux.ibm.com \
--to=wintera@linux.ibm.com \
--cc=borntraeger@linux.ibm.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hca@linux.ibm.com \
--cc=kuba@kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=mathew.j.martineau@linux.intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeedm@nvidia.com \
--cc=schnelle@linux.ibm.com \
--cc=soheil@google.com \
--cc=willemb@google.com \
--cc=ycheng@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).