From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
To: netdev@vger.kernel.org, willemdebruijn.kernel@gmail.com
Cc: davem@davemloft.net, rds-devel@oss.oracle.com,
sowmini.varadhan@oracle.com, santosh.shilimkar@oracle.com
Subject: [PATCH RFC net-next 0/6] rds: zerocopy support
Date: Wed, 17 Jan 2018 04:19:58 -0800 [thread overview]
Message-ID: <cover.1516147540.git.sowmini.varadhan@oracle.com> (raw)
This patch series provides support for MSG_ZERCOCOPY
on a PF_RDS socket based on the APIs and infrastructure added
by f214f915e7db ("tcp: enable MSG_ZEROCOPY")
For single threaded rds-stress testing using rds-tcp with the
ixgbe driver using 1M message sizes (-a 1M -q 1M) preliminary
results show that there is a significant reduction in latency: about
90 usec with zerocopy, compared with 200 usec without zerocopy.
Additional testing/debugging is ongoing, but I am sharing
the current patchset to get some feedback on API design choices
especially for the send-completion notification for multi-threaded
datagram socket applications
Brief RDS Architectural overview: PF_RDS sockets implement
message-bounded datagram semantics over a reliable transport.
The RDS socket layer tracks message boundaries and uses
an underlying transport like TCP to segment/reassemble the
message into MTU sized frames. In addition to the reliable,
ordered delivery semantics provided by the transport, the
RDS layer also retains the datagram in its retransmit queue,
to be resent in case of transport failure/restart events.
This patchset modifies the above for zerocopy in the following manner.
- if the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
- if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
application pages sent down with rds_sendmsg are pinned. The pinning
uses the accounting infrastructure added by a91dbff551a6 ("sock: ulimit
on MSG_ZEROCOPY pages")
The message is unpinned after we get back an ACK (TCP ACK, in the
case of rds-tcp) indicating that the RDS module at the receiver
has received the datagram, and it is safe for the sender to free
the message from its (RDS) retransmit queue.
The payload bytes in the message may not be modified for the
duration that the message has been pinned. A multi-threaded
application using this infrastructure thus needs to be notified
about send-completion, and that notification must uniquely
identify the message to the application so that the application
buffers may be freed/reused.
Unique identification of the message in the completion notification
is done in the following manner:
- application passes down a 32 bit cookie as ancillary data with
rds_sendmsg. The ancillary data in this case has cmsg_level == SOL_RDS
and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
- upon send-completion, the rds module passes up a batch of cookies
on the sk_error_queue associated with the PF_RDS socket. The message
thus received will have a batch of N cookies in the data, with the
number of cookies (N) specified in the ancillary data passed with
recvmsg(). The current patchset sets up the ancillary data as a
sock_extended_err with ee_origin == SO_EE_ORIGIN_ZEROCOPY, and
ee_data == N based on 52267790ef52 ("sock: add MSG_ZEROCOPY"), and
alternate suggestions for designing this API are invited. The
important point here is that the notification would need to be able
to contain an arbitrary number of cookies, where each cookie
would allow the application to uniquely identify a buffer used with
sendmsg()
Note that cookie-batching on send-completion notification means
that the application may not know the buffering requirements
a priori and the buffer sent down with recvmsg on the MSG_ERRQUEUE
may be smaller than the required size for the notifications to be
sent. To accomodate this case, sk_error_queue has been enhanced
to support MSG_PEEK semantics (so that the application
can retry with a larger buffer)
Work in progress
- additional testing: when we test this with rds-stress with 8 sockets,
and a send depth of 64 (i.e. each socket can have at most 64 outstanding
requests) some data corruption is reported by rds-stress. Working
on drilling down the root-cause
- optimizing the send-completion notification API: our use-cases are
multi-threaded, and we want to be able to reuse buffers as soon
as possible (instead of waiting for the req-resp transaction to
complete). Sub-optimal design of the completion notification can
actually cause a perf deterioration (system-call overhead to
reap notification, throughput can go down because application does
not send "fast enough", even though latency is small), so this area
needs to be optimized carefully
- additional test results beyond the rds-stress micro-benchmarks.
Sowmini Varadhan (6):
sock: MSG_PEEK support for sk_error_queue
skbuff: export mm_[un]account_pinned_pages for other modules
rds: hold a sock ref from rds_message to the rds_sock
sock: permit SO_ZEROCOPY on PF_RDS socket
rds: support for zcopy completion notification
rds: zerocopy Tx support.
drivers/net/tun.c | 2 +-
include/linux/skbuff.h | 3 +
include/net/sock.h | 2 +-
include/uapi/linux/rds.h | 1 +
net/core/skbuff.c | 6 ++-
net/core/sock.c | 14 +++++-
net/packet/af_packet.c | 3 +-
net/rds/af_rds.c | 3 +
net/rds/message.c | 119 ++++++++++++++++++++++++++++++++++++++++++++-
net/rds/rds.h | 16 +++++-
net/rds/recv.c | 3 +
net/rds/send.c | 41 ++++++++++++----
12 files changed, 192 insertions(+), 21 deletions(-)
next reply other threads:[~2018-01-17 12:37 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-17 12:19 Sowmini Varadhan [this message]
2018-01-17 12:19 ` [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue Sowmini Varadhan
2018-01-17 23:50 ` Willem de Bruijn
2018-01-18 11:02 ` Sowmini Varadhan
2018-01-18 15:54 ` Eric Dumazet
2018-01-18 16:10 ` Sowmini Varadhan
2018-01-18 16:53 ` Eric Dumazet
2018-01-18 17:12 ` Sowmini Varadhan
2018-01-18 22:54 ` Willem de Bruijn
2018-01-18 23:03 ` Sowmini Varadhan
2018-01-18 23:09 ` Willem de Bruijn
2018-01-18 23:20 ` Sowmini Varadhan
2018-01-18 23:24 ` Willem de Bruijn
2018-01-18 15:51 ` Eric Dumazet
2018-01-17 12:20 ` [PATCH RFC net-next 2/6] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
2018-01-17 12:20 ` [PATCH RFC net-next 3/6] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
2018-01-17 12:20 ` [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
2018-01-18 0:03 ` Willem de Bruijn
2018-01-17 12:20 ` [PATCH RFC net-next 5/6] rds: support for zcopy completion notification Sowmini Varadhan
2018-01-18 0:23 ` Willem de Bruijn
2018-01-18 11:40 ` Sowmini Varadhan
2018-01-18 22:46 ` Willem de Bruijn
2018-01-17 12:20 ` [PATCH RFC net-next 6/6] rds: zerocopy Tx support Sowmini Varadhan
2018-01-18 0:32 ` Willem de Bruijn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1516147540.git.sowmini.varadhan@oracle.com \
--to=sowmini.varadhan@oracle.com \
--cc=davem@davemloft.net \
--cc=netdev@vger.kernel.org \
--cc=rds-devel@oss.oracle.com \
--cc=santosh.shilimkar@oracle.com \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).