From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Network Development <netdev@vger.kernel.org>,
David Miller <davem@davemloft.net>,
rds-devel@oss.oracle.com, santosh.shilimkar@oracle.com,
sowmini.varadhan@oracle.com
Subject: Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion notification
Date: Thu, 18 Jan 2018 06:40:24 -0500 [thread overview]
Message-ID: <20180118114024.GB24920@oracle.com> (raw)
In-Reply-To: <CAF=yD-Jj8Ldb6xADdn2+oiXFK8nmDL5nKyoi8hdtbvozOzW-ag@mail.gmail.com>
On (01/17/18 19:23), Willem de Bruijn wrote:
> > +static void rds_rm_zerocopy_callback(struct rds_sock *rs)
:
> > + skb = alloc_skb(ncookies * sizeof(u32), GFP_ATOMIC);
>
> TCP zerocopy avoids this issue by allocating the notification skb when the
> zerocopy packet is created, which would be rds_message_copy_from_user.
right, I could allocate the skb when we set up the zcopy data strucutres.
> This does not add an allocation, if using the same trick of stashing
> the intermediate notification object (op_mmp_znotifier) in skb->cb. Though,
> alloc_skb is probably more expensive than that kzalloc. If nothing else,
> because of more zeroing.
I would have liked to reuse skb->cb, but had to fall back to the alloc_skb
because of the attempt to fall back to flexibly numbered (and sized) cookie
notifications.
If we choose the "max number of cookies" option discussed in the previous
thread, I think I should be able to do this with the existing 48 byte sized
cb and pack in 8 32 bit cookies after the sock_extended_err.. maybe
that's sufficient.
> > + serr = SKB_EXT_ERR(skb);
> > + serr->ee.ee_errno = 0;
> > + serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
> > + serr->ee.ee_data = ncookies;
>
> This changes the semantics of these fields. Please add a new SO_EE_CODE flag,
> even if the semantics can be derived from the packet family (for now).
sounds good, I'll take care of this (and other review comments)
for the next rev.
> Even better would be if we can avoid the cookies completely. I understand
> the issue with concurrent send threads racing on obtaining a zckey value. If
> the sender could learn which zckey was chosen for a call, would that suffice?
I'm not sure I understand the proposal- you want sendmsg to return
the cookie ("zckey") for you? How?
even if we mangled sendmsg() to return something other than
the existing POSIX semantics, how will the application be asynchronously
notified that send has completed (on a per-message/per-datagram) basis?
> I suspect that in even with concurrent senders, notifications arrive
> largely in
> order, in which case we could just maintain the existing semantics and even
> reuse that implementation.
not so. rds-stress [1] with -d8 -t8 quickly disproves this on my 10G ixgbe
connection.
When you have multiple threads writing to a socket, you cannot know
what was the "order of send", unless you bottleneck all the threads
to go through a single-point-of-send. rds-stress is an example of this
sort of usage (fairly typical in our applications)
[1] http://public-yum.oracle.com/repo/OracleLinux/OL6/ofed_UEK/x86_64//getPackageSource/rds-tools-2.0.7-1.12.el6.src.rpm
Consider 2 RDS sockets fd1 and fd2, each one sending to the
same pair of IP addresses: if fd1 gets bound to tcp sock1, fd2 to tcp sock2,
it's very possible that the send completion is not in the same order
as the order of send. The application cannot know which socket gets
bound to which TCP connection (or even whether/how-many tcp connections
are involved: mprds strives to make this transparent to the application)
Same problem exists for pure-datagram sockets like PF_PACKET, UDP etc.
application may send buf1 (datagram1) and buf2 (datagram2), and buf2
may make it to the destination before buf1. The "notifications
arrive largely in order" may be mostly true about a single-stream TCP
connection but not for the datagram models (or even threaded tcp apps
like iperf?)..
> > + tail = skb_peek_tail(q);
> > + if (!tail ||
> > + SKB_EXT_ERR(tail)->ee.ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
> > + __skb_queue_tail(q, skb);
> > + skb = NULL;
> > + }
>
> This drops the packet if the branch is not taken. In the TCP case
> this condition
> means that we can try to coalesce packets, but that is not the case here.
good point, I'll check into this and fix (same applies for the comments
to patches 4/6 and 6/6)
> > } rdma;
> > struct rm_data_op {
> > unsigned int op_active:1;
> > - unsigned int op_notify:1;
> > + unsigned int op_notify:1,
> > + op_zcopy:1,
:
> > + struct rds_znotifier *op_mmp_znotifier;
>
> not necessary if op_mmp_znotifier is NULL unless set in
> rds_message_copy_from_user
To make sure I dont misunderstand, you are suggesting that we dont
need op_zcopy, but can just check for the null-ness of
op_mmp_znotifier (yes, true, I agree)? or something else?
--Sowmini
next prev parent reply other threads:[~2018-01-18 11:40 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-17 12:19 [PATCH RFC net-next 0/6] rds: zerocopy support Sowmini Varadhan
2018-01-17 12:19 ` [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue Sowmini Varadhan
2018-01-17 23:50 ` Willem de Bruijn
2018-01-18 11:02 ` Sowmini Varadhan
2018-01-18 15:54 ` Eric Dumazet
2018-01-18 16:10 ` Sowmini Varadhan
2018-01-18 16:53 ` Eric Dumazet
2018-01-18 17:12 ` Sowmini Varadhan
2018-01-18 22:54 ` Willem de Bruijn
2018-01-18 23:03 ` Sowmini Varadhan
2018-01-18 23:09 ` Willem de Bruijn
2018-01-18 23:20 ` Sowmini Varadhan
2018-01-18 23:24 ` Willem de Bruijn
2018-01-18 15:51 ` Eric Dumazet
2018-01-17 12:20 ` [PATCH RFC net-next 2/6] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
2018-01-17 12:20 ` [PATCH RFC net-next 3/6] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
2018-01-17 12:20 ` [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
2018-01-18 0:03 ` Willem de Bruijn
2018-01-17 12:20 ` [PATCH RFC net-next 5/6] rds: support for zcopy completion notification Sowmini Varadhan
2018-01-18 0:23 ` Willem de Bruijn
2018-01-18 11:40 ` Sowmini Varadhan [this message]
2018-01-18 22:46 ` Willem de Bruijn
2018-01-17 12:20 ` [PATCH RFC net-next 6/6] rds: zerocopy Tx support Sowmini Varadhan
2018-01-18 0:32 ` Willem de Bruijn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180118114024.GB24920@oracle.com \
--to=sowmini.varadhan@oracle.com \
--cc=davem@davemloft.net \
--cc=netdev@vger.kernel.org \
--cc=rds-devel@oss.oracle.com \
--cc=santosh.shilimkar@oracle.com \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).