All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Jakub Kicinski <kuba@kernel.org>,
	Jonathan Lemon <jonathan.lemon@gmail.com>,
	"David S . Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	David Ahern <dsahern@kernel.org>, Jens Axboe <axboe@kernel.dk>
Subject: Re: [RFC 00/12] io_uring zerocopy send
Date: Wed, 1 Dec 2021 19:59:00 +0000	[thread overview]
Message-ID: <0d82f4e2-730f-4888-ec82-2354ffa9c2d8@gmail.com> (raw)
In-Reply-To: <CA+FuTSf-N08d6pcbie2=zFcQJf3_e2dBJRUZuop4pOhNfSANUA@mail.gmail.com>

On 12/1/21 18:10, Willem de Bruijn wrote:
>> # performance:
>>
>> The worst case for io_uring is (4), still 1.88 times faster than
>> msg_zerocopy (2), and there are a couple of "easy" optimisations left
>> out from the patchset. For 4096 bytes payload zc is only slightly
>> outperforms non-zc version, the larger payload the wider gap.
>> I'll get more numbers next time.
> 
>> Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
>> much. Notification posting is not a big problem for now, but need
>> to compare the performance for when io_uring_tx_zerocopy_callback()
>> is called from IRQ context, and possible rework it to use task_work.
>>
>> It supports both, regular buffers and fixed ones, but there is a bunch of
>> optimisations exclusively for io_uring's fixed buffers. For comparison,
>> normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s
>>
>> 1) we pass a bvec, so no page table walks.
>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
>>     still doing page get/put (see 4/12) slashed 4-5%.
>> 3) avoiding get_page/put_page in 5/12
>> 4) completion events are posted into io_uring's CQ, so no
>>     extra recvmsg for getting events
>> 5) no poll(2) in the code because of io_uring
>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
>>     io_uring caches the structures reducing it to nearly zero-overhead.
> 
> Nice set of complementary optimizations.
> 
> We have looked at adding some of those as independent additions to
> msg_zerocopy before, such as long-term pinned regions. One issue with
> that is that the pages must remain until the request completes,
> regardless of whether the calling process is alive. So it cannot rely
> on a pinned range held by a process only.
> 
> If feasible, it would be preferable if the optimizations can be added
> to msg_zerocopy directly, rather than adding a dependency on io_uring
> to make use of them. But not sure how feasible that is. For some, like
> 4 and 5, the answer is clearly it isn't.  6, it probably is?

And for 3), io_uring has a complex infra for keeping pages alive,
the additional overhead is one almost percpu_ref_put() per
request/notification, or even better in common cases. Not sure it's
feasible/possible with current msg_zerocopy. Also, io_uring's
ubufs are kept as a part of a larger structure, which may complicate
things.


>> # discussion / questions
>>
>> I haven't got a grasp on many aspects of the net stack yet, so would
>> appreciate feedback in general and there are a couple of questions
>> thoughts.
>>
>> 1) What are initialisation rules for adding a new field into
>> struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
>> filling all the fields.
>>
>> 2) I don't like too much ubuf_info propagation from udp_sendmsg() into
>> __ip_append_data() (see 3/12). Ideas how to do it better?
> 
> Agreed that both of these are less than ideal.
> 
> I can't comment too much on the io_uring aspect of the patch series.
> But msg_zerocopy is probably used in a small fraction of traffic (even
> if a high fraction for users who care about its benefits). We have to
> try to minimize the cost incurred on the general hot path.

One thing, I can hide the initial ubuf check in the beginning of
__ip_append_data() under a common

if (sock_flag(sk, SOCK_ZEROCOPY)) {}

But as SOCK_ZEROCOPY is more of a design problem workaround,
tbh not sure I like from the API perspective. Thoughts? I hope
I can also shuffle some of the stuff in 5/12 out of the
hot path, need to dig a bit deeper.

> I was going to suggest using the standard msg_zerocopy ubuf_info
> alloc/free mechanism. But you explicitly mention seeing omalloc/ofree
> in the cycle profile.
> 
> It might still be possible to somehow signal to msg_zerocopy_alloc
> that this is being called from within an io_uring request, and
> therefore should use a pre-existing uarg with different
> uarg->callback. If nothing else, some info can be passed as a cmsg.
> But perhaps there is a more direct pointer path to follow from struct
> sk, say? Here my limited knowledge of io_uring forces me to hand wave.

One thing I consider important though is to be able to specify a
ubuf per request, but not somehow registering it in a socket. It's
more flexible from the userspace API perspective. It would also need
constant register/unregister, and there are concerns with
referencing/cancellations, that's where it came from in the first
place.

IOW, I'd really prefer to pass it down on a per request basis.

> Probably also want to see how all this would integrate with TCP. In
> some ways, that might be easier, as it does not have the indirection
> through ip_make_skb, etc.

Worked well in general, but patches I used should be a broken for
some input after adding 5/12, so need some work. will send next time.

-- 
Pavel Begunkov

  reply	other threads:[~2021-12-01 19:59 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-30 15:18 [RFC 00/12] io_uring zerocopy send Pavel Begunkov
2021-11-30 15:18 ` [RFC 01/12] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
2021-11-30 15:18 ` [RFC 02/12] skbuff: pass a struct ubuf_info in msghdr Pavel Begunkov
2021-11-30 15:18 ` [RFC 03/12] net/udp: add support msgdr::msg_ubuf Pavel Begunkov
2021-11-30 15:18 ` [RFC 04/12] net: add zerocopy_sg_from_iter for bvec Pavel Begunkov
2021-11-30 15:18 ` [RFC 05/12] net: optimise page get/free for bvec zc Pavel Begunkov
2021-12-01 19:20   ` Jonathan Lemon
2021-12-01 20:17     ` Pavel Begunkov
2021-11-30 15:18 ` [RFC 06/12] io_uring: add send notifiers registration Pavel Begunkov
2021-11-30 15:18 ` [RFC 07/12] io_uring: infrastructure for send zc notifications Pavel Begunkov
2021-11-30 15:18 ` [RFC 08/12] io_uring: wire send zc request type Pavel Begunkov
2021-11-30 15:18 ` [RFC 09/12] io_uring: add an option to flush zc notifications Pavel Begunkov
2021-11-30 15:18 ` [RFC 10/12] io_uring: opcode independent fixed buf import Pavel Begunkov
2021-11-30 15:18 ` [RFC 11/12] io_uring: sendzc with fixed buffers Pavel Begunkov
2021-11-30 23:22   ` kernel test robot
2021-12-01  9:18   ` kernel test robot
2021-11-30 15:19 ` [RFC 12/12] io_uring: cache struct ubuf_info Pavel Begunkov
2021-12-01  3:10 ` [RFC 00/12] io_uring zerocopy send David Ahern
2021-12-01 15:32   ` Pavel Begunkov
2021-12-01 17:57     ` David Ahern
2021-12-01 19:11       ` Pavel Begunkov
2021-12-01 19:20         ` David Ahern
2021-12-01 20:15           ` Pavel Begunkov
2021-12-01 21:51             ` Martin KaFai Lau
2021-12-01 22:35               ` David Ahern
2021-12-01 23:07                 ` Martin KaFai Lau
2021-12-01 23:18                   ` Pavel Begunkov
2021-12-02 15:48               ` Pavel Begunkov
2021-12-02 17:40                 ` Martin KaFai Lau
2021-12-01 20:42       ` Pavel Begunkov
2021-12-01 14:31 ` Pavel Begunkov
2021-12-01 17:49   ` David Ahern
2021-12-01 19:59     ` Pavel Begunkov
2021-12-01 18:10 ` Willem de Bruijn
2021-12-01 19:59   ` Pavel Begunkov [this message]
2021-12-01 20:29     ` Pavel Begunkov
2021-12-02  0:36       ` Willem de Bruijn
2021-12-02 16:25         ` Pavel Begunkov
2021-12-02  0:32     ` Willem de Bruijn
2021-12-02 16:45       ` Pavel Begunkov
2021-12-02 21:25         ` Willem de Bruijn
2021-12-03 16:19           ` Pavel Begunkov
2021-12-03 16:30             ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0d82f4e2-730f-4888-ec82-2354ffa9c2d8@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=io-uring@vger.kernel.org \
    --cc=jonathan.lemon@gmail.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=yoshfuji@linux-ipv6.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.