netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	 john fastabend <john.fastabend@gmail.com>,
	 bpf <bpf@vger.kernel.org>,  Networking <netdev@vger.kernel.org>
Cc: "davidhwei@meta.com" <davidhwei@meta.com>
Subject: RE: Sockmap's parser/verdict programs and epoll notifications
Date: Sun, 16 Jul 2023 21:37:29 -0700	[thread overview]
Message-ID: <64b4c5891096b_2b67208f@john.notmuch> (raw)
In-Reply-To: <CAEf4BzYMAAhwscTWWTenvyr-PQ7E5tMg_iqXsPj_dyZEMVCrKg@mail.gmail.com>

Andrii Nakryiko wrote:
> Hey John,

Sorry missed this while I was on PTO that week.

> 
> We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> in-kernel parsing of RSocket frames. A very simple format ([0]) where
> the first 3 bytes specify the size of the frame payload. The idea was
> to collect the entire frame in the kernel before notifying user-space
> that data is available. This is meant to minimize unnecessary wakeups
> due to incomplete logical frames, saving CPU.

Nice.

> 
> You can find the BPF source code I've used at [1], it has lots of
> extra logging and stuff, but the idea is to read the first 3 bytes of
> each logical frame, and return the expected full frame size from the
> parser program. The verdict program always just returns SK_PASS.
> 
> This seems to work exactly as expected in manual simulations of
> various packet size distributions, and even for a bunch of
> ping/pong-like benchmark (which are very sensitive to correct frame
> length determination, so I'm reasonably confident we don't screw that
> up much). And yet, when benchmarking sending multiple logical RPC
> streams over the same single socket (so many interleaving RSocket
> frames on single socket, but in terms of logical frames nothing should
> change), we often see that while full frame hasn't been accumulated in
> socket receive buffer yet, epoll_wait() for that socket would return
> with success notifying user space that there is data on socket.
> Subsequent recvfrom() call would immediately return -EAGAIN and no
> data, and our benchmark would go on this loop of useless
> epoll_wait()+recvfrom() calls back to back, many times over.

Aha yes this sounds bad.

> 
> So I have a few questions:
>   - is the above use case something that was meant to be handled by
> sockmap+parser/verdict?

We shouldn't wake up user space if there is nothing to read. So
yes this seems like a valid use case to me.

>   - is it correct to assume that epoll won't wake up until amount of
> bytes requested by parser program is accumulated (this seems to be the
> case from manually experimenting with various "packet delays");

Seems there is some bug that races and causes it to wake up
user space. I'm aware of a couple bugs in the stream parser
that I wanted to fix. Not sure I can get to them this week
but should have time next week. We have a couple more fixes
to resolve a few HTTPS server compliance tests as well.

>   - is there some known bug or race in how sockmap and strparser
> framework interacts with epoll subsystem that could cause this weird
> epoll_wait() behavior?

Yes I know of some races in strparser. I'll elaborate later
probably with patches as I don't recall them readily at the
moment.

> 
> It does seem like some sort of timing issue, but I couldn't pin down
> exactly what are the conditions that this happens in. But it's quite
> reproducible with a pretty high frequency using our internal benchmark
> when multiple logical streams are involved.
> 
> Any thoughts or suggestions?

Seems like a bug we should fix it. I'm aware of a couple
issues with the stream parser that we plan to fix so could
be one of those or a new one I'm not aware of. I'll take
a look more closely next week.

>   [0] https://rsocket.io/about/protocol/#framing-format
>   [1] https://github.com/anakryiko/libbpf-bootstrap/blob/thrift-coalesce-rcvlowat/examples/c/bootstrap.bpf.c
> 
> -- Andrii

  reply	other threads:[~2023-07-17  4:37 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 18:30 Sockmap's parser/verdict programs and epoll notifications Andrii Nakryiko
2023-07-17  4:37 ` John Fastabend [this message]
2023-09-08 22:46   ` Andrii Nakryiko
2023-09-11 14:43     ` John Fastabend
2023-09-11 18:01       ` Andrii Nakryiko
2023-10-03  5:04         ` John Fastabend
2023-10-03  5:16           ` John Fastabend
2023-10-03 12:41             ` Jakub Kicinski
2023-10-03 22:22               ` Andrii Nakryiko
2023-10-04  5:40                 ` John Fastabend
2023-10-03 22:18             ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=64b4c5891096b_2b67208f@john.notmuch \
    --to=john.fastabend@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=davidhwei@meta.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).