Sockmap's parser/verdict programs and epoll notifications

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Sockmap's parser/verdict programs and epoll notifications
@ 2023-07-07 18:30 Andrii Nakryiko
  2023-07-17  4:37 ` John Fastabend
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2023-07-07 18:30 UTC (permalink / raw)
  To: john fastabend, bpf, Networking; +Cc: davidhwei@meta.com

Hey John,

We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
in-kernel parsing of RSocket frames. A very simple format ([0]) where
the first 3 bytes specify the size of the frame payload. The idea was
to collect the entire frame in the kernel before notifying user-space
that data is available. This is meant to minimize unnecessary wakeups
due to incomplete logical frames, saving CPU.

You can find the BPF source code I've used at [1], it has lots of
extra logging and stuff, but the idea is to read the first 3 bytes of
each logical frame, and return the expected full frame size from the
parser program. The verdict program always just returns SK_PASS.

This seems to work exactly as expected in manual simulations of
various packet size distributions, and even for a bunch of
ping/pong-like benchmark (which are very sensitive to correct frame
length determination, so I'm reasonably confident we don't screw that
up much). And yet, when benchmarking sending multiple logical RPC
streams over the same single socket (so many interleaving RSocket
frames on single socket, but in terms of logical frames nothing should
change), we often see that while full frame hasn't been accumulated in
socket receive buffer yet, epoll_wait() for that socket would return
with success notifying user space that there is data on socket.
Subsequent recvfrom() call would immediately return -EAGAIN and no
data, and our benchmark would go on this loop of useless
epoll_wait()+recvfrom() calls back to back, many times over.

So I have a few questions:
  - is the above use case something that was meant to be handled by
sockmap+parser/verdict?
  - is it correct to assume that epoll won't wake up until amount of
bytes requested by parser program is accumulated (this seems to be the
case from manually experimenting with various "packet delays");
  - is there some known bug or race in how sockmap and strparser
framework interacts with epoll subsystem that could cause this weird
epoll_wait() behavior?

It does seem like some sort of timing issue, but I couldn't pin down
exactly what are the conditions that this happens in. But it's quite
reproducible with a pretty high frequency using our internal benchmark
when multiple logical streams are involved.

Any thoughts or suggestions?

  [0] https://rsocket.io/about/protocol/#framing-format
  [1] https://github.com/anakryiko/libbpf-bootstrap/blob/thrift-coalesce-rcvlowat/examples/c/bootstrap.bpf.c

-- Andrii

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Sockmap's parser/verdict programs and epoll notifications
  2023-07-07 18:30 Sockmap's parser/verdict programs and epoll notifications Andrii Nakryiko
@ 2023-07-17  4:37 ` John Fastabend
  2023-09-08 22:46   ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: John Fastabend @ 2023-07-17  4:37 UTC (permalink / raw)
  To: Andrii Nakryiko, john fastabend, bpf, Networking; +Cc: davidhwei@meta.com

Andrii Nakryiko wrote:
> Hey John,

Sorry missed this while I was on PTO that week.

> 
> We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> in-kernel parsing of RSocket frames. A very simple format ([0]) where
> the first 3 bytes specify the size of the frame payload. The idea was
> to collect the entire frame in the kernel before notifying user-space
> that data is available. This is meant to minimize unnecessary wakeups
> due to incomplete logical frames, saving CPU.

Nice.

> 
> You can find the BPF source code I've used at [1], it has lots of
> extra logging and stuff, but the idea is to read the first 3 bytes of
> each logical frame, and return the expected full frame size from the
> parser program. The verdict program always just returns SK_PASS.
> 
> This seems to work exactly as expected in manual simulations of
> various packet size distributions, and even for a bunch of
> ping/pong-like benchmark (which are very sensitive to correct frame
> length determination, so I'm reasonably confident we don't screw that
> up much). And yet, when benchmarking sending multiple logical RPC
> streams over the same single socket (so many interleaving RSocket
> frames on single socket, but in terms of logical frames nothing should
> change), we often see that while full frame hasn't been accumulated in
> socket receive buffer yet, epoll_wait() for that socket would return
> with success notifying user space that there is data on socket.
> Subsequent recvfrom() call would immediately return -EAGAIN and no
> data, and our benchmark would go on this loop of useless
> epoll_wait()+recvfrom() calls back to back, many times over.

Aha yes this sounds bad.

> 
> So I have a few questions:
>   - is the above use case something that was meant to be handled by
> sockmap+parser/verdict?

We shouldn't wake up user space if there is nothing to read. So
yes this seems like a valid use case to me.

>   - is it correct to assume that epoll won't wake up until amount of
> bytes requested by parser program is accumulated (this seems to be the
> case from manually experimenting with various "packet delays");

Seems there is some bug that races and causes it to wake up
user space. I'm aware of a couple bugs in the stream parser
that I wanted to fix. Not sure I can get to them this week
but should have time next week. We have a couple more fixes
to resolve a few HTTPS server compliance tests as well.

>   - is there some known bug or race in how sockmap and strparser
> framework interacts with epoll subsystem that could cause this weird
> epoll_wait() behavior?

Yes I know of some races in strparser. I'll elaborate later
probably with patches as I don't recall them readily at the
moment.

> 
> It does seem like some sort of timing issue, but I couldn't pin down
> exactly what are the conditions that this happens in. But it's quite
> reproducible with a pretty high frequency using our internal benchmark
> when multiple logical streams are involved.
> 
> Any thoughts or suggestions?

Seems like a bug we should fix it. I'm aware of a couple
issues with the stream parser that we plan to fix so could
be one of those or a new one I'm not aware of. I'll take
a look more closely next week.

>   [0] https://rsocket.io/about/protocol/#framing-format
>   [1] https://github.com/anakryiko/libbpf-bootstrap/blob/thrift-coalesce-rcvlowat/examples/c/bootstrap.bpf.c
> 
> -- Andrii

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sockmap's parser/verdict programs and epoll notifications
  2023-07-17  4:37 ` John Fastabend
@ 2023-09-08 22:46   ` Andrii Nakryiko
  2023-09-11 14:43     ` John Fastabend
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2023-09-08 22:46 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, Networking, davidhwei@meta.com

On Sun, Jul 16, 2023 at 9:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > Hey John,
>
> Sorry missed this while I was on PTO that week.

yeah, vacations tend to cause missing things :)

>
> >
> > We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> > and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> > in-kernel parsing of RSocket frames. A very simple format ([0]) where
> > the first 3 bytes specify the size of the frame payload. The idea was
> > to collect the entire frame in the kernel before notifying user-space
> > that data is available. This is meant to minimize unnecessary wakeups
> > due to incomplete logical frames, saving CPU.
>
> Nice.
>
> >
> > You can find the BPF source code I've used at [1], it has lots of
> > extra logging and stuff, but the idea is to read the first 3 bytes of
> > each logical frame, and return the expected full frame size from the
> > parser program. The verdict program always just returns SK_PASS.
> >
> > This seems to work exactly as expected in manual simulations of
> > various packet size distributions, and even for a bunch of
> > ping/pong-like benchmark (which are very sensitive to correct frame
> > length determination, so I'm reasonably confident we don't screw that
> > up much). And yet, when benchmarking sending multiple logical RPC
> > streams over the same single socket (so many interleaving RSocket
> > frames on single socket, but in terms of logical frames nothing should
> > change), we often see that while full frame hasn't been accumulated in
> > socket receive buffer yet, epoll_wait() for that socket would return
> > with success notifying user space that there is data on socket.
> > Subsequent recvfrom() call would immediately return -EAGAIN and no
> > data, and our benchmark would go on this loop of useless
> > epoll_wait()+recvfrom() calls back to back, many times over.
>
> Aha yes this sounds bad.
>
> >
> > So I have a few questions:
> >   - is the above use case something that was meant to be handled by
> > sockmap+parser/verdict?
>
> We shouldn't wake up user space if there is nothing to read. So
> yes this seems like a valid use case to me.
>
> >   - is it correct to assume that epoll won't wake up until amount of
> > bytes requested by parser program is accumulated (this seems to be the
> > case from manually experimenting with various "packet delays");
>
> Seems there is some bug that races and causes it to wake up
> user space. I'm aware of a couple bugs in the stream parser
> that I wanted to fix. Not sure I can get to them this week
> but should have time next week. We have a couple more fixes
> to resolve a few HTTPS server compliance tests as well.
>
> >   - is there some known bug or race in how sockmap and strparser
> > framework interacts with epoll subsystem that could cause this weird
> > epoll_wait() behavior?
>
> Yes I know of some races in strparser. I'll elaborate later
> probably with patches as I don't recall them readily at the
> moment.

So I missed a good chunk of BPF mailing list traffic while I was on my
PTO. Did you end up getting to these bugs in strparser logic? Should I
try running the latest bpf-next/net-next on our production workload to
see if this is still happening?

>
> >
> > It does seem like some sort of timing issue, but I couldn't pin down
> > exactly what are the conditions that this happens in. But it's quite
> > reproducible with a pretty high frequency using our internal benchmark
> > when multiple logical streams are involved.
> >
> > Any thoughts or suggestions?
>
> Seems like a bug we should fix it. I'm aware of a couple
> issues with the stream parser that we plan to fix so could
> be one of those or a new one I'm not aware of. I'll take
> a look more closely next week.
>
> >   [0] https://rsocket.io/about/protocol/#framing-format
> >   [1] https://github.com/anakryiko/libbpf-bootstrap/blob/thrift-coalesce-rcvlowat/examples/c/bootstrap.bpf.c
> >
> > -- Andrii

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sockmap's parser/verdict programs and epoll notifications
  2023-09-08 22:46   ` Andrii Nakryiko
@ 2023-09-11 14:43     ` John Fastabend
  2023-09-11 18:01       ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: John Fastabend @ 2023-09-11 14:43 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend; +Cc: bpf, Networking, davidhwei@meta.com

Andrii Nakryiko wrote:
> On Sun, Jul 16, 2023 at 9:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > Hey John,
> >
> > Sorry missed this while I was on PTO that week.
> 
> yeah, vacations tend to cause missing things :)
> 
> >
> > >
> > > We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> > > and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> > > in-kernel parsing of RSocket frames. A very simple format ([0]) where
> > > the first 3 bytes specify the size of the frame payload. The idea was
> > > to collect the entire frame in the kernel before notifying user-space
> > > that data is available. This is meant to minimize unnecessary wakeups
> > > due to incomplete logical frames, saving CPU.
> >
> > Nice.
> >
> > >
> > > You can find the BPF source code I've used at [1], it has lots of
> > > extra logging and stuff, but the idea is to read the first 3 bytes of
> > > each logical frame, and return the expected full frame size from the
> > > parser program. The verdict program always just returns SK_PASS.
> > >
> > > This seems to work exactly as expected in manual simulations of
> > > various packet size distributions, and even for a bunch of
> > > ping/pong-like benchmark (which are very sensitive to correct frame
> > > length determination, so I'm reasonably confident we don't screw that
> > > up much). And yet, when benchmarking sending multiple logical RPC
> > > streams over the same single socket (so many interleaving RSocket
> > > frames on single socket, but in terms of logical frames nothing should
> > > change), we often see that while full frame hasn't been accumulated in
> > > socket receive buffer yet, epoll_wait() for that socket would return
> > > with success notifying user space that there is data on socket.
> > > Subsequent recvfrom() call would immediately return -EAGAIN and no
> > > data, and our benchmark would go on this loop of useless
> > > epoll_wait()+recvfrom() calls back to back, many times over.
> >
> > Aha yes this sounds bad.
> >
> > >
> > > So I have a few questions:
> > >   - is the above use case something that was meant to be handled by
> > > sockmap+parser/verdict?
> >
> > We shouldn't wake up user space if there is nothing to read. So
> > yes this seems like a valid use case to me.
> >
> > >   - is it correct to assume that epoll won't wake up until amount of
> > > bytes requested by parser program is accumulated (this seems to be the
> > > case from manually experimenting with various "packet delays");
> >
> > Seems there is some bug that races and causes it to wake up
> > user space. I'm aware of a couple bugs in the stream parser
> > that I wanted to fix. Not sure I can get to them this week
> > but should have time next week. We have a couple more fixes
> > to resolve a few HTTPS server compliance tests as well.
> >
> > >   - is there some known bug or race in how sockmap and strparser
> > > framework interacts with epoll subsystem that could cause this weird
> > > epoll_wait() behavior?
> >
> > Yes I know of some races in strparser. I'll elaborate later
> > probably with patches as I don't recall them readily at the
> > moment.
> 
> So I missed a good chunk of BPF mailing list traffic while I was on my
> PTO. Did you end up getting to these bugs in strparser logic? Should I
> try running the latest bpf-next/net-next on our production workload to
> see if this is still happening?

You will likely still hit there error I haven't got it out of my queue
yet. I just knocked off a couple things last week so could probably
take a look at flushing my queue this week. Then it would make sense
to retest to see if its something new or not.

I'll at least send an RFC with the idea even if I don't get to testing
it yet.

Thanks,
John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sockmap's parser/verdict programs and epoll notifications
  2023-09-11 14:43     ` John Fastabend
@ 2023-09-11 18:01       ` Andrii Nakryiko
  2023-10-03  5:04         ` John Fastabend
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2023-09-11 18:01 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, Networking, davidhwei@meta.com

On Mon, Sep 11, 2023 at 7:43 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > On Sun, Jul 16, 2023 at 9:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Andrii Nakryiko wrote:
> > > > Hey John,
> > >
> > > Sorry missed this while I was on PTO that week.
> >
> > yeah, vacations tend to cause missing things :)
> >
> > >
> > > >
> > > > We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> > > > and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> > > > in-kernel parsing of RSocket frames. A very simple format ([0]) where
> > > > the first 3 bytes specify the size of the frame payload. The idea was
> > > > to collect the entire frame in the kernel before notifying user-space
> > > > that data is available. This is meant to minimize unnecessary wakeups
> > > > due to incomplete logical frames, saving CPU.
> > >
> > > Nice.
> > >
> > > >
> > > > You can find the BPF source code I've used at [1], it has lots of
> > > > extra logging and stuff, but the idea is to read the first 3 bytes of
> > > > each logical frame, and return the expected full frame size from the
> > > > parser program. The verdict program always just returns SK_PASS.
> > > >
> > > > This seems to work exactly as expected in manual simulations of
> > > > various packet size distributions, and even for a bunch of
> > > > ping/pong-like benchmark (which are very sensitive to correct frame
> > > > length determination, so I'm reasonably confident we don't screw that
> > > > up much). And yet, when benchmarking sending multiple logical RPC
> > > > streams over the same single socket (so many interleaving RSocket
> > > > frames on single socket, but in terms of logical frames nothing should
> > > > change), we often see that while full frame hasn't been accumulated in
> > > > socket receive buffer yet, epoll_wait() for that socket would return
> > > > with success notifying user space that there is data on socket.
> > > > Subsequent recvfrom() call would immediately return -EAGAIN and no
> > > > data, and our benchmark would go on this loop of useless
> > > > epoll_wait()+recvfrom() calls back to back, many times over.
> > >
> > > Aha yes this sounds bad.
> > >
> > > >
> > > > So I have a few questions:
> > > >   - is the above use case something that was meant to be handled by
> > > > sockmap+parser/verdict?
> > >
> > > We shouldn't wake up user space if there is nothing to read. So
> > > yes this seems like a valid use case to me.
> > >
> > > >   - is it correct to assume that epoll won't wake up until amount of
> > > > bytes requested by parser program is accumulated (this seems to be the
> > > > case from manually experimenting with various "packet delays");
> > >
> > > Seems there is some bug that races and causes it to wake up
> > > user space. I'm aware of a couple bugs in the stream parser
> > > that I wanted to fix. Not sure I can get to them this week
> > > but should have time next week. We have a couple more fixes
> > > to resolve a few HTTPS server compliance tests as well.
> > >
> > > >   - is there some known bug or race in how sockmap and strparser
> > > > framework interacts with epoll subsystem that could cause this weird
> > > > epoll_wait() behavior?
> > >
> > > Yes I know of some races in strparser. I'll elaborate later
> > > probably with patches as I don't recall them readily at the
> > > moment.
> >
> > So I missed a good chunk of BPF mailing list traffic while I was on my
> > PTO. Did you end up getting to these bugs in strparser logic? Should I
> > try running the latest bpf-next/net-next on our production workload to
> > see if this is still happening?
>
> You will likely still hit there error I haven't got it out of my queue
> yet. I just knocked off a couple things last week so could probably
> take a look at flushing my queue this week. Then it would make sense
> to retest to see if its something new or not.
>
> I'll at least send an RFC with the idea even if I don't get to testing
> it yet.

Sounds good, thanks a lot!

>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sockmap's parser/verdict programs and epoll notifications
  2023-09-11 18:01       ` Andrii Nakryiko
@ 2023-10-03  5:04         ` John Fastabend
  2023-10-03  5:16           ` John Fastabend
  0 siblings, 1 reply; 11+ messages in thread
From: John Fastabend @ 2023-10-03  5:04 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend; +Cc: bpf, Networking, davidhwei@meta.com

Andrii Nakryiko wrote:
> On Mon, Sep 11, 2023 at 7:43 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > On Sun, Jul 16, 2023 at 9:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > Andrii Nakryiko wrote:
> > > > > Hey John,
> > > >
> > > > Sorry missed this while I was on PTO that week.
> > >
> > > yeah, vacations tend to cause missing things :)
> > >
> > > >
> > > > >
> > > > > We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> > > > > and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> > > > > in-kernel parsing of RSocket frames. A very simple format ([0]) where
> > > > > the first 3 bytes specify the size of the frame payload. The idea was
> > > > > to collect the entire frame in the kernel before notifying user-space
> > > > > that data is available. This is meant to minimize unnecessary wakeups
> > > > > due to incomplete logical frames, saving CPU.
> > > >
> > > > Nice.
> > > >
> > > > >
> > > > > You can find the BPF source code I've used at [1], it has lots of
> > > > > extra logging and stuff, but the idea is to read the first 3 bytes of
> > > > > each logical frame, and return the expected full frame size from the
> > > > > parser program. The verdict program always just returns SK_PASS.
> > > > >
> > > > > This seems to work exactly as expected in manual simulations of
> > > > > various packet size distributions, and even for a bunch of
> > > > > ping/pong-like benchmark (which are very sensitive to correct frame
> > > > > length determination, so I'm reasonably confident we don't screw that
> > > > > up much). And yet, when benchmarking sending multiple logical RPC
> > > > > streams over the same single socket (so many interleaving RSocket
> > > > > frames on single socket, but in terms of logical frames nothing should
> > > > > change), we often see that while full frame hasn't been accumulated in
> > > > > socket receive buffer yet, epoll_wait() for that socket would return
> > > > > with success notifying user space that there is data on socket.
> > > > > Subsequent recvfrom() call would immediately return -EAGAIN and no
> > > > > data, and our benchmark would go on this loop of useless
> > > > > epoll_wait()+recvfrom() calls back to back, many times over.
> > > >
> > > > Aha yes this sounds bad.
> > > >
> > > > >
> > > > > So I have a few questions:
> > > > >   - is the above use case something that was meant to be handled by
> > > > > sockmap+parser/verdict?
> > > >
> > > > We shouldn't wake up user space if there is nothing to read. So
> > > > yes this seems like a valid use case to me.
> > > >
> > > > >   - is it correct to assume that epoll won't wake up until amount of
> > > > > bytes requested by parser program is accumulated (this seems to be the
> > > > > case from manually experimenting with various "packet delays");
> > > >
> > > > Seems there is some bug that races and causes it to wake up
> > > > user space. I'm aware of a couple bugs in the stream parser
> > > > that I wanted to fix. Not sure I can get to them this week
> > > > but should have time next week. We have a couple more fixes
> > > > to resolve a few HTTPS server compliance tests as well.
> > > >
> > > > >   - is there some known bug or race in how sockmap and strparser
> > > > > framework interacts with epoll subsystem that could cause this weird
> > > > > epoll_wait() behavior?
> > > >
> > > > Yes I know of some races in strparser. I'll elaborate later
> > > > probably with patches as I don't recall them readily at the
> > > > moment.
> > >
> > > So I missed a good chunk of BPF mailing list traffic while I was on my
> > > PTO. Did you end up getting to these bugs in strparser logic? Should I
> > > try running the latest bpf-next/net-next on our production workload to
> > > see if this is still happening?
> >
> > You will likely still hit there error I haven't got it out of my queue
> > yet. I just knocked off a couple things last week so could probably
> > take a look at flushing my queue this week. Then it would make sense
> > to retest to see if its something new or not.
> >
> > I'll at least send an RFC with the idea even if I don't get to testing
> > it yet.
> 
> Sounds good, thanks a lot!
> 
> >
> > Thanks,
> > John

Hi Andrii,

Finally got around to thinking about this. And also I belive we have
the verdict programs mostly fixed up to handle polling correctly now.
The problem was incorrectly handling the tcp_sock copied_seq var
which is used by tcp_epollin_ready() to wakeup the application. Its
also used to calculate responses to some ioctl we found servers using
to decide when to actually do a recv, e.g. they wait on the ioctl until
enough bytes are received.

The trick is to ensure we only update copied_seq when the bytes are
in fact actually ready to read from socket queue. The sockmap verdict
program code was incrementing this before running the verdict prog
so we raced with userspace. It kind of works in many cases because
we are holding the sock lock in many cases so we block the user space
recvmsg.

Now to your problem as I understand it. You are trying to use the
parser program to hold some N bytes where N is the message block.
At which point it will get pushed to a verdict prog and finally
queued in the msg recieve queue so a syscall to recv*() can
actually read it. The parser program, unlike if you just have
a verdict prog, causes the skb to run through the stream parser to
collect bytes and then run the verdict program. The stream parser
is using tcp_read_sock() which increments the seq_copied immediately
even before the  verdict prog is run so I expect the odd behavior
you see is when that race completes. It likely  mostly works because
we have the sock lock for lots of the code making the race behavior
smaller than it might otherwise appear. I didn't do a full anlaysis
but it might just be when we hit an ENOMEM condition and need to
backoff. Which might explain why you only see the issue when you
run with larger envs.

It feels a bit suboptimal in your case to run two BPF programs and
parser logic compared to a single verdict program. Could we just
add a bpf helper we can run from the verdict program to only wake
up the user space after N bytes. To mirror the sk_msg programs we
migth call it bpf_skb_cork_bytes(skb, bytes, flags). We could use
flags to decide if we need to call the prog again with the new
full sized skb or if we can just process it directly without the
extra BPF call.

This with the other piece we want from our side to allow running
verdict and sk_msg programs on sockets without having them in a
sockmap/sockhash it would seem like a better system to me. The
idea to drop the sockmap/sockhash is because we never remove progs
once they are added and we add them from sockops side. The filter
to socketes is almost always the port + metadata related to the
process or environment. This simplifies having to manage the
sockmap/sockhash and guess what size it should be. Sometimes we
overrun these maps and have to kill connections until we can
get more space.

For you case I would expect it to be (a) simpler just a single
program to manage instead of two and a map and (b) more efficient
to call one prog in datapath vs two.

WDYT?

Thanks,
John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sockmap's parser/verdict programs and epoll notifications
  2023-10-03  5:04         ` John Fastabend
@ 2023-10-03  5:16           ` John Fastabend
  2023-10-03 12:41             ` Jakub Kicinski
  2023-10-03 22:18             ` Andrii Nakryiko
  0 siblings, 2 replies; 11+ messages in thread
From: John Fastabend @ 2023-10-03  5:16 UTC (permalink / raw)
  To: John Fastabend, Andrii Nakryiko, John Fastabend
  Cc: bpf, Networking, davidhwei@meta.com

John Fastabend wrote:
> Andrii Nakryiko wrote:
> > On Mon, Sep 11, 2023 at 7:43 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Andrii Nakryiko wrote:
> > > > On Sun, Jul 16, 2023 at 9:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > >
> > > > > Andrii Nakryiko wrote:
> > > > > > Hey John,
> > > > >
> > > > > Sorry missed this while I was on PTO that week.
> > > >
> > > > yeah, vacations tend to cause missing things :)
> > > >
> > > > >
> > > > > >
> > > > > > We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> > > > > > and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> > > > > > in-kernel parsing of RSocket frames. A very simple format ([0]) where
> > > > > > the first 3 bytes specify the size of the frame payload. The idea was
> > > > > > to collect the entire frame in the kernel before notifying user-space
> > > > > > that data is available. This is meant to minimize unnecessary wakeups
> > > > > > due to incomplete logical frames, saving CPU.
> > > > >
> > > > > Nice.
> > > > >
> > > > > >
> > > > > > You can find the BPF source code I've used at [1], it has lots of
> > > > > > extra logging and stuff, but the idea is to read the first 3 bytes of
> > > > > > each logical frame, and return the expected full frame size from the
> > > > > > parser program. The verdict program always just returns SK_PASS.
> > > > > >
> > > > > > This seems to work exactly as expected in manual simulations of
> > > > > > various packet size distributions, and even for a bunch of
> > > > > > ping/pong-like benchmark (which are very sensitive to correct frame
> > > > > > length determination, so I'm reasonably confident we don't screw that
> > > > > > up much). And yet, when benchmarking sending multiple logical RPC
> > > > > > streams over the same single socket (so many interleaving RSocket
> > > > > > frames on single socket, but in terms of logical frames nothing should
> > > > > > change), we often see that while full frame hasn't been accumulated in
> > > > > > socket receive buffer yet, epoll_wait() for that socket would return
> > > > > > with success notifying user space that there is data on socket.
> > > > > > Subsequent recvfrom() call would immediately return -EAGAIN and no
> > > > > > data, and our benchmark would go on this loop of useless
> > > > > > epoll_wait()+recvfrom() calls back to back, many times over.
> > > > >
> > > > > Aha yes this sounds bad.
> > > > >
> > > > > >
> > > > > > So I have a few questions:
> > > > > >   - is the above use case something that was meant to be handled by
> > > > > > sockmap+parser/verdict?
> > > > >
> > > > > We shouldn't wake up user space if there is nothing to read. So
> > > > > yes this seems like a valid use case to me.
> > > > >
> > > > > >   - is it correct to assume that epoll won't wake up until amount of
> > > > > > bytes requested by parser program is accumulated (this seems to be the
> > > > > > case from manually experimenting with various "packet delays");
> > > > >
> > > > > Seems there is some bug that races and causes it to wake up
> > > > > user space. I'm aware of a couple bugs in the stream parser
> > > > > that I wanted to fix. Not sure I can get to them this week
> > > > > but should have time next week. We have a couple more fixes
> > > > > to resolve a few HTTPS server compliance tests as well.
> > > > >
> > > > > >   - is there some known bug or race in how sockmap and strparser
> > > > > > framework interacts with epoll subsystem that could cause this weird
> > > > > > epoll_wait() behavior?
> > > > >
> > > > > Yes I know of some races in strparser. I'll elaborate later
> > > > > probably with patches as I don't recall them readily at the
> > > > > moment.
> > > >
> > > > So I missed a good chunk of BPF mailing list traffic while I was on my
> > > > PTO. Did you end up getting to these bugs in strparser logic? Should I
> > > > try running the latest bpf-next/net-next on our production workload to
> > > > see if this is still happening?
> > >
> > > You will likely still hit there error I haven't got it out of my queue
> > > yet. I just knocked off a couple things last week so could probably
> > > take a look at flushing my queue this week. Then it would make sense
> > > to retest to see if its something new or not.
> > >
> > > I'll at least send an RFC with the idea even if I don't get to testing
> > > it yet.
> > 
> > Sounds good, thanks a lot!
> > 
> > >
> > > Thanks,
> > > John
> 
> Hi Andrii,
> 
> Finally got around to thinking about this. And also I belive we have
> the verdict programs mostly fixed up to handle polling correctly now.
> The problem was incorrectly handling the tcp_sock copied_seq var
> which is used by tcp_epollin_ready() to wakeup the application. Its
> also used to calculate responses to some ioctl we found servers using
> to decide when to actually do a recv, e.g. they wait on the ioctl until
> enough bytes are received.
> 
> The trick is to ensure we only update copied_seq when the bytes are
> in fact actually ready to read from socket queue. The sockmap verdict
> program code was incrementing this before running the verdict prog
> so we raced with userspace. It kind of works in many cases because
> we are holding the sock lock in many cases so we block the user space
> recvmsg.
> 
> Now to your problem as I understand it. You are trying to use the
> parser program to hold some N bytes where N is the message block.
> At which point it will get pushed to a verdict prog and finally
> queued in the msg recieve queue so a syscall to recv*() can
> actually read it. The parser program, unlike if you just have
> a verdict prog, causes the skb to run through the stream parser to
> collect bytes and then run the verdict program. The stream parser
> is using tcp_read_sock() which increments the seq_copied immediately
> even before the  verdict prog is run so I expect the odd behavior
> you see is when that race completes. It likely  mostly works because
> we have the sock lock for lots of the code making the race behavior
> smaller than it might otherwise appear. I didn't do a full anlaysis
> but it might just be when we hit an ENOMEM condition and need to
> backoff. Which might explain why you only see the issue when you
> run with larger envs.
> 
> It feels a bit suboptimal in your case to run two BPF programs and
> parser logic compared to a single verdict program. Could we just
> add a bpf helper we can run from the verdict program to only wake
> up the user space after N bytes. To mirror the sk_msg programs we
> migth call it bpf_skb_cork_bytes(skb, bytes, flags). We could use
> flags to decide if we need to call the prog again with the new
> full sized skb or if we can just process it directly without the
> extra BPF call.
> 
> This with the other piece we want from our side to allow running
> verdict and sk_msg programs on sockets without having them in a
> sockmap/sockhash it would seem like a better system to me. The
> idea to drop the sockmap/sockhash is because we never remove progs
> once they are added and we add them from sockops side. The filter
> to socketes is almost always the port + metadata related to the
> process or environment. This simplifies having to manage the
> sockmap/sockhash and guess what size it should be. Sometimes we
> overrun these maps and have to kill connections until we can
> get more space.
> 
> For you case I would expect it to be (a) simpler just a single
> program to manage instead of two and a map and (b) more efficient
> to call one prog in datapath vs two.
> 
> WDYT?
> 
> Thanks,
> John

On second thought I'll also fix the existing stream parser code here
shortly. Its a bit broken if I just leave it as is, but I still like
the idea of a new helper.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sockmap's parser/verdict programs and epoll notifications
  2023-10-03  5:16           ` John Fastabend
@ 2023-10-03 12:41             ` Jakub Kicinski
  2023-10-03 22:22               ` Andrii Nakryiko
  2023-10-03 22:18             ` Andrii Nakryiko
  1 sibling, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2023-10-03 12:41 UTC (permalink / raw)
  To: John Fastabend; +Cc: Andrii Nakryiko, bpf, Networking, davidhwei@meta.com

On Mon, 02 Oct 2023 22:16:13 -0700 John Fastabend wrote:
> > This with the other piece we want from our side to allow running
> > verdict and sk_msg programs on sockets without having them in a
> > sockmap/sockhash it would seem like a better system to me. The
> > idea to drop the sockmap/sockhash is because we never remove progs
> > once they are added and we add them from sockops side. The filter
> > to socketes is almost always the port + metadata related to the
> > process or environment. This simplifies having to manage the
> > sockmap/sockhash and guess what size it should be. Sometimes we
> > overrun these maps and have to kill connections until we can
> > get more space.

That's a step in the right direction for sure, but I still think that
Google's auto-lowat is the best approach. We just need a hook that
looks at incoming data and sets rcvlowat appropriately. That's it.
TCP looks at rcvlowat in a number of places to make protocol decisions,
not just the wake-up. Plus Google will no longer have to carry their
OOT patch..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sockmap's parser/verdict programs and epoll notifications
  2023-10-03 12:41             ` Jakub Kicinski
@ 2023-10-03 22:22               ` Andrii Nakryiko
  2023-10-04  5:40                 ` John Fastabend
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2023-10-03 22:22 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: John Fastabend, bpf, Networking, davidhwei@meta.com

On Tue, Oct 3, 2023 at 5:42 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 02 Oct 2023 22:16:13 -0700 John Fastabend wrote:
> > > This with the other piece we want from our side to allow running
> > > verdict and sk_msg programs on sockets without having them in a
> > > sockmap/sockhash it would seem like a better system to me. The
> > > idea to drop the sockmap/sockhash is because we never remove progs
> > > once they are added and we add them from sockops side. The filter
> > > to socketes is almost always the port + metadata related to the
> > > process or environment. This simplifies having to manage the
> > > sockmap/sockhash and guess what size it should be. Sometimes we
> > > overrun these maps and have to kill connections until we can
> > > get more space.
>
> That's a step in the right direction for sure, but I still think that
> Google's auto-lowat is the best approach. We just need a hook that
> looks at incoming data and sets rcvlowat appropriately. That's it.
> TCP looks at rcvlowat in a number of places to make protocol decisions,
> not just the wake-up. Plus Google will no longer have to carry their
> OOT patch..

David can correct me, but when he tried the SO_RCVLOWAT approach to
solving this problem, he saw no improvements (and it might have
actually been a regression in terms of behavior). I'd say that this
sounds a bit suspicious and we have plans to get back to SO_RCVLOWAT
and try to understand the behavior a bit better.

I'll just say that the simpler the solution - the better. And if this
rcvlowat hook gets us the ability to delay network notification to
user-space until a full logical packet (where packet size is provided
by BPF program without user space involvement) is assembled (up to
some reasonable limits, of course), that would be great.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sockmap's parser/verdict programs and epoll notifications
  2023-10-03 22:22               ` Andrii Nakryiko
@ 2023-10-04  5:40                 ` John Fastabend
  0 siblings, 0 replies; 11+ messages in thread
From: John Fastabend @ 2023-10-04  5:40 UTC (permalink / raw)
  To: Andrii Nakryiko, Jakub Kicinski
  Cc: John Fastabend, bpf, Networking, davidhwei@meta.com

Andrii Nakryiko wrote:
> On Tue, Oct 3, 2023 at 5:42 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 02 Oct 2023 22:16:13 -0700 John Fastabend wrote:
> > > > This with the other piece we want from our side to allow running
> > > > verdict and sk_msg programs on sockets without having them in a
> > > > sockmap/sockhash it would seem like a better system to me. The
> > > > idea to drop the sockmap/sockhash is because we never remove progs
> > > > once they are added and we add them from sockops side. The filter
> > > > to socketes is almost always the port + metadata related to the
> > > > process or environment. This simplifies having to manage the
> > > > sockmap/sockhash and guess what size it should be. Sometimes we
> > > > overrun these maps and have to kill connections until we can
> > > > get more space.
> >
> > That's a step in the right direction for sure, but I still think that
> > Google's auto-lowat is the best approach. We just need a hook that
> > looks at incoming data and sets rcvlowat appropriately. That's it.
> > TCP looks at rcvlowat in a number of places to make protocol decisions,
> > not just the wake-up. Plus Google will no longer have to carry their
> > OOT patch..
> 
> David can correct me, but when he tried the SO_RCVLOWAT approach to
> solving this problem, he saw no improvements (and it might have
> actually been a regression in terms of behavior). I'd say that this
> sounds a bit suspicious and we have plans to get back to SO_RCVLOWAT
> and try to understand the behavior a bit better.

Not sure how large your packets are but you might need to bump your
sk_rcvbuf size as well otherwise even if you set SO_RCVLOWAT you can
hit memory pressure which will wake up the application regardless
iirc.

> 
> I'll just say that the simpler the solution - the better. And if this
> rcvlowat hook gets us the ability to delay network notification to
> user-space until a full logical packet (where packet size is provided
> by BPF program without user space involvement) is assembled (up to
> some reasonable limits, of course), that would be great.

When we created the sockmap/sockhash maps and verdict progs, etc. one
of the goals was to avoid touching the TCP code paths as much as
possible. We also wanted to work on top of KTLS. Maybe you wouldn't
need it, but if you need to read a header across multiple skbs that
is hard without something to reconstruct them. Perhaps here you
could get away without needing this though.

I'll still fix the parser program and start working on simplifying
the verdict programs so they can run without maps and so on because
it helps other use cases. Maybe it will end up working for this
case or you find a simpler mechanism.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Sockmap's parser/verdict programs and epoll notifications
  2023-10-03  5:16           ` John Fastabend
  2023-10-03 12:41             ` Jakub Kicinski
@ 2023-10-03 22:18             ` Andrii Nakryiko
  1 sibling, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2023-10-03 22:18 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, Networking, davidhwei@meta.com

On Mon, Oct 2, 2023 at 10:16 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> John Fastabend wrote:
> > Andrii Nakryiko wrote:
> > > On Mon, Sep 11, 2023 at 7:43 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > Andrii Nakryiko wrote:
> > > > > On Sun, Jul 16, 2023 at 9:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > > >
> > > > > > Andrii Nakryiko wrote:
> > > > > > > Hey John,
> > > > > >
> > > > > > Sorry missed this while I was on PTO that week.
> > > > >
> > > > > yeah, vacations tend to cause missing things :)
> > > > >
> > > > > >
> > > > > > >
> > > > > > > We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> > > > > > > and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> > > > > > > in-kernel parsing of RSocket frames. A very simple format ([0]) where
> > > > > > > the first 3 bytes specify the size of the frame payload. The idea was
> > > > > > > to collect the entire frame in the kernel before notifying user-space
> > > > > > > that data is available. This is meant to minimize unnecessary wakeups
> > > > > > > due to incomplete logical frames, saving CPU.
> > > > > >
> > > > > > Nice.
> > > > > >
> > > > > > >
> > > > > > > You can find the BPF source code I've used at [1], it has lots of
> > > > > > > extra logging and stuff, but the idea is to read the first 3 bytes of
> > > > > > > each logical frame, and return the expected full frame size from the
> > > > > > > parser program. The verdict program always just returns SK_PASS.
> > > > > > >
> > > > > > > This seems to work exactly as expected in manual simulations of
> > > > > > > various packet size distributions, and even for a bunch of
> > > > > > > ping/pong-like benchmark (which are very sensitive to correct frame
> > > > > > > length determination, so I'm reasonably confident we don't screw that
> > > > > > > up much). And yet, when benchmarking sending multiple logical RPC
> > > > > > > streams over the same single socket (so many interleaving RSocket
> > > > > > > frames on single socket, but in terms of logical frames nothing should
> > > > > > > change), we often see that while full frame hasn't been accumulated in
> > > > > > > socket receive buffer yet, epoll_wait() for that socket would return
> > > > > > > with success notifying user space that there is data on socket.
> > > > > > > Subsequent recvfrom() call would immediately return -EAGAIN and no
> > > > > > > data, and our benchmark would go on this loop of useless
> > > > > > > epoll_wait()+recvfrom() calls back to back, many times over.
> > > > > >
> > > > > > Aha yes this sounds bad.
> > > > > >
> > > > > > >
> > > > > > > So I have a few questions:
> > > > > > >   - is the above use case something that was meant to be handled by
> > > > > > > sockmap+parser/verdict?
> > > > > >
> > > > > > We shouldn't wake up user space if there is nothing to read. So
> > > > > > yes this seems like a valid use case to me.
> > > > > >
> > > > > > >   - is it correct to assume that epoll won't wake up until amount of
> > > > > > > bytes requested by parser program is accumulated (this seems to be the
> > > > > > > case from manually experimenting with various "packet delays");
> > > > > >
> > > > > > Seems there is some bug that races and causes it to wake up
> > > > > > user space. I'm aware of a couple bugs in the stream parser
> > > > > > that I wanted to fix. Not sure I can get to them this week
> > > > > > but should have time next week. We have a couple more fixes
> > > > > > to resolve a few HTTPS server compliance tests as well.
> > > > > >
> > > > > > >   - is there some known bug or race in how sockmap and strparser
> > > > > > > framework interacts with epoll subsystem that could cause this weird
> > > > > > > epoll_wait() behavior?
> > > > > >
> > > > > > Yes I know of some races in strparser. I'll elaborate later
> > > > > > probably with patches as I don't recall them readily at the
> > > > > > moment.
> > > > >
> > > > > So I missed a good chunk of BPF mailing list traffic while I was on my
> > > > > PTO. Did you end up getting to these bugs in strparser logic? Should I
> > > > > try running the latest bpf-next/net-next on our production workload to
> > > > > see if this is still happening?
> > > >
> > > > You will likely still hit there error I haven't got it out of my queue
> > > > yet. I just knocked off a couple things last week so could probably
> > > > take a look at flushing my queue this week. Then it would make sense
> > > > to retest to see if its something new or not.
> > > >
> > > > I'll at least send an RFC with the idea even if I don't get to testing
> > > > it yet.
> > >
> > > Sounds good, thanks a lot!
> > >
> > > >
> > > > Thanks,
> > > > John
> >
> > Hi Andrii,
> >
> > Finally got around to thinking about this. And also I belive we have
> > the verdict programs mostly fixed up to handle polling correctly now.
> > The problem was incorrectly handling the tcp_sock copied_seq var
> > which is used by tcp_epollin_ready() to wakeup the application. Its
> > also used to calculate responses to some ioctl we found servers using
> > to decide when to actually do a recv, e.g. they wait on the ioctl until
> > enough bytes are received.
> >
> > The trick is to ensure we only update copied_seq when the bytes are
> > in fact actually ready to read from socket queue. The sockmap verdict
> > program code was incrementing this before running the verdict prog
> > so we raced with userspace. It kind of works in many cases because
> > we are holding the sock lock in many cases so we block the user space
> > recvmsg.
> >
> > Now to your problem as I understand it. You are trying to use the
> > parser program to hold some N bytes where N is the message block.
> > At which point it will get pushed to a verdict prog and finally
> > queued in the msg recieve queue so a syscall to recv*() can
> > actually read it. The parser program, unlike if you just have
> > a verdict prog, causes the skb to run through the stream parser to
> > collect bytes and then run the verdict program. The stream parser
> > is using tcp_read_sock() which increments the seq_copied immediately
> > even before the  verdict prog is run so I expect the odd behavior
> > you see is when that race completes. It likely  mostly works because
> > we have the sock lock for lots of the code making the race behavior
> > smaller than it might otherwise appear. I didn't do a full anlaysis
> > but it might just be when we hit an ENOMEM condition and need to
> > backoff. Which might explain why you only see the issue when you
> > run with larger envs.
> >
> > It feels a bit suboptimal in your case to run two BPF programs and
> > parser logic compared to a single verdict program. Could we just
> > add a bpf helper we can run from the verdict program to only wake
> > up the user space after N bytes. To mirror the sk_msg programs we
> > migth call it bpf_skb_cork_bytes(skb, bytes, flags). We could use
> > flags to decide if we need to call the prog again with the new
> > full sized skb or if we can just process it directly without the
> > extra BPF call.
> >
> > This with the other piece we want from our side to allow running
> > verdict and sk_msg programs on sockets without having them in a
> > sockmap/sockhash it would seem like a better system to me. The
> > idea to drop the sockmap/sockhash is because we never remove progs
> > once they are added and we add them from sockops side. The filter
> > to socketes is almost always the port + metadata related to the
> > process or environment. This simplifies having to manage the
> > sockmap/sockhash and guess what size it should be. Sometimes we
> > overrun these maps and have to kill connections until we can
> > get more space.
> >
> > For you case I would expect it to be (a) simpler just a single
> > program to manage instead of two and a map and (b) more efficient
> > to call one prog in datapath vs two.
> >
> > WDYT?
> >

Avoiding the need to maintain sockmap/sockhash is a win for sure, and
you are right, that normally once you attach such special
verdict/parser program (usually by port number, which typically
identifies service, right?), you don't detach it until socket is
closed. So yes, absolutely, this seems like a simplification.

> > Thanks,
> > John
>
> On second thought I'll also fix the existing stream parser code here
> shortly. Its a bit broken if I just leave it as is, but I still like
> the idea of a new helper.

Yep, no matter what's the new and better approach, it would be nice to
have existing stuff behave less erratically :) Thanks for taking care
of this!

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-10-04  5:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-07 18:30 Sockmap's parser/verdict programs and epoll notifications Andrii Nakryiko
2023-07-17  4:37 ` John Fastabend
2023-09-08 22:46   ` Andrii Nakryiko
2023-09-11 14:43     ` John Fastabend
2023-09-11 18:01       ` Andrii Nakryiko
2023-10-03  5:04         ` John Fastabend
2023-10-03  5:16           ` John Fastabend
2023-10-03 12:41             ` Jakub Kicinski
2023-10-03 22:22               ` Andrii Nakryiko
2023-10-04  5:40                 ` John Fastabend
2023-10-03 22:18             ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).