From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
daniel@iogearbox.net, joamaki@gmail.com,
xiyou.wangcong@gmail.com
Subject: Re: [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Date: Tue, 19 Oct 2021 11:16:29 +0200 [thread overview]
Message-ID: <87sfwxfk7m.fsf@cloudflare.com> (raw)
In-Reply-To: <20211011191647.418704-3-john.fastabend@gmail.com>
On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> A socket in a sockmap may have different combinations of programs
> attached depending on configuration. There can be no programs in which
> case the socket acts as a sink only. There can be a TX program in this
> case a BPF program is attached to sending side, but no RX program is
> attached. There can be an RX program only where sends have no BPF
> program attached, but receives are hooked with BPF. And finally,
> both TX and RX programs may be attached. Giving us the permutations,
>
> None, Tx, Rx, and TxRx
>
> To date most of our use cases have been TX case being used as a fast
> datapath to directly copy between local application and a userspace
> proxy. Or Rx cases and TxRX applications that are operating an in
> kernel based proxy. The traffic in the first case where we hook
> applications into a userspace application looks like this,
>
> AppA redirect AppB
> Tx <-----------> Rx
> | |
> + +
> TCP <--> lo <--> TCP
>
> In this case all traffic from AppA (after 3whs) is copied into the
> AppB ingress queue and no traffic is ever on the TCP recieive_queue.
>
> In the second case the application never receives, except in some
> rare error cases, traffic on the actual user space socket. Instead
> the send happens in the kernel.
>
> AppProxy socket pool
> sk0 ------------->{sk1,sk2, skn}
> ^ |
> | |
> | v
> ingress lb egress
> TCP TCP
>
> Here because traffic is never read off the socket with userspace
> recv() APIs there is only ever one reader on the sk receive_queue.
> Namely the BPF programs.
>
> However, we've started to introduce a third configuration where the
> BPF program on receive should process the data, but then the normal
> case is to push the data into the receive queue of AppB.
>
> AppB
> recv() (userspace)
> -----------------------
> tcp_bpf_recvmsg() (kernel)
> | |
> | |
> | |
> ingress_msgQ |
> | |
> RX_BPF |
> | |
> v v
> sk->receive_queue
>
>
> This is different from the App{A,B} redirect because traffic is
> first received on the sk->receive_queue.
>
> Now for the issue. The tcp_bpf_recvmsg() handler first checks the
> ingress_msg queue for any data handled by the BPF rx program and
> returned with PASS code so that it was enqueued on the ingress msg
> queue. Then if no data exists on that queue it checks the socket
> receive queue. Unfortunately, this is the same receive_queue the
> BPF program is reading data off of. So we get a race. Its possible
> for the recvmsg() hook to pull data off the receive_queue before
> the BPF hook has a chance to read it. It typically happens when
> an application is banging on recv() and getting EAGAINs. Until
> they manage to race with the RX BPF program.
>
> To fix this we note that before this patch at attach time when
> the socket is loaded into the map we check if it needs a TX
> program or just the base set of proto bpf hooks. Then it uses
> the above general RX hook regardless of if we have a BPF program
> attached at rx or not. This patch now extends this check to
> handle all cases enumerated above, TX, RX, TXRX, and none. And
> to fix above race when an RX program is attached we use a new
> hook that is nearly identical to the old one except now we
> do not let the recv() call skip the RX BPF program. Now only
> the BPF program pulls data from sk->receive_queue and recv()
> only pulls data from the ingress msgQ post BPF program handling.
>
> With this resolved our AppB from above has been up and running
> for many hours without detecting any errors. We do this by
> correlating counters in RX BPF events and the AppB to ensure
> data is never skipping the BPF program. Selftests, was not
> able to detect this because we only run them for a short
> period of time on well ordered send/recvs so we don't get any
> of the noise we see in real application environments.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
next prev parent reply other threads:[~2021-10-19 9:16 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-11 19:16 [PATCH bpf 0/4] bpf, sockmap: fixes stress testing and regression John Fastabend
2021-10-11 19:16 ` [PATCH bpf 1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage John Fastabend
2021-10-19 7:17 ` Jakub Sitnicki
2021-10-20 5:28 ` John Fastabend
2021-10-20 15:11 ` Jakub Sitnicki
2021-10-20 15:51 ` John Fastabend
2021-10-20 16:35 ` Jakub Sitnicki
2021-10-21 19:24 ` John Fastabend
2021-10-11 19:16 ` [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self John Fastabend
2021-10-19 9:16 ` Jakub Sitnicki [this message]
2021-10-11 19:16 ` [PATCH bpf 3/4] bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding John Fastabend
2021-10-19 15:39 ` Jakub Sitnicki
2021-10-11 19:16 ` [PATCH bpf 4/4] bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg John Fastabend
2021-10-23 13:05 ` Jakub Sitnicki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87sfwxfk7m.fsf@cloudflare.com \
--to=jakub@cloudflare.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=joamaki@gmail.com \
--cc=john.fastabend@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.