Re: [PATCH bpf-next] bpf, sockmap: Add sk_rmem_alloc check for tcp_bpf_ingress()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jakub Sitnicki <jakub@cloudflare.com>
To: Wang Yufen <wangyufen@huawei.com>
Cc: ast@kernel.org, john.fastabend@gmail.com, andrii@kernel.org,
	daniel@iogearbox.net, lmb@cloudflare.com, davem@davemloft.net,
	kafai@fb.com, dsahern@kernel.org, kuba@kernel.org,
	songliubraving@fb.com, yhs@fb.com, kpsingh@kernel.org,
	netdev@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next] bpf, sockmap: Add sk_rmem_alloc check for tcp_bpf_ingress()
Date: Sat, 26 Mar 2022 21:37:18 +0100	[thread overview]
Message-ID: <87ee2oxvhy.fsf@cloudflare.com> (raw)
In-Reply-To: <20220318104222.1410625-1-wangyufen@huawei.com>

Hello,

On Fri, Mar 18, 2022 at 06:42 PM +08, Wang Yufen wrote:
> We use sk_msg to redirect with sock hash, like this:
>
>   skA   redirect    skB
>   Tx <----------->  Rx
>
> And construct a scenario where the packet sending speed is high, the
> packet receiving speed is slow, so the packets are stacked in the ingress
> queue on the receiving side. After a period of time, the memory is
> exhausted and the system ooms.
>
> To fix, we add sk_rmem_alloc while sk_msg queued in the ingress queue
> and subtract sk_rmem_alloc while sk_msg dequeued from the ingress queue
> and check sk_rmem_alloc at the beginning of bpf_tcp_ingress().
>
> Signed-off-by: Wang Yufen <wangyufen@huawei.com>
> ---
>  include/linux/skmsg.h | 9 ++++++---
>  net/core/skmsg.c      | 2 ++
>  net/ipv4/tcp_bpf.c    | 6 ++++++
>  3 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index c5a2d6f50f25..d2cfd5fa2274 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -308,9 +308,10 @@ static inline void sk_psock_queue_msg(struct sk_psock *psock,
>  				      struct sk_msg *msg)
>  {
>  	spin_lock_bh(&psock->ingress_lock);
> -	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
> +	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
>  		list_add_tail(&msg->list, &psock->ingress_msg);
> -	else {
> +		atomic_add(msg->sg.size, &psock->sk->sk_rmem_alloc);
> +	} else {
>  		sk_msg_free(psock->sk, msg);
>  		kfree(msg);
>  	}
> @@ -323,8 +324,10 @@ static inline struct sk_msg *sk_psock_dequeue_msg(struct sk_psock *psock)
>  
>  	spin_lock_bh(&psock->ingress_lock);
>  	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> -	if (msg)
> +	if (msg) {
>  		list_del(&msg->list);
> +		atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
> +	}
>  	spin_unlock_bh(&psock->ingress_lock);
>  	return msg;
>  }
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index cc381165ea08..b19a3c49564f 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -445,6 +445,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>  				if (!msg_rx->skb)
>  					sk_mem_uncharge(sk, copy);
>  				msg_rx->sg.size -= copy;
> +				atomic_sub(copy, &sk->sk_rmem_alloc);
>  
>  				if (!sge->length) {
>  					sk_msg_iter_var_next(i);
> @@ -754,6 +755,7 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
>  
>  	list_for_each_entry_safe(msg, tmp, &psock->ingress_msg, list) {
>  		list_del(&msg->list);
> +		atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
>  		sk_msg_free(psock->sk, msg);
>  		kfree(msg);
>  	}
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 1cdcb4df0eb7..dd099875414c 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -24,6 +24,12 @@ static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
>  		return -ENOMEM;
>  
>  	lock_sock(sk);
> +	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) {
> +		release_sock(sk);
> +		kfree(tmp);
> +		return -EAGAIN;
> +	}
> +
>  	tmp->sg.start = msg->sg.start;
>  	i = msg->sg.start;
>  	do {

Thanks for the patch and sorry for the delay.

I have to dig deeper to understand what you are experiencing.

1/ We can't charge rmem in sk_psock_queue_msg(). This path is shared by
redirect from sendmsg and redirect from ingress. When redirecting
psock->ingress_skb, we already charge sk_rmem_alloc in
sk_psock_skb_ingress() by transferring the skb ownership. See
sk_set_owner_r().

2/ The psock->ingress_msg build up for a slow reader should not lead to
be unbounded if memcg limits are in place. We charge
sk->sk_forward_alloc for received bytes in bpf_tcp_ingress(). I'd expect
the receiver process to get killed due to OOM, if it's running under a
dedicated cgroup. Is it the so in your case?

I'm pretty sure, though, that we have an accounting bug in
bpf_tcp_ingress(), because we are asking there for more write buffers
call - sk_wmem_schedule() - instead of read buffers
(sk_rmem_schedule()).

3/ I'm a bit surprised that you are not getting -ENOMEM from
bpf_tcp_ingress() already today, when the receive side allocates more
than what the net.ipv4.tcp_wmem hard limit is. (_wmem instead of _rmem
because of the bug I've mentioned above.) I'd expect we would be failing
in __sk_mem_raise_allocated().

Could you please check if you are not hitting the sock_exceed_buf_limit
tracepoint to confirm?

$ sudo bpftrace -lv 'tracepoint:sock:sock_exceed_buf_limit'
tracepoint:sock:sock_exceed_buf_limit
    char name[32]
    long * sysctl_mem
    long allocated
    int sysctl_rmem
    int rmem_alloc
    int sysctl_wmem
    int wmem_alloc
    int wmem_queued
    int kind

Thanks,
Jakub

     prev parent reply	other threads:[~2022-03-26 21:19 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-18 10:42 [PATCH bpf-next] bpf, sockmap: Add sk_rmem_alloc check for tcp_bpf_ingress() Wang Yufen
2022-03-26 20:37 ` Jakub Sitnicki [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ee2oxvhy.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=lmb@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=songliubraving@fb.com \
    --cc=wangyufen@huawei.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.