From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: cong.wang@bytedance.com, daniel@iogearbox.net, lmb@isovalent.com,
edumazet@google.com, bpf@vger.kernel.org, netdev@vger.kernel.org,
ast@kernel.org, andrii@kernel.org, will@isovalent.com
Subject: Re: [PATCH bpf v2 03/12] bpf: sockmap, improved check for empty queue
Date: Wed, 29 Mar 2023 14:24:17 +0200 [thread overview]
Message-ID: <87zg7vbu60.fsf@cloudflare.com> (raw)
In-Reply-To: <20230327175446.98151-4-john.fastabend@gmail.com>
On Mon, Mar 27, 2023 at 10:54 AM -07, John Fastabend wrote:
> We noticed some rare sk_buffs were stepping past the queue when system was
> under memory pressure. The general theory is to skip enqueueing
> sk_buffs when its not necessary which is the normal case with a system
> that is properly provisioned for the task, no memory pressure and enough
> cpu assigned.
>
> But, if we can't allocate memory due to an ENOMEM error when enqueueing
> the sk_buff into the sockmap receive queue we push it onto a delayed
> workqueue to retry later. When a new sk_buff is received we then check
> if that queue is empty. However, there is a problem with simply checking
> the queue length. When a sk_buff is being processed from the ingress queue
> but not yet on the sockmap msg receive queue its possible to also recv
> a sk_buff through normal path. It will check the ingress queue which is
> zero and then skip ahead of the pkt being processed.
>
> Previously we used sock lock from both contexts which made the problem
> harder to hit, but not impossible.
>
> To fix also check the 'state' variable where we would cache partially
> processed sk_buff. This catches the majority of cases. But, we also
> need to use the mutex lock around this check because we can't have both
> codes running and check sensibly. We could perhaps do this with atomic
> bit checks, but we are already here due to memory pressure so slowing
> things down a bit seems OK and simpler to just grab a lock.
>
> To reproduce issue we run NGINX compliance test with sockmap running and
> observe some flakes in our testing that we attributed to this issue.
>
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Tested-by: William Findlay <will@isovalent.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
I've got an idea to try, but it'd a bigger change.
skb_dequeue is lock, skb_peek, skb_unlink, unlock, right?
What if we split up the skb_dequeue in sk_psock_backlog to publish the
change to the ingress_skb queue only once an skb has been processed?
static void sk_psock_backlog(struct work_struct *work)
{
...
while ((skb = skb_peek_locked(&psock->ingress_skb))) {
...
skb_unlink(skb, &psock->ingress_skb);
}
...
}
Even more - if we hold off the unlinking until an skb has been fully
processed, that perhaps opens up the way to get rid of keeping state in
sk_psock_work_state. We could just skb_pull the processed data instead.
It's just an idea and I don't want to block a tested fix that LGTM so
consider this:
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
next prev parent reply other threads:[~2023-03-29 12:48 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-27 17:54 [PATCH bpf v2 00/11] bpf sockmap fixes John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 01/12] bpf: sockmap, pass skb ownership through read_skb John Fastabend
2023-03-28 10:42 ` Jakub Sitnicki
2023-03-27 17:54 ` [PATCH bpf v2 02/12] bpf: sockmap, convert schedule_work into delayed_work John Fastabend
2023-03-28 12:09 ` Jakub Sitnicki
2023-03-28 21:56 ` John Fastabend
2023-03-29 11:09 ` Jakub Sitnicki
2023-03-27 17:54 ` [PATCH bpf v2 03/12] bpf: sockmap, improved check for empty queue John Fastabend
2023-03-29 12:24 ` Jakub Sitnicki [this message]
2023-04-01 0:59 ` John Fastabend
2023-04-03 8:42 ` Jakub Sitnicki
2023-03-27 17:54 ` [PATCH bpf v2 04/12] bpf: sockmap, handle fin correctly John Fastabend
2023-04-03 11:11 ` Jakub Sitnicki
2023-04-03 21:05 ` John Fastabend
2023-04-04 10:11 ` Jakub Sitnicki
2023-03-27 17:54 ` [PATCH bpf v2 05/12] bpf: sockmap, TCP data stall on recv before accept John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 06/12] bpf: sockmap, wake up polling after data copy John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 07/12] bpf: sockmap incorrectly handling copied_seq John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 08/12] bpf: sockmap, pull socket helpers out of listen test for general use John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 09/12] bpf: sockmap, build helper to create connected socket pair John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 10/12] bpf: sockmap, test shutdown() correctly exits epoll and recv()=0 John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 11/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 12/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer with drops John Fastabend
2023-03-27 18:17 ` [PATCH bpf v2 00/11] bpf sockmap fixes John Fastabend
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zg7vbu60.fsf@cloudflare.com \
--to=jakub@cloudflare.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=cong.wang@bytedance.com \
--cc=daniel@iogearbox.net \
--cc=edumazet@google.com \
--cc=john.fastabend@gmail.com \
--cc=lmb@isovalent.com \
--cc=netdev@vger.kernel.org \
--cc=will@isovalent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.