bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: cong.wang@bytedance.com, daniel@iogearbox.net, lmb@isovalent.com,
	edumazet@google.com, bpf@vger.kernel.org, netdev@vger.kernel.org,
	ast@kernel.org, andrii@kernel.org, will@isovalent.com
Subject: Re: [PATCH bpf v2 02/12] bpf: sockmap, convert schedule_work into delayed_work
Date: Wed, 29 Mar 2023 13:09:37 +0200	[thread overview]
Message-ID: <874jq3dcw1.fsf@cloudflare.com> (raw)
In-Reply-To: <642362a1403ee_286af20850@john.notmuch>

On Tue, Mar 28, 2023 at 02:56 PM -07, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Mon, Mar 27, 2023 at 10:54 AM -07, John Fastabend wrote:
>> > Sk_buffs are fed into sockmap verdict programs either from a strparser
>> > (when the user might want to decide how framing of skb is done by attaching
>> > another parser program) or directly through tcp_read_sock. The
>> > tcp_read_sock is the preferred method for performance when the BPF logic is
>> > a stream parser.
>> >
>> > The flow for Cilium's common use case with a stream parser is,
>> >
>> >  tcp_read_sock()
>> >   sk_psock_verdict_recv
>> >     ret = bpf_prog_run_pin_on_cpu()
>> >     sk_psock_verdict_apply(sock, skb, ret)
>> >      // if system is under memory pressure or app is slow we may
>> >      // need to queue skb. Do this queuing through ingress_skb and
>> >      // then kick timer to wake up handler
>> >      skb_queue_tail(ingress_skb, skb)
>> >      schedule_work(work);
>> >
>> >
>> > The work queue is wired up to sk_psock_backlog(). This will then walk the
>> > ingress_skb skb list that holds our sk_buffs that could not be handled,
>> > but should be OK to run at some later point. However, its possible that
>> > the workqueue doing this work still hits an error when sending the skb.
>> > When this happens the skbuff is requeued on a temporary 'state' struct
>> > kept with the workqueue. This is necessary because its possible to
>> > partially send an skbuff before hitting an error and we need to know how
>> > and where to restart when the workqueue runs next.
>> >
>> > Now for the trouble, we don't rekick the workqueue. This can cause a
>> > stall where the skbuff we just cached on the state variable might never
>> > be sent. This happens when its the last packet in a flow and no further
>> > packets come along that would cause the system to kick the workqueue from
>> > that side.
>> >
>> > To fix we could do simple schedule_work(), but while under memory pressure
>> > it makes sense to back off some instead of continue to retry repeatedly. So
>> > instead to fix convert schedule_work to schedule_delayed_work and add
>> > backoff logic to reschedule from backlog queue on errors. Its not obvious
>> > though what a good backoff is so use '1'.
>> >
>> > To test we observed some flakes whil running NGINX compliance test with
>> > sockmap we attributed these failed test to this bug and subsequent issue.
>> >
>> > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
>> > Tested-by: William Findlay <will@isovalent.com>
>> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> > ---
>
> [...]
>
>> > --- a/net/core/skmsg.c
>> > +++ b/net/core/skmsg.c
>> > @@ -481,7 +481,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>> >  	}
>> >  out:
>> >  	if (psock->work_state.skb && copied > 0)
>> > -		schedule_work(&psock->work);
>> > +		schedule_delayed_work(&psock->work, 0);
>> >  	return copied;
>> >  }
>> >  EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
>> > @@ -639,7 +639,8 @@ static void sk_psock_skb_state(struct sk_psock *psock,
>> >  
>> >  static void sk_psock_backlog(struct work_struct *work)
>> >  {
>> > -	struct sk_psock *psock = container_of(work, struct sk_psock, work);
>> > +	struct delayed_work *dwork = to_delayed_work(work);
>> > +	struct sk_psock *psock = container_of(dwork, struct sk_psock, work);
>> >  	struct sk_psock_work_state *state = &psock->work_state;
>> >  	struct sk_buff *skb = NULL;
>> >  	bool ingress;
>> > @@ -679,6 +680,10 @@ static void sk_psock_backlog(struct work_struct *work)
>> >  				if (ret == -EAGAIN) {
>> >  					sk_psock_skb_state(psock, state, skb,
>> >  							   len, off);
>> > +
>> > +					// Delay slightly to prioritize any
>> > +					// other work that might be here.
>> > +					schedule_delayed_work(&psock->work, 1);
>> 
>> Do IIUC that this means we can back out changes from commit bec217197b41
>> ("skmsg: Schedule psock work if the cached skb exists on the psock")?
>
> Yeah I think so this is a more direct way to get the same result. I'm also
> thinking this check,
>
>        if (psock->work_state.skb && copied > 0)
>                schedule_work(&psock->work)
>
> is not correct copied=0 which could happen on empty queue could be the
> result of a skb stuck from this eagain error in backlog.

I suspect the 'copied > 0' check is there to handle the 0-length read
scenario. But I think you're right, that the empty queue scenario is not
being handled properly.

> I think its OK to revert that patch in a separate patch. And ideally we
> could get some way to load up the stack to hit these corner cases without
> running long stress tests.
>
> WDYT?

Yeah, the revert can wait. I was just curious if my thinking was
right. There is plenty of material in this series as is :-)

  reply	other threads:[~2023-03-29 11:18 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-27 17:54 [PATCH bpf v2 00/11] bpf sockmap fixes John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 01/12] bpf: sockmap, pass skb ownership through read_skb John Fastabend
2023-03-28 10:42   ` Jakub Sitnicki
2023-03-27 17:54 ` [PATCH bpf v2 02/12] bpf: sockmap, convert schedule_work into delayed_work John Fastabend
2023-03-28 12:09   ` Jakub Sitnicki
2023-03-28 21:56     ` John Fastabend
2023-03-29 11:09       ` Jakub Sitnicki [this message]
2023-03-27 17:54 ` [PATCH bpf v2 03/12] bpf: sockmap, improved check for empty queue John Fastabend
2023-03-29 12:24   ` Jakub Sitnicki
2023-04-01  0:59     ` John Fastabend
2023-04-03  8:42       ` Jakub Sitnicki
2023-03-27 17:54 ` [PATCH bpf v2 04/12] bpf: sockmap, handle fin correctly John Fastabend
2023-04-03 11:11   ` Jakub Sitnicki
2023-04-03 21:05     ` John Fastabend
2023-04-04 10:11       ` Jakub Sitnicki
2023-03-27 17:54 ` [PATCH bpf v2 05/12] bpf: sockmap, TCP data stall on recv before accept John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 06/12] bpf: sockmap, wake up polling after data copy John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 07/12] bpf: sockmap incorrectly handling copied_seq John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 08/12] bpf: sockmap, pull socket helpers out of listen test for general use John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 09/12] bpf: sockmap, build helper to create connected socket pair John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 10/12] bpf: sockmap, test shutdown() correctly exits epoll and recv()=0 John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 11/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer John Fastabend
2023-03-27 17:54 ` [PATCH bpf v2 12/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer with drops John Fastabend
2023-03-27 18:17 ` [PATCH bpf v2 00/11] bpf sockmap fixes John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874jq3dcw1.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=edumazet@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=lmb@isovalent.com \
    --cc=netdev@vger.kernel.org \
    --cc=will@isovalent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).