From: John Fastabend <john.fastabend@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Lorenz Bauer <lmb@isovalent.com>,
Eric Dumazet <edumazet@google.com>, bpf <bpf@vger.kernel.org>,
Network Development <netdev@vger.kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
will@isovalent.com
Subject: Re: [PATCH bpf v6 03/12] bpf: sockmap, improved check for empty queue
Date: Thu, 13 Apr 2023 13:06:45 -0700 [thread overview]
Message-ID: <643860d554e5f_496ba2085d@john.notmuch> (raw)
In-Reply-To: <CAADnVQ+43oW3F3_R2cz76ivfeJNDt6Nf66jOdXADjsp=jJZxgQ@mail.gmail.com>
Alexei Starovoitov wrote:
> On Fri, Apr 7, 2023 at 10:17 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > We noticed some rare sk_buffs were stepping past the queue when system was
> > under memory pressure. The general theory is to skip enqueueing
> > sk_buffs when its not necessary which is the normal case with a system
> > that is properly provisioned for the task, no memory pressure and enough
> > cpu assigned.
> >
> > But, if we can't allocate memory due to an ENOMEM error when enqueueing
> > the sk_buff into the sockmap receive queue we push it onto a delayed
> > workqueue to retry later. When a new sk_buff is received we then check
> > if that queue is empty. However, there is a problem with simply checking
> > the queue length. When a sk_buff is being processed from the ingress queue
> > but not yet on the sockmap msg receive queue its possible to also recv
> > a sk_buff through normal path. It will check the ingress queue which is
> > zero and then skip ahead of the pkt being processed.
> >
> > Previously we used sock lock from both contexts which made the problem
> > harder to hit, but not impossible.
> >
> > To fix also check the 'state' variable where we would cache partially
> > processed sk_buff. This catches the majority of cases. But, we also
> > need to use the mutex lock around this check because we can't have both
> > codes running and check sensibly. We could perhaps do this with atomic
> > bit checks, but we are already here due to memory pressure so slowing
> > things down a bit seems OK and simpler to just grab a lock.
> >
> > To reproduce issue we run NGINX compliance test with sockmap running and
> > observe some flakes in our testing that we attributed to this issue.
> >
> > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > Tested-by: William Findlay <will@isovalent.com>
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> > net/core/skmsg.c | 20 +++++++++++++++++++-
> > 1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index 198bed303c51..f8731818b5c3 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -987,6 +987,7 @@ EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read);
> > static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
> > int verdict)
> > {
> > + struct sk_psock_work_state *state;
> > struct sock *sk_other;
> > int err = 0;
> > u32 len, off;
> > @@ -1003,13 +1004,28 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
> >
> > skb_bpf_set_ingress(skb);
> >
> > + /* We need to grab mutex here because in-flight skb is in one of
> > + * the following states: either on ingress_skb, in psock->state
> > + * or being processed by backlog and neither in state->skb and
> > + * ingress_skb may be also empty. The troublesome case is when
> > + * the skb has been dequeued from ingress_skb list or taken from
> > + * state->skb because we can not easily test this case. Maybe we
> > + * could be clever with flags and resolve this but being clever
> > + * got us here in the first place and we note this is done under
> > + * sock lock and backlog conditions mean we are already running
> > + * into ENOMEM or other performance hindering cases so lets do
> > + * the obvious thing and grab the mutex.
> > + */
> > + mutex_lock(&psock->work_mutex);
> > + state = &psock->work_state;
>
> This splat says that above is wrong:
>
> [ 98.732763] BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:580
> [ 98.733483] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid:
> 370, name: test_progs
> [ 98.734103] preempt_count: 102, expected: 0
> [ 98.734416] RCU nest depth: 4, expected: 0
> [ 98.734739] 6 locks held by test_progs/370:
> [ 98.735046] #0: ffff888106475530 (sk_lock-AF_INET){+.+.}-{0:0},
> at: inet_shutdown+0x43/0x150
> [ 98.735695] #1: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
> __ip_queue_xmit+0x5/0xa00
> [ 98.736325] #2: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
> process_backlog+0xc0/0x360
> [ 98.736971] #3: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
> ip_local_deliver_finish+0xbb/0x220
> [ 98.737668] #4: ffff8881064748b0 (slock-AF_INET/1){+.-.}-{2:2},
> at: tcp_v4_rcv+0x1b72/0x1d80
> [ 98.738297] #5: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
> sk_psock_verdict_recv+0x5/0x3a0
> [ 98.738973] Preemption disabled at:
> [ 98.738976] [<ffffffff8238bb41>] ip_finish_output2+0x171/0xfa0
> [ 98.739687] CPU: 1 PID: 370 Comm: test_progs Tainted: G O
> 6.3.0-rc5-00193-g9149a3b041d2 #942
> [ 98.740379] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> [ 98.741164] Call Trace:
> [ 98.741338] <IRQ>
> [ 98.741483] dump_stack_lvl+0x60/0x70
> [ 98.741735] __might_resched+0x21c/0x340
> [ 98.742005] __mutex_lock+0xb4/0x12a0
> [ 98.743721] ? mutex_lock_io_nested+0x1070/0x1070
> [ 98.744053] ? lock_is_held_type+0xda/0x130
> [ 98.744337] ? preempt_count_sub+0x14/0xc0
> [ 98.744624] ? sk_psock_verdict_apply+0x1a3/0x2f0
> [ 98.744936] sk_psock_verdict_apply+0x1a3/0x2f0
> [ 98.745242] ? preempt_count_sub+0x14/0xc0
> [ 98.745530] sk_psock_verdict_recv+0x1e7/0x3a0
> [ 98.745858] ? preempt_count_sub+0x14/0xc0
> [ 98.746168] tcp_read_skb+0x19c/0x2d0
> [ 98.746447] ? sk_psock_strp_read+0x390/0x390
> [ 98.746774] ? tcp_alloc_md5sig_pool+0x230/0x230
> [ 98.747116] ? rcu_read_lock_held+0x91/0xa0
> [ 98.747427] ? rcu_read_lock_sched_held+0xc0/0xc0
> [ 98.747772] ? __rcu_read_unlock+0x6b/0x2a0
> [ 98.748087] sk_psock_verdict_data_ready+0x99/0x2d0
> [ 98.748446] tcp_data_queue+0xd39/0x19b0
> [ 98.748749] ? tcp_send_rcvq+0x280/0x280
> [ 98.749038] ? tcp_urg+0x7f/0x4c0
> [ 98.749298] ? tcp_ack_update_rtt.isra.55+0x910/0x910
> [ 98.749644] ? lockdep_hardirqs_on+0x79/0x100
> [ 98.749940] ? ktime_get+0x112/0x120
> [ 98.750225] ? ktime_get+0x86/0x120
> [ 98.750498] tcp_rcv_established+0x3fb/0xcc0
> [ 98.752087] tcp_v4_do_rcv+0x34a/0x4c0
> [ 98.752400] tcp_v4_rcv+0x1c9a/0x1d80
> [ 98.754137] ip_protocol_deliver_rcu+0x4f/0x4d0
> [ 98.754562] ip_local_deliver_finish+0x146/0x220
> [ 98.754928] ip_local_deliver+0x100/0x2e0
> [ 98.756055] ip_rcv+0xb6/0x2b0
> [ 98.757689] __netif_receive_skb_one_core+0xd2/0x110
> [ 98.759121] process_backlog+0x160/0x360
> [ 98.759446] __napi_poll+0x57/0x300
> [ 98.759725] net_rx_action+0x555/0x600
> [ 98.760029] ? napi_threaded_poll+0x2b0/0x2b0
> [ 98.760444] __do_softirq+0xeb/0x4e7
> [ 98.760728] ? ip_finish_output2+0x391/0xfa0
> [ 98.761059] do_softirq+0xa1/0xd0
>
> I'm afraid I have to revert this set from bpf tree.
OK thanks. We also got a syzkaller complaint so will fix these two things.
next prev parent reply other threads:[~2023-04-13 20:06 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-07 17:16 [PATCH bpf v6 00/12] bpf sockmap fixes John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 01/12] bpf: sockmap, pass skb ownership through read_skb John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 02/12] bpf: sockmap, convert schedule_work into delayed_work John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 03/12] bpf: sockmap, improved check for empty queue John Fastabend
2023-04-11 17:53 ` Jakub Sitnicki
2023-04-13 17:51 ` Alexei Starovoitov
2023-04-13 20:06 ` John Fastabend [this message]
2023-04-07 17:16 ` [PATCH bpf v6 04/12] bpf: sockmap, handle fin correctly John Fastabend
2023-04-11 18:02 ` Jakub Sitnicki
2023-04-07 17:16 ` [PATCH bpf v6 05/12] bpf: sockmap, TCP data stall on recv before accept John Fastabend
2023-04-12 10:26 ` Jakub Sitnicki
2023-04-07 17:16 ` [PATCH bpf v6 06/12] bpf: sockmap, wake up polling after data copy John Fastabend
2023-04-13 13:00 ` Jakub Sitnicki
2023-04-07 17:16 ` [PATCH bpf v6 07/12] bpf: sockmap, incorrectly handling copied_seq John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 08/12] bpf: sockmap, pull socket helpers out of listen test for general use John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 09/12] bpf: sockmap, build helper to create connected socket pair John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 10/12] bpf: sockmap, test shutdown() correctly exits epoll and recv()=0 John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 11/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 12/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer with drops John Fastabend
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=643860d554e5f_496ba2085d@john.notmuch \
--to=john.fastabend@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=edumazet@google.com \
--cc=jakub@cloudflare.com \
--cc=lmb@isovalent.com \
--cc=netdev@vger.kernel.org \
--cc=will@isovalent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.