From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77E24BA22 for ; Thu, 4 May 2023 17:00:55 +0000 (UTC) Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com [IPv6:2a00:1450:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7649030F7 for ; Thu, 4 May 2023 10:00:53 -0700 (PDT) Received: by mail-ej1-x62e.google.com with SMTP id a640c23a62f3a-95369921f8eso124831566b.0 for ; Thu, 04 May 2023 10:00:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; t=1683219652; x=1685811652; h=mime-version:message-id:in-reply-to:date:subject:cc:to:from :user-agent:references:from:to:cc:subject:date:message-id:reply-to; bh=qjaODlPi6v4XERHXsi1vzOe64e5c9SqxGMblTgf0hzM=; b=XtNivxu3Rq1nSlFVf9oMdr0PfWMVAFMHG0UpHiyUnCTMbA8ktrSOO8JKMLjr8BITGV fiDSQX2C42CNvV2d5lQhrQOm/kQAja8NT2pLjKOZ0d79af8WoDMGouleYu76NuUlawaE OG2IAx8ZyMvDIbAUZkht93gW9ANqQLL1xOqPQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683219652; x=1685811652; h=mime-version:message-id:in-reply-to:date:subject:cc:to:from :user-agent:references:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=qjaODlPi6v4XERHXsi1vzOe64e5c9SqxGMblTgf0hzM=; b=eQiYAQudHitEVzy0fhFEHtpjKhyBkIRS7qzZkK+Y5Zlesn384wXL7aso3lLV7HgYW1 AwF1YG24IVBnYRcMaQ1V0tYCRmx5OT5GiCDNBY+sF7EDXBzYN1UGOtP1n1dQSQycgcWo Htn+3HsZtwowJyiKllfb3/0HuJfcCfGMPnvjDHcGQ9zcM8VCP/d0qwlAz3/f2V85oB5p T84N2jgKbgA8fLmBv4uAEqP99ZJmY9gchlaVxdvA31oZRUZUM4VkYOb83d5oxDTly5H0 2fp7eREXXLVoCoI+f4Crj06sc+m5E+XHHxufVxnsEsidGvCizuxtCkn0SZN1xQaOhfV8 5uPw== X-Gm-Message-State: AC+VfDwAcIYJr3evBwM7WlYNixwAjtbKtBqTzxpNRIiF9LFh/kcOjBQa DJ5vWDOFqPgXGljFeQiFkUATiA== X-Google-Smtp-Source: ACHHUZ5T3XqAsp9qZYStW12DZCevceerSES5BJgR2t3xpunOjYUbQaPsEm3xJX3JQlTTQteps26KnQ== X-Received: by 2002:a17:907:628b:b0:957:272d:7b2c with SMTP id nd11-20020a170907628b00b00957272d7b2cmr7679611ejc.41.1683219651628; Thu, 04 May 2023 10:00:51 -0700 (PDT) Received: from cloudflare.com (apn-31-0-32-7.dynamic.gprs.plus.pl. [31.0.32.7]) by smtp.gmail.com with ESMTPSA id jz17-20020a17090775f100b009603d34cfecsm9843162ejc.164.2023.05.04.10.00.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 May 2023 10:00:50 -0700 (PDT) References: <20230502155159.305437-1-john.fastabend@gmail.com> <20230502155159.305437-5-john.fastabend@gmail.com> User-agent: mu4e 1.6.10; emacs 28.2 From: Jakub Sitnicki To: John Fastabend Cc: daniel@iogearbox.net, lmb@isovalent.com, edumazet@google.com, bpf@vger.kernel.org, netdev@vger.kernel.org, ast@kernel.org, andrii@kernel.org, will@isovalent.com Subject: Re: [PATCH bpf v7 04/13] bpf: sockmap, improved check for empty queue Date: Thu, 04 May 2023 18:53:23 +0200 In-reply-to: <20230502155159.305437-5-john.fastabend@gmail.com> Message-ID: <877cto2fr2.fsf@cloudflare.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net On Tue, May 02, 2023 at 08:51 AM -07, John Fastabend wrote: > We noticed some rare sk_buffs were stepping past the queue when system was > under memory pressure. The general theory is to skip enqueueing > sk_buffs when its not necessary which is the normal case with a system > that is properly provisioned for the task, no memory pressure and enough > cpu assigned. > > But, if we can't allocate memory due to an ENOMEM error when enqueueing > the sk_buff into the sockmap receive queue we push it onto a delayed > workqueue to retry later. When a new sk_buff is received we then check > if that queue is empty. However, there is a problem with simply checking > the queue length. When a sk_buff is being processed from the ingress queue > but not yet on the sockmap msg receive queue its possible to also recv > a sk_buff through normal path. It will check the ingress queue which is > zero and then skip ahead of the pkt being processed. > > Previously we used sock lock from both contexts which made the problem > harder to hit, but not impossible. > > To fix instead of popping the skb from the queue entirely we peek the > skb from the queue and do the copy there. This ensures checks to the > queue length are non-zero while skb is being processed. Then finally > when the entire skb has been copied to user space queue or another > socket we pop it off the queue. This way the queue length check allows > bypassing the queue only after the list has been completely processed. > > To reproduce issue we run NGINX compliance test with sockmap running and > observe some flakes in our testing that we attributed to this issue. > > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") > Tested-by: William Findlay > Suggested-by: Jakub Sitnicki > Signed-off-by: John Fastabend > --- > include/linux/skmsg.h | 1 - > net/core/skmsg.c | 32 ++++++++------------------------ > 2 files changed, 8 insertions(+), 25 deletions(-) > > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h > index 904ff9a32ad6..054d7911bfc9 100644 > --- a/include/linux/skmsg.h > +++ b/include/linux/skmsg.h > @@ -71,7 +71,6 @@ struct sk_psock_link { > }; > > struct sk_psock_work_state { > - struct sk_buff *skb; > u32 len; > u32 off; > }; > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 3f95c460c261..bc5ca973400c 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -622,16 +622,12 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, > > static void sk_psock_skb_state(struct sk_psock *psock, > struct sk_psock_work_state *state, > - struct sk_buff *skb, > int len, int off) > { > spin_lock_bh(&psock->ingress_lock); > if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { > - state->skb = skb; > state->len = len; > state->off = off; > - } else { > - sock_drop(psock->sk, skb); > } > spin_unlock_bh(&psock->ingress_lock); > } > @@ -642,23 +638,17 @@ static void sk_psock_backlog(struct work_struct *work) > struct sk_psock *psock = container_of(dwork, struct sk_psock, work); > struct sk_psock_work_state *state = &psock->work_state; > struct sk_buff *skb = NULL; > + u32 len = 0, off = 0; > bool ingress; > - u32 len, off; > int ret; > > mutex_lock(&psock->work_mutex); > - if (unlikely(state->skb)) { > - spin_lock_bh(&psock->ingress_lock); > - skb = state->skb; > + if (unlikely(state->len)) { > len = state->len; > off = state->off; > - state->skb = NULL; > - spin_unlock_bh(&psock->ingress_lock); > } > - if (skb) > - goto start; > > - while ((skb = skb_dequeue(&psock->ingress_skb))) { > + while ((skb = skb_peek(&psock->ingress_skb))) { > len = skb->len; > off = 0; > if (skb_bpf_strparser(skb)) { > @@ -667,7 +657,6 @@ static void sk_psock_backlog(struct work_struct *work) > off = stm->offset; > len = stm->full_len; > } > -start: > ingress = skb_bpf_ingress(skb); > skb_bpf_redirect_clear(skb); > do { > @@ -677,8 +666,7 @@ static void sk_psock_backlog(struct work_struct *work) > len, ingress); > if (ret <= 0) { > if (ret == -EAGAIN) { > - sk_psock_skb_state(psock, state, skb, > - len, off); > + sk_psock_skb_state(psock, state, len, off); > > /* Delay slightly to prioritize any > * other work that might be here. I've been staring at this bit and I think it doesn't matter if we update psock->work_state when SK_PSOCK_TX_ENABLED has been cleared. But what I think we shouldn't be doing here is scheduling sk_psock_backlog again if SK_PSOCK_TX_ENABLED got cleared by sk_psock_stop. > @@ -689,15 +677,16 @@ static void sk_psock_backlog(struct work_struct *work) > /* Hard errors break pipe and stop xmit. */ > sk_psock_report_error(psock, ret ? -ret : EPIPE); > sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED); > - sock_drop(psock->sk, skb); > goto end; > } > off += ret; > len -= ret; > } while (len); > > - if (!ingress) > + skb = skb_dequeue(&psock->ingress_skb); > + if (!ingress) { > kfree_skb(skb); > + } > } > end: > mutex_unlock(&psock->work_mutex);