From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
patches@lists.linux.dev, Jakub Sitnicki <jakub@cloudflare.com>,
John Fastabend <john.fastabend@gmail.com>,
Daniel Borkmann <daniel@iogearbox.net>,
William Findlay <will@isovalent.com>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH 6.3 32/45] bpf, sockmap: Improved check for empty queue
Date: Thu, 1 Jun 2023 14:21:28 +0100 [thread overview]
Message-ID: <20230601131940.134702774@linuxfoundation.org> (raw)
In-Reply-To: <20230601131938.702671708@linuxfoundation.org>
From: John Fastabend <john.fastabend@gmail.com>
[ Upstream commit 405df89dd52cbcd69a3cd7d9a10d64de38f854b2 ]
We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.
But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.
Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.
To fix instead of popping the skb from the queue entirely we peek the
skb from the queue and do the copy there. This ensures checks to the
queue length are non-zero while skb is being processed. Then finally
when the entire skb has been copied to user space queue or another
socket we pop it off the queue. This way the queue length check allows
bypassing the queue only after the list has been completely processed.
To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
include/linux/skmsg.h | 1 -
net/core/skmsg.c | 32 ++++++++------------------------
2 files changed, 8 insertions(+), 25 deletions(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 904ff9a32ad61..054d7911bfc9f 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -71,7 +71,6 @@ struct sk_psock_link {
};
struct sk_psock_work_state {
- struct sk_buff *skb;
u32 len;
u32 off;
};
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 76ff15f8bb06e..bcd45a99a3db3 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -622,16 +622,12 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
static void sk_psock_skb_state(struct sk_psock *psock,
struct sk_psock_work_state *state,
- struct sk_buff *skb,
int len, int off)
{
spin_lock_bh(&psock->ingress_lock);
if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
- state->skb = skb;
state->len = len;
state->off = off;
- } else {
- sock_drop(psock->sk, skb);
}
spin_unlock_bh(&psock->ingress_lock);
}
@@ -642,23 +638,17 @@ static void sk_psock_backlog(struct work_struct *work)
struct sk_psock *psock = container_of(dwork, struct sk_psock, work);
struct sk_psock_work_state *state = &psock->work_state;
struct sk_buff *skb = NULL;
+ u32 len = 0, off = 0;
bool ingress;
- u32 len, off;
int ret;
mutex_lock(&psock->work_mutex);
- if (unlikely(state->skb)) {
- spin_lock_bh(&psock->ingress_lock);
- skb = state->skb;
+ if (unlikely(state->len)) {
len = state->len;
off = state->off;
- state->skb = NULL;
- spin_unlock_bh(&psock->ingress_lock);
}
- if (skb)
- goto start;
- while ((skb = skb_dequeue(&psock->ingress_skb))) {
+ while ((skb = skb_peek(&psock->ingress_skb))) {
len = skb->len;
off = 0;
if (skb_bpf_strparser(skb)) {
@@ -667,7 +657,6 @@ static void sk_psock_backlog(struct work_struct *work)
off = stm->offset;
len = stm->full_len;
}
-start:
ingress = skb_bpf_ingress(skb);
skb_bpf_redirect_clear(skb);
do {
@@ -677,8 +666,7 @@ static void sk_psock_backlog(struct work_struct *work)
len, ingress);
if (ret <= 0) {
if (ret == -EAGAIN) {
- sk_psock_skb_state(psock, state, skb,
- len, off);
+ sk_psock_skb_state(psock, state, len, off);
/* Delay slightly to prioritize any
* other work that might be here.
@@ -690,15 +678,16 @@ static void sk_psock_backlog(struct work_struct *work)
/* Hard errors break pipe and stop xmit. */
sk_psock_report_error(psock, ret ? -ret : EPIPE);
sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
- sock_drop(psock->sk, skb);
goto end;
}
off += ret;
len -= ret;
} while (len);
- if (!ingress)
+ skb = skb_dequeue(&psock->ingress_skb);
+ if (!ingress) {
kfree_skb(skb);
+ }
}
end:
mutex_unlock(&psock->work_mutex);
@@ -791,11 +780,6 @@ static void __sk_psock_zap_ingress(struct sk_psock *psock)
skb_bpf_redirect_clear(skb);
sock_drop(psock->sk, skb);
}
- kfree_skb(psock->work_state.skb);
- /* We null the skb here to ensure that calls to sk_psock_backlog
- * do not pick up the free'd skb.
- */
- psock->work_state.skb = NULL;
__sk_psock_purge_ingress_msg(psock);
}
@@ -814,7 +798,6 @@ void sk_psock_stop(struct sk_psock *psock)
spin_lock_bh(&psock->ingress_lock);
sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
sk_psock_cork_free(psock);
- __sk_psock_zap_ingress(psock);
spin_unlock_bh(&psock->ingress_lock);
}
@@ -829,6 +812,7 @@ static void sk_psock_destroy(struct work_struct *work)
sk_psock_done_strp(psock);
cancel_delayed_work_sync(&psock->work);
+ __sk_psock_zap_ingress(psock);
mutex_destroy(&psock->work_mutex);
psock_progs_drop(&psock->progs);
--
2.39.2
next prev parent reply other threads:[~2023-06-01 13:27 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-01 13:20 [PATCH 6.3 00/45] 6.3.6-rc1 review Greg Kroah-Hartman
2023-06-01 13:20 ` [PATCH 6.3 01/45] firmware: arm_scmi: Fix incorrect alloc_workqueue() invocation Greg Kroah-Hartman
2023-06-01 13:20 ` [PATCH 6.3 02/45] firmware: arm_ffa: Fix usage of partition info get count flag Greg Kroah-Hartman
2023-06-01 13:20 ` [PATCH 6.3 03/45] spi: spi-geni-qcom: Select FIFO mode for chip select Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 04/45] coresight: perf: Release Coresight path when alloc trace id failed Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 05/45] ARM: dts: imx6ull-dhcor: Set and limit the mode for PMIC buck 1, 2 and 3 Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 06/45] selftests/bpf: Fix pkg-config call building sign-file Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 07/45] power: supply: rt9467: Fix passing zero to dev_err_probe Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 08/45] platform/x86/amd/pmf: Fix CnQF and auto-mode after resume Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 09/45] bpf: netdev: init the offload table earlier Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 10/45] gpiolib: fix allocation of mixed dynamic/static GPIOs Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 11/45] tls: rx: device: fix checking decryption status Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 12/45] tls: rx: strp: set the skb->len of detached / CoWed skbs Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 13/45] tls: rx: strp: fix determining record length in copy mode Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 14/45] tls: rx: strp: force mixed decrypted records into " Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 15/45] tls: rx: strp: factor out copying skb data Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 16/45] tls: rx: strp: preserve decryption status of skbs when needed Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 17/45] tls: rx: strp: dont use GFP_KERNEL in softirq context Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 18/45] net: fec: add dma_wmb to ensure correct descriptor values Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 19/45] cxl/port: Fix NULL pointer access in devm_cxl_add_port() Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 20/45] ASoC: Intel: avs: Fix module lookup Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 21/45] drm/i915: Move shared DPLL disabling into CRTC disable hook Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 22/45] drm/i915: Disable DPLLs before disconnecting the TC PHY Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 23/45] drm/i915: Fix PIPEDMC disabling for a bigjoiner configuration Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 24/45] net/mlx5e: TC, Fix using eswitch mapping in nic mode Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 25/45] Revert "net/mlx5: Expose steering dropped packets counter" Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 26/45] Revert "net/mlx5: Expose vnic diagnostic counters for eswitch managed vports" Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 27/45] net/mlx5: E-switch, Devcom, sync devcom events and devcom comp register Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 28/45] gpio-f7188x: fix chip name and pin count on Nuvoton chip Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 29/45] bpf, sockmap: Pass skb ownership through read_skb Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 30/45] bpf, sockmap: Convert schedule_work into delayed_work Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 31/45] bpf, sockmap: Reschedule is now done through backlog Greg Kroah-Hartman
2023-06-01 13:21 ` Greg Kroah-Hartman [this message]
2023-06-01 13:21 ` [PATCH 6.3 33/45] bpf, sockmap: Handle fin correctly Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 34/45] bpf, sockmap: TCP data stall on recv before accept Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 35/45] bpf, sockmap: Wake up polling after data copy Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 36/45] bpf, sockmap: Incorrectly handling copied_seq Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 37/45] blk-wbt: fix that wbt cant be disabled by default Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 38/45] blk-mq: fix race condition in active queue accounting Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 39/45] vfio/type1: check pfn valid before converting to struct page Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 40/45] cpufreq: amd-pstate: Remove fast_switch_possible flag from active driver Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 41/45] net: phy: mscc: enable VSC8501/2 RGMII RX clock Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 42/45] bluetooth: Add cmd validity checks at the start of hci_sock_ioctl() Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 43/45] cpufreq: amd-pstate: Update policy->cur in amd_pstate_adjust_perf() Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 44/45] cpufreq: amd-pstate: Add ->fast_switch() callback Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 45/45] netfilter: ctnetlink: Support offloaded conntrack entry deletion Greg Kroah-Hartman
2023-06-01 20:27 ` [PATCH 6.3 00/45] 6.3.6-rc1 review Shuah Khan
2023-06-01 20:27 ` Florian Fainelli
2023-06-02 6:15 ` Ron Economos
2023-06-02 7:01 ` Conor Dooley
2023-06-02 8:45 ` Jon Hunter
2023-06-02 9:02 ` Bagas Sanjaya
2023-06-02 9:44 ` Naresh Kamboju
2023-06-02 13:29 ` Markus Reichelt
2023-06-02 16:56 ` Justin Forbes
2023-06-02 22:36 ` Guenter Roeck
2023-06-05 9:19 ` Chris Paterson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230601131940.134702774@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=daniel@iogearbox.net \
--cc=jakub@cloudflare.com \
--cc=john.fastabend@gmail.com \
--cc=patches@lists.linux.dev \
--cc=sashal@kernel.org \
--cc=stable@vger.kernel.org \
--cc=will@isovalent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).