From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>,
Alexei Starovoitov <ast@kernel.org>,
Sasha Levin <sashal@kernel.org>,
john.fastabend@gmail.com, jakub@cloudflare.com,
netdev@vger.kernel.org, bpf@vger.kernel.org
Subject: [PATCH AUTOSEL 5.15 32/33] bpf, sockmap: Fix data lost during EAGAIN retries
Date: Tue, 3 Jun 2025 21:05:23 -0400 [thread overview]
Message-ID: <20250604010524.6091-32-sashal@kernel.org> (raw)
In-Reply-To: <20250604010524.6091-1-sashal@kernel.org>
From: Jiayuan Chen <jiayuan.chen@linux.dev>
[ Upstream commit 7683167196bd727ad5f3c3fc6a9ca70f54520a81 ]
We call skb_bpf_redirect_clear() to clean _sk_redir before handling skb in
backlog, but when sk_psock_handle_skb() return EAGAIN due to sk_rcvbuf
limit, the redirect info in _sk_redir is not recovered.
Fix skb redir loss during EAGAIN retries by restoring _sk_redir
information using skb_bpf_set_redir().
Before this patch:
'''
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress
Setting up benchmark 'sockmap'...
create socket fd c1:13 p1:14 c2:15 p2:16
Benchmark 'sockmap' started.
Send Speed 1343.172 MB/s, BPF Speed 1343.238 MB/s, Rcv Speed 65.271 MB/s
Send Speed 1352.022 MB/s, BPF Speed 1352.088 MB/s, Rcv Speed 0 MB/s
Send Speed 1354.105 MB/s, BPF Speed 1354.105 MB/s, Rcv Speed 0 MB/s
Send Speed 1355.018 MB/s, BPF Speed 1354.887 MB/s, Rcv Speed 0 MB/s
'''
Due to the high send rate, the RX processing path may frequently hit the
sk_rcvbuf limit. Once triggered, incorrect _sk_redir will cause the flow
to mistakenly enter the "!ingress" path, leading to send failures.
(The Rcv speed depends on tcp_rmem).
After this patch:
'''
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress
Setting up benchmark 'sockmap'...
create socket fd c1:13 p1:14 c2:15 p2:16
Benchmark 'sockmap' started.
Send Speed 1347.236 MB/s, BPF Speed 1347.367 MB/s, Rcv Speed 65.402 MB/s
Send Speed 1353.320 MB/s, BPF Speed 1353.320 MB/s, Rcv Speed 65.536 MB/s
Send Speed 1353.186 MB/s, BPF Speed 1353.121 MB/s, Rcv Speed 65.536 MB/s
'''
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20250407142234.47591-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
**YES** This commit should be backported to stable kernel trees. ##
Extensive Analysis ### **Critical Bug Fix Nature** This is a **data loss
bug** in the BPF sockmap subsystem that causes silent packet drops under
load conditions. The commit adds a single line that restores essential
redirect information during EAGAIN retry scenarios. ### **Code Change
Analysis** The fix adds exactly one line at `net/core/skmsg.c:694`: ```c
skb_bpf_set_redir(skb, psock->sk, ingress); ``` This line restores
redirect information that was previously cleared by
`skb_bpf_redirect_clear(skb)` at line 684. When `sk_psock_handle_skb()`
returns `-EAGAIN` due to memory pressure or socket buffer limits, the
work is rescheduled, but without this fix, the redirect information
(`skb->_sk_redir`) would be lost. ### **Impact Without the Fix** 1.
**Silent Data Loss**: When EAGAIN occurs, subsequent retry attempts find
no redirect target (`skb_bpf_redirect_fetch()` returns NULL), causing
packets to be dropped in `sk_psock_skb_redirect()` at lines 935-938. 2.
**BPF Policy Bypass**: Packets that should be redirected according to
BPF program logic get dropped instead, effectively bypassing network
policies. 3. **Production Impact**: The commit message shows concrete
performance degradation - receive speeds dropping from 65+ MB/s to 0
MB/s under high load conditions. ### **Backport Suitability Factors**
**✅ Minimal and Surgical:** - Single line addition - No API changes or
structural modifications - Uses existing stable function
`skb_bpf_set_redir()` **✅ Self-Contained:** - Fix only affects the
specific error path (`ret == -EAGAIN`) - No dependencies on other
concurrent changes - Uses well-established APIs present across kernel
versions **✅ Clear Bug Fix Semantics:** - Restores state that was
previously cleared - Follows the established pattern: clear → try →
restore on failure - The comment explicitly states "Restore redir info
we cleared before" **✅ Critical Subsystem:** - Affects BPF sockmap, a
core networking infrastructure component - Used by service meshes,
container networking, and load balancers - Failure causes silent data
loss that's difficult to debug ### **Comparison with Similar Commits**
Looking at the provided historical examples: - Similar to commit #2 and
#4 (both marked YES) which also fix sockmap data handling issues -
Unlike commit #1, #3, and #5 (marked NO) which involved more complex
architectural changes - This fix addresses a fundamental correctness
issue rather than optimizations ### **Risk Assessment** **Low Risk:** -
The fix is in an error recovery path, so it only executes when problems
already exist - Restoring redirect information cannot make the situation
worse - The function `skb_bpf_set_redir()` is a simple state restoration
operation ### **Stable Tree Criteria Compliance** 1. **Important
bugfix**: ✅ Fixes silent data loss 2. **Minimal risk**: ✅ Single line,
error path only 3. **No new features**: ✅ Pure bug fix 4. **Confined
scope**: ✅ Limited to sockmap redirect handling 5. **User-visible
impact**: ✅ Prevents packet loss under load This commit perfectly fits
the stable tree criteria for important, low-risk bug fixes that address
user-visible problems in critical subsystems.
net/core/skmsg.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index a5947aa559837..3ae1704a8a7c1 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -663,7 +663,8 @@ static void sk_psock_backlog(struct work_struct *work)
if (ret <= 0) {
if (ret == -EAGAIN) {
sk_psock_skb_state(psock, state, len, off);
-
+ /* Restore redir info we cleared before */
+ skb_bpf_set_redir(skb, psock->sk, ingress);
/* Delay slightly to prioritize any
* other work that might be here.
*/
--
2.39.5
next prev parent reply other threads:[~2025-06-04 1:06 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-04 1:04 [PATCH AUTOSEL 5.15 01/33] net: macb: Check return value of dma_set_mask_and_coherent() Sasha Levin
2025-06-04 1:04 ` [PATCH AUTOSEL 5.15 02/33] tipc: use kfree_sensitive() for aead cleanup Sasha Levin
2025-06-04 1:04 ` [PATCH AUTOSEL 5.15 03/33] i2c: designware: Invoke runtime suspend on quick slave re-registration Sasha Levin
2025-06-04 1:04 ` [PATCH AUTOSEL 5.15 04/33] emulex/benet: correct command version selection in be_cmd_get_stats() Sasha Levin
2025-06-04 1:04 ` [PATCH AUTOSEL 5.15 05/33] wifi: mt76: mt76x2: Add support for LiteOn WN4516R,WN4519R Sasha Levin
2025-06-04 1:04 ` [PATCH AUTOSEL 5.15 06/33] sctp: Do not wake readers in __sctp_write_space() Sasha Levin
2025-06-04 1:04 ` [PATCH AUTOSEL 5.15 07/33] cpufreq: scmi: Skip SCMI devices that aren't used by the CPUs Sasha Levin
2025-06-04 1:04 ` [PATCH AUTOSEL 5.15 08/33] i2c: npcm: Add clock toggle recovery Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 09/33] net: dlink: add synchronization for stats update Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 10/33] tcp: always seek for minimal rtt in tcp_rcv_rtt_update() Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 11/33] tcp: fix initial tp->rcvq_space.space value for passive TS enabled flows Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 12/33] ipv4/route: Use this_cpu_inc() for stats on PREEMPT_RT Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 13/33] openvswitch: Stricter validation for the userspace action Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 14/33] net: atlantic: generate software timestamp just before the doorbell Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 15/33] pinctrl: armada-37xx: propagate error from armada_37xx_pmx_set_by_name() Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 16/33] pinctrl: armada-37xx: propagate error from armada_37xx_gpio_get_direction() Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 17/33] pinctrl: armada-37xx: propagate error from armada_37xx_pmx_gpio_set_direction() Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 18/33] pinctrl: armada-37xx: propagate error from armada_37xx_gpio_get() Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 19/33] net: mlx4: add SOF_TIMESTAMPING_TX_SOFTWARE flag when getting ts info Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 20/33] wifi: mac80211: do not offer a mesh path if forwarding is disabled Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 21/33] clk: rockchip: rk3036: mark ddrphy as critical Sasha Levin
2025-06-04 1:05 ` Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 22/33] libbpf: Add identical pointer detection to btf_dedup_is_equiv() Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 23/33] scsi: lpfc: Fix lpfc_check_sli_ndlp() handling for GEN_REQUEST64 commands Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 24/33] iommu/amd: Ensure GA log notifier callbacks finish running before module unload Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 25/33] net: bridge: mcast: re-implement br_multicast_{enable, disable}_port functions Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 26/33] vxlan: Do not treat dst cache initialization errors as fatal Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 27/33] software node: Correct a OOB check in software_node_get_reference_args() Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 28/33] pinctrl: mcp23s08: Reset all pins to input at probe Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 29/33] scsi: lpfc: Use memcpy() for BIOS version Sasha Levin
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 30/33] sock: Correct error checking condition for (assign|release)_proto_idx() Sasha Levin
2025-06-04 1:05 ` [Intel-wired-lan] [PATCH AUTOSEL 5.15 31/33] i40e: fix MMIO write access to an invalid page in i40e_clear_hw Sasha Levin
2025-06-04 1:05 ` Sasha Levin
2025-06-04 1:05 ` Sasha Levin [this message]
2025-06-04 1:05 ` [PATCH AUTOSEL 5.15 33/33] octeontx2-pf: Add error log forcn10k_map_unmap_rq_policer() Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250604010524.6091-32-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=jakub@cloudflare.com \
--cc=jiayuan.chen@linux.dev \
--cc=john.fastabend@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.