* [PATCH v12 01/12] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
From: Pawan Gupta @ 2026-06-23 17:32 UTC (permalink / raw)
To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, Jason Baron, Alice Ryhl,
Steven Rostedt, Ard Biesheuvel, Shuah Khan
Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
linux-doc
In-Reply-To: <20260622-vmscape-bhb-v12-0-76cbda0ae3e5@linux.intel.com>
Currently, the BHB clearing sequence is followed by an LFENCE to prevent
transient execution of subsequent indirect branches prematurely. However,
the LFENCE barrier could be unnecessary in certain cases. For example, when
the kernel is using the BHI_DIS_S mitigation, and BHB clearing is only
needed for userspace. In such cases, the LFENCE is redundant because ring
transitions would provide the necessary serialization.
Below is a quick recap of BHI mitigation options:
On Alder Lake and newer
BHI_DIS_S: Hardware control to mitigate BHI in ring0. This has low
performance overhead.
Long loop: Alternatively, a longer version of the BHB clearing sequence
can be used to mitigate BHI. It can also be used to mitigate the BHI
variant of VMSCAPE. This is not yet implemented in Linux.
On older CPUs
Short loop: Clears BHB at kernel entry and VMexit. The "Long loop" is
effective on older CPUs as well, but should be avoided because of
unnecessary overhead.
On Alder Lake and newer CPUs, eIBRS isolates the indirect targets between
guest and host. But when affected by the BHI variant of VMSCAPE, a guest's
branch history may still influence indirect branches in userspace. This
also means the big hammer IBPB could be replaced with a cheaper option that
clears the BHB at exit-to-userspace after a VMexit.
In preparation for adding the support for the BHB sequence (without LFENCE)
on newer CPUs, move the LFENCE to the caller side after clear_bhb_loop() is
executed. Allow callers to decide whether they need the LFENCE or not. This
adds a few extra bytes to the call sites, but it obviates the need for
multiple variants of clear_bhb_loop().
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Jon Kohler <jon@nutanix.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
---
arch/x86/entry/entry_64.S | 5 ++++-
arch/x86/include/asm/nospec-branch.h | 4 ++--
arch/x86/net/bpf_jit_comp.c | 2 ++
3 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 42447b1e1dff..3a180a36ca0e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1528,6 +1528,9 @@ SYM_CODE_END(rewind_stack_and_make_dead)
* refactored in the future if needed. The .skips are for safety, to ensure
* that all RETs are in the second half of a cacheline to mitigate Indirect
* Target Selection, rather than taking the slowpath via its_return_thunk.
+ *
+ * Note, callers should use a speculation barrier like LFENCE immediately after
+ * a call to this function to ensure BHB is cleared before indirect branches.
*/
SYM_FUNC_START(clear_bhb_loop)
ANNOTATE_NOENDBR
@@ -1562,7 +1565,7 @@ SYM_FUNC_START(clear_bhb_loop)
sub $1, %ecx
jnz 1b
.Lret2: RET
-5: lfence
+5:
pop %rbp
RET
SYM_FUNC_END(clear_bhb_loop)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 4f4b5e8a1574..70b377fcbc1c 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -331,11 +331,11 @@
#ifdef CONFIG_X86_64
.macro CLEAR_BRANCH_HISTORY
- ALTERNATIVE "", "call clear_bhb_loop", X86_FEATURE_CLEAR_BHB_LOOP
+ ALTERNATIVE "", "call clear_bhb_loop; lfence", X86_FEATURE_CLEAR_BHB_LOOP
.endm
.macro CLEAR_BRANCH_HISTORY_VMEXIT
- ALTERNATIVE "", "call clear_bhb_loop", X86_FEATURE_CLEAR_BHB_VMEXIT
+ ALTERNATIVE "", "call clear_bhb_loop; lfence", X86_FEATURE_CLEAR_BHB_VMEXIT
.endm
#else
#define CLEAR_BRANCH_HISTORY
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index ea9e707e8abf..f58ff2891d7d 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1624,6 +1624,8 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u8 *ip,
if (emit_call(&prog, func, ip))
return -EINVAL;
+ /* Don't speculate past this until BHB is cleared */
+ EMIT_LFENCE();
EMIT1(0x59); /* pop rcx */
EMIT1(0x58); /* pop rax */
}
--
2.34.1
^ permalink raw reply related
* [PATCH v12 00/12] VMSCAPE optimization for BHI variant
From: Pawan Gupta @ 2026-06-23 17:32 UTC (permalink / raw)
To: x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin, Josh Poimboeuf,
David Kaplan, Sean Christopherson, Borislav Petkov, Dave Hansen,
Peter Zijlstra, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, KP Singh, Jiri Olsa, David S. Miller,
David Laight, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
David Ahern, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
Paolo Bonzini, Jonathan Corbet, Jason Baron, Alice Ryhl,
Steven Rostedt, Ard Biesheuvel, Shuah Khan
Cc: linux-kernel, kvm, Asit Mallick, Tao Zhang, bpf, netdev,
linux-doc
v12:
- Applied tags from Sean and Borisov.
- Rebased to v7.1
It would be nice to have some more review-tags to help this get merged
sooner. If you have already reviewed v11 this version should be easy, it is
mostly a rebase.
v11: https://lore.kernel.org/r/20260422-vmscape-bhb-v11-0-b18e0cf32af4@linux.intel.com
- Use ifdef EXPORT_SYMBOL_FOR_KVM guard for EXPORT_STATIC_CALL_FOR_KVM()
definition. It is not practical to define one but not the other. (Sean)
- Collected tags.
v10: https://lore.kernel.org/r/20260414-vmscape-bhb-v10-0-efa924abae5f@linux.intel.com
- Add patches to define EXPORT_STATIC_CALL_FOR_MODULES() and
EXPORT_STATIC_CALL_FOR_KVM(), so that vmscape_predictor_flush static key
is only accessible to KVM and not to other kernel modules. (PeterZ)
(Borisov earlier objected to exporting the static key to all modules, but
now the static key is only exported to KVM. I guess that resolves the
concern.)
- Avoid an explicit call to vmscape_mitigation_enabled() and instead use
static_call_query() in VMexit hot path. (Sean)
- Drop vmscape_mitigation_enabled(), as it is no longer needed.
- Rebased to v7.0
v9: https://lore.kernel.org/r/20260402-vmscape-bhb-v9-0-94d16bc29774@linux.intel.com
- Use global variables for BHB loop counters instead of ALTERNATIVE-based
approach. (Dave & others)
- Use 32-bit registers (%eax/%ecx) for loop counters, loaded via movzbl
from 8-bit globals. 8-bit registers (e.g. %ah in the inner loop) caused
performance regression on certain CPUs due to partial-register stalls. (David Laight)
- Let BPF save/restore %rax/%rcx as in the original implementation, since
it is the only caller that needs these registers preserved across the
BHB clearing sequence.
- Drop Reviewed-by from patch 2/10 as the implementation changed significantly.
- Apply Tested-by from Jon Kohler to the series (except patch 2/10).
- Fix commit message grammar. (Borislav)
- Rebased to v7.0-rc6.
v8: https://lore.kernel.org/r/20260324-vmscape-bhb-v8-0-68bb524b3ab9@linux.intel.com
- Use helper in KVM to convey the mitigation status. (PeterZ/Borisov)
- Fix the documentation for default vmscape mitigation. (BPF bot)
- Remove the stray lines in bug.c (BPF bot).
- Updated commit messages and comments.
- Rebased to v7.0-rc5.
v7: https://lore.kernel.org/r/20260319-vmscape-bhb-v7-0-b76a777a98af@linux.intel.com
- s/This allows/Allow/ and s/This does adds/This adds/ in patch 1/10 commit
message (Borislav).
- Minimize register usage in BHB clearing seq. (David Laight)
- Instead of separate ecx/eax counters, use al/ah.
- Adjust the alignment of RET due to register size change.
- save/restore rax in the seq itself.
- Remove the save/restore of rax/rcx for BPF callers.
- Rename clear_bhb_loop() to clear_bhb_loop_nofence() to make it
obvious that the LFENCE is not part of the sequence (Borislav).
- Fix Kconfig: s/select/depends on/ HAVE_STATIC_CALL (PeterZ).
- Rebased to v7.0-rc4.
v6: https://lore.kernel.org/r/20251201-vmscape-bhb-v6-0-d610dd515714@linux.intel.com
- Remove semicolon at the end of asm in ALTERNATIVE (Uros).
- Fix build warning in vmscape_select_mitigation() (LKP).
- Rebased to v6.18.
v5: https://lore.kernel.org/r/20251126-vmscape-bhb-v5-2-02d66e423b00@linux.intel.com
- For BHI seq, limit runtime-patching to loop counts only (Dave).
Dropped 2 patches that moved the BHB seq to a macro.
- Remove redundant switch cases in vmscape_select_mitigation() (Nikolay).
- Improve commit message (Nikolay).
- Collected tags.
v4: https://lore.kernel.org/r/20251119-vmscape-bhb-v4-0-1adad4e69ddc@linux.intel.com
- Move LFENCE to the callsite, out of clear_bhb_loop(). (Dave)
- Make clear_bhb_loop() work for larger BHB. (Dave)
This now uses hardware enumeration to determine the BHB size to clear.
- Use write_ibpb() instead of indirect_branch_prediction_barrier() when
IBPB is known to be available. (Dave)
- Use static_call() to simplify mitigation at exit-to-userspace. (Dave)
- Refactor vmscape_select_mitigation(). (Dave)
- Fix vmscape=on which was wrongly behaving as AUTO. (Dave)
- Split the patches. (Dave)
- Patch 1-4 prepares for making the sequence flexible for VMSCAPE use.
- Patch 5 trivial rename of variable.
- Patch 6-8 prepares for deploying BHB mitigation for VMSCAPE.
- Patch 9 deploys the mitigation.
- Patch 10-11 fixes ON Vs AUTO mode.
v3: https://lore.kernel.org/r/20251027-vmscape-bhb-v3-0-5793c2534e93@linux.intel.com
- s/x86_pred_flush_pending/x86_predictor_flush_exit_to_user/ (Sean).
- Removed IBPB & BHB-clear mutual exclusion at exit-to-userspace.
- Collected tags.
v2: https://lore.kernel.org/r/20251015-vmscape-bhb-v2-0-91cbdd9c3a96@linux.intel.com
- Added check for IBPB feature in vmscape_select_mitigation(). (David)
- s/vmscape=auto/vmscape=on/ (David)
- Added patch to remove LFENCE from VMSCAPE BHB-clear sequence.
- Rebased to v6.18-rc1.
v1: https://lore.kernel.org/r/20250924-vmscape-bhb-v1-0-da51f0e1934d@linux.intel.com
Hi All,
These patches aim to improve the performance of a recent mitigation for
VMSCAPE[1] vulnerability. This improvement is relevant for BHI variant of
VMSCAPE that affect Alder Lake and newer processors.
The current mitigation approach uses IBPB on kvm-exit-to-userspace for all
affected range of CPUs. This is an overkill for CPUs that are only affected
by the BHI variant. On such CPUs clearing the branch history is sufficient
for VMSCAPE, and also more apt as the underlying issue is due to poisoned
branch history.
Below is the iPerf data for transfer between guest and host, comparing IBPB
and BHB-clear mitigation. BHB-clear shows performance improvement over IBPB
in most cases.
Platform: Emerald Rapids
Baseline: vmscape=off
Target: IBPB at VMexit-to-userspace Vs the new BHB-clear at
VMexit-to-userspace mitigation (both compared against baseline).
(pN = N parallel connections)
| iPerf user-net | IBPB | BHB Clear |
|----------------|---------|-----------|
| UDP 1-vCPU_p1 | -12.5% | 1.3% |
| TCP 1-vCPU_p1 | -10.4% | -1.5% |
| TCP 1-vCPU_p1 | -7.5% | -3.0% |
| UDP 4-vCPU_p16 | -3.7% | -3.7% |
| TCP 4-vCPU_p4 | -2.9% | -1.4% |
| UDP 4-vCPU_p4 | -0.6% | 0.0% |
| TCP 4-vCPU_p4 | 3.5% | 0.0% |
| iPerf bridge-net | IBPB | BHB Clear |
|------------------|---------|-----------|
| UDP 1-vCPU_p1 | -9.4% | -0.4% |
| TCP 1-vCPU_p1 | -3.9% | -0.5% |
| UDP 4-vCPU_p16 | -2.2% | -3.8% |
| TCP 4-vCPU_p4 | -1.0% | -1.0% |
| TCP 4-vCPU_p4 | 0.5% | 0.5% |
| UDP 4-vCPU_p4 | 0.0% | 0.9% |
| TCP 1-vCPU_p1 | 0.0% | 0.9% |
| iPerf vhost-net | IBPB | BHB Clear |
|-----------------|---------|-----------|
| UDP 1-vCPU_p1 | -4.3% | 1.0% |
| TCP 1-vCPU_p1 | -3.8% | -0.5% |
| TCP 1-vCPU_p1 | -2.7% | -0.7% |
| UDP 4-vCPU_p16 | -0.7% | -2.2% |
| TCP 4-vCPU_p4 | -0.4% | 0.8% |
| UDP 4-vCPU_p4 | 0.4% | -0.7% |
| TCP 4-vCPU_p4 | 0.0% | 0.6% |
[1] https://comsec.ethz.ch/research/microarch/vmscape-exposing-and-exploiting-incomplete-branch-predictor-isolation-in-cloud-environments/
---
Pawan Gupta (12):
x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop()
x86/bhi: Make clear_bhb_loop() effective on newer CPUs
x86/bhi: Rename clear_bhb_loop() to clear_bhb_loop_nofence()
x86/vmscape: Rename x86_ibpb_exit_to_user to x86_predictor_flush_exit_to_user
x86/vmscape: Move mitigation selection to a switch()
x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier()
static_call: Define EXPORT_STATIC_CALL_FOR_MODULES()
KVM: Define EXPORT_STATIC_CALL_FOR_KVM()
x86/vmscape: Use static_call() for predictor flush
x86/vmscape: Deploy BHB clearing mitigation
x86/vmscape: Resolve conflict between attack-vectors and vmscape=force
x86/vmscape: Add cmdline vmscape=on to override attack vector controls
Documentation/admin-guide/hw-vuln/vmscape.rst | 15 ++++-
Documentation/admin-guide/kernel-parameters.txt | 6 +-
arch/x86/Kconfig | 1 +
arch/x86/entry/entry_64.S | 21 ++++---
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/include/asm/entry-common.h | 13 ++--
arch/x86/include/asm/kvm_types.h | 1 +
arch/x86/include/asm/nospec-branch.h | 15 +++--
arch/x86/kernel/cpu/bugs.c | 84 +++++++++++++++++++++----
arch/x86/kvm/x86.c | 4 +-
arch/x86/net/bpf_jit_comp.c | 4 +-
include/linux/kvm_types.h | 10 ++-
include/linux/static_call.h | 8 +++
13 files changed, 147 insertions(+), 37 deletions(-)
---
base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6
change-id: 20250916-vmscape-bhb-d7d469977f2f
Best regards,
--
Pawan
^ permalink raw reply
* [PATCH v2 net 2/2] tipc: avoid busy looping in tipc_exit_net()
From: Eric Dumazet @ 2026-06-23 17:30 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Xin Long, Jon Maloy,
tipc-discussion, netdev, eric.dumazet, Eric Dumazet
In-Reply-To: <20260623173030.2925059-1-edumazet@google.com>
Blamed commit introduced a busy-wait loop in tipc_exit_net()
to wait for pending UDP bearer cleanup works to complete:
while (atomic_read(&tn->wq_count))
cond_resched();
This loop can busy-wait for a long time if cond_resched() is a NOP. This
typically happens if the netns exit is executed by a high priority task,
or under kernels configured without preemption (CONFIG_PREEMPT_NONE). In
such cases, it wastes CPU cycles and can lead to soft lockups.
Fix this by replacing the busy loop with wait_var_event(), allowing the
thread to sleep properly until the work queue count reaches zero.
Accordingly, update cleanup_bearer() to use atomic_dec_and_test() and
wake_up_var() to wake up the waiter when the count drops to zero.
This uses the global wait queue hash table, avoiding the need to bloat
struct tipc_net with a wait_queue_head_t. The atomic_dec_and_test()
provides the necessary memory barrier to ensure the wakeup is not missed.
Fixes: 04c26faa51d1 ("tipc: wait and exit until all work queues are done")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: tipc-discussion@lists.sourceforge.net
---
net/tipc/core.c | 4 ++--
net/tipc/udp_media.c | 4 +++-
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/net/tipc/core.c b/net/tipc/core.c
index 1ddecea1df6e9100334c47a28ff6c065292fb9ad..315975c3be8186784e9c44c9ff69d62c17ffd4b9 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -45,6 +45,7 @@
#include "crypto.h"
#include <linux/module.h>
+#include <linux/wait_bit.h>
/* configurable TIPC parameters */
unsigned int tipc_net_id __read_mostly;
@@ -118,8 +119,7 @@ static void __net_exit tipc_exit_net(struct net *net)
#ifdef CONFIG_TIPC_CRYPTO
tipc_crypto_stop(&tipc_net(net)->crypto_tx);
#endif
- while (atomic_read(&tn->wq_count))
- cond_resched();
+ wait_var_event(&tn->wq_count, atomic_read(&tn->wq_count) == 0);
}
static void __net_exit tipc_pernet_pre_exit(struct net *net)
diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index 66f3cb87a0aaaac8f40e8f237ab9a44d539b1cd8..62ae7f5b58409c89798c915dee752ac42487581f 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -40,6 +40,7 @@
#include <linux/igmp.h>
#include <linux/kernel.h>
#include <linux/workqueue.h>
+#include <linux/wait_bit.h>
#include <linux/list.h>
#include <net/sock.h>
#include <net/ip.h>
@@ -830,7 +831,8 @@ static void cleanup_bearer(struct work_struct *work)
synchronize_net();
dst_cache_destroy(&ub->rcast.dst_cache);
- atomic_dec(&tn->wq_count);
+ if (atomic_dec_and_test(&tn->wq_count))
+ wake_up_var(&tn->wq_count);
kfree(ub);
}
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v2 net 1/2] tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
From: Eric Dumazet @ 2026-06-23 17:30 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Xin Long, Jon Maloy,
tipc-discussion, netdev, eric.dumazet, Eric Dumazet,
syzbot+e14bc5d4942756023b77
In-Reply-To: <20260623173030.2925059-1-edumazet@google.com>
TIPC UDP media bearer teardown calls dst_cache_destroy() on its
replicast caches before calling synchronize_net() to wait for
concurrent RCU readers (transmitters) to finish:
static void cleanup_bearer(struct work_struct *work)
{
...
list_for_each_entry_safe(rcast, tmp, &ub->rcast.list, list) {
dst_cache_destroy(&rcast->dst_cache);
list_del_rcu(&rcast->list);
kfree_rcu(rcast, rcu);
}
...
dst_cache_destroy(&ub->rcast.dst_cache);
udp_tunnel_sock_release(ub->sk);
synchronize_net();
...
}
This is highly buggy because dst_cache_destroy() immediately frees the
per-CPU cache memory (free_percpu()) and releases the cached dst
entries without any synchronization.
If a concurrent transmitter (e.g., tipc_udp_xmit()) is running on another
CPU under RCU protection, it can call dst_cache_get() concurrently,
leading to:
1. Use-After-Free on the per-CPU cache pointer itself (crash).
2. "rcuref - imbalanced put()" warning if it attempts to release a
dst that was concurrently released by dst_cache_destroy().
Furthermore, calling kfree(ub) immediately after synchronize_net() without
closing the socket first (or waiting after closing it) leaves a window
where a concurrent receiver (tipc_udp_recv()) could start after
synchronize_net(), access ub, and suffer a UAF when kfree(ub) runs.
To fix this, we must defer dst_cache_destroy() and kfree(ub) until after
we have ensured that no more readers can see the bearer/socket and all
existing readers have finished:
1. Defer rcast entry destruction (both dst_cache_destroy() and kfree())
to an RCU callback using call_rcu_hurry().
Using call_rcu_hurry() ensures the dst entries are released quickly.
2. Release the bearer socket using udp_tunnel_sock_release() (stops
new receive readers).
3. Call synchronize_net() to wait for all outstanding RCU readers
(both transmit and receive) to finish.
4. Now that it is safe, call dst_cache_destroy() on the main bearer
cache, and free ub.
Note: 3) and 4) can be changed later in net-next to also use
call_rcu_hurry() and get rid of the synchronize_net() latency.
Fixes: e9c1a793210f ("tipc: add dst_cache support for udp media")
Reported-by: syzbot+e14bc5d4942756023b77@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a396a66.52ae72c2.136ac7.0003.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: tipc-discussion@lists.sourceforge.net
---
v2: addressed Xin Long feedback
v1: https://lore.kernel.org/netdev/CANn89i+dkbrSAwvaWXW7yWMfcwUebuTBLG5T7AGZaZcpVYGyfQ@mail.gmail.com/T/#m7bbeedffe3bedb69e33236410e3833c7ce809850
net/tipc/udp_media.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index 988b8a7f953ad6da860e6190f1f244650f121dce..66f3cb87a0aaaac8f40e8f237ab9a44d539b1cd8 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -803,6 +803,14 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b,
return err;
}
+static void rcast_free_rcu(struct rcu_head *rcu)
+{
+ struct udp_replicast *rcast = container_of(rcu, struct udp_replicast, rcu);
+
+ dst_cache_destroy(&rcast->dst_cache);
+ kfree(rcast);
+}
+
/* cleanup_bearer - break the socket/bearer association */
static void cleanup_bearer(struct work_struct *work)
{
@@ -811,18 +819,17 @@ static void cleanup_bearer(struct work_struct *work)
struct tipc_net *tn;
list_for_each_entry_safe(rcast, tmp, &ub->rcast.list, list) {
- dst_cache_destroy(&rcast->dst_cache);
list_del_rcu(&rcast->list);
- kfree_rcu(rcast, rcu);
+ call_rcu_hurry(&rcast->rcu, rcast_free_rcu);
}
tn = tipc_net(sock_net(ub->sk));
- dst_cache_destroy(&ub->rcast.dst_cache);
udp_tunnel_sock_release(ub->sk);
- /* Note: could use a call_rcu() to avoid another synchronize_net() */
synchronize_net();
+
+ dst_cache_destroy(&ub->rcast.dst_cache);
atomic_dec(&tn->wq_count);
kfree(ub);
}
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply related
* [PATCH v2 net 0/2] tipc: syzbot related fixes
From: Eric Dumazet @ 2026-06-23 17:30 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Xin Long, Jon Maloy,
tipc-discussion, netdev, eric.dumazet, Eric Dumazet
First patch fixes a recent syzbot report.
Second patch is inspired by numerous syzbot soft lockup
reports with RTNL pressure.
Eric Dumazet (2):
tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
tipc: avoid busy looping in tipc_exit_net()
net/tipc/core.c | 4 ++--
net/tipc/udp_media.c | 19 ++++++++++++++-----
2 files changed, 16 insertions(+), 7 deletions(-)
--
2.55.0.rc0.799.gd6f94ed593-goog
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andrei Vagin @ 2026-06-23 17:29 UTC (permalink / raw)
To: Askar Safin
Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
rostedt, torvalds, val, viro, willy
In-Reply-To: <20260623094211.1080873-1-safinaskar@gmail.com>
On Tue, Jun 23, 2026 at 2:42 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Andrei Vagin <avagin@gmail.com>:
> > Actually, this change introduces a performance and functional
> > regression for CRIU.
> >
> > Here is a brief overview of how CRIU currently dumps memory pages:
> >
> > CRIU injects a parasite code blob into the target process's address
> > space. The parasite invokes vmsplice() with the SPLICE_F_GIFT flag to
> > pin physical pages directly inside a pipe without copying them. The main
> > CRIU process then takes over from outside the target context, calling
> > splice() on the other end of the pipe to stream the data directly into
> > checkpoint image files or a remote network socket.
> >
> > I ran a simple test that creates an anonymous mapping and touches every
> > page within it:
> > Without this patch, CRIU takes 9 seconds to dump the test process.
> > With this patch, It takes 18 seconds...
> >
> > Plus, it obviously introduces some memory overhead.
> >
> > If these changes are merged, we will need to completely rework the
> > memory dumping mechanism in CRIU. Using vmsplice() in this proposed form
> > no longer makes any sense for our architecture...
>
> I just have read some docs for CRIU. I found this statement:
>
> > #### Why `splice` is Better:
> > * **Consistency via COW**: The `SPLICE_F_GIFT` flag ensures that if the process modifies a "gifted" page after resuming, the kernel performs a **Copy-on-Write (COW)**. The pipe buffer > continues to hold the *original* version of the page as it existed at the moment of the `vmsplice()` call, ensuring a perfectly consistent snapshot of that page.
>
> This is wrong (with released kernels). I confirmed this by testing this on my current kernel (6.12.90).
>
> See the code in the end of this message.
>
> If you actually rely on mentioned consistency, then, it seems, CRIU is broken.
>
> So, in fact, my patch actually brings consistency to CRIU. :)
Askar, unfortunately, the statement about "Consistency via COW" is just
"AI imagination". The under-the-hood docs were recently transferred from
the criu.org wiki using some AI transformations, which introduced this
wrong statement. The original document can be found here:
https://criu.org/Memory_dumping_and_restoring.
To clarify, CRIU does not rely on page consistency for intermediate
pre-dumps. The pre-dump mechanism is designed to be iterative: During a
pre-dump, tasks are briefly frozen to vmsplice dirty pages into pipes.
Then the tasks are resumed, and CRIU drains the pipes. If the process
modifies a page after it has been spliced, the data in the pipe may
become inconsistent. However, any such modification is tracked by the
soft-dirty page tracker. In the next pre-dump iteration (or the final
dump), these modified pages will be identified as dirty again and
re-dumped. During restore, the images are applied sequentially, and the
final dump (taken while the process is fully frozen) ensures we
reconstruct a consistent final state.
But what really matters in this scheme is the vmsplice performance.
The proposed change significantly slows it down. In the case of CRIU,
vmsplice performance is critical because the target process is frozen
during these calls. Minimizing the freeze time is the primary goal of
pre-dumping to make migration almost invisible to the user workload.
Thanks,
Andrei
^ permalink raw reply
* Re: [PATCH net-next v3] vsock/virtio: rewrite MSG_ZEROCOPY flag handling
From: Michael S. Tsirkin @ 2026-06-23 17:26 UTC (permalink / raw)
To: Arseniy Krasnov
Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
Bobby Eshleman, Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm,
virtualization, netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260623153819.697635-1-avkrasnov@rulkc.org>
On Tue, Jun 23, 2026 at 06:38:19PM +0300, Arseniy Krasnov wrote:
> Logically it was based on TCP implementation, so to make further support
> easier, rewrite it in the TCP way (like in 'tcp_sendmsg_locked()'). This
> patch only rewrites flag handling (e.g. it doesn't change logic).
>
> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
It seems to change logic though:
> ---
> Changelog v1->v2:
> * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
> already added.
> Changelog v2->v3:
> * Update commit message.
> * Remove one empty line.
>
> net/vmw_vsock/virtio_transport_common.c | 47 ++++++++++++-------------
> 1 file changed, 22 insertions(+), 25 deletions(-)
>
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 09475007165b..41c2a0b82a8e 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -328,38 +328,35 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> return pkt_len;
>
> - if (info->msg) {
> - /* If zerocopy is not enabled by 'setsockopt()', we behave as
> - * there is no MSG_ZEROCOPY flag set.
> + if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
> + /* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
> + * 'MSG_ZEROCOPY' flag handling here is based on the same flag
> + * handling from 'tcp_sendmsg_locked()'.
> */
> - if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> - info->msg->msg_flags &= ~MSG_ZEROCOPY;
So previously without SOCK_ZEROCOPY, MSG_ZEROCOPY was always ignored...
> + if (info->msg->msg_ubuf) {
> + uarg = info->msg->msg_ubuf;
> + can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
now it's not in this case?
Maybe the right call, but saying "does not change logic" seems wrong.
> + } else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
> + uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
> + NULL, false);
> + if (!uarg) {
> + virtio_transport_put_credit(vvs, pkt_len);
> + return -ENOMEM;
> + }
>
> - if (info->msg->msg_flags & MSG_ZEROCOPY)
> can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
> + if (!can_zcopy)
> + uarg_to_msgzc(uarg)->zerocopy = 0;
>
> + have_uref = true;
> + }
> +
> + /* 'can_zcopy' means that this transmission will be
> + * in zerocopy way (e.g. using 'frags' array).
> + */
> if (can_zcopy)
> max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
> (MAX_SKB_FRAGS * PAGE_SIZE));
> -
> - if (info->msg->msg_flags & MSG_ZEROCOPY &&
> - info->op == VIRTIO_VSOCK_OP_RW) {
> - uarg = info->msg->msg_ubuf;
> -
> - if (!uarg) {
> - uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> - pkt_len, NULL, false);
> - if (!uarg) {
> - virtio_transport_put_credit(vvs, pkt_len);
> - return -ENOMEM;
> - }
> -
> - if (!can_zcopy)
> - uarg_to_msgzc(uarg)->zerocopy = 0;
> -
> - have_uref = true;
> - }
> - }
> }
>
> rest_len = pkt_len;
> --
> 2.25.1
^ permalink raw reply
* Re: Ethtool : PRBS feature
From: Lee Trager @ 2026-06-23 17:10 UTC (permalink / raw)
To: Andrew Lunn, Das, Shubham
Cc: Maxime Chevallier, Alexander H Duyck, netdev@vger.kernel.org,
mkubecek@suse.cz, D H, Siddaraju, Chintalapalle, Balaji,
Lindberg, Magnus, niklas.damberg@ericsson.com
In-Reply-To: <5f22c491-b816-421e-a531-bf87a07fea70@lunn.ch>
On 6/23/26 2:43 AM, Andrew Lunn wrote:
> Taking a quick look at this:
>
> You are missing a way to enumerate what test patterns the hardware
> supports. There is more than prbs7. You want to be able to report the
> contents of C45 1.1500, and other similar registers.
Not only is there more than PRBS7 but also PRBS 8/10 encoding which is
an option on any test. There may be other options, that was the only one
fbnic supported. I agree there does need to be a user interface which
displays supported tests and options.
> To avoid race conditions, maybe some of these commands need combining.
> ethtool --phy-test eth1 tx-prbs prbs7 rx-prbs prbs7 bert start
>
> The configuration is then atomic, with respect to the uAPI, so we
> don't get two users configuring it at the same time, ending up with a
> messed up configuration.
Testing consumes the link so you really don't want anything done to the
netdev while testing is running. fbnic does the following.
1. Testing cannot start when the link is up
2. Once testing starts the driver removes the netdev to prevent use. The
netdev is only added back when testing stops. The upstream solution will
need something that can keep the netdev but lock everything down while
testing is running.
3. Once testing starts you cannot change the test, even on an individual
lane basis. You must stop testing first.
>
> Traditionally, Unix does not offer a way to clear statistic counters
> back to zero. So i'm not sure about clear-stats. We also need to think
> about hardware which does not support that. And there is locking
> issues, can the stats be cleared while a test is active?
fbnic actually has separate registers for PRBS test results. Results do
need to be clean between runs but I never created an explicit clear
interface. Firmware automatically reset the registers when a new test
was started. This also allows results to be viewed after testing has
stopped.
Reading results was a little tricky due to roll over between two 32bit
registers. I was able to read results while testing was running without
pausing. Technically I could clear results while testing was running but
never saw a need to.
>
> You need to think about the units for inject errors. There is no
> floating point support. Also, is this corrupt packets? Or single bit
> flips in the stream? It needs to be well defined what it actually
> means. The driver can then convert it to whatever the hardware
> supports. How does 802.3 specify this?
>
> Also, 802.3 defines PRBS7 as a benign pattern. With a quick look, i
> did not find a definition of benign, but injecting errors does not
> seem benign to me.
>
> I'm assuming when 'start' is used, the networking core will change the
> interface status to IF_OPER_TESTING. It is not always obvious why an
> interface is in testing mode, rather than IF_OPER_UP. Cable testing
> could also be running, etc. So maybe there needs to be a way to report
> why it is in IF_OPER_TESTING?
>
> I also wounder if a timeout should be used with start, so that it will
> return to IF_OPER_UP after a time period?
When I spoke to hardware engineers at Meta they did not want a timeout.
Testing often occurred over days, so they wanted to be able to start it
and explicitly stop it. I'm not against a time out but I do think it
should be optional.
Since PRBS testing is handled by firmware one safety measure I added is
if firmware lost contact with the host testing was automatically stopped
and TX FIR values were reset to factory. This ensured that the NIC won't
get stuck in testing and on initialization the driver doesn't have to
worry about testing state.
Lee
^ permalink raw reply
* Re: [PATCH net v2 2/2] net: stmmac: dwmac-spacemit: Fix wrong irq definition
From: Maxime Chevallier @ 2026-06-23 16:54 UTC (permalink / raw)
To: Inochi Amaoto, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Maxime Coquelin, Alexandre Torgue,
Yixun Lan, Russell King (Oracle)
Cc: netdev, linux-stm32, linux-arm-kernel, linux-riscv, spacemit,
linux-kernel, Yixun Lan, Longbin Li
In-Reply-To: <20260623074637.503864-3-inochiama@gmail.com>
Hi,
On 6/23/26 09:46, Inochi Amaoto wrote:
> The current irq definition of the wake irq and the lpi irq
> is wrong, replace them with the right number and name.
>
> Fixes: 30f0ba420ed3 ("net: stmmac: Add glue layer for Spacemit K3 SoC")
> Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Maxime
> ---
> drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> index 3bfb6d49be6c..322bdf167a4a 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> @@ -22,8 +22,8 @@
> #define CTRL_PHY_INTF_RMII FIELD_PREP(CTRL_PHY_INTF_MODE, 0)
> #define CTRL_PHY_INTF_RGMII FIELD_PREP(CTRL_PHY_INTF_MODE, 1)
> #define CTRL_PHY_INTF_MII FIELD_PREP(CTRL_PHY_INTF_MODE, 3)
> -#define CTRL_WAKE_IRQ_EN BIT(9)
> -#define CTRL_PHY_IRQ_EN BIT(12)
> +#define CTRL_LPI_IRQ_EN BIT(9)
> +#define CTRL_WAKE_IRQ_EN BIT(12)
>
> /* dline register bits */
> #define RGMII_RX_DLINE_EN BIT(0)
^ permalink raw reply
* Re: [PATCH net v2 1/2] net: stmmac: dwmac-spacemit: Fix wrong phy interface definition
From: Maxime Chevallier @ 2026-06-23 16:53 UTC (permalink / raw)
To: Inochi Amaoto, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Maxime Coquelin, Alexandre Torgue,
Yixun Lan, Russell King (Oracle)
Cc: netdev, linux-stm32, linux-arm-kernel, linux-riscv, spacemit,
linux-kernel, Yixun Lan, Longbin Li
In-Reply-To: <20260623074637.503864-2-inochiama@gmail.com>
Hello,
On 6/23/26 09:46, Inochi Amaoto wrote:
> The current MII interface register definition from the vendor is wrong,
> use the right number for the macro. Also, correct the interface mask
> in spacemit_set_phy_intf_sel() so it can update the register with the
> right number
>
> Fixes: 30f0ba420ed3 ("net: stmmac: Add glue layer for Spacemit K3 SoC")
> Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
> ---
> drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> index 223754cc5c79..3bfb6d49be6c 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> @@ -18,8 +18,10 @@
> #include "stmmac_platform.h"
>
> /* ctrl register bits */
> -#define CTRL_PHY_INTF_RGMII BIT(3)
> -#define CTRL_PHY_INTF_MII BIT(4)
> +#define CTRL_PHY_INTF_MODE GENMASK(4, 3)
> +#define CTRL_PHY_INTF_RMII FIELD_PREP(CTRL_PHY_INTF_MODE, 0)
> +#define CTRL_PHY_INTF_RGMII FIELD_PREP(CTRL_PHY_INTF_MODE, 1)
> +#define CTRL_PHY_INTF_MII FIELD_PREP(CTRL_PHY_INTF_MODE, 3)
> #define CTRL_WAKE_IRQ_EN BIT(9)
> #define CTRL_PHY_IRQ_EN BIT(12)
>
> @@ -118,7 +120,7 @@ static void spacemit_get_interfaces(struct stmmac_priv *priv, void *bsp_priv,
>
> static int spacemit_set_phy_intf_sel(void *bsp_priv, u8 phy_intf_sel)
> {
> - unsigned int mask = CTRL_PHY_INTF_MII | CTRL_PHY_INTF_RGMII;
> + unsigned int mask = CTRL_PHY_INTF_MODE;
> struct spacmit_dwmac *dwmac = bsp_priv;
> unsigned int val = 0;
>
> @@ -128,6 +130,7 @@ static int spacemit_set_phy_intf_sel(void *bsp_priv, u8 phy_intf_sel)
> break;
>
> case PHY_INTF_SEL_RMII:
> + val = CTRL_PHY_INTF_RMII;
This isn't strictly-speaking necessary as this is 0 and val is already 0, maybe
compilers can figure it out and this leaves us with more self-documenting code ?
So I'm ok with that personally,
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Maxime
> break;
>
> case PHY_INTF_SEL_RGMII:
^ permalink raw reply
* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Eric Biggers @ 2026-06-23 16:49 UTC (permalink / raw)
To: Bastien Nocera
Cc: linux-crypto, Herbert Xu, Marcel Holtmann, Luiz Augusto von Dentz,
linux-doc, linux-api, linux-kernel, netdev, Linus Torvalds,
linux-bluetooth, ell
In-Reply-To: <7d08a6df54279e9915f5df6bd4e5e5dde52b4fe1.camel@hadess.net>
On Tue, Jun 23, 2026 at 02:44:28PM +0200, Bastien Nocera wrote:
> Hey,
>
> Replying to this older patch.
>
> On Wed, 2026-04-29 at 18:15 -0700, Eric Biggers wrote:
> <snip>
> > This isn't intended to change anything overnight. After all, most Linux
> > distros won't be able to disable the kconfig options quite yet, mainly
> > because of iwd. But this should create a bit more impetus for these
> > userspace programs to be fixed, and the documentation update should also
> > help prevent more users from appearing.
>
> There are 2 other users that I know of: bluez, and the ell library
> (used by iwd and bluez).
>
> From what I could tell, bluetoothd uses AF_ALG for cryptography:
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/src/shared/crypto.c
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/tools/mesh-gatt/crypto.c
>
> It uses "ecb(aes)" and "cmac(aes)" as algorithms.
>
> Finally, it also uses them both again:
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/mesh/crypto.c
> through ell:
> https://git.kernel.org/pub/scm/libs/ell/ell.git/tree/ell/cipher.c
>
> Because that's a question that also came up, bluetoothd also uses the
> CAP_NET_ADMIN capability.
>
> I'll let Luiz and Marcel take it over from here.
>
We're aware of that and are taking it into account in the allowlist:
https://lore.kernel.org/linux-crypto/20260622234803.6982-1-ebiggers@kernel.org/
If you have any feedback on the allowlist, please respond to that patch.
- Eric
^ permalink raw reply
* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Ralf Lici @ 2026-06-23 16:36 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: netdev, Daniel Gröber, Antonio Quartulli, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
linux-kernel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
Beniamino Galvani
In-Reply-To: <87ik7aej6f.fsf@toke.dk>
On Mon, 22 Jun 2026 16:36:24 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> > My second concern is that the SIIT boundary would be a property of
> >> > rule and hook placement. That gives flexibility, but it also means the
> >> > translation point has to be constrained and documented very carefully
> >> > to avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior.
> >> > For this use case I would rather have the route that matches the
> >> > translation prefix also be the object that says: leave this family
> >> > here and continue in the other one.
> >>
> >> Yeah, with flexibility comes the ability to shoot yourself in the foot.
> >> But that's not really different from much of the other functionality we
> >> have in the kernel today, is it? For netfilter in particular it's
> >> certainly possible to configure a broken NAT configuration that leads to
> >> packet drops (or just invalid packets being sent out on a network
> >> device).
> >>
> >
> > True, misconfiguration is always possible and that alone is not an
> > argument against the netfilter model. But what do we actually gain in
> > capability from that flexibility? I agree on the UX argument (an admin
> > would look in nft first), but in terms of what the feature can do, I
> > can't yet see what the nft model unlocks. More on this just below.
> >
> >> > After looking at the available kernel mechanisms again, I think the
> >> > better model is probably LWT: routes carry an ipxlat encap referencing a
> >> > named translator domain configured over netlink. That should represent
> >> > the stateless, prefix-based and symmetric nature of ipxlat.
> >>
> >> I think this description actually hits the nail on the head: What are we
> >> implementing here? Is it a product feature, or a building block for one?
> >> The properties you mention wrt consistency, symmetry etc are properties
> >> of the high-level feature (which is also generally the level things are
> >> specified in RFCs). Whereas other packet mangling features in the kernel
> >> are more in the "building block" category, where it's possible to
> >> configure things to implement a particular feature set / compliance with
> >> a particular RFC, but it's also possible to do things that are outside
> >> of that.
> >>
> >> I think this relates to the "mechanism, not policy" approach that we
> >> take to most things in the kernel: implement the building blocks to do
> >> something in the most general way we can, and then leave it up to
> >> userspace to configure things in a way that results in a consistent
> >> high-level system behaviour.
> >>
> >
> > That's a good point, and I agree that we should not bake a high-level
> > product policy into the kernel if what we need is a reusable mechanism
> > (the LWT idea was my attempt at exactly that). What I am still trying to
> > understand is whether there is a useful generic trigger for stateless
> > cross-family translation beyond the route/prefix/policy-routing cases.
> >
> > Routes and policy routing already cover the selectors I can make
> > coherent for a stateless, per-packet translator: destination/source
> > prefix, iif/oif/VRF, mark, TOS/DSCP, and so on. nft can of course match
> > much more than that, but the additional selectors that would materially
> > change the translation decision seem to be selectors such as L4 fields,
> > payload state, or conntrack state. Those are exactly the selectors I am
> > struggling to make correct for a stateless translator:
> >
> > - non-first fragments carry no L4 header at all, yet the translator must
> > rewrite every fragment (an nft ... tcp dport trigger cannot fire on
> > them);
> >
> > - ICMP errors must be translated too, but the flow identity lives in the
> > quoted inner header (reversed), not in anything an L4/ct match on the
> > error packet can see and there is no conntrack to associate them,
> > since this is stateless.
>
> True in principle, but if (say) you deploy this on a network that is
> configured so it will never fragment packets, this won't be an issue in
> practice.
>
> I.e., you're quite right that arbitrary matching criteria cannot be
> guaranteed to result in coherent translation. But I think that goes into
> the "use it wrong, get wrong results" bin. E.g., if you match on
> something that results in only a subset of the packets of a flow being
> translated, well, only that subset of the packets will make it to the
> destination. The SIIT translator itself should not try to fix this, but
> neither should it prevent it; that's what I mean by "building block" -
> it's up to the builder using the blocks to make sure the building
> doesn't collapse, that's out of scope for the block manufacturer to
> worry about :)
>
I agree with that framing. The translation core should not try to prove
that the surrounding policy describes a coherent SIIT deployment.
> > So an L4-conditional trigger does not look like a good primitive for
> > correct stateless SIIT unless the action also defragments/refragments or
> > uses conntrack-like state. Those may be valid mechanisms, but they move
> > the design away from the stateless per-packet SIIT boundary this RFC is
> > trying to model.
> >
> > So my first question is: is there a useful nft configuration this should
> > enable that is not naturally expressible as route selection, while still
> > remaining stateless SIIT rather than a NAT64-like stateful feature?
> > Maybe there is a real use case there, but I cannot construct one yet.
>
> So the poster child for "match on arbitrary criteria" is of course BPF.
> You can write BPF programs that match on arbitrary parts of the packet
> header, custom encapsulation headers,or even on out of band things like
> system state, phase of the moon, or what have you. And we should
> certainly allow a BPF program to make the decision on whether to perform
> the SIIT translation.
>
> Which... maybe is an argument to keep it as a device like you do in this
> RFC series? Redirecting to a device is trivially supported from TC-BPF,
> which also makes it possible to use the translation mechanism without
> going through the routing subsystem at all, saving a bit of overhead.
> Whereas making it a route action ties it very closely to the routing
> subsystem.
>
> WDYT?
>
I see the netdevice appeal for this, especially as a BPF redirect
target. But as we discussed earlier, the device model has some real
problems: the device selected by the first route is not the real
post-translation egress, so the model ends up doing translation and
reinjection rather than normal transmission. Concretely:
- it needs synthetic routing state purely to get things like MTU for
fragmentation, because the real post-translation nexthop is not known
at translation time;
- TTL/Hop Limit handling gets harder to reason about because the packet
has effectively gone through two routing decisions;
- rx/tx stats can't be made meaningful for a direction-agnostic device
whose ndo_start_xmit is really "translate and receive";
- and the setup is not very obvious: create an interface, route packets
to it, then have them come back translated.
None of these is fatal on its own, but together they make me think the
abstraction does not quite fit.
On the BPF point specifically: I agree a BPF program should be able to
decide whether to translate. What I am less sure about is whether
redirecting to a netdevice is the best way to expose that. A TC action
(yet another model, I know :)) gives you the same thing in-pipeline and
more directly:
tc filter add dev wwan0 egress \
bpf obj match.o action ipxlat4to6 domain clat0
Let BPF make the policy decision, with the native action doing the
translation work that the current BPF CLAT implementations have trouble
with: fragmentation, checksum corner cases, and ICMP error inner
headers (as explained by Beniamino).
So TC clsact looks like the natural in-kernel replacement for today's
TC-BPF CLAT programs: no extra netdev, you attach to the existing
uplink, direction is explicit, and on egress you sit on the real route
dst, so the synthetic-dst and double-routing problems above just don't
arise. The cost is more moving parts than a single bpf_redirect since
userspace has to manage clsact, filters, priorities and action
lifecycle/cleanup.
For a gateway translator, though, I still think a device-bound model is
less natural. There the translation point is more like a forwarding
decision across routes and nexthops, so a route/LWT attachment, or
possibly a netfilter attachment seems easier to reason about. Also, as
you already pointed out while discussing LWT, an admin setting up NAT64
is more likely to reach for an nft rule than for a clsact filter on a
specific device.
Taking a step back, ipxlat is really a generic translation engine plus a
thin harness around it. So rather than pick one attachment, it might be
worth structuring the engine so different harnesses can drive it.
There's interesting precedent for this shape:
- ILA, again, is the closest sibling: stateless IPv6 address translation
with a shared core in ila_common.c, driven both by an LWT frontend in
ila_lwt.c and by an inline netfilter hook with a netlink-configured
mapping table in ila_xlat.c.
- act_ct is the precedent for the TC side specifically: a TC action that
reuses the netfilter conntrack engine rather than reimplementing it.
And act_nat is the cautionary counter-example: a standalone TC
reimplementation of stateless NAT that shares no code with nf_nat, and
carries a "would be nice to share code" comment :)
So I am wondering whether the right direction is to factor the
translation engine cleanly, land it with one harness first, and keep the
other attachment points as follow-up work once the core semantics are
settled.
Does that direction seem reasonable to you?
--
Ralf Lici
Mandelbit Srl
^ permalink raw reply
* Re: [PATCH net-next v3 0/2] net: phy: sfp/mdio-i2c: defer RollBall probe + fix mii_bus leak
From: Maxime Chevallier @ 2026-06-23 16:34 UTC (permalink / raw)
To: Petr Wozniak, linux, andrew, hkallweit1
Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
bjorn, olek2, kabel
In-Reply-To: <20260623080538.7646-1-petr.wozniak@gmail.com>
Hi Petr,
On 6/23/26 10:05, Petr Wozniak wrote:
> This series resends the RollBall bridge probe deferral (a fix for the
> regression in commit 8fe125892f40) and adds a related mii_bus leak fix.
These are bugfixes, you need to target the 'net' tree as explained here :
https://docs.kernel.org/process/maintainer-netdev.html
Thanks :)
Maxime
>
> Patch 1 fixes a pre-existing mii_bus leak in sfp_i2c_mdiobus_destroy()
> that has been present since the helper was introduced in 2022. Patch 2's
> new -ENODEV path destroys the MDIO bus via sfp_i2c_mdiobus_destroy(), so
> patch 1 is a prerequisite to avoid leaking the bus on that path.
>
> The v2 deferral patch was corrupted in transit and failed to apply; it is
> regenerated here against current net-next with no functional change.
>
> v3:
> - Resend: v2 defer patch was corrupted in transit and failed to apply
> (netdev/apply); regenerated against current net-next.
> - Fixed block comment style flagged by checkpatch. No functional change.
> - Added patch 1/2 (sfp: free mii_bus in sfp_i2c_mdiobus_destroy).
> v2 (defer):
> - Generalized scope: regression affects boot-inserted and hotplugged
> modules where bridge init exceeds 200 ms; Aleksander Bajkowski
> confirmed FLYPRO SFP-10GT-CS-30M / AQR113C broken when hotplugged.
> - Corrected state machine description (probe runs in SFP_S_INIT after
> SFP_S_WAIT) - Jan Hoffmann.
> - No code changes from v1.
> v1: initial submission.
>
> Petr Wozniak (2):
> net: phy: sfp: free mii_bus in sfp_i2c_mdiobus_destroy
> net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
>
> drivers/net/mdio/mdio-i2c.c | 15 +++++++++------
> drivers/net/phy/sfp.c | 23 ++++++++++++++---------
> include/linux/mdio/mdio-i2c.h | 1 +
> 3 files changed, 24 insertions(+), 15 deletions(-)
>
>
> base-commit: b85966adbf5de0668a815c6e3527f87e0c387fb4
^ permalink raw reply
* Re: [PATCH iwl-net v2 2/2] ice: preserve uplink DFLT Rx rule on switchdev release
From: Simon Horman @ 2026-06-23 16:25 UTC (permalink / raw)
To: poros; +Cc: Simon Horman, netdev, aleksandr.loktionov
In-Reply-To: <20260622113428.2565255-3-poros@redhat.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ice: preserve uplink DFLT Rx rule on switchdev release
This commit preserves the default Rx rule on the uplink VSI when releasing
the switchdev environment if the interface is in promiscuous mode. This
prevents the catch-all rule from being removed while the netdev flags still
request promiscuity.
> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> index 2e4f0969035f7..48273ef9f69dc 100644
> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
[ ... ]
> @@ -278,8 +280,16 @@ static void ice_eswitch_release_env(struct ice_pf *pf)
> vlan_ops->ena_rx_filtering(uplink_vsi);
[Severity: High]
Could this lockless read of netdev->flags cause a regression due to a TOCTOU
race?
The ena_rx_filtering() call eventually reaches ice_cfg_vlan_pruning(),
which reads uplink_vsi->netdev->flags & IFF_PROMISC locklessly to decide
whether to enable VLAN pruning.
> ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
> ICE_FLTR_TX);
> - ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
> - ICE_FLTR_RX);
> +
> + /* Keep the Rx DFLT rule if the uplink is promiscuous; it must outlive
> + * the session. Test the live netdev->flags, the same value
> + * ena_rx_filtering() -> ice_cfg_vlan_pruning() above keys its decision
> + * on, so the preserved DFLT rule and the pruning state stay consistent.
> + */
> + if (!(uplink_vsi->netdev->flags & IFF_PROMISC))
... and then the exact same flags are read locklessly again right here.
Because setup/teardown paths are protected by devl_lock rather than
rtnl_lock, could a concurrent dev_change_flags() holding rtnl_lock toggle
IFF_PROMISC between the two checks?
If IFF_PROMISC changes from 0 to 1 in this window, the code would incorrectly
enable VLAN pruning (seeing 0) and then incorrectly retain the DFLT_RX rule
(seeing 1), leaving the hardware with VLAN pruning enabled during promiscuous
mode.
[Severity: High]
Does preserving the Rx DFLT rule here cause a regression by permanently
breaking multicast promiscuity configuration if IFF_PROMISC was set?
During the switchdev session, ice_remove_vsi_fltr() removes the
ICE_SW_LKUP_PROMISC rules. By manually preserving the DFLT rule here, any
future synchronization via ice_vsi_sync_fltr() will evaluate
!ice_is_dflt_vsi_in_use(vsi->port_info) as false:
ice_main.c:ice_vsi_sync_fltr() {
...
if (!ice_is_dflt_vsi_in_use(vsi->port_info)) {
err = ice_set_dflt_vsi(vsi);
...
err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
}
...
}
Because it is skipped, ICE_MCAST_PROMISC_BITS is never configured, leaving
the VSI in unicast promiscuous mode but silently dropping multicast traffic
destined for other MAC addresses.
> + ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
> + false, ICE_FLTR_RX);
> +
> ice_fltr_add_mac_and_broadcast(uplink_vsi,
> uplink_vsi->port_info->mac.perm_addr,
> ICE_FWD_TO_VSI);
[Severity: High]
This is a pre-existing issue, but does this teardown function fail to
synchronize hardware state for previously joined multicast groups and
secondary unicast addresses?
When entering switchdev mode, __dev_uc_unsync() and __dev_mc_unsync() clear
the synchronization state of OS-level MAC addresses, and
ice_remove_vsi_fltr() removes the actual hardware filters. While switchdev
runs, calls to ndo_set_rx_mode return early and do not update hardware.
When returning to legacy mode, this function restores only the primary MAC
and broadcast addresses, but doesn't seem to invoke netif_set_rx_mode() or
equivalent to resynchronize the deferred Rx filters from the netdev to the
hardware. Could this cause the hardware to silently drop packets for
previously joined groups until the interface is bounced?
^ permalink raw reply
* Re: [PATCH v7 01/15] arm64: dts: qcom: kodiak: Add EL2 overlay
From: Mukesh Ojha @ 2026-06-23 16:31 UTC (permalink / raw)
To: Sumit Garg
Cc: andersson, linux-arm-msm, devicetree, dri-devel, freedreno,
linux-media, netdev, linux-wireless, ath12k, linux-remoteproc,
konradybcio, robh, krzk+dt, conor+dt, robin.clark, sean, akhilpo,
lumag, abhinav.kumar, jesszhan0024, marijn.suijten, airlied,
simona, vikash.garodia, dikshita.agarwal, bod, mchehab, elder,
andrew+netdev, davem, edumazet, kuba, pabeni, jjohnson,
mathieu.poirier, trilokkumar.soni, pavan.kondeti, jorge.ramirez,
tonyh, vignesh.viswanathan, srinivas.kandagatla, amirreza.zarrabi,
jens.wiklander, op-tee, apurupa, skare, linux-kernel, Sumit Garg
In-Reply-To: <20260522115936.201208-2-sumit.garg@kernel.org>
On Fri, May 22, 2026 at 05:29:22PM +0530, Sumit Garg wrote:
> From: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
>
> All the existing variants Kodiak boards are using Gunyah hypervisor
> which means that, so far, Linux-based OS could only boot in EL1 on those
> devices. However, it is possible for us to boot Linux at EL2 on these
> devices [1].
>
> When running under Gunyah, the remote processor firmware IOMMU
> streams are controlled by Gunyah. However, without Gunyah, the IOMMU is
> managed by the consumer of this DeviceTree. Therefore, describe the
> firmware streams for each remote processor.
>
> Add a EL2-specific DT overlay and apply it to Kodiak IOT variant
> devices to create -el2.dtb for each of them alongside "normal" dtb.
>
> Note that modem and media subsystems haven't been supported yet due
> to missing dependencies. For GPU to work, zap shader is disabled and
> in EL2 mode the kernel owns hardware watchdog which is enabled here.
>
> [1]
> https://docs.qualcomm.com/bundle/publicresource/topics/80-70020-4/boot-developer-touchpoints.html#uefi
>
> Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
> [SG: watchdog and modem fixup]
> Signed-off-by: Sumit Garg <sumit.garg@oss.qualcomm.com>
As discussed internally, I will be taking this patch separately and you
can drop this from series.
--
-Mukesh Ojha
^ permalink raw reply
* Re: [PATCH net-next v3 2/2] net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
From: Maxime Chevallier @ 2026-06-23 16:28 UTC (permalink / raw)
To: Petr Wozniak, linux, andrew, hkallweit1
Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
bjorn, olek2, kabel
In-Reply-To: <20260623080538.7646-3-petr.wozniak@gmail.com>
Hi Petr,
On 6/23/26 10:05, Petr Wozniak wrote:
> commit 8fe125892f40 ("net: phy: sfp: probe for RollBall I2C-to-MDIO
> bridge in mdio-i2c") introduced a regression: the RollBall I2C-to-MDIO
> bridge is not yet ready to respond to CMD_READ/CMD_DONE cycles when
> sfp_sm_add_mdio_bus() runs in SFP_S_INIT. The 200 ms probe times out,
> i2c_mii_probe_rollball() returns -ENODEV, and sfp_sm_add_mdio_bus()
> sets mdio_protocol = MDIO_I2C_NONE. By the time sfp_sm_probe_for_phy()
> runs (up to ~17 s later on affected hardware), the bridge is fully
> initialized but PHY probing is skipped because the protocol has already
> been changed to NONE.
>
> This affects both modules inserted before boot and hotplugged modules on
> hardware where bridge initialization exceeds the 200 ms probe window
> (confirmed: FLYPRO SFP-10GT-CS-30M with Aquantia AQR113C, hotplugged).
>
> Move the probe from i2c_mii_init_rollball(), called at bus-creation time,
> to sfp_sm_probe_for_phy() in sfp.c, where it runs after the SFP state
> machine module initialization delays. Export the probe function as
> mdio_i2c_probe_rollball() so sfp.c can call it.
>
> For RTL8261BE-based modules the probe correctly returns -ENODEV at PHY
> discovery time, causing sfp_sm_probe_for_phy() to destroy the MDIO bus
> and set MDIO_I2C_NONE, eliminating the 5+ minute PHY probe retry loop.
>
> For genuine RollBall modules (e.g. FLYPRO SFP-10GT-CS-30M with Aquantia
> AQR113C) the probe now runs after initialization is complete and
> correctly returns 0, so PHY detection proceeds normally.
>
> Reported-by: Aleksander Bajkowski <olek2@wp.pl>
> Fixes: 8fe125892f40 ("net: phy: sfp: probe for RollBall I2C-to-MDIO bridge in mdio-i2c")
> Signed-off-by: Petr Wozniak <petr.wozniak@gmail.com>
I'm not currently at home so I can't test that on my side, but as you'll
have to resend to the net tree, can you CC me for the next round so that
I can test with the few odd-ball modules I have ?
I expect to be able to test this on friday :(
Maxime
> ---
> v3: regenerated against net-next (v2 failed to apply due to transit
> corruption); fixed block comment style (checkpatch); no functional
> change.
> v2: commit message only - generalized scope (Aleksander Bajkowski);
> corrected SM description (Jan Hoffmann); no code change from v1.
> v1: initial.
> drivers/net/mdio/mdio-i2c.c | 15 +++++++++------
> drivers/net/phy/sfp.c | 22 +++++++++++++---------
> include/linux/mdio/mdio-i2c.h | 1 +
> 3 files changed, 23 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/net/mdio/mdio-i2c.c b/drivers/net/mdio/mdio-i2c.c
> index b88f63234b4e..2a3a418c1369 100644
> --- a/drivers/net/mdio/mdio-i2c.c
> +++ b/drivers/net/mdio/mdio-i2c.c
> @@ -419,7 +419,7 @@ static int i2c_mii_write_rollball(struct mii_bus *bus, int phy_id, int devad,
> return 0;
> }
>
> -static int i2c_mii_probe_rollball(struct i2c_adapter *i2c)
> +int mdio_i2c_probe_rollball(struct i2c_adapter *i2c)
> {
> u8 data_buf[] = { ROLLBALL_DATA_ADDR, 0x01, 0x00, 0x00 };
> u8 cmd_buf[] = { ROLLBALL_CMD_ADDR, ROLLBALL_CMD_READ };
> @@ -462,9 +462,13 @@ static int i2c_mii_probe_rollball(struct i2c_adapter *i2c)
>
> return -ENODEV;
> }
> +EXPORT_SYMBOL_GPL(mdio_i2c_probe_rollball);
>
> static int i2c_mii_init_rollball(struct i2c_adapter *i2c)
> {
> + /* Send the RollBall unlock password; bridge presence is verified
> + * later, in sfp_sm_probe_for_phy(), after module initialization.
> + */
> struct i2c_msg msg;
> u8 pw[5];
> int ret;
> @@ -486,7 +490,7 @@ static int i2c_mii_init_rollball(struct i2c_adapter *i2c)
> if (ret != 1)
> return -EIO;
>
> - return i2c_mii_probe_rollball(i2c);
> + return 0;
> }
>
> static bool mdio_i2c_check_functionality(struct i2c_adapter *i2c,
> @@ -531,10 +535,9 @@ struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter *i2c,
> case MDIO_I2C_ROLLBALL:
> ret = i2c_mii_init_rollball(i2c);
> if (ret < 0) {
> - if (ret != -ENODEV)
> - dev_err(parent,
> - "Cannot initialize RollBall MDIO I2C protocol: %d\n",
> - ret);
> + dev_err(parent,
> + "Cannot initialize RollBall MDIO I2C protocol: %d\n",
> + ret);
> mdiobus_free(mii);
> return ERR_PTR(ret);
> }
> diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
> index c4d274ab651e..bbfaa0450798 100644
> --- a/drivers/net/phy/sfp.c
> +++ b/drivers/net/phy/sfp.c
> @@ -2174,17 +2174,10 @@ static void sfp_sm_fault(struct sfp *sfp, unsigned int next_state, bool warn)
>
> static int sfp_sm_add_mdio_bus(struct sfp *sfp)
> {
> - int ret;
> -
> if (sfp->mdio_protocol == MDIO_I2C_NONE)
> return 0;
>
> - ret = sfp_i2c_mdiobus_create(sfp);
> - if (ret == -ENODEV) {
> - sfp->mdio_protocol = MDIO_I2C_NONE;
> - return 0;
> - }
> - return ret;
> + return sfp_i2c_mdiobus_create(sfp);
> }
>
> /* Probe a SFP for a PHY device if the module supports copper - the PHY
> @@ -2215,7 +2208,18 @@ static int sfp_sm_probe_for_phy(struct sfp *sfp)
> break;
>
> case MDIO_I2C_ROLLBALL:
> - err = sfp_sm_probe_phy(sfp, SFP_PHY_ADDR_ROLLBALL, true);
> + /* Probe here, after module initialization delays, so that
> + * genuine RollBall bridges have had time to start up.
> + * Modules without a bridge (e.g. RTL8261BE) return -ENODEV.
> + */
> + err = mdio_i2c_probe_rollball(sfp->i2c);
> + if (err == -ENODEV) {
> + sfp_i2c_mdiobus_destroy(sfp);
> + sfp->mdio_protocol = MDIO_I2C_NONE;
> + break;
> + }
> + if (!err)
> + err = sfp_sm_probe_phy(sfp, SFP_PHY_ADDR_ROLLBALL, true);
> break;
> }
>
> diff --git a/include/linux/mdio/mdio-i2c.h b/include/linux/mdio/mdio-i2c.h
> index 65b550a6fc32..5cf14f45c94b 100644
> --- a/include/linux/mdio/mdio-i2c.h
> +++ b/include/linux/mdio/mdio-i2c.h
> @@ -20,5 +20,6 @@ enum mdio_i2c_proto {
>
> struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter *i2c,
> enum mdio_i2c_proto protocol);
> +int mdio_i2c_probe_rollball(struct i2c_adapter *i2c);
>
> #endif
^ permalink raw reply
* Re: [PATCH iwl-net v2 1/2] ice: skip per-VLAN promisc rules when default VSI Rx rule is set
From: Simon Horman @ 2026-06-23 16:25 UTC (permalink / raw)
To: poros; +Cc: Simon Horman, netdev, aleksandr.loktionov
In-Reply-To: <20260622113428.2565255-2-poros@redhat.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
Subject: ice: skip per-VLAN promisc rules when default VSI Rx rule is set
This patch optimizes the number of hardware switch rules used when an ice
port is part of a vlan-filtering bridge. It avoids installing per-VLAN
promiscuous rules if the broad default VSI Rx rule is already catching
all traffic.
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index e2fbe111f8496..efb1f536de6c7 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
[ ... ]
> @@ -442,17 +496,12 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)
[Severity: High]
This is a pre-existing issue, but does MAC filter overflow fail to force
promiscuous mode, resulting in silent packet drops?
When the hardware MAC filter limit is reached (LIBIE_AQ_RC_ENOSPC), the driver
attempts to fall back to promiscuous mode by setting promisc_forced_on = true.
However, the execution flow then reaches a check for IFF_PROMISC:
if (vsi->current_netdev_flags & IFF_PROMISC) {
/* Apply Rx filter rule to get traffic from wire */
...
} else {
/* Clear Rx filter to remove traffic from wire */
Since the user never explicitly requested IFF_PROMISC, this evaluates to false,
causing the driver to take the else branch. The else branch actively disables
the default VSI rule and clears promiscuous mode instead of enabling it. The
fallback feature appears to be non-functional, leading to lost traffic for
newly added MACs.
[Severity: High]
This is a pre-existing issue, but could this function prevent retries of
incomplete configurations and break multicast promiscuity if any default
VSI is already active?
In ice_vsi_sync_fltr(), the setup of multicast promiscuity and disabling of
Rx filtering is gated:
if (!ice_is_dflt_vsi_in_use(vsi->port_info)) {
err = ice_set_dflt_vsi(vsi);
...
err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
if (err)
goto out_promisc;
}
If ice_set_dflt_vsi() succeeds but the subsequent ice_set_promisc() fails,
the function jumps to out_promisc to trigger a retry. On the next cycle,
ice_is_dflt_vsi_in_use() returns true because this VSI is now the default,
causing the entire block to be skipped and leaving the VSI misconfigured.
Furthermore, if another VSI (such as a VF) is already the default VSI,
the PF VSI entirely skips setting multicast promiscuity, breaking multicast
reception in promiscuous mode.
> } else {
> /* Clear Rx filter to remove traffic from wire */
> if (ice_is_vsi_dflt_vsi(vsi)) {
> - err = ice_clear_dflt_vsi(vsi);
> + err = ice_vsi_exit_dflt_promisc(vsi);
> if (err) {
> - netdev_err(netdev, "Error %d clearing default VSI %i Rx rule\n",
> - err, vsi->vsi_num);
> vsi->current_netdev_flags |=
> IFF_PROMISC;
> goto out_promisc;
> }
> - if (vsi->netdev->features &
> - NETIF_F_HW_VLAN_CTAG_FILTER)
> - vlan_ops->ena_rx_filtering(vsi);
> }
[ ... ]
> @@ -3675,10 +3724,9 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
> while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
> usleep_range(1000, 2000);
[Severity: High]
This is a pre-existing issue, but isn't there a risk of TOCTOU races with
this ad-hoc lock pattern lacking memory barriers?
The driver spins on ICE_CFG_BUSY using test_and_set_bit() and usleep_range()
to guard VSI filter synchronization. The lock is subsequently dropped using
clear_bit(ICE_CFG_BUSY, vsi->state) at the exit label.
Because clear_bit() lacks a memory release barrier, the CPU or compiler can
reorder memory stores from the critical section to occur after the flag is
cleared, leading to data races and corruption of shared filter state on
weakly ordered architectures.
> - /* Add multicast promisc rule for the VLAN ID to be added if
> - * all-multicast is currently enabled.
> - */
> - if (vsi->current_netdev_flags & IFF_ALLMULTI) {
> + /* skip the per-VID rule when the DFLT Rx rule already covers this VID */
> + if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
> + !ice_is_vsi_dflt_vsi(vsi)) {
> ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
> ICE_MCAST_VLAN_PROMISC_BITS,
> vid);
> @@ -3696,11 +3744,12 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
> if (ret)
> goto finish;
[Severity: High]
This is a pre-existing issue, but what happens if vlan_ops->add_vlan()
fails here? Is the hardware switch rule leaked?
When adding a VLAN to an interface with IFF_ALLMULTI enabled, this function
first allocates a promiscuous switch rule in hardware via
ice_fltr_set_vsi_promisc().
If this succeeds but the subsequent call to vlan_ops->add_vlan() fails
(e.g., due to ENOSPC or memory allocation failure), the function returns
early without rolling back the promiscuous rule. Because the network stack
considers the VLAN addition to have failed, ice_vlan_rx_kill_vid() will
never be called for this VID, permanently leaking a Flow Lookup Unit (FLU).
> - /* If all-multicast is currently enabled and this VLAN ID is only one
> - * besides VLAN-0 we have to update look-up type of multicast promisc
> - * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
> + /* On the first non-zero VLAN, promote the VLAN-0 multicast promisc
> + * rule from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN. Skip when
> + * the DFLT Rx rule is installed; it already covers every VID.
> */
> if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
> + !ice_is_vsi_dflt_vsi(vsi) &&
> ice_vsi_num_non_zero_vlans(vsi) == 1) {
[ ... ]
> @@ -3763,11 +3812,12 @@ int ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
> ICE_MCAST_VLAN_PROMISC_BITS, vid);
>
> if (!ice_vsi_has_non_zero_vlans(vsi)) {
> - /* Update look-up type of multicast promisc rule for VLAN 0
> - * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
> - * all-multicast is enabled and VLAN 0 is the only VLAN rule.
> + /* Last non-zero VLAN gone: demote the VLAN-0 multicast promisc
> + * rule back to ICE_SW_LKUP_PROMISC. Skip when the DFLT Rx rule
> + * is installed; no recipe swap is needed.
> */
> - if (vsi->current_netdev_flags & IFF_ALLMULTI) {
> + if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
> + !ice_is_vsi_dflt_vsi(vsi)) {
[Severity: High]
Will this modification cause a stale switch rule (ICE_SW_LKUP_PROMISC_VLAN for
vid=0) to be permanently leaked if the last VLAN is removed while the DFLT VSI
rule is active?
This patch skips demoting the vid=0 multicast promiscuous rule from
ICE_SW_LKUP_PROMISC_VLAN back to ICE_SW_LKUP_PROMISC when the last VLAN is
removed, if ice_is_vsi_dflt_vsi(vsi) is true.
However, if the interface later drops IFF_PROMISC (removing the DFLT VSI rule)
and then drops IFF_ALLMULTI, ice_clear_promisc() is called. Because the
interface now has zero VLANs, ice_clear_promisc() executes its else branch:
} else {
status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
promisc_m, 0);
}
This only searches and clears rules from the ICE_SW_LKUP_PROMISC recipe. The
stranded rule in the ICE_SW_LKUP_PROMISC_VLAN recipe is never cleared,
resulting in a permanent hardware switch rule leak.
> ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
> ICE_MCAST_VLAN_PROMISC_BITS,
> 0);
^ permalink raw reply
* Re: [PATCH net-next v3 1/2] net: phy: sfp: free mii_bus in sfp_i2c_mdiobus_destroy
From: Maxime Chevallier @ 2026-06-23 16:23 UTC (permalink / raw)
To: Petr Wozniak, linux, andrew, hkallweit1
Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
bjorn, olek2, kabel
In-Reply-To: <20260623080538.7646-2-petr.wozniak@gmail.com>
On 6/23/26 10:05, Petr Wozniak wrote:
> sfp_i2c_mdiobus_create() allocates the I2C MDIO bus with mdio_i2c_alloc(),
> a plain (non-devm) allocation, and registers it. sfp_i2c_mdiobus_destroy()
> only unregisters the bus and clears sfp->i2c_mii without calling
> mdiobus_free(). As the only reference to the bus is then cleared, the
> struct mii_bus is leaked.
>
> This is hit whenever a copper/RollBall SFP module that instantiated an MDIO
> bus is removed: sfp_sm_main() takes the global teardown path and calls
> sfp_i2c_mdiobus_destroy(). sfp_cleanup(), on driver unbind, frees
> sfp->i2c_mii directly, which is why the leak only triggered on module
> hot-removal and not on unbind.
which is worse, this can happen many times in a row :)
>
> Free the bus in sfp_i2c_mdiobus_destroy() to match the allocation done in
> sfp_i2c_mdiobus_create().
>
> Fixes: e85b1347ace6 ("net: sfp: create/destroy I2C mdiobus before PHY probe/after PHY release")
> Signed-off-by: Petr Wozniak <petr.wozniak@gmail.com>
With this patch sent towards the -net tree,
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Maxime
> ---
> drivers/net/phy/sfp.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
> index 03bfd8640db9..c4d274ab651e 100644
> --- a/drivers/net/phy/sfp.c
> +++ b/drivers/net/phy/sfp.c
> @@ -963,6 +963,7 @@ static int sfp_i2c_mdiobus_create(struct sfp *sfp)
> static void sfp_i2c_mdiobus_destroy(struct sfp *sfp)
> {
> mdiobus_unregister(sfp->i2c_mii);
> + mdiobus_free(sfp->i2c_mii);
> sfp->i2c_mii = NULL;
> }
>
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 16:08 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <20260623-bpf-sk_msg-split-unix-v2-1-ca7a626a94a5@cloudflare.com>
On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> completed all code paths related to sockmap-based redirects should be
> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> socket references would remain under BPF_SYSCALL.
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
> Changes in v2:
> - Handle prot->recvmsg being NULL (Sashiko)
> - Elaborate on the end goal in description
> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> ---
> net/unix/af_unix.c | 4 ++--
> net/unix/unix_bpf.c | 6 ++++++
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index f7a9d55eee8a..84c11c60c75f 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> #ifdef CONFIG_BPF_SYSCALL
> const struct proto *prot = READ_ONCE(sk->sk_prot);
>
> - if (prot != &unix_dgram_proto)
> + if (prot->recvmsg)
There is no reason to have this dead branch when
CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
Let's compile out all sockmap code when both configs
are not enabled.
Since AF_UNIX differs from TCP/UDP, it can take the
simpler approach.
> return prot->recvmsg(sk, msg, size, flags);
> #endif
> return __unix_dgram_recvmsg(sk, msg, size, flags);
> @@ -3152,7 +3152,7 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
> struct sock *sk = sock->sk;
> const struct proto *prot = READ_ONCE(sk->sk_prot);
>
> - if (prot != &unix_stream_proto)
> + if (prot->recvmsg)
> return prot->recvmsg(sk, msg, size, flags);
> #endif
> return unix_stream_read_generic(&state, true);
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index f86ff19e9764..5289a04b4993 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -7,6 +7,7 @@
>
> #include "af_unix.h"
>
> +#ifdef CONFIG_NET_SOCK_MSG
> #define unix_sk_has_data(__sk, __psock) \
> ({ !skb_queue_empty(&__sk->sk_receive_queue) || \
> !skb_queue_empty(&__psock->ingress_skb) || \
> @@ -94,6 +95,7 @@ static int unix_bpf_recvmsg(struct sock *sk, struct msghdr *msg,
> sk_psock_put(sk, psock);
> return copied;
> }
> +#endif /* CONFIG_NET_SOCK_MSG */
>
> static struct proto *unix_dgram_prot_saved __read_mostly;
> static DEFINE_SPINLOCK(unix_dgram_prot_lock);
> @@ -107,8 +109,10 @@ static void unix_dgram_bpf_rebuild_protos(struct proto *prot, const struct proto
> {
> *prot = *base;
> prot->close = sock_map_close;
> +#ifdef CONFIG_NET_SOCK_MSG
> prot->recvmsg = unix_bpf_recvmsg;
> prot->sock_is_readable = sk_msg_is_readable;
> +#endif
> }
>
> static void unix_stream_bpf_rebuild_protos(struct proto *prot,
> @@ -116,8 +120,10 @@ static void unix_stream_bpf_rebuild_protos(struct proto *prot,
> {
> *prot = *base;
> prot->close = sock_map_close;
> +#ifdef CONFIG_NET_SOCK_MSG
> prot->recvmsg = unix_bpf_recvmsg;
> prot->sock_is_readable = sk_msg_is_readable;
> +#endif
> prot->unhash = sock_map_unhash;
> }
>
>
>
>
^ permalink raw reply
* Re: [PATCH 0/3] SM8450 IPA support
From: Alex Elder @ 2026-06-23 15:56 UTC (permalink / raw)
To: esteuwu, Bjorn Andersson, Konrad Dybcio, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alex Elder
Cc: linux-arm-msm, devicetree, linux-kernel, netdev
In-Reply-To: <20260622-sm8450-ipa-v1-0-532f0299f96e@proton.me>
On 6/22/26 8:44 PM, Esteban Urrutia via B4 Relay wrote:
> This series adds support for the IPA subsystem found in the SM8450 SoC.
> While IPA v5.0 is very similar to IPA v5.1 (heck, it even managed to
> properly get the modem up and running), it wasn't perfect, since the
> modem would sometimes hang when rebooting or powering the AP off.
> After a thorough investigation, I managed to create the proper data file
> required for IPA v5.1.
>
> Regards,
> Esteban
I assume you have implemented this based on what you found in
some downstream code. And if so, could you please indicate
where to find that (so I can do some cross-referencing myself).
I no longer have access to any Qualcomm internal documentation.
Thanks.
-Alex
> Signed-off-by: Esteban Urrutia <esteuwu@proton.me>
> ---
> Esteban Urrutia (3):
> arm64: dts: qcom: sm8450: Add IPA support
> dt-bindings: net: qcom,ipa: Add SM8450 compatible string
> net: ipa: Add IPA v5.1 data
>
> .../devicetree/bindings/net/qcom,ipa.yaml | 1 +
> arch/arm64/boot/dts/qcom/sm8450.dtsi | 55 ++-
> drivers/net/ipa/Makefile | 2 +-
> drivers/net/ipa/data/ipa_data-v5.1.c | 477 +++++++++++++++++++++
> drivers/net/ipa/gsi_reg.c | 1 +
> drivers/net/ipa/ipa_data.h | 1 +
> drivers/net/ipa/ipa_main.c | 4 +
> drivers/net/ipa/ipa_reg.c | 1 +
> 8 files changed, 536 insertions(+), 6 deletions(-)
> ---
> base-commit: 948efecf22e49aa4bf55bb73ec79a0ddcfd38571
> change-id: 20260622-sm8450-ipa-5da81f67eb65
>
> Best regards,
> --
> Esteban Urrutia <esteuwu@proton.me>
>
>
^ permalink raw reply
* Re: [PATCH] net: ipa: fix SMEM state handle leaks in SMP2P init
From: Alex Elder @ 2026-06-23 15:53 UTC (permalink / raw)
To: Haoxiang Li, elder, andrew+netdev, davem, edumazet, kuba, pabeni
Cc: netdev, linux-kernel, stable
In-Reply-To: <20260623031831.1788454-1-haoxiang_li2024@163.com>
On 6/22/26 10:18 PM, Haoxiang Li wrote:
> ipa_smp2p_init() acquires two Qualcomm SMEM state handles with
> qcom_smem_state_get(). However, neither the init error paths
> nor ipa_smp2p_exit() release them.
>
> Use devm_qcom_smem_state_get() for both state handles so the
> references are released automatically when the platform device
> is removed.
>
> Fixes: 530f9216a953 ("soc: qcom: ipa: AP/modem communications")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
So I guess they were never "put" before?
This looks OK, but I'll just mention that the IPA code
doesn't use devm_*() (managed) interfaces. So it would
be more consistent to just call qcom_smem_state_put()
at the end of ipa_smp2p_exit() for both ipa->enabled_state
and ipa->valid_state.
-Alex
> ---
> drivers/net/ipa/ipa_smp2p.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ipa/ipa_smp2p.c b/drivers/net/ipa/ipa_smp2p.c
> index 2f0ccdd937cc..d8fd56949082 100644
> --- a/drivers/net/ipa/ipa_smp2p.c
> +++ b/drivers/net/ipa/ipa_smp2p.c
> @@ -228,15 +228,15 @@ ipa_smp2p_init(struct ipa *ipa, struct platform_device *pdev, bool modem_init)
> u32 valid_bit;
> int ret;
>
> - valid_state = qcom_smem_state_get(dev, "ipa-clock-enabled-valid",
> - &valid_bit);
> + valid_state = devm_qcom_smem_state_get(dev, "ipa-clock-enabled-valid",
> + &valid_bit);
> if (IS_ERR(valid_state))
> return PTR_ERR(valid_state);
> if (valid_bit >= 32) /* BITS_PER_U32 */
> return -EINVAL;
>
> - enabled_state = qcom_smem_state_get(dev, "ipa-clock-enabled",
> - &enabled_bit);
> + enabled_state = devm_qcom_smem_state_get(dev, "ipa-clock-enabled",
> + &enabled_bit);
> if (IS_ERR(enabled_state))
> return PTR_ERR(enabled_state);
> if (enabled_bit >= 32) /* BITS_PER_U32 */
^ permalink raw reply
* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Petr Mladek @ 2026-06-23 15:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Sebastian Andrzej Siewior, linux-arch, linux-kernel, sched-ext,
netdev, David S . Miller, Andrea Righi, Arnd Bergmann, Ben Segall,
Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
Juri Lelli, K Prateek Nayak, Paolo Abeni, Peter Zijlstra,
Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623081258.580e034fdb5b98f4f8dba44a@linux-foundation.org>
On Tue 2026-06-23 08:12:58, Andrew Morton wrote:
> On Tue, 23 Jun 2026 16:26:49 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
>
> > Provide a deferred version of the WARN_ON() macro. It will delay
> > flushing the console until a later context. It is needed in a context
> > where the caller holds locks which can lead to a deadlock content is
> > flushed to the console driver.
> > An example would from a warning from within the scheduler resulting in a
> > wake-up of a task.
> >
> > Deferring the output works by using printk_deferred_enter/ exit() around
> > the printing output. This must be used in a context where the task can't
> > migrate to another CPU. This should be the case usually, since the
> > scheduler would acquire the rq lock whith disabled interrupts, but to be
> > safe preemption is disabled to guarantee this.
> >
> > In order not to bloat the code on architectures which provide an
> > optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
> > __report_bug() and does not increase the code size.
> >
> > Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
> > macros. Extend __report_bug() to handle the deferred case.
> >
> > ...
> >
> > --- a/include/asm-generic/bug.h
> > +++ b/include/asm-generic/bug.h
> > @@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
> > */
> > bug->flags |= BUGFLAG_DONE;
> > }
> > -
> > + if (deferred) {
> > + preempt_disable_notrace();
> > + printk_deferred_enter();
> > + }
>
> For some reason the comment over printk_deferred_enter() says
> "Interrupts must be disabled for the deferred duration". Is that the
> case for all the printk_deferred_enter() calls which this patch adds?
Strictly speaking, "only" CPU migration must be disabled around
printk_deferred_enter()/exit() call because the state is stored
in a per-CPU variable.
It means that preempt_disable() would work.
I do not recall whether we mentioned interrupts by mistake or
on purpose. It is possible that we suggested to disable interrupts
because we did not want to deffer messages from unrelated (interrupt)
context.
Best Regards,
Petr
^ permalink raw reply
* [PATCH net-next v3] vsock/virtio: rewrite MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-23 15:38 UTC (permalink / raw)
To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
Jason Wang, Bobby Eshleman, Xuan Zhuo, Eugenio Pérez,
Simon Horman
Cc: kvm, virtualization, netdev, linux-kernel, oxffffaa, rulkc,
Arseniy Krasnov
Logically it was based on TCP implementation, so to make further support
easier, rewrite it in the TCP way (like in 'tcp_sendmsg_locked()'). This
patch only rewrites flag handling (e.g. it doesn't change logic).
Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
---
Changelog v1->v2:
* Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
already added.
Changelog v2->v3:
* Update commit message.
* Remove one empty line.
net/vmw_vsock/virtio_transport_common.c | 47 ++++++++++++-------------
1 file changed, 22 insertions(+), 25 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 09475007165b..41c2a0b82a8e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -328,38 +328,35 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
return pkt_len;
- if (info->msg) {
- /* If zerocopy is not enabled by 'setsockopt()', we behave as
- * there is no MSG_ZEROCOPY flag set.
+ if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
+ /* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
+ * 'MSG_ZEROCOPY' flag handling here is based on the same flag
+ * handling from 'tcp_sendmsg_locked()'.
*/
- if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
- info->msg->msg_flags &= ~MSG_ZEROCOPY;
+ if (info->msg->msg_ubuf) {
+ uarg = info->msg->msg_ubuf;
+ can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
+ } else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
+ uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
+ NULL, false);
+ if (!uarg) {
+ virtio_transport_put_credit(vvs, pkt_len);
+ return -ENOMEM;
+ }
- if (info->msg->msg_flags & MSG_ZEROCOPY)
can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
+ if (!can_zcopy)
+ uarg_to_msgzc(uarg)->zerocopy = 0;
+ have_uref = true;
+ }
+
+ /* 'can_zcopy' means that this transmission will be
+ * in zerocopy way (e.g. using 'frags' array).
+ */
if (can_zcopy)
max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
(MAX_SKB_FRAGS * PAGE_SIZE));
-
- if (info->msg->msg_flags & MSG_ZEROCOPY &&
- info->op == VIRTIO_VSOCK_OP_RW) {
- uarg = info->msg->msg_ubuf;
-
- if (!uarg) {
- uarg = msg_zerocopy_realloc(sk_vsock(vsk),
- pkt_len, NULL, false);
- if (!uarg) {
- virtio_transport_put_credit(vvs, pkt_len);
- return -ENOMEM;
- }
-
- if (!can_zcopy)
- uarg_to_msgzc(uarg)->zerocopy = 0;
-
- have_uref = true;
- }
- }
}
rest_len = pkt_len;
--
2.25.1
^ permalink raw reply related
* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Andrew Morton @ 2026-06-23 15:12 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
Andrea Righi, Arnd Bergmann, Ben Segall, Breno Leitao,
Changwoo Min, David Vernet, Dietmar Eggemann, Eric Dumazet,
Ingo Molnar, Jakub Kicinski, John Ogness, Juri Lelli,
K Prateek Nayak, Paolo Abeni, Peter Zijlstra, Petr Mladek,
Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-2-bigeasy@linutronix.de>
On Tue, 23 Jun 2026 16:26:49 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> Provide a deferred version of the WARN_ON() macro. It will delay
> flushing the console until a later context. It is needed in a context
> where the caller holds locks which can lead to a deadlock content is
> flushed to the console driver.
> An example would from a warning from within the scheduler resulting in a
> wake-up of a task.
>
> Deferring the output works by using printk_deferred_enter/ exit() around
> the printing output. This must be used in a context where the task can't
> migrate to another CPU. This should be the case usually, since the
> scheduler would acquire the rq lock whith disabled interrupts, but to be
> safe preemption is disabled to guarantee this.
>
> In order not to bloat the code on architectures which provide an
> optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
> __report_bug() and does not increase the code size.
>
> Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
> macros. Extend __report_bug() to handle the deferred case.
>
> ...
>
> --- a/include/asm-generic/bug.h
> +++ b/include/asm-generic/bug.h
> @@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
> */
> bug->flags |= BUGFLAG_DONE;
> }
> -
> + if (deferred) {
> + preempt_disable_notrace();
> + printk_deferred_enter();
> + }
For some reason the comment over printk_deferred_enter() says
"Interrupts must be disabled for the deferred duration". Is that the
case for all the printk_deferred_enter() calls which this patch adds?
^ permalink raw reply
* Re: [PATCH net-next v2] Documentation: net/smc: correct old value of smcr_max_recv_wr
From: Breno Leitao @ 2026-06-23 15:12 UTC (permalink / raw)
To: Mahanta Jambigi
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, alibuda, dust.li,
sidraya, wenjia, wintera, pasic, horms, tonylu, guwen, netdev,
linux-s390
In-Reply-To: <20260424052336.3262350-1-mjambigi@linux.ibm.com>
On Fri, Apr 24, 2026 at 07:23:36AM +0200, Mahanta Jambigi wrote:
> The smc-sysctl.rst documentation incorrectly stated that the previous
> hardcoded maximum number of WR buffers on the receive path (smcr_max_recv_wr)
> was 16. The correct historical value used before the introduction of the sysctl
> control was 48. Update the documentation to reflect the accurate historical
> value. Also fix a couple of minor typos.
>
> Fixes: aef3cdb47bbb net/smc: make wr buffer count configurable
This Fixes tag is broken. You probably want:
Fixes: aef3cdb47bbb ("net/smc: make wr buffer count configurable")
Other than that, it looks good, the corrected value checks out.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox