* Wireguard head of line blocking when CPUs saturate
From: Toke Høiland-Jørgensen @ 2026-06-19 15:56 UTC (permalink / raw)
To: wireguard; +Cc: netdev
Hey everyone
I'm running Wireguard on my main gateway, which is a not-super-high
powered ARM box with eight cores (based on the NXP LS1088A SoC). The box
does, however, also have eight hardware queues for its networking, which
means regular network traffic can be spread nicely across the cores.
However, the per-core performance is limited, making it pretty trivial
to saturate a single core by just running a fat TCP flow through it. And
when this happens, Wireguard traffic just... stalls. I.e., no traffic
gets through the Wireguard interface until the (unrelated) flow
saturating one of the cores subsides.
I suspect what happens is that Wireguard spreads out traffic to all
cores for encryption, but has to wait for the respective CPUs to finish
encrypting the packets in order before they can actually be transmitted.
And because one CPU is now suddenly saturated in softirq context, the
Wireguard work queue never gets a chance to run on that CPU, stalling TX
progress for the Wireguard device entirely.
I'm sending this message to (a) see if anyone else is seeing the same
kind of stalling, and (b) to get input on whether the explanation
outlined above seems plausible. And, in the case of affirmative answers
to both (a) and (b), to hopefully start a discussion on what to do about
this :)
-Toke
^ permalink raw reply
* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
From: David Woodhouse @ 2026-06-19 15:34 UTC (permalink / raw)
To: Thomas Gleixner, John Stultz, Stephen Boyd, Miroslav Lichvar,
Richard Cochran, linux-kernel, netdev
Cc: Rodolfo Giometti, Alexander Gordeev
In-Reply-To: <87h5myd56x.ffs@fw13>
[-- Attachment #1: Type: text/plain, Size: 3150 bytes --]
On Fri, 2026-06-19 at 15:34 +0200, Thomas Gleixner wrote:
> On Fri, Jun 19 2026 at 01:33, David Woodhouse wrote:
> > @@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
> >
> > nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> > nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> > +
> > + /*
> > + * For the NTP-disciplined mono-based clocks, report how far
> > + * @systime is from the ideal NTP time at @now, in signed ns,
> > + * so a caller can land on the ideal line by adding it. Four
> > + * terms, summed in ns << NTP_SCALE_SHIFT before converting:
> > + *
> > + * - tk->ntp_error, the deviation as of the last update;
> > + * - (cycle_delta * ntp_err_frac), the fractional-mult drift
> > + * accrued since then (cycle_delta is at most a tick on a
> > + * tickful kernel, but many ticks' worth under NO_HZ);
> > + * - (cycle_delta * ntp_err_mult), subtracting the applied +1
> > + * mult dither over the same span;
> > + * - the sub-ns fraction @systime dropped when the read was
> > + * truncated to whole ns (low @shift bits, exact despite the
> > + * multiply overflowing).
> > + *
> > + * RAW is undisciplined and AUX has its own discipline, so they
> > + * carry no ntp_error.
>
> AUX has ntp_error too. AUX clocks have a per clock NTP instance, which
> work exactly like the main timerkeeper's one. Only CLOCK_MONOTONIC_RAW
> needs to be excluded.
Ack.
> > + */
> > + if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
> > + clock_id == CLOCK_BOOTTIME) {
> > + u32 nes = tk->ntp_error_shift;
> > + u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
> > + tk->tkr_mono.mask;
> > + s64 err = tk->ntp_error +
> > + (((s64)mul_u64_u64_shr(cycle_delta,
> > + tk->ntp_err_frac, 32) -
> > + (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
> > +
> > + err += (s64)((cycle_delta * tk->tkr_mono.mult +
> > + tk->tkr_mono.xtime_nsec) &
> > + ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
> > + systime_snapshot->ntp_error =
> > + (err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
> > + NTP_SCALE_SHIFT;
>
> This formatting makes my brain hurt. Can you please split that out into
> a separate function?
Yep. There's also a potential error there — an *additional* discrepancy
comes from the enforced monotonicity that timekeeping_cycles_to_ns()
applies (the case where it just returns tkr->xtime_nsec >> tkr_shift).
I couldn't work out if I cared about the clocksource-is-non-monotonic
casse, and even if I did, what I should do about it.
I also wasn't sure if this should be a new CLOCK_REALTIME_NONMONOTONIC
or something like that, such that e.g. PTP clients could *ask* for it.
It's all very well hard-coding it in pps_get_ts() and unconditionally
changing the behaviour... I *think* we could justify that. But the
example I actually used in the patch was PTP, and that's slightly
harder to justify the behavioural change.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* [PATCH net] net: au1000: move free_irq out of the close-time spinlocked section
From: Runyu Xiao @ 2026-06-19 15:18 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, linux-kernel, Runyu Xiao, stable
au1000_close() calls free_irq() while aup->lock is still held with
spin_lock_irqsave(). free_irq() can sleep because it takes the IRQ
descriptor request mutex, so it does not belong inside the close-time
spinlocked section.
This was found by our static analysis tool and then confirmed by manual
review of the in-tree au1000_close() .ndo_stop path. The reviewed path
keeps aup->lock held across the MAC reset, queue stop and
free_irq(dev->irq, dev).
A directed runtime validation kept that ndo_stop carrier and the same
free_irq(dev->irq, dev) operation under the driver lock. Lockdep reported
"BUG: sleeping function called from invalid context" and "Invalid wait
context" while free_irq() was taking desc->request_mutex, with
au1000_close() and free_irq() on the stack.
Drop aup->lock before freeing the IRQ. The protected close-time work still
stops the device and queue before IRQ teardown, but the sleepable IRQ core
path now runs outside the spinlocked section.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/net/ethernet/amd/au1000_eth.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/amd/au1000_eth.c b/drivers/net/ethernet/amd/au1000_eth.c
index 9d35ac348ebe..5a04056e38fa 100644
--- a/drivers/net/ethernet/amd/au1000_eth.c
+++ b/drivers/net/ethernet/amd/au1000_eth.c
@@ -943,9 +943,10 @@ static int au1000_close(struct net_device *dev)
/* stop the device */
netif_stop_queue(dev);
+ spin_unlock_irqrestore(&aup->lock, flags);
+
/* disable the interrupt */
free_irq(dev->irq, dev);
- spin_unlock_irqrestore(&aup->lock, flags);
return 0;
}
--
2.34.1
^ permalink raw reply related
* [PATCH v3 2/2] selftests/tc-testing: Add DualPI2 GSO backlog accounting test
From: Xingquan Liu @ 2026-06-19 15:13 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: netdev, Jiri Pirko, Victor Nogueira, Chia-Yu Chang, Xingquan Liu,
stable
In-Reply-To: <20260619151447.223640-1-b1n@b1n.io>
Add a regression test for DualPI2 GSO backlog accounting when it is
used as a child qdisc of QFQ.
The test sends one UDP GSO datagram through a QFQ class with DualPI2 as
the leaf qdisc. DualPI2 splits the skb into two segments. After the
traffic drains, both QFQ and DualPI2 must report zero backlog and zero
qlen.
On kernels with the broken accounting, QFQ can keep a stale non-zero
qlen after all real packets have been dequeued.
Signed-off-by: Xingquan Liu <b1n@b1n.io>
---
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 44 +++++++++++++++++++
tools/testing/selftests/tc-testing/tdc_gso.py | 43 ++++++++++++++++++
2 files changed, 87 insertions(+)
create mode 100755 tools/testing/selftests/tc-testing/tdc_gso.py
diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
index cd1f2ee8f354..ed6a900bb568 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
@@ -250,5 +250,49 @@
"teardown": [
"$TC qdisc del dev $DUMMY handle 1: root"
]
+ },
+ {
+ "id": "891f",
+ "name": "Verify DualPI2 GSO backlog accounting with QFQ parent",
+ "category": [
+ "qdisc",
+ "dualpi2",
+ "qfq",
+ "gso"
+ ],
+ "plugins": {
+ "requires": "nsPlugin"
+ },
+ "setup": [
+ "$IP link set dev $DUMMY up || true",
+ "$IP addr add 10.10.10.10/24 dev $DUMMY || true",
+ "$TC qdisc add dev $DUMMY root handle 1: qfq",
+ "$TC class add dev $DUMMY parent 1: classid 1:1 qfq weight 1 maxpkt 4096",
+ "$TC qdisc add dev $DUMMY parent 1:1 handle 2: dualpi2",
+ "$TC filter add dev $DUMMY parent 1: matchall classid 1:1"
+ ],
+ "cmdUnderTest": "./tdc_gso.py 10.10.10.10 10.10.10.1 9000 1200 2400",
+ "expExitCode": "0",
+ "verifyCmd": "$TC -j -s qdisc ls dev $DUMMY",
+ "matchJSON": [
+ {
+ "kind": "qfq",
+ "handle": "1:",
+ "packets": 2,
+ "backlog": 0,
+ "qlen": 0
+ },
+ {
+ "kind": "dualpi2",
+ "handle": "2:",
+ "packets": 2,
+ "backlog": 0,
+ "qlen": 0
+ }
+ ],
+ "teardown": [
+ "$TC qdisc del dev $DUMMY root",
+ "$IP addr del 10.10.10.10/24 dev $DUMMY || true"
+ ]
}
]
diff --git a/tools/testing/selftests/tc-testing/tdc_gso.py b/tools/testing/selftests/tc-testing/tdc_gso.py
new file mode 100755
index 000000000000..b66528ea4b68
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tdc_gso.py
@@ -0,0 +1,43 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+tdc_gso.py - send a UDP GSO datagram
+
+Copyright (C) 2026 Xingquan Liu <b1n@b1n.io>
+"""
+
+import argparse
+import socket
+import struct
+import sys
+
+UDP_MAX_SEGMENTS = 1 << 7
+
+
+parser = argparse.ArgumentParser(description="UDP GSO datagram sender")
+parser.add_argument("src", help="source IPv4 address")
+parser.add_argument("dst", help="destination IPv4 address")
+parser.add_argument("port", type=int, help="destination UDP port")
+parser.add_argument("gso_size", type=int, help="UDP GSO segment payload size")
+parser.add_argument("payload_len", type=int, help="total UDP payload length")
+args = parser.parse_args()
+
+if args.gso_size <= 0 or args.gso_size > 0xFFFF:
+ parser.error("gso_size must fit in an unsigned 16-bit integer")
+if args.payload_len <= args.gso_size:
+ parser.error("payload_len must be larger than gso_size")
+if args.payload_len > args.gso_size * UDP_MAX_SEGMENTS:
+ parser.error("payload_len exceeds UDP_MAX_SEGMENTS")
+
+SOL_UDP = getattr(socket, "SOL_UDP", socket.IPPROTO_UDP)
+UDP_SEGMENT = getattr(socket, "UDP_SEGMENT", 103)
+
+sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+sock.bind((args.src, 0))
+
+payload = b"b" * args.payload_len
+cmsg = [(SOL_UDP, UDP_SEGMENT, struct.pack("=H", args.gso_size))]
+
+sent = sock.sendmsg([payload], cmsg, 0, (args.dst, args.port))
+sys.exit(sent != len(payload))
--
Xingquan Liu
^ permalink raw reply related
* [PATCH v3 1/2] net/sched: dualpi2: fix GSO backlog accounting
From: Xingquan Liu @ 2026-06-19 15:13 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: netdev, Jiri Pirko, Victor Nogueira, Chia-Yu Chang, Xingquan Liu,
stable
When DualPI2 splits a GSO skb into N segments, it propagates N
additional packets to its parent before returning NET_XMIT_SUCCESS.
The parent then accounts for the original skb once more, leaving its
qlen one larger than the number of packets actually queued.
With QFQ as the parent, after all real packets are dequeued, QFQ still
has a non-zero qlen while its in-service aggregate has no active
classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
the result to qfq_peek_skb(), causing a NULL pointer dereference.
Follow the same pattern used by tbf_segment() and taprio: count only
successfully queued segments, propagate the difference between the
original skb and those segments, and return NET_XMIT_SUCCESS whenever
at least one segment was queued.
Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
Cc: stable@vger.kernel.org
Signed-off-by: Xingquan Liu <b1n@b1n.io>
---
v3:
- Move the UDP GSO sender into tdc_gso.py.
v2:
- Change patch commit message.
- Add tdc test.
net/sched/sch_dualpi2.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
index d7c3254ef800..5434df6ca8ef 100644
--- a/net/sched/sch_dualpi2.c
+++ b/net/sched/sch_dualpi2.c
@@ -461,7 +461,7 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
if (IS_ERR_OR_NULL(nskb))
return qdisc_drop(skb, sch, to_free);
- cnt = 1;
+ cnt = 0;
byte_len = 0;
orig_len = qdisc_pkt_len(skb);
skb_list_walk_safe(nskb, nskb, next) {
@@ -488,16 +488,15 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
byte_len += nskb->len;
}
}
- if (cnt > 1) {
+ if (cnt > 0) {
/* The caller will add the original skb stats to its
* backlog, compensate this if any nskb is enqueued.
*/
- --cnt;
- byte_len -= orig_len;
+ qdisc_tree_reduce_backlog(sch, 1 - cnt,
+ orig_len - byte_len);
}
- qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
consume_skb(skb);
- return err;
+ return cnt > 0 ? NET_XMIT_SUCCESS : err;
}
return dualpi2_enqueue_skb(skb, sch, to_free);
}
base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
--
Xingquan Liu
^ permalink raw reply related
* [PATCH net v2] net/smc: fix out-of-bounds read when sk_user_data holds a sk_psock
From: Sechang Lim @ 2026-06-19 15:03 UTC (permalink / raw)
To: D . Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Ursula Braun,
Karsten Graul, Guvenc Gulce, linux-rdma, linux-s390, netdev,
linux-kernel, bpf
SMC stores its smc_sock in the clcsock's sk_user_data tagged
SK_USER_DATA_NOCOPY and reads it back with smc_clcsock_user_data(), which
only strips that flag. sockmap stores a sk_psock in the same field tagged
SK_USER_DATA_NOCOPY | SK_USER_DATA_PSOCK. Nothing keeps both off one
socket, and SMC then casts the sk_psock to an smc_sock.
A passive-open child hits this. It inherits the listener's
smc_clcsock_data_ready(), but sk_clone_lock() clears its NOCOPY
sk_user_data, and a BPF sock_ops program then adds the child to a sockmap,
installing a sk_psock in that field. The inherited callback reads it as an
smc_sock and dereferences a clcsk_* pointer past the end of the sk_psock:
BUG: KASAN: slab-out-of-bounds in smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
Read of size 8 at addr ffff8880013b8674 by task syz.6.12484/67930
<IRQ>
smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
tcp_urg+0x24d/0x360 net/ipv4/tcp_input.c:6264
tcp_rcv_state_process+0x280d/0x4940 net/ipv4/tcp_input.c:7336
tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
tcp_v4_rcv+0x1eaa/0x2a00 net/ipv4/tcp_ipv4.c:2186
[...]
</IRQ>
Allocated by task 67930:
sk_psock_init+0x142/0x740 net/core/skmsg.c:766
sock_hash_update_common+0xd3/0x990 net/core/sock_map.c:1010
bpf_sock_hash_update+0x114/0x170 net/core/sock_map.c:1229
__cgroup_bpf_run_filter_sock_ops+0x74/0xa0 kernel/bpf/cgroup.c:1727
tcp_init_transfer+0x1085/0x1100 net/ipv4/tcp_input.c:6693
[...]
sk_psock() already guards the other side, returning NULL unless
SK_USER_DATA_PSOCK is set. Make smc_clcsock_user_data() and its RCU
variant return the smc_sock only when sk_user_data carries SMC's tag
alone. A sk_psock then reads back as NULL, which the data_ready and
fallback callbacks already handle.
Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
net/smc/smc.h | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 52145df83f6e..88dfb459b7cc 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -342,13 +342,25 @@ static inline void smc_init_saved_callbacks(struct smc_sock *smc)
static inline struct smc_sock *smc_clcsock_user_data(const struct sock *clcsk)
{
- return (struct smc_sock *)
- ((uintptr_t)clcsk->sk_user_data & ~SK_USER_DATA_NOCOPY);
+ uintptr_t data = (uintptr_t)clcsk->sk_user_data;
+
+ /*
+ * Return the smc_sock only if the slot carries SMC's tag alone.
+ * sockmap stores a sk_psock here tagged SK_USER_DATA_PSOCK; it is
+ * not an smc_sock and must not be dereferenced as one.
+ */
+ if ((data & ~SK_USER_DATA_PTRMASK) != SK_USER_DATA_NOCOPY)
+ return NULL;
+ return (struct smc_sock *)(data & SK_USER_DATA_PTRMASK);
}
static inline struct smc_sock *smc_clcsock_user_data_rcu(const struct sock *clcsk)
{
- return (struct smc_sock *)rcu_dereference_sk_user_data(clcsk);
+ uintptr_t data = (uintptr_t)rcu_dereference(__sk_user_data(clcsk));
+
+ if ((data & ~SK_USER_DATA_PTRMASK) != SK_USER_DATA_NOCOPY)
+ return NULL;
+ return (struct smc_sock *)(data & SK_USER_DATA_PTRMASK);
}
/* save target_cb in saved_cb, and replace target_cb with new_cb */
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net] net/smc: fix out-of-bounds read in smc_clcsock_data_ready()
From: Sechang Lim @ 2026-06-19 14:59 UTC (permalink / raw)
To: D. Wythe
Cc: Dust Li, Sidraya Jayagond, Wenjia Zhang, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, David S . Miller, Mahanta Jambigi,
Tony Lu, Wen Gu, Simon Horman, Ursula Braun, Karsten Graul,
Guvenc Gulce, netdev, linux-rdma, linux-s390, bpf, linux-kernel
In-Reply-To: <20260616071639.GA104390@j66a10360.sqa.eu95>
On Tue, Jun 16, 2026 at 03:16:39PM +0800, D. Wythe wrote:
>On Sun, Jun 14, 2026 at 12:09:30PM +0000, Sechang Lim wrote:
>> smc_clcsock_data_ready() is installed on the listen socket and reads its
>> sk_user_data as an smc_sock. A passive-open child inherits this callback,
>> but sk_clone_lock() clears the child's sk_user_data because it is tagged
>> SK_USER_DATA_NOCOPY. smc_tcp_syn_recv_sock() restores the child's af_ops,
>> but the inherited sk_data_ready() is left in place until accept.
>>
>> In that window the child is established. A cgroup sock_ops program can run
>> bpf_sock_hash_update() on it from tcp_init_transfer(); sk_psock_init()
>> stores a sk_psock in the NULL sk_user_data. The inherited callback then
>> reads sk_user_data via smc_clcsock_user_data(), which masks only
>> SK_USER_DATA_NOCOPY, mistakes the sk_psock for an smc_sock, and reads a
>> callback pointer past the end of the sk_psock:
>>
>> BUG: KASAN: slab-out-of-bounds in smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>> Read of size 8 at addr ffff8880013b8674 by task syz.6.12484/67930
>> <IRQ>
>> smc_clcsock_data_ready+0x84/0x200 net/smc/af_smc.c:2637
>> tcp_urg+0x24d/0x360 net/ipv4/tcp_input.c:6264
>> tcp_rcv_state_process+0x280d/0x4940 net/ipv4/tcp_input.c:7336
>> tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
>> tcp_v4_rcv+0x1eaa/0x2a00 net/ipv4/tcp_ipv4.c:2186
>> ip_protocol_deliver_rcu+0x226/0x420 net/ipv4/ip_input.c:207
>> ip_local_deliver_finish+0x35a/0x5f0 net/ipv4/ip_input.c:241
>> __netif_receive_skb_one_core+0x1e5/0x210 net/core/dev.c:6216
>> process_backlog+0x631/0x1470 net/core/dev.c:6682
>> __napi_poll+0xb3/0x320 net/core/dev.c:7749
>> net_rx_action+0x4fa/0xcb0 net/core/dev.c:7969
>> handle_softirqs+0x236/0x800 kernel/softirq.c:622
>> </IRQ>
>>
>> Allocated by task 67930:
>> sk_psock_init+0x142/0x740 net/core/skmsg.c:766
>> sock_map_link+0x646/0xdf0 net/core/sock_map.c:279
>> sock_hash_update_common+0xd3/0x990 net/core/sock_map.c:1010
>> bpf_sock_hash_update+0x114/0x170 net/core/sock_map.c:1229
>> __cgroup_bpf_run_filter_sock_ops+0x74/0xa0 kernel/bpf/cgroup.c:1727
>> tcp_init_transfer+0x1085/0x1100 net/ipv4/tcp_input.c:6693
>> tcp_rcv_state_process+0x241e/0x4940 net/ipv4/tcp_input.c:7231
>> tcp_child_process+0x371/0xa50 net/ipv4/tcp_minisocks.c:1002
>>
>> Restore the inherited sk_data_ready() in smc_tcp_syn_recv_sock(), where the
>> child's sk_user_data is already cleared, rather than only at accept.
>>
>> Fixes: a60a2b1e0af1 ("net/smc: reduce active tcp_listen workers")
>> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
>> ---
>> net/smc/af_smc.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
>> index b5db69073e20..152971e8ad17 100644
>> --- a/net/smc/af_smc.c
>> +++ b/net/smc/af_smc.c
>> @@ -156,6 +156,12 @@ static struct sock *smc_tcp_syn_recv_sock(const struct sock *sk,
>> if (child) {
>> rcu_assign_sk_user_data(child, NULL);
>>
>> + /*
>> + * the child inherited the listen-specific sk_data_ready();
>> + * restore it here, as sk_user_data may be reused before accept
>> + */
>> + child->sk_data_ready = smc->clcsk_data_ready;
>
>One concern:
>
>smc_clcsock_user_data_rcu() together with refcount_inc_not_zero() only
>pins the smc_sock; it does not guarantee anything about the lifetime or
>consistency of smc->clcsk_data_ready. In the listen-close path,
>smc_clcsock_restore_cb() clears that field under sk_callback_lock,
>while smc_tcp_syn_recv_sock() reads it without any lock. These are
>independent protection domains. If close wins the race,
>child->sk_data_ready can end up NULL and the next data arrival will
>crash.
>
will drop the syn_recv restore in v2. Thanks for your review.
>Also, I don't object to this fix, but I'd rather see the underlying cause
>addressed directly. The real issue seems to be the conflict between
>SMC's sk_user_data and sk_psock. Maybe there is a cleaner solution, e.g.
>always setting user_data.
>
Agreed.
Thanks, will send v2.
Best,
Sechang
^ permalink raw reply
* Re: AW: AW: AW: [PATCH net] net: usb: lan78xx: restore VLAN filter table after device reset
From: Nicolai Buchwitz @ 2026-06-19 14:01 UTC (permalink / raw)
To: Sven Schuchmann
Cc: Thangaraj Samynathan, Rengarajan Sundararajan, UNGLinuxDriver,
Woojung.Huh, Andrew Lunn, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, netdev, linux-usb, linux-kernel
In-Reply-To: <BEZP281MB224523ADACDB48D8E3974D4AD9E22@BEZP281MB2245.DEUP281.PROD.OUTLOOK.COM>
Hi Sven
On 19.6.2026 15:31, Sven Schuchmann wrote:
> Hello Nicolai,
>
> looks good from my point of view
> (Calling the lan78xx_write_vlan_table() from
> lan78xx_mac_link_up() and from lan78xx_reset()).
Thanks.
> But I investigated a little more and it seems the hash table
> (which is right behind the vlan table in the controllers memory)
> also gets cleared. I wrote some random data into this table and have
> seen that it gets also cleared. I think this needs to be fixed too.
Something like
static int lan78xx_write_mchash_table(struct lan78xx_net *dev)
{
struct lan78xx_priv *pdata = (struct lan78xx_priv
*)(dev->data[0]);
return lan78xx_dataport_write(dev, DP_SEL_RSEL_VLAN_DA_,
DP_SEL_VHF_VLAN_LEN,
DP_SEL_VHF_HASH_LEN,
pdata->mchash_table); // from lan78xx_deferred_multicast_write)
}
with callers in lan78xx_deferred_multicast_write() and
lan78xx_mac_link_up(), should
do the trick?
>
> In the Datasheet from the LAN7801 I can read:
> "After a reset event, the RFE will automatically initialize the
> contents of the VHF to 0h."
> Where VHF also refers to the hash table.
> But I still do not understand what reset is happening when I just
> unplug the network cable....
I suspect it is triggered from the PHY:
8.10 (MAC Reset Watchdog Timer):
"A portion of the MAC operates on clocks generated by the Ethernet PHY
[...] PHY Reset
(PHY_RST) results in resetting the portion of the MAC operating on the
PHY receive and
transmit clocks."
So which PHY are you using?
> [...]
Thanks,
Nicolai
^ permalink raw reply
* Re: [PATCH net-next v5 12/15] onsemi: s2500: Add driver support for TS2500 MAC-PHY
From: Uwe Kleine-König @ 2026-06-19 13:59 UTC (permalink / raw)
To: Selvamani.Rajagopal
Cc: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, Parthiban Veerasooran, Richard Cochran, Rob Herring,
Krzysztof Kozlowski, Conor Dooley, Simon Horman, Jonathan Corbet,
Shuah Khan, netdev, linux-kernel, devicetree, linux-doc,
Jerry Ray
In-Reply-To: <20260614-s2500-mac-phy-support-v5-12-89874b72f725@onsemi.com>
[-- Attachment #1: Type: text/plain, Size: 1190 bytes --]
On Sun, Jun 14, 2026 at 10:00:28AM -0700, Selvamani Rajagopal via B4 Relay wrote:
> +static const struct of_device_id s2500_of_match[] = {
> + { .compatible = "onnn,s2500" },
> + {}
s/{}/{ }/
> +};
> +
> +static const struct spi_device_id s2500_ids[] = {
> + { "s2500" },
> + {}
> +};
Please make this:
static const struct spi_device_id s2500_ids[] = {
{ .name = "s2500" },
{ }
};
> +MODULE_DEVICE_TABLE(spi, s2500_ids);
> +
> +static struct spi_driver s2500_driver = {
> + .driver = {
> + .name = DRV_NAME,
> + .of_match_table = s2500_of_match,
> + },
> + .probe = s2500_probe,
> + .remove = s2500_remove,
> + .id_table = s2500_ids,
Tastes are different, but the idea to align = is usually screwed by
follow up patches. Here it's broken from the start. If you ask me: Use a
single space before each =.
> +};
> +
> +module_spi_driver(s2500_driver);
Usually there is no empty line between the driver struct and the macro
registering it.
> +
> +MODULE_AUTHOR("Piergiorgio Beruto <pier.beruto@onsemi.com>");
> +MODULE_AUTHOR("Selva Rajagopal <selvamani.rajagopal@onsemi.com>");
> +MODULE_DESCRIPTION("onsemi MACPHY ethernet driver");
> +MODULE_LICENSE("GPL");
Best regards
Uwe
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH v4 net] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-06-19 13:55 UTC (permalink / raw)
To: Shradha Gupta
Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
Saurabh Singh Sengar, stable
In-Reply-To: <20260619073338.481035-1-shradhagupta@linux.microsoft.com>
On Fri, Jun 19, 2026 at 12:33:35AM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated is capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vCPUs, irrespective of
> their NUMA/core bindings.
>
> This is important, especially in the envs where number of vCPUs are so
> few that the softIRQ handling overhead on two IRQs on the same vCPU is
> much more than their overheads if they were spread across sibling vCPUs.
>
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU, while some vCPUs have none.
>
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly.
>
> We also studied the results of setting the affinity and hint to
> NULL in these cases, and observed that, with this logic if there are
> pre existing IRQs allocated on the VM(apart from MANA), during MANA
> IRQs allocation, it leads to clustering of the MANA queue IRQs again.
> These results can be seen through case 3 in the following data.
>
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 0
> IRQ2: mana_q2 2
> IRQ3: mana_q3 0
> IRQ4: mana_q4 3
>
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU 0 1 2 3
> =======================================================
> pass 1: 38.85 0.03 24.89 24.65
> pass 2: 39.15 0.03 24.57 25.28
> pass 3: 40.36 0.03 23.20 23.17
>
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 0
> IRQ2: mana_q2 1
> IRQ3: mana_q3 2
> IRQ4: mana_q4 3
>
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU 0 1 2 3
> =======================================================
> pass 1: 15.42 15.85 14.99 14.51
> pass 2: 15.53 15.94 15.81 15.93
> pass 3: 16.41 16.35 16.40 16.36
>
> =======================================================
> Case 3: with affinity set to NULL
> =======================================================
> 4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 2
> IRQ2: mana_q2 3
> IRQ3: mana_q3 2
> IRQ4: mana_q4 3
>
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn with patch w/o patch aff NULL
> 20480 15.65 7.73 5.25
> 10240 15.63 8.93 5.77
> 8192 15.64 9.69 7.16
> 6144 15.64 13.16 9.33
> 4096 15.69 15.75 13.50
> 2048 15.69 15.83 13.61
> 1024 15.71 15.28 13.60
>
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Yury Norov <ynorov@nvidia.com>
> ---
> Changes in v4
> * Add mana prefix on irq_affinity_*() in mana driver
> * Corrected grammar, comment for mana_irq_setup_linear()
> * added new line as per guidelines
> * added case 3 in commit message for when affinity is NULL
> ---
> Changes in v3
> * Optimize the comments in mana_gd_setup_dyn_irqs()
> * add more details in the dev_dbg for extra IRQs
> ---
> Changes in v2
> * Removed the unused skip_first_cpu variable
> * fixed exit condition in irq_setup_linear() with len == 0
> * changed return type of irq_setup_linear() as it will always be 0
> * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> * added appropriate comments to indicate expected behaviour when
> IRQs are more than or equal to num_online_cpus()
> ---
> .../net/ethernet/microsoft/mana/gdma_main.c | 78 +++++++++++++++----
> 1 file changed, 64 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index a0fdd052d7f1..e8b7ffb47eb9 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> } else {
> /* If dynamic allocation is enabled we have already allocated
> * hwc msi
> + * Also, we make sure in this case the following is always true
> + * (num_msix_usable - 1 HWC) <= num_online_cpus()
> */
> gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> }
> @@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
> * do the same thing.
> */
>
> -static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> - bool skip_first_cpu)
> +static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len,
> + int node, bool skip_first_cpu)
> {
> const struct cpumask *next, *prev = cpu_none_mask;
> cpumask_var_t cpus __free(free_cpumask_var);
> @@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> return 0;
> }
>
> +/* must be called with cpus_read_lock() held */
> +static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> + int cpu;
> +
> + for_each_online_cpu(cpu) {
> + if (len == 0)
> + break;
> +
> + irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> + len--;
> + }
> +}
> +
> static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> {
> struct gdma_context *gc = pci_get_drvdata(pdev);
> struct gdma_irq_context *gic;
> - bool skip_first_cpu = false;
> int *irqs, err, i, msi;
>
> irqs = kmalloc_objs(int, nvec);
> @@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> return -ENOMEM;
>
> /*
> + * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> + * nvec is only Queue IRQ (HWC already setup).
> * While processing the next pci irq vector, we start with index 1,
> * as IRQ vector at index 0 is already processed for HWC.
> * However, the population of irqs array starts with index 0, to be
> - * further used in irq_setup()
> + * further used in mana_irq_setup_numa_aware()
> */
> for (i = 1; i <= nvec; i++) {
> msi = i;
> @@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> }
>
> /*
> - * When calling irq_setup() for dynamically added IRQs, if number of
> - * CPUs is more than or equal to allocated MSI-X, we need to skip the
> - * first CPU sibling group since they are already affinitized to HWC IRQ
> + * When calling mana_irq_setup_numa_aware() for dynamically added IRQs,
> + * if number of CPUs is more than or equal to allocated MSI-X, we need to
> + * skip the first CPU sibling group since they are already affinitized to
> + * HWC IRQ
> */
> cpus_read_lock();
> - if (gc->num_msix_usable <= num_online_cpus())
> - skip_first_cpu = true;
> + if (gc->num_msix_usable <= num_online_cpus()) {
> + err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node,
> + true);
> + if (err) {
> + cpus_read_unlock();
> + goto free_irq;
> + }
> + } else {
> + /*
> + * When num_msix_usable are more than num_online_cpus, our
> + * queue IRQs should be equal to num of online vCPUs.
> + * We try to make sure queue IRQs spread across all vCPUs.
> + * In such a case NUMA or CPU core affinity does not matter.
> + * Note: in this case the total mana IRQ should always be
> + * num_online_cpus + 1. The first HWC IRQ is already handled
> + * in HWC setup calls
> + * However, if CPUs went offline since num_msix_usable was
> + * computed, queue IRQs will be more than num_online_cpus().
> + * In such cases remaining extra IRQs will retain their default
> + * affinity.
> + */
> + int first_unassigned = num_online_cpus();
>
> - err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> - if (err) {
> - cpus_read_unlock();
> - goto free_irq;
> + if (nvec > first_unassigned) {
> + char buf[32];
> +
> + if (first_unassigned == nvec - 1)
> + snprintf(buf, sizeof(buf), "%d",
> + first_unassigned);
> + else
> + snprintf(buf, sizeof(buf), "%d-%d",
> + first_unassigned, nvec - 1);
> +
> + dev_dbg(&pdev->dev,
> + "MANA IRQ indices #%s will retain the default CPU affinity\n",
> + buf);
> + }
> +
> + mana_irq_setup_linear(irqs, nvec);
> }
>
> cpus_read_unlock();
> @@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
> nvec -= 1;
> }
>
> - err = irq_setup(irqs, nvec, gc->numa_node, false);
> + err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false);
> if (err) {
> cpus_read_unlock();
> goto free_irq;
>
> base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
> --
> 2.34.1
^ permalink raw reply
* Re: [RFC] Enabling CONFIG_NTP_PPS for NOHZ by adding ntp_error to system_time_snapshot
From: Thomas Gleixner @ 2026-06-19 13:34 UTC (permalink / raw)
To: David Woodhouse, John Stultz, Stephen Boyd, Miroslav Lichvar,
Richard Cochran, linux-kernel, netdev
Cc: Rodolfo Giometti, Alexander Gordeev
In-Reply-To: <3616fc9718614bf11915569599038a5bcb268c02.camel@infradead.org>
On Fri, Jun 19 2026 at 01:33, David Woodhouse wrote:
> @@ -1285,6 +1286,45 @@ void ktime_get_snapshot_id(clockid_t clock_id, struct system_time_snapshot *syst
>
> nsec_sys = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
> nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> +
> + /*
> + * For the NTP-disciplined mono-based clocks, report how far
> + * @systime is from the ideal NTP time at @now, in signed ns,
> + * so a caller can land on the ideal line by adding it. Four
> + * terms, summed in ns << NTP_SCALE_SHIFT before converting:
> + *
> + * - tk->ntp_error, the deviation as of the last update;
> + * - (cycle_delta * ntp_err_frac), the fractional-mult drift
> + * accrued since then (cycle_delta is at most a tick on a
> + * tickful kernel, but many ticks' worth under NO_HZ);
> + * - (cycle_delta * ntp_err_mult), subtracting the applied +1
> + * mult dither over the same span;
> + * - the sub-ns fraction @systime dropped when the read was
> + * truncated to whole ns (low @shift bits, exact despite the
> + * multiply overflowing).
> + *
> + * RAW is undisciplined and AUX has its own discipline, so they
> + * carry no ntp_error.
AUX has ntp_error too. AUX clocks have a per clock NTP instance, which
work exactly like the main timerkeeper's one. Only CLOCK_MONOTONIC_RAW
needs to be excluded.
> + */
> + if (clock_id == CLOCK_REALTIME || clock_id == CLOCK_MONOTONIC ||
> + clock_id == CLOCK_BOOTTIME) {
> + u32 nes = tk->ntp_error_shift;
> + u64 cycle_delta = (now - tk->tkr_mono.cycle_last) &
> + tk->tkr_mono.mask;
> + s64 err = tk->ntp_error +
> + (((s64)mul_u64_u64_shr(cycle_delta,
> + tk->ntp_err_frac, 32) -
> + (s64)(cycle_delta * tk->ntp_err_mult)) << nes);
> +
> + err += (s64)((cycle_delta * tk->tkr_mono.mult +
> + tk->tkr_mono.xtime_nsec) &
> + ((1ULL << tk->tkr_mono.shift) - 1)) << nes;
> + systime_snapshot->ntp_error =
> + (err + (1LL << (NTP_SCALE_SHIFT - 1))) >>
> + NTP_SCALE_SHIFT;
This formatting makes my brain hurt. Can you please split that out into
a separate function?
/*
* Big fat comment....
*/
static void snapshot_ntp_error(clockid_t clock_id, struct system_time_snapshot *snap,
struct timekeeper *tk)
{
if (clock_id == CLOCK_MONOTONIC_RAW) {
snap->ntp_error = 0;
return;
}
u64 cycle_delta = (now - tk->tkr_mono.cycle_last) & tk->tkr_mono.mask;
u32 nes = tk->ntp_error_shift;
s64 tmp, err = tk->ntp_error;
err += ((s64)mul_u64_u64_shr(cycle_delta, tk->ntp_err_frac, 32) -
(s64)(cycle_delta * tk->ntp_err_mult)) << nes;
tmp = (s64)(cycle_delta * tk->tkr_mono.mult + tk->tkr_mono.xtime_nsec);
tmp &= (1ULL << tk->tkr_mono.shift) - 1;
err += tmp << nes;
snap->ntp_error = (err + (1LL << (NTP_SCALE_SHIFT - 1))) >> NTP_SCALE_SHIFT;
}
or something readable like that.
^ permalink raw reply
* AW: AW: AW: [PATCH net] net: usb: lan78xx: restore VLAN filter table after device reset
From: Sven Schuchmann @ 2026-06-19 13:31 UTC (permalink / raw)
To: Nicolai Buchwitz
Cc: Thangaraj Samynathan, Rengarajan Sundararajan,
UNGLinuxDriver@microchip.com, Woojung.Huh@microchip.com,
Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev@vger.kernel.org, linux-usb@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <4abfc9b1e8860da93c03639863bd0232@tipi-net.de>
Hello Nicolai,
looks good from my point of view
(Calling the lan78xx_write_vlan_table() from
lan78xx_mac_link_up() and from lan78xx_reset()).
But I investigated a little more and it seems the hash table
(which is right behind the vlan table in the controllers memory)
also gets cleared. I wrote some random data into this table and have
seen that it gets also cleared. I think this needs to be fixed too.
In the Datasheet from the LAN7801 I can read:
"After a reset event, the RFE will automatically initialize the contents of the VHF to 0h."
Where VHF also refers to the hash table.
But I still do not understand what reset is happening when I just unplug the network cable....
Regards,
Sven
On 19.6.2026 11:53, Nicolai Buchwitz wrote:
> Hi Sven
>
> On 19.6.2026 11:18, Sven Schuchmann wrote:
> > Hello Nicolai,
> >
> > my first opservation is that calling lan78xx_write_vlan_table()
> > at the end lan78xx_start_rx_path() fixes the problem. I was able
> > to do over 200 connect/disconnects without any problem.
>
> Thanks, that's the right direction. For the final patch I'd move it
> to lan78xx_mac_link_up(), which is IMHO a bit "cleaner":
>
> [...]
> static void lan78xx_rx_urb_submit_all(struct lan78xx_net *dev);
> +static int lan78xx_write_vlan_table(struct lan78xx_net *dev);
> [...]
> static void lan78xx_mac_link_up(struct phylink_config *config,
> [...]
> if (ret < 0)
> goto link_up_fail;
>
> + ret = lan78xx_write_vlan_table(dev);
> + if (ret < 0)
> + goto link_up_fail;
> +
> netif_start_queue(net);
> [...]
>
> Could you give this version a quick test and confirm? Then I'll add
> your Tested-by.
>
> > [...]
>
> Thanks
> Nicolai
^ permalink raw reply
* [Bug ?] Packet with End.X segment not correctly forwarded to nexthop
From: Anthony Doeraene @ 2026-06-19 13:25 UTC (permalink / raw)
To: andrea.mayer; +Cc: netdev
Hello,
I am currently experimenting with SRv6 and VRFs, and I found some weird
interactions between the two.
For the context, I need routers to have multiple VRFs, with each VRF
having different routes to reach destinations.
Our routers not only send packets to a specific nexthop, but also
specify the VRF that the nexthop
should use to forward these packets.
To achieve this goal, routes in these VRFs push two segments: a local
End.X segment, and a End.DT46 segment.
Due to some implementation constraints, I want to have a single End.DT46
segment shared by
all routers in the network.
Once packets are encapsulated by the VRF, the packet is sent in the main
table to do a lookup for the nexthop.
As the End.DT46 segment is shared between routers and can not be used to
learn the nexthop, I decided to
use an End.X segment to specify it.
However, what I observe in this scenario is that End.X segment
processing function is never called, resulting
in the packet not being sent to the correct nexthop.
I am wondering if this is an expected behavior (i.e. a node should never
push a local segment), or if it is a real bug ?
I am not well versed into the implementation details of SRv6 in the
kernel, but I'm suspecting that this "bug" comes
from the fact that seg6_output_core calls dst_output, which does not
allow an SRv6 segment function to be called.
A minimal example is given below, which creates two namespaces (r1, r2)
and allows to reproduce this behavior.
(tested on a kernel compiled on virtme-ng from commit
e771677c937da5808f7b6c1f0e4a97ec1a84f8a8)
Thank you in advance for the help and thanks for the SRv6 support on Linux,
Doeraene Anthony
File setup.sh
```
# Topology under test:
#
# fc00::1:1 fc00::1:2
# fc00::1 [ r1 ] ------------------------- [ r2 ] fc00::2
#
# Description:
# ============
#
# Each node has an additional VRF, which it can use to provide different
# routing decisions based on arbitrary rules (e.g. QoS aware forwarding)
# Routes in this VRF will encapsulate the packets and push segments to
# specify the nexthop (End.X) and the VRF the nexthop should use
# (End.DT46). The same End.DT46 segment is shared by all nodes
#
# Problem:
# ========
#
# Once segments are pushed, the End.X segment is never applied. As a
# result, the segment is not popped from the SL, and the packet is sent
# on an incorrect interface.
#
# Forwarding steps:
# =================
#
# - R1 sends the packet to fc00::2 in its VRF `myvrf`
# - This VRF encapsulates the packet and add two segments:
# 1) End.X segment to force the transmission of the packet on r1-r2
# 2) End.DT46 segment allowing r2 to know which VRF it should use
# to forward the packet.
# - After encapsulation, r1 does a lookup in its main table for the
# End.X segment, but does not pop the segment. The packet is thus
# sent incorrectly on the dummy interface
#
# Running the example (with sudo):
# ====================
#
# 1) Start the topology
#
# bash setup.sh
#
# 2) Start pinging (leave in the background)
#
# ip netns exec r1 ping -I fc00::1 fc00::2
#
# 3) Check with tcpdump. We should see packets on r1-r2, and should not
# see any packet on dum0
#
# ip netns exec r1 tcpdump -i dum0 -n
# ip netns exec r1 tcpdump -i r1-r2 -n
if [ -z "$(lsmod | grep vrf)" ]; then
echo "Run modprobe vrf"
exit 1
fi
nodes="r1 r2"
vrftable=10
localsid=90
# Create nodes
for node in $nodes; do
ip netns add $node
ip -n $node link set lo up
done
# Create loopback addresses
ip -n r1 addr add fc00::1 dev lo
ip -n r2 addr add fc00::2 dev lo
# Create links
ip link add r1-r2 type veth peer name r2-r1
ip link set r1-r2 netns r1
ip link set r2-r1 netns r2
ip -n r1 link set r1-r2 up
ip -n r2 link set r2-r1 up
# Configure IPs
ip -n r1 addr add dev r1-r2 fc00::1:1/112
ip -n r2 addr add dev r2-r1 fc00::1:2/112
# Add default routes
ip -n r1 -6 route add default via fc00::1:2
ip -n r2 -6 route add default via fc00::1:1
# Configure sysctls
for node in $nodes; do
ip netns exec $node sysctl -w net.ipv6.conf.all.forwarding=1
ip netns exec $node sysctl -w net.ipv6.conf.all.seg6_enabled=1
ip netns exec $node sysctl -w net.vrf.strict_mode=1
for itf in $(ip netns exec $node ls /sys/class/net); do
ip netns exec $node sysctl net.ipv6.conf.$itf.seg6_enabled=1
done
done
for node in $nodes; do
# Create a dummy interface for End.X segments
ip -n $node link add dum0 type dummy
ip -n $node link set dum0 up
# Create VRF
ip -n $node link add myvrf type vrf table $vrftable
ip -n $node link set dev myvrf up
done
# Create SID table route
ip -n r1 -6 rule add to fc00:1::/32 lookup $localsid prio 998
ip -n r1 -6 rule add to fc00:ffff::/32 lookup $localsid prio 999
ip -n r2 -6 rule add to fc00:2::/32 lookup $localsid prio 998
ip -n r2 -6 rule add to fc00:ffff::/32 lookup $localsid prio 999
# Create the DT46 segment associated with the VRF
ip -n r1 route add table $localsid fc00:ffff:: \
encap seg6local \
action End.DT46 vrftable $vrftable dev myvrf
ip -n r2 route add table $localsid fc00:ffff:: \
encap seg6local \
action End.DT46 vrftable $vrftable dev myvrf
# Create the End.X segment
ip -n r1 route add table $localsid fc00:1:2:: \
encap seg6local action End.X nh6 fc00::1:2 oif r1-r2 dev dum0
ip -n r2 route add table $localsid fc00:2:1:: \
encap seg6local action End.X nh6 fc00::1:1 oif r2-r1 dev dum0
# Setup routes (main table)
ip -n r1 route add fc00::2 dev myvrf
# Setup routes (VRF). R1 push an End.X into End.DT46 segment
ip -n r1 route add fc00::2 encap seg6 \
mode encap \
segs fc00:1:2::,fc00:ffff:: \
dev r1-r2 via fc00::1:2 \
table 10
```
^ permalink raw reply
* Re: [PATCH v28 5/5] sfc: support pio mapping based on cxl
From: Edward Cree @ 2026-06-19 13:23 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero
In-Reply-To: <20260618181806.118745-6-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> A PIO buffer is a region of device memory to which the driver can write a
> packet for TX, with the device handling the transmit doorbell without
> requiring a DMA for getting the packet data, which helps reducing latency
> in certain exchanges. With CXL mem protocol this latency can be lowered
> further.
>
> With a device supporting CXL and successfully initialised, use the cxl
> region to map the memory range and use this mapping for PIO buffers.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
One nit:
> diff --git a/drivers/net/ethernet/sfc/efx.h b/drivers/net/ethernet/sfc/efx.h
> index 45e191686625..057d30090894 100644
> --- a/drivers/net/ethernet/sfc/efx.h
> +++ b/drivers/net/ethernet/sfc/efx.h
> @@ -236,5 +236,4 @@ static inline bool efx_rwsem_assert_write_locked(struct rw_semaphore *sem)
>
> int efx_xdp_tx_buffers(struct efx_nic *efx, int n, struct xdp_frame **xdpfs,
> bool flush);
> -
> #endif /* EFX_EFX_H */
This looks like a stray changebar, clean it up if respinning.
-ed
^ permalink raw reply
* Re: [PATCH v28 4/5] sfc: obtain and map cxl range using devm_cxl_probe_mem
From: Edward Cree @ 2026-06-19 13:20 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero
In-Reply-To: <20260618181806.118745-5-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> Use core API for safely obtain the CXL range linked to an HDM committed
> by the BIOS. Map such a range for being used as the ctpio buffer.
>
> A potential user space action through sysfs unbinding or core cxl
> modules remove will trigger sfc driver device detachment, with that case
> not racing with this mapping as this is done during driver probe and
> therefore protected with device lock against those user space actions.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
^ permalink raw reply
* Re: [PATCH v28 3/5] cxl/sfc: Initialize dpa without a mailbox
From: Edward Cree @ 2026-06-19 13:15 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero, Dan Williams, Ben Cheatham, Jonathan Cameron
In-Reply-To: <20260618181806.118745-4-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> Type3 relies on mailbox CXL_MBOX_OP_IDENTIFY command for initializing
> memdev state params which end up being used for DPA initialization.
>
> Allow a Type2 driver to initialize DPA simply by giving the size of its
> volatile hardware partition.
>
> Move related functions to memdev.
>
> Add sfc driver as the client.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
^ permalink raw reply
* Re: [PATCH v28 2/5] cxl/sfc: Map cxl regs
From: Edward Cree @ 2026-06-19 13:14 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero, Dan Williams, Jonathan Cameron, Ben Cheatham
In-Reply-To: <20260618181806.118745-3-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> Export cxl core functions for a Type2 driver being able to discover and
> map the device registers.
>
> Use it in sfc driver cxl initialization.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
^ permalink raw reply
* [PATCH v3 net] net: airoha: Fix TX scheduler queue mask loop upper bound
From: Wayen Yan @ 2026-06-19 13:12 UTC (permalink / raw)
To: netdev
Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
linux-mediatek
In airoha_qdma_set_chan_tx_sched(), the loop clearing queue mask was
using AIROHA_NUM_TX_RING (32) instead of AIROHA_NUM_QOS_QUEUES (8).
Each channel has 8 queues, and TXQ_DISABLE_CHAN_QUEUE_MASK(channel, i)
computes BIT(i + (channel * 8)). With i ranging 0..31, this causes:
- channel 0: clears bit 0..31 (all 4 channels) instead of 0..7
- channel 1: clears bit 8..31 (channels 1-3) instead of 8..15
- channel 2: clears bit 16..31 (channels 2-3) instead of 16..23
- channel 3: clears bit 24..31 (channel 3 only) - correct by accident
While BIT(32+) on arm64 produces 64-bit values truncated to 0 in u32
mask parameter, the loop still incorrectly clears queues within the
same channel beyond queue 7.
Even though this is functionally harmless (the register resets to 0
and is only ever cleared, never set — so clearing extra bits is a
no-op), the loop bound is semantically wrong and should be fixed for
correctness and clarity.
Fix by using AIROHA_NUM_QOS_QUEUES (8) as the loop upper bound.
Fixes: ef1ca9271313 ("net: airoha: Add sched HTB offload support")
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Wayen Yan <win847@gmail.com>
---
Changes in v3:
- Rebase on top of current net tree (Lorenzo pointed out v2 was
not based on latest net HEAD).
- No code changes from v2.
drivers/net/ethernet/airoha/airoha_eth.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 64dde6464f..47fb32517a 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -2395,7 +2395,7 @@ static int airoha_qdma_set_chan_tx_sched(struct net_device *netdev,
struct airoha_gdm_dev *dev = netdev_priv(netdev);
int i;
- for (i = 0; i < AIROHA_NUM_TX_RING; i++)
+ for (i = 0; i < AIROHA_NUM_QOS_QUEUES; i++)
airoha_qdma_clear(dev->qdma, REG_QUEUE_CLOSE_CFG(channel),
TXQ_DISABLE_CHAN_QUEUE_MASK(channel, i));
--
2.51.0
^ permalink raw reply related
* Re: [PATCH v28 1/5] sfc: add cxl support
From: Edward Cree @ 2026-06-19 13:12 UTC (permalink / raw)
To: alejandro.lucero-palau, linux-cxl, netdev, djbw, edward.cree,
davem, kuba, pabeni, edumazet, dave.jiang
Cc: Alejandro Lucero, Jonathan Cameron, Alison Schofield,
Dan Williams
In-Reply-To: <20260618181806.118745-2-alejandro.lucero-palau@amd.com>
On 18/06/2026 19:18, alejandro.lucero-palau@amd.com wrote:
> From: Alejandro Lucero <alucerop@amd.com>
>
> Add CXL initialization based on new CXL API for accel drivers and make
> it dependent on kernel CXL configuration.
>
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Acked-by: Edward Cree <ecree.xilinx@gmail.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
...
> diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
> index b98c259f672d..de3fc9537662 100644
> --- a/drivers/net/ethernet/sfc/net_driver.h
> +++ b/drivers/net/ethernet/sfc/net_driver.h
> @@ -1197,14 +1197,23 @@ struct efx_nic {
> atomic_t n_rx_noskb_drops;
> };
>
> +#ifdef CONFIG_SFC_CXL
> +struct efx_cxl;
> +#endif
> +
> /**
> * struct efx_probe_data - State after hardware probe
> * @pci_dev: The PCI device
> * @efx: Efx NIC details
> + * @cxl: details of related cxl objects
> + * @cxl_pio_initialised: cxl initialization outcome.
> */
> struct efx_probe_data {
> struct pci_dev *pci_dev;
> struct efx_nic efx;
> +#ifdef CONFIG_SFC_CXL
> + struct efx_cxl *cxl;
> +#endif
> };
The documented cxl_pio_initialised member does not appear to exist.
Will this not cause a kerneldoc build error?
^ permalink raw reply
* Re: [PATCH v2 2/2] selftests/tc-testing: Add DualPI2 GSO backlog accounting test
From: Victor Nogueira @ 2026-06-19 13:10 UTC (permalink / raw)
To: Xingquan Liu; +Cc: Jamal Hadi Salim, netdev, Jiri Pirko, Chia-Yu Chang
In-Reply-To: <20260619073211.637928-2-b1n@b1n.io>
On Fri, Jun 19, 2026 at 4:32 AM Xingquan Liu <b1n@b1n.io> wrote:
>
> Add a regression test for DualPI2 GSO backlog accounting when it is
> used as a child qdisc of QFQ.
>
> The test sends one UDP GSO datagram through a QFQ class with DualPI2 as
> the leaf qdisc. DualPI2 splits the skb into two segments. After the
> traffic drains, both QFQ and DualPI2 must report zero backlog and zero
> qlen.
>
> On kernels with the broken accounting, QFQ can keep a stale non-zero
> qlen after all real packets have been dequeued.
>
> Signed-off-by: Xingquan Liu <b1n@b1n.io>
> ---
> .../tc-testing/tc-tests/qdiscs/dualpi2.json | 44 +++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
> index cd1f2ee8f354..ffd6fd5ba8f7 100644
> --- a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
> +++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
> + {
> + "id": "891f",
> [...]
> + "cmdUnderTest": "python3 -c 'import socket,struct; SOL_UDP=getattr(socket,\"SOL_UDP\",socket.IPPROTO_UDP); UDP_SEGMENT=getattr(socket,\"UDP_SEGMENT\",103); s=socket.socket(socket.AF_INET,socket.SOCK_DGRAM); s.bind((\"10.10.10.10\",0)); p=b\"X\"*2400; n=s.sendmsg([p],[(SOL_UDP,UDP_SEGMENT,struct.pack(\"=H\",1200))],0,(\"10.10.10.1\",9000)); raise SystemExit(n != len(p))'",
Can you make this a separate Python script?
Something similar to what the flower tests did [1] with tdc_batch.py [2].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/tools/testing/selftests/tc-testing/tc-tests/filters/flower.json#n205
[2] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/tools/testing/selftests/tc-testing/tdc_batch.py
^ permalink raw reply
* Re: [PATCH net 3/6] ipv6: fix error handling in forwarding sysctl
From: Nicolas Dichtel @ 2026-06-19 13:04 UTC (permalink / raw)
To: Fernando Fernandez Mancera, netdev
Cc: shemminger, dforster, gospo, ddutt, brian.haley, horms, pabeni,
kuba, edumazet, davem, idosch, dsahern
In-Reply-To: <91d77512-f741-41d1-a799-5409690da5d7@suse.de>
Le 19/06/2026 à 12:28, Fernando Fernandez Mancera a écrit :
> On 6/19/26 11:34 AM, Nicolas Dichtel wrote:
>> Le 18/06/2026 à 18:22, Fernando Fernandez Mancera a écrit :
>>> When writing to the forwarding sysctl, if proc_dointvec() fails to parse
>>> the input, it returns a negative error code. The current implementation
>>> is overwriting that error for write operations.
>>>
>>> This results in a silent failure, it returns a successful write although
>>> the configuration was not modified at all. When modifying the "all"
>>> variant it can also modify the configuration of existing interfaces to
>>> the wrong value.
>>>
>>> Fix this by checking the return value of proc_dointvec() and returning
>>> early on failure.
>>>
>>> Fixes: b325fddb7f86 ("ipv6: Fix sysctl unregistration deadlock")
>> The bug existed before the git era.
>> Maybe
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>>
>>
>
>
> Hm, not really, AFAICS b325fddb7f86 is the first commit overwriting the return
> value from proc_dointvec(). See:
>
> @@ -3983,7 +3986,7 @@ int addrconf_sysctl_forward(ctl_table *ctl, int write,
> struct file * filp,
> ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos);
>
> if (write)
> - addrconf_fixup_forwarding(ctl, valp, val);
> + ret = addrconf_fixup_forwarding(ctl, valp, val);
> return ret;
> }
Hmm, you're right.
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
^ permalink raw reply
* Re: [PATCH] net: fman: fix clock and device node leak in probe error paths
From: Simon Horman @ 2026-06-19 12:13 UTC (permalink / raw)
To: zhaojinming
Cc: Simon Horman, madalin.bucur, sean.anderson, andrew+netdev, davem,
edumazet, kuba, pabeni, netdev, linux-kernel
In-Reply-To: <20260618075435.1262533-1-zhaojinming@uniontech.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
Simon says: I think the first issue below does warrant being investigated
within the context of this patch.
The second one seems related to the intent of this patch,
but I don't feel strongly about it needing to be fixed
as part of this work.
And the last one seems yet more weakly related.
---
net: fman: fix clock and device node leak in probe error paths
This patch aims to fix resource leaks in the error paths of
read_dts_node(). It introduces a clk_put label to ensure the clock
reference is properly released on probe failures.
> diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c
> index 013273a2de32f..734cbe8efd7e3 100644
> --- a/drivers/net/ethernet/freescale/fman/fman.c
> +++ b/drivers/net/ethernet/freescale/fman/fman.c
[ ... ]
> @@ -2757,7 +2757,7 @@ static struct fman *read_dts_node(struct platform_device *of_dev)
> err = -EINVAL;
> dev_err(&of_dev->dev, "%s: could not find MURAM node\n",
> __func__);
> - goto fman_free;
> + goto clk_put;
> }
[Severity: High]
Will this cause a double-free of the device node on error paths?
In read_dts_node(), the call to of_find_matching_node() drops the
reference on its "from" argument, which is fm_node:
muram_node = of_find_matching_node(fm_node, fman_muram_match);
By redirecting the error paths after this point to the clk_put label,
the code falls through to fman_node_put and calls of_node_put(fm_node).
Since of_find_matching_node() already consumed the reference, does this
second put lead to a reference count underflow and use-after-free?
[ ... ]
> @@ -2816,9 +2816,10 @@ static struct fman *read_dts_node(struct platform_device *of_dev)
>
> return fman;
[Severity: High]
This isn't a bug introduced by this patch, but does this function leak the
clock reference when returning successfully?
While the patch correctly adds clk_put(clk) to the error paths, the
success path shown above returns without releasing the reference acquired
via of_clk_get(). Does the driver need to hold this reference throughout
its lifetime, or should it be released here to prevent a memory leak
during repeated driver binds and unbinds?
> +clk_put:
> + clk_put(clk);
> fman_node_put:
> of_node_put(fm_node);
> -fman_free:
> kfree(fman);
> return ERR_PTR(err);
> }
[Severity: High]
This is a pre-existing issue, but is it safe to manually free the fman
structure here when devm-managed shared interrupts are still registered?
The driver allocates fman using kzalloc_obj() and registers shared
interrupt handlers using devm_request_irq(), passing fman as the dev_id.
If initialization fails later in the function, the error path calls
kfree(fman). However, the devm-managed interrupts are not unregistered
until the probe function returns and devres cleanup runs.
Because the interrupt is IRQF_SHARED, could another device trigger an
interrupt during this window? If so, would the still-registered fman_irq
handler attempt to access the already-freed fman structure?
^ permalink raw reply
* [PATCH net 16/16] netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak
From: Pablo Neira Ayuso @ 2026-06-19 11:54 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260619115452.93949-1-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
This needs to test for nonzero retval.
Fixes: c54c7c685494 ("netfilter: nft_meta_bridge: add NFT_META_BRI_IIFPVID support")
Closes: https://sashiko.dev/#/patchset/20260618061631.21919-1-fw%40strlen.de
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/bridge/netfilter/nft_meta_bridge.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/bridge/netfilter/nft_meta_bridge.c b/net/bridge/netfilter/nft_meta_bridge.c
index 3d95f68e0906..e4c9aa1f64e2 100644
--- a/net/bridge/netfilter/nft_meta_bridge.c
+++ b/net/bridge/netfilter/nft_meta_bridge.c
@@ -44,7 +44,9 @@ static void nft_meta_bridge_get_eval(const struct nft_expr *expr,
if (!br_dev || !br_vlan_enabled(br_dev))
goto err;
- br_vlan_get_pvid_rcu(in, &p_pvid);
+ if (br_vlan_get_pvid_rcu(in, &p_pvid))
+ goto err;
+
nft_reg_store16(dest, p_pvid);
return;
}
--
2.47.3
^ permalink raw reply related
* [PATCH net 15/16] netfilter: nf_conntrack_expect: store master_tuple in expectation
From: Pablo Neira Ayuso @ 2026-06-19 11:54 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260619115452.93949-1-pablo@netfilter.org>
Store master conntrack tuple in the expectation since exp->master might
refer to a different conntrack when accessed from rcu read side lock
area due to typesafe rcu rules.
Fixes: 02a3231b6d82 ("netfilter: nf_conntrack_expect: store netns and zone in expectation")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_conntrack_expect.h | 1 +
net/netfilter/nf_conntrack_broadcast.c | 1 +
net/netfilter/nf_conntrack_expect.c | 2 ++
net/netfilter/nf_conntrack_netlink.c | 9 +++------
4 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack_expect.h b/include/net/netfilter/nf_conntrack_expect.h
index be4a120d549e..c024345c9bd8 100644
--- a/include/net/netfilter/nf_conntrack_expect.h
+++ b/include/net/netfilter/nf_conntrack_expect.h
@@ -26,6 +26,7 @@ struct nf_conntrack_expect {
possible_net_t net;
/* We expect this tuple, with the following mask */
+ struct nf_conntrack_tuple master_tuple;
struct nf_conntrack_tuple tuple;
struct nf_conntrack_tuple_mask mask;
diff --git a/net/netfilter/nf_conntrack_broadcast.c b/net/netfilter/nf_conntrack_broadcast.c
index 400119b6320e..bf78828c7549 100644
--- a/net/netfilter/nf_conntrack_broadcast.c
+++ b/net/netfilter/nf_conntrack_broadcast.c
@@ -62,6 +62,7 @@ int nf_conntrack_broadcast_help(struct sk_buff *skb,
if (exp == NULL)
goto out;
+ exp->master_tuple = ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
exp->tuple = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
helper = rcu_dereference(help->helper);
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index 49e18eda037e..9454913e1b33 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -355,6 +355,8 @@ void nf_ct_expect_init(struct nf_conntrack_expect *exp, unsigned int class,
exp->tuple.src.l3num = family;
exp->tuple.dst.protonum = proto;
+ exp->master_tuple = ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+
if (saddr) {
memcpy(&exp->tuple.src.u3, saddr, len);
if (sizeof(exp->tuple.src.u3) > len)
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 4e78d2482989..22efcb8a29c1 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -3015,7 +3015,6 @@ ctnetlink_exp_dump_expect(struct sk_buff *skb,
const struct nf_conntrack_expect *exp)
{
__s32 timeout = (__s32)(READ_ONCE(exp->timeout) - nfct_time_stamp) / HZ;
- struct nf_conn *master = exp->master;
struct nf_conntrack_helper *helper;
#if IS_ENABLED(CONFIG_NF_NAT)
struct nlattr *nest_parms;
@@ -3030,9 +3029,7 @@ ctnetlink_exp_dump_expect(struct sk_buff *skb,
goto nla_put_failure;
if (ctnetlink_exp_dump_mask(skb, &exp->tuple, &exp->mask) < 0)
goto nla_put_failure;
- if (ctnetlink_exp_dump_tuple(skb,
- &master->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
- CTA_EXPECT_MASTER) < 0)
+ if (ctnetlink_exp_dump_tuple(skb, &exp->master_tuple, CTA_EXPECT_MASTER) < 0)
goto nla_put_failure;
#if IS_ENABLED(CONFIG_NF_NAT)
@@ -3045,9 +3042,9 @@ ctnetlink_exp_dump_expect(struct sk_buff *skb,
if (nla_put_be32(skb, CTA_EXPECT_NAT_DIR, htonl(exp->dir)))
goto nla_put_failure;
- nat_tuple.src.l3num = nf_ct_l3num(master);
+ nat_tuple.src.l3num = exp->master_tuple.src.l3num;
nat_tuple.src.u3 = exp->saved_addr;
- nat_tuple.dst.protonum = nf_ct_protonum(master);
+ nat_tuple.dst.protonum = exp->master_tuple.dst.protonum;
nat_tuple.src.u = exp->saved_proto;
if (ctnetlink_exp_dump_tuple(skb, &nat_tuple,
--
2.47.3
^ permalink raw reply related
* [PATCH net 14/16] netfilter: nf_conntrack_expect: use conntrack GC to reap expectations
From: Pablo Neira Ayuso @ 2026-06-19 11:54 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260619115452.93949-1-pablo@netfilter.org>
This patch replaces the timer API by GC worker approach for
expectations, as it already happened in many other subsystems.
Use the existing conntrack GC worker to iterate over the local list of
expectations in the master conntrack to reap expired expectations.
Check IPS_HELPER_BIT to run GC for expectations, set it on for nft_ct
expectation which nevers sets it. Hold the expectation spinlock while
iterating over the master conntrack expectation list to synchronize with
nf_ct_remove_expectations(). This also performs runtime packet path
garbage collection through the expectation insertion and lookup
functions while walking over one of the chains of the global expectation
hashtables. Unconfirmed conntrack entries are skipped since ct->ext can
be reallocated and dying are skipped since those will be gone soon.
Set on IPS_HELPER_BIT if the helper ct extension is added, then the new
GC worker does not need to bump the ct refcount to check if the ct->ext
helper is available.
This removes the extra bump on the refcount for expectation timers, this
allows to remove several nf_ct_expect_put() calls after the unlink,
after this update only refcount remains at 1 while on the expectation
hashes.
This patch implicitly addresses a race with the existing timer API
allowing an expectation to access a stale exp->master pointer which has
been already released when expectation removal loses races with an
expiring timer, ie. timer_del() reporting false.
Add a new NF_CT_EXPECT_DEAD flag to reap this expectation via GC. This
is needed by nf_conntrack_unexpect_related() which is called in error
paths to invalidate newly created expectations that has been added into
the hashes. These expectactions cannot be inmediately released as GC or
nf_ct_remove_expectations() could race to make it. On expectation
insert, the runtime GC reaps stale expectations before checking the
expectation limit set by policy.
Set current timestamp in nf_ct_expect_alloc(), then add the expectation
policy timeout (or custom timeout specified added on top of this) to
specify the expectation lifetime.
Fixes: bffcaad9afdf ("netfilter: ctnetlink: ensure safe access to master conntrack")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_conntrack_expect.h | 16 +-
.../linux/netfilter/nf_conntrack_common.h | 1 +
net/netfilter/nf_conntrack_core.c | 33 +++-
net/netfilter/nf_conntrack_expect.c | 145 +++++++++---------
net/netfilter/nf_conntrack_h323_main.c | 4 +-
net/netfilter/nf_conntrack_helper.c | 10 +-
net/netfilter/nf_conntrack_netlink.c | 22 ++-
net/netfilter/nf_conntrack_sip.c | 13 +-
net/netfilter/nft_ct.c | 3 +-
9 files changed, 139 insertions(+), 108 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack_expect.h b/include/net/netfilter/nf_conntrack_expect.h
index 80f50fd0f7ad..be4a120d549e 100644
--- a/include/net/netfilter/nf_conntrack_expect.h
+++ b/include/net/netfilter/nf_conntrack_expect.h
@@ -54,8 +54,8 @@ struct nf_conntrack_expect {
/* The conntrack of the master connection */
struct nf_conn *master;
- /* Timer function; deletes the expectation. */
- struct timer_list timeout;
+ /* jiffies32 when this expectation expires */
+ u32 timeout;
#if IS_ENABLED(CONFIG_NF_NAT)
union nf_inet_addr saved_addr;
@@ -69,6 +69,14 @@ struct nf_conntrack_expect {
struct rcu_head rcu;
};
+static inline bool nf_ct_exp_is_expired(const struct nf_conntrack_expect *exp)
+{
+ if (READ_ONCE(exp->flags) & NF_CT_EXPECT_DEAD)
+ return true;
+
+ return (__s32)(READ_ONCE(exp->timeout) - nfct_time_stamp) <= 0;
+}
+
static inline struct net *nf_ct_exp_net(struct nf_conntrack_expect *exp)
{
return read_pnet(&exp->net);
@@ -130,7 +138,6 @@ static inline void nf_ct_unlink_expect(struct nf_conntrack_expect *exp)
void nf_ct_remove_expectations(struct nf_conn *ct);
void nf_ct_unexpect_related(struct nf_conntrack_expect *exp);
-bool nf_ct_remove_expect(struct nf_conntrack_expect *exp);
void nf_ct_expect_iterate_destroy(bool (*iter)(struct nf_conntrack_expect *e, void *data), void *data);
void nf_ct_expect_iterate_net(struct net *net,
@@ -153,5 +160,8 @@ static inline int nf_ct_expect_related(struct nf_conntrack_expect *expect,
return nf_ct_expect_related_report(expect, 0, 0, flags);
}
+struct nf_conn_help;
+void nf_ct_expectation_gc(struct nf_conn_help *master_help);
+
#endif /*_NF_CONNTRACK_EXPECT_H*/
diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h b/include/uapi/linux/netfilter/nf_conntrack_common.h
index 56b6b60a814f..ee51045ae1d6 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_common.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_common.h
@@ -160,6 +160,7 @@ enum ip_conntrack_expect_events {
#define NF_CT_EXPECT_USERSPACE 0x4
#ifdef __KERNEL__
+#define NF_CT_EXPECT_DEAD 0x8
#define NF_CT_EXPECT_MASK (NF_CT_EXPECT_PERMANENT | NF_CT_EXPECT_INACTIVE | \
NF_CT_EXPECT_USERSPACE)
#endif
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 4fb3a2d18631..784bd1d7a9bf 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1471,6 +1471,31 @@ static bool gc_worker_can_early_drop(const struct nf_conn *ct)
return false;
}
+static void nf_ct_help_gc(struct nf_conn *ct)
+{
+ struct nf_conn_help *help;
+
+ if (!refcount_inc_not_zero(&ct->ct_general.use))
+ return;
+
+ /* load ->status after refcount increase */
+ smp_acquire__after_ctrl_dep();
+
+ if (!nf_ct_is_confirmed(ct) || nf_ct_is_dying(ct)) {
+ nf_ct_put(ct);
+ return;
+ }
+
+ /* re-check helper due to SLAB_TYPESAFE_BY_RCU */
+ if (test_bit(IPS_HELPER_BIT, &ct->status)) {
+ help = nfct_help(ct);
+ if (help)
+ nf_ct_expectation_gc(help);
+ }
+
+ nf_ct_put(ct);
+}
+
static void gc_worker(struct work_struct *work)
{
unsigned int i, hashsz, nf_conntrack_max95 = 0;
@@ -1543,7 +1568,13 @@ static void gc_worker(struct work_struct *work)
expires = (expires - (long)next_run) / ++count;
next_run += expires;
- if (nf_conntrack_max95 == 0 || gc_worker_skip_ct(tmp))
+ if (gc_worker_skip_ct(tmp))
+ continue;
+
+ if (test_bit(IPS_HELPER_BIT, &tmp->status))
+ nf_ct_help_gc(tmp);
+
+ if (nf_conntrack_max95 == 0)
continue;
net = nf_ct_net(tmp);
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
index 5c9b17835c28..49e18eda037e 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -43,6 +43,24 @@ unsigned int nf_ct_expect_max __read_mostly;
static struct kmem_cache *nf_ct_expect_cachep __read_mostly;
static siphash_aligned_key_t nf_ct_expect_hashrnd;
+void nf_ct_expectation_gc(struct nf_conn_help *master_help)
+{
+ struct nf_conntrack_expect *exp;
+ struct hlist_node *next;
+
+ if (hlist_empty(&master_help->expectations))
+ return;
+
+ spin_lock_bh(&nf_conntrack_expect_lock);
+ hlist_for_each_entry_safe(exp, next, &master_help->expectations, lnode) {
+ if (!nf_ct_exp_is_expired(exp))
+ continue;
+
+ nf_ct_unlink_expect(exp);
+ }
+ spin_unlock_bh(&nf_conntrack_expect_lock);
+}
+
/* nf_conntrack_expect helper functions */
void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
u32 portid, int report)
@@ -52,7 +70,6 @@ void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
struct nf_conntrack_net *cnet;
lockdep_nfct_expect_lock_held();
- WARN_ON_ONCE(timer_pending(&exp->timeout));
hlist_del_rcu(&exp->hnode);
@@ -70,16 +87,6 @@ void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
}
EXPORT_SYMBOL_GPL(nf_ct_unlink_expect_report);
-static void nf_ct_expectation_timed_out(struct timer_list *t)
-{
- struct nf_conntrack_expect *exp = timer_container_of(exp, t, timeout);
-
- spin_lock_bh(&nf_conntrack_expect_lock);
- nf_ct_unlink_expect(exp);
- spin_unlock_bh(&nf_conntrack_expect_lock);
- nf_ct_expect_put(exp);
-}
-
static unsigned int nf_ct_expect_dst_hash(const struct net *n, const struct nf_conntrack_tuple *tuple)
{
struct {
@@ -117,19 +124,6 @@ nf_ct_exp_equal(const struct nf_conntrack_tuple *tuple,
nf_ct_exp_zone_equal_any(i, zone);
}
-bool nf_ct_remove_expect(struct nf_conntrack_expect *exp)
-{
- lockdep_nfct_expect_lock_held();
-
- if (timer_delete(&exp->timeout)) {
- nf_ct_unlink_expect(exp);
- nf_ct_expect_put(exp);
- return true;
- }
- return false;
-}
-EXPORT_SYMBOL_GPL(nf_ct_remove_expect);
-
struct nf_conntrack_expect *
__nf_ct_expect_find(struct net *net,
const struct nf_conntrack_zone *zone,
@@ -144,6 +138,8 @@ __nf_ct_expect_find(struct net *net,
h = nf_ct_expect_dst_hash(net, tuple);
hlist_for_each_entry_rcu(i, &nf_ct_expect_hash[h], hnode) {
+ if (nf_ct_exp_is_expired(i))
+ continue;
if (nf_ct_exp_equal(tuple, i, zone, net))
return i;
}
@@ -178,6 +174,7 @@ nf_ct_find_expectation(struct net *net,
{
struct nf_conntrack_net *cnet = nf_ct_pernet(net);
struct nf_conntrack_expect *i, *exp = NULL;
+ struct hlist_node *next;
unsigned int h;
lockdep_nfct_expect_lock_held();
@@ -186,7 +183,11 @@ nf_ct_find_expectation(struct net *net,
return NULL;
h = nf_ct_expect_dst_hash(net, tuple);
- hlist_for_each_entry(i, &nf_ct_expect_hash[h], hnode) {
+ hlist_for_each_entry_safe(i, next, &nf_ct_expect_hash[h], hnode) {
+ if (nf_ct_exp_is_expired(i)) {
+ nf_ct_unlink_expect(i);
+ continue;
+ }
if (!(i->flags & NF_CT_EXPECT_INACTIVE) &&
nf_ct_exp_equal(tuple, i, zone, net)) {
exp = i;
@@ -196,13 +197,16 @@ nf_ct_find_expectation(struct net *net,
if (!exp)
return NULL;
+ if (!refcount_inc_not_zero(&exp->use))
+ return NULL;
+
/* If master is not in hash table yet (ie. packet hasn't left
this machine yet), how can other end know about expected?
Hence these are not the droids you are looking for (if
master ct never got confirmed, we'd hold a reference to it
and weird things would happen to future packets). */
if (!nf_ct_is_confirmed(exp->master))
- return NULL;
+ goto err_release_exp;
/* Avoid race with other CPUs, that for exp->master ct, is
* about to invoke ->destroy(), or nf_ct_delete() via timeout
@@ -214,18 +218,17 @@ nf_ct_find_expectation(struct net *net,
*/
if (unlikely(nf_ct_is_dying(exp->master) ||
!refcount_inc_not_zero(&exp->master->ct_general.use)))
- return NULL;
+ goto err_release_exp;
- if (exp->flags & NF_CT_EXPECT_PERMANENT || !unlink) {
- refcount_inc(&exp->use);
- return exp;
- } else if (timer_delete(&exp->timeout)) {
- nf_ct_unlink_expect(exp);
+ if (exp->flags & NF_CT_EXPECT_PERMANENT || !unlink)
return exp;
- }
- /* Undo exp->master refcnt increase, if timer_delete() failed */
- nf_ct_put(exp->master);
+ nf_ct_unlink_expect(exp);
+
+ return exp;
+
+err_release_exp:
+ nf_ct_expect_put(exp);
return NULL;
}
@@ -241,9 +244,8 @@ void nf_ct_remove_expectations(struct nf_conn *ct)
return;
spin_lock_bh(&nf_conntrack_expect_lock);
- hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) {
- nf_ct_remove_expect(exp);
- }
+ hlist_for_each_entry_safe(exp, next, &help->expectations, lnode)
+ nf_ct_unlink_expect(exp);
spin_unlock_bh(&nf_conntrack_expect_lock);
}
EXPORT_SYMBOL_GPL(nf_ct_remove_expectations);
@@ -292,7 +294,7 @@ static bool master_matches(const struct nf_conntrack_expect *a,
void nf_ct_unexpect_related(struct nf_conntrack_expect *exp)
{
spin_lock_bh(&nf_conntrack_expect_lock);
- nf_ct_remove_expect(exp);
+ WRITE_ONCE(exp->flags, exp->flags | NF_CT_EXPECT_DEAD);
spin_unlock_bh(&nf_conntrack_expect_lock);
}
EXPORT_SYMBOL_GPL(nf_ct_unexpect_related);
@@ -308,6 +310,7 @@ struct nf_conntrack_expect *nf_ct_expect_alloc(struct nf_conn *me)
if (!new)
return NULL;
+ new->timeout = nfct_time_stamp;
new->master = me;
refcount_set(&new->use, 1);
return new;
@@ -413,17 +416,12 @@ static void nf_ct_expect_insert(struct nf_conntrack_expect *exp,
struct net *net = nf_ct_exp_net(exp);
unsigned int h = nf_ct_expect_dst_hash(net, &exp->tuple);
- /* two references : one for hash insert, one for the timer */
- refcount_add(2, &exp->use);
+ refcount_inc(&exp->use);
- timer_setup(&exp->timeout, nf_ct_expectation_timed_out, 0);
helper = rcu_dereference_protected(master_help->helper,
lockdep_is_held(&nf_conntrack_expect_lock));
- if (helper) {
- exp->timeout.expires = jiffies +
- helper->expect_policy[exp->class].timeout * HZ;
- }
- add_timer(&exp->timeout);
+ if (helper)
+ exp->timeout += helper->expect_policy[exp->class].timeout * HZ;
hlist_add_head_rcu(&exp->lnode, &master_help->expectations);
master_help->expecting[exp->class]++;
@@ -435,19 +433,26 @@ static void nf_ct_expect_insert(struct nf_conntrack_expect *exp,
NF_CT_STAT_INC(net, expect_create);
}
-/* Race with expectations being used means we could have none to find; OK. */
static void evict_oldest_expect(struct nf_conn_help *master_help,
- struct nf_conntrack_expect *new)
+ struct nf_conntrack_expect *new,
+ const struct nf_conntrack_expect_policy *p)
{
struct nf_conntrack_expect *exp, *last = NULL;
+ struct hlist_node *next;
- hlist_for_each_entry(exp, &master_help->expectations, lnode) {
+ hlist_for_each_entry_safe(exp, next, &master_help->expectations, lnode) {
+ if (nf_ct_exp_is_expired(exp)) {
+ nf_ct_unlink_expect(exp);
+ continue;
+ }
if (exp->class == new->class)
last = exp;
}
- if (last)
- nf_ct_remove_expect(last);
+ /* Still worth to evict oldest expectation after garbage collection? */
+ if (last &&
+ master_help->expecting[last->class] >= p->max_expected)
+ nf_ct_unlink_expect(last);
}
static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect,
@@ -467,14 +472,18 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect,
h = nf_ct_expect_dst_hash(net, &expect->tuple);
hlist_for_each_entry_safe(i, next, &nf_ct_expect_hash[h], hnode) {
+ if (nf_ct_exp_is_expired(i)) {
+ nf_ct_unlink_expect(i);
+ continue;
+ }
if (master_matches(i, expect, flags) &&
expect_matches(i, expect)) {
if (i->class != expect->class ||
i->master != expect->master)
return -EALREADY;
- if (nf_ct_remove_expect(i))
- break;
+ nf_ct_unlink_expect(i);
+ break;
} else if (expect_clash(i, expect)) {
ret = -EBUSY;
goto out;
@@ -486,14 +495,8 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect,
if (helper) {
p = &helper->expect_policy[expect->class];
if (p->max_expected &&
- master_help->expecting[expect->class] >= p->max_expected) {
- evict_oldest_expect(master_help, expect);
- if (master_help->expecting[expect->class]
- >= p->max_expected) {
- ret = -EMFILE;
- goto out;
- }
- }
+ master_help->expecting[expect->class] >= p->max_expected)
+ evict_oldest_expect(master_help, expect, p);
}
cnet = nf_ct_pernet(net);
@@ -547,10 +550,8 @@ void nf_ct_expect_iterate_destroy(bool (*iter)(struct nf_conntrack_expect *e, vo
hlist_for_each_entry_safe(exp, next,
&nf_ct_expect_hash[i],
hnode) {
- if (iter(exp, data) && timer_delete(&exp->timeout)) {
+ if (iter(exp, data))
nf_ct_unlink_expect(exp);
- nf_ct_expect_put(exp);
- }
}
}
@@ -577,10 +578,8 @@ void nf_ct_expect_iterate_net(struct net *net,
if (!net_eq(nf_ct_exp_net(exp), net))
continue;
- if (iter(exp, data) && timer_delete(&exp->timeout)) {
+ if (iter(exp, data))
nf_ct_unlink_expect_report(exp, portid, report);
- nf_ct_expect_put(exp);
- }
}
}
@@ -657,17 +656,17 @@ static int exp_seq_show(struct seq_file *s, void *v)
struct net *net = seq_file_net(s);
struct hlist_node *n = v;
char *delim = "";
+ __s32 timeout;
expect = hlist_entry(n, struct nf_conntrack_expect, hnode);
if (!net_eq(nf_ct_exp_net(expect), net))
return 0;
+ if (nf_ct_exp_is_expired(expect))
+ return 0;
- if (expect->timeout.function)
- seq_printf(s, "%ld ", timer_pending(&expect->timeout)
- ? (long)(expect->timeout.expires - jiffies)/HZ : 0);
- else
- seq_puts(s, "- ");
+ timeout = (__s32)(READ_ONCE(expect->timeout) - nfct_time_stamp) / HZ;
+ seq_printf(s, "%d ", timeout > 0 ? timeout : 0);
seq_printf(s, "l3proto = %u proto=%u ",
expect->tuple.src.l3num,
expect->tuple.dst.protonum);
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 7f189dceb3c4..24931e379985 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -1388,8 +1388,8 @@ static int process_rcf(struct sk_buff *skb, struct nf_conn *ct,
"timeout to %u seconds for",
info->timeout);
nf_ct_dump_tuple(&exp->tuple);
- mod_timer_pending(&exp->timeout,
- jiffies + info->timeout * HZ);
+ WRITE_ONCE(exp->timeout,
+ nfct_time_stamp + (info->timeout * HZ));
}
spin_unlock_bh(&nf_conntrack_expect_lock);
}
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 2f35bdd0d7d7..8b94001c2430 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -181,10 +181,10 @@ nf_ct_helper_ext_add(struct nf_conn *ct, gfp_t gfp)
struct nf_conn_help *help;
help = nf_ct_ext_add(ct, NF_CT_EXT_HELPER, gfp);
- if (help)
+ if (help) {
+ __set_bit(IPS_HELPER_BIT, &ct->status);
INIT_HLIST_HEAD(&help->expectations);
- else
- pr_debug("failed to add helper extension area");
+ }
return help;
}
EXPORT_SYMBOL_GPL(nf_ct_helper_ext_add);
@@ -203,10 +203,8 @@ int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl,
return 0;
help = nfct_help(tmpl);
- if (help != NULL) {
+ if (help)
helper = rcu_dereference(help->helper);
- set_bit(IPS_HELPER_BIT, &ct->status);
- }
help = nfct_help(ct);
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index b429e648f06c..4e78d2482989 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -3014,8 +3014,8 @@ static int
ctnetlink_exp_dump_expect(struct sk_buff *skb,
const struct nf_conntrack_expect *exp)
{
+ __s32 timeout = (__s32)(READ_ONCE(exp->timeout) - nfct_time_stamp) / HZ;
struct nf_conn *master = exp->master;
- long timeout = ((long)exp->timeout.expires - (long)jiffies) / HZ;
struct nf_conntrack_helper *helper;
#if IS_ENABLED(CONFIG_NF_NAT)
struct nlattr *nest_parms;
@@ -3178,6 +3178,9 @@ ctnetlink_exp_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
restart:
hlist_for_each_entry_rcu(exp, &nf_ct_expect_hash[cb->args[0]],
hnode) {
+ if (nf_ct_exp_is_expired(exp))
+ continue;
+
if (l3proto && exp->tuple.src.l3num != l3proto)
continue;
@@ -3456,11 +3459,8 @@ static int ctnetlink_del_expect(struct sk_buff *skb,
}
/* after list removal, usage count == 1 */
- if (timer_delete(&exp->timeout)) {
- nf_ct_unlink_expect_report(exp, NETLINK_CB(skb).portid,
- nlmsg_report(info->nlh));
- nf_ct_expect_put(exp);
- }
+ nf_ct_unlink_expect_report(exp, NETLINK_CB(skb).portid,
+ nlmsg_report(info->nlh));
spin_unlock_bh(&nf_conntrack_expect_lock);
/* have to put what we 'get' above.
* after this line usage count == 0 */
@@ -3484,14 +3484,10 @@ static int
ctnetlink_change_expect(struct nf_conntrack_expect *x,
const struct nlattr * const cda[])
{
- if (cda[CTA_EXPECT_TIMEOUT]) {
- if (!timer_delete(&x->timeout))
- return -ETIME;
+ if (cda[CTA_EXPECT_TIMEOUT])
+ WRITE_ONCE(x->timeout, nfct_time_stamp +
+ ntohl(nla_get_be32(cda[CTA_EXPECT_TIMEOUT])) * HZ);
- x->timeout.expires = jiffies +
- ntohl(nla_get_be32(cda[CTA_EXPECT_TIMEOUT])) * HZ;
- add_timer(&x->timeout);
- }
return 0;
}
diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index c606d1f60b58..5ec3a4a4bbd7 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -897,11 +897,10 @@ static int refresh_signalling_expectation(struct nf_conn *ct,
exp->tuple.dst.protonum != proto ||
exp->tuple.dst.u.udp.port != port)
continue;
- if (mod_timer_pending(&exp->timeout, jiffies + expires * HZ)) {
- exp->flags &= ~NF_CT_EXPECT_INACTIVE;
- found = 1;
- break;
- }
+ WRITE_ONCE(exp->timeout, nfct_time_stamp + (expires * HZ));
+ WRITE_ONCE(exp->flags, exp->flags & ~NF_CT_EXPECT_INACTIVE);
+ found = 1;
+ break;
}
spin_unlock_bh(&nf_conntrack_expect_lock);
return found;
@@ -920,8 +919,7 @@ static void flush_expectations(struct nf_conn *ct, bool media)
hlist_for_each_entry_safe(exp, next, &help->expectations, lnode) {
if ((exp->class != SIP_EXPECT_SIGNALLING) ^ media)
continue;
- if (!nf_ct_remove_expect(exp))
- continue;
+ nf_ct_unlink_expect(exp);
if (!media)
break;
}
@@ -1413,7 +1411,6 @@ static int process_register_request(struct sk_buff *skb, unsigned int protoff,
nf_ct_expect_init(exp, SIP_EXPECT_SIGNALLING, nf_ct_l3num(ct),
saddr, &daddr, proto, NULL, &port);
- exp->timeout.expires = sip_timeout * HZ;
rcu_assign_pointer(exp->assign_helper, helper);
exp->flags = NF_CT_EXPECT_PERMANENT | NF_CT_EXPECT_INACTIVE;
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 25934c6f01fb..958054dd2e2e 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -1145,7 +1145,6 @@ static void nft_ct_helper_obj_eval(struct nft_object *obj,
help = nf_ct_helper_ext_add(ct, GFP_ATOMIC);
if (help && refcount_inc_not_zero(&to_assign->ct_refcnt)) {
rcu_assign_pointer(help->helper, to_assign);
- set_bit(IPS_HELPER_BIT, &ct->status);
if ((ct->status & IPS_NAT_MASK) && !nfct_seqadj(ct))
if (!nfct_seqadj_ext_add(ct))
@@ -1326,7 +1325,7 @@ static void nft_ct_expect_obj_eval(struct nft_object *obj,
&ct->tuplehash[!dir].tuple.src.u3,
&ct->tuplehash[!dir].tuple.dst.u3,
priv->l4proto, NULL, &priv->dport);
- exp->timeout.expires = jiffies + priv->timeout * HZ;
+ exp->timeout += priv->timeout * HZ;
if (nf_ct_expect_related(exp, 0) != 0)
regs->verdict.code = NF_DROP;
--
2.47.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox