* Re: [PATCH net-next v2 1/2] virtio_net: xsk: fix race in rx wake up
From: Menglong Dong @ 2026-06-16 1:48 UTC (permalink / raw)
To: menglong8.dong, Xuan Zhuo
Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
minhquangbui99, kerneljasonxing, netdev, virtualization,
linux-kernel, eperezma
In-Reply-To: <1781491685.0613394-1-xuanzhuo@linux.alibaba.com>
On 2026/6/15 10:48 Xuan Zhuo <xuanzhuo@linux.alibaba.com> write:
> On Thu, 11 Jun 2026 10:56:43 +0800, menglong8.dong@gmail.com wrote:
> > From: Menglong Dong <dongml2@chinatelecom.cn>
> >
> > During packet receiving in virtio-net, the rq can be empty, which means
> > "rq->vq->num_free == virtqueue_get_vring_size(rq->vq)", in
> > virtnet_add_recvbuf_xsk(), if we are using xsk. Meanwhile, the fill ring
> > can be empty too, which means we can't allocate anything from
> > xsk_buff_alloc_batch(). Then, we will set the XDP_RING_NEED_WAKEUP flag.
> >
[...]
> >
> > + need_wakeup = xsk_uses_need_wakeup(pool);
> > xsk_buffs = rq->xsk_buffs;
> >
> > + /* If both rq->vq and fill ring are empty, and then the user submit
> > + * all the chunks to the fill ring and check the wake up flag
> > + * after xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(),
> > + * we will lose the chance to wake up the rx napi, so we have to
> > + * set the need_wakeup flag here.
> > + */
> > + if (need_wakeup && virtqueue_get_vring_size(rq->vq) == rq->vq->num_free)
> > + xsk_set_rx_need_wakeup(pool);
>
> Is Condition A here too strict? We should trigger the wakeup under a wider range
> of scenarios.
Hi, Xuan. Thinks for your reviewing :)
The logic here is a addition logic to the origin wake up logic, which I planed
to fix a race condition. However, this race condition seems not likely to happen,
as we discussed in this thread:
https://lore.kernel.org/netdev/rHZz5_ylT4WggoZ-Ic2Q4w@linux.dev/
So this patch is not necessary, and I'll send the 2nd patch standalone.
Thanks!
Menglong Dong
>
> > +
> > num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->vq->num_free);
> > if (!num) {
> > - if (xsk_uses_need_wakeup(pool)) {
> > + if (need_wakeup) {
> > xsk_set_rx_need_wakeup(pool);
> > /* Return 0 instead of -ENOMEM so that NAPI is
> > * descheduled.
> > @@ -1341,8 +1352,6 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
> > }
> >
> > return -ENOMEM;
> > - } else {
> > - xsk_clear_rx_need_wakeup(pool);
> > }
> >
> > len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
> > @@ -1363,6 +1372,16 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
> > goto err;
> > }
> >
> > + if (need_wakeup) {
> > + if (rq->vq->num_free)
> > + /* We have free buffers, so we'd better wake up the
> > + * rx napi as soon as possible.
> > + */
> > + xsk_set_rx_need_wakeup(pool);
>
> Is the purpose of waking up RX NAPI to invoke try_fill_recv? However,
> virtnet_poll does not call try_fill_recv directly. it is done
> conditionally.
>
> Thanks.
>
>
> > + else
> > + xsk_clear_rx_need_wakeup(pool);
> > + }
> > +
> > return num;
> >
> > err:
> > --
> > 2.54.0
> >
>
>
^ permalink raw reply
* Re: [PATCH bpf 2/2] selftests/bpf: Cover partial copy of non-linear skb test_run output
From: sun jian @ 2026-06-16 1:44 UTC (permalink / raw)
To: Paul Chaignon
Cc: bpf, netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, shuah
In-Reply-To: <ajAIdJHvvIH9ZSrM@mail.gmail.com>
On Mon, Jun 15, 2026 at 10:13 PM Paul Chaignon <paul.chaignon@gmail.com> wrote:
>
> On Mon, Jun 15, 2026 at 03:38:56PM +0800, Sun Jian wrote:
> > Add a test case for BPF_PROG_TEST_RUN with a non-linear skb and a short
> > data_out buffer.
> >
> > The test verifies that test_run returns -ENOSPC, reports the full packet
> > length through data_size_out, and copies the packet prefix into data_out.
> > The test uses a 100-byte data_out buffer with a 64-byte linear head, so the
> > expected output spans both the skb head and the first fragment.
> >
> > Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
> > ---
> > .../selftests/bpf/prog_tests/skb_load_bytes.c | 35 +++++++++++++++++++
> > 1 file changed, 35 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/skb_load_bytes.c b/tools/testing/selftests/bpf/prog_tests/skb_load_bytes.c
> > index d7f83c0a40a5..134be0ea8ed7 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/skb_load_bytes.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/skb_load_bytes.c
> > @@ -3,6 +3,39 @@
> > #include <network_helpers.h>
> > #include "skb_load_bytes.skel.h"
> >
> > +#define NONLINEAR_PKT_LEN 9000
> > +#define NONLINEAR_HEAD_LEN 64
> > +#define SHORT_OUT_LEN 100
> > +
> > +static void test_nonlinear_data_out_partial(int prog_fd)
> > +{
> > + LIBBPF_OPTS(bpf_test_run_opts, tattr);
> > + __u8 pkt[NONLINEAR_PKT_LEN];
> > + __u8 out[SHORT_OUT_LEN];
> > + struct __sk_buff skb = {};
> > + int err, i;
> > +
> > + for (i = 0; i < sizeof(pkt); i++)
> > + pkt[i] = i & 0xff;
> > +
> > + memset(out, 0xa5, sizeof(out));
> > +
> > + skb.data_end = NONLINEAR_HEAD_LEN;
> > +
> > + tattr.data_in = pkt;
> > + tattr.data_size_in = sizeof(pkt);
> > + tattr.data_out = out;
> > + tattr.data_size_out = sizeof(out);
> > + tattr.ctx_in = &skb;
> > + tattr.ctx_size_in = sizeof(skb);
> > +
> > + err = bpf_prog_test_run_opts(prog_fd, &tattr);
> > +
> > + ASSERT_EQ(err, -ENOSPC, "nonlinear_partial_err");
> > + ASSERT_EQ(tattr.data_size_out, sizeof(pkt), "nonlinear_partial_data_size_out");
> > + ASSERT_OK(memcmp(out, pkt, sizeof(out)), "nonlinear_partial_data_out");
> > +}
> > +
> > void test_skb_load_bytes(void)
> > {
> > struct skb_load_bytes *skel;
> > @@ -40,6 +73,8 @@ void test_skb_load_bytes(void)
> > if (!ASSERT_EQ(test_result, 0, "offset 10"))
> > goto out;
> >
> > + test_nonlinear_data_out_partial(prog_fd);
> > +
>
> Maybe prog_tests/prog_run_opts.c would be a better place to cover this?
> test_skb_load_bytes() is meant to cover the bpf_skb_load_bytes helper.
>
> > out:
> > skb_load_bytes__destroy(skel);
> > }
> > --
> > 2.43.0
> >
Hi Paul,
Thanks, agreed. The test is really about BPF_PROG_TEST_RUN copy-out
semantics, not the bpf_skb_load_bytes() helper.
I'll move it to prog_run_opts.c in v2.
Thanks,
Sun Jian
^ permalink raw reply
* Re: [PATCH bpf 1/2] bpf: Fix partial copy of non-linear skb test_run output
From: sun jian @ 2026-06-16 1:43 UTC (permalink / raw)
To: Paul Chaignon
Cc: bpf, netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, shuah
In-Reply-To: <ajAAkIv0R3ztfc5g@mail.gmail.com>
On Mon, Jun 15, 2026 at 9:39 PM Paul Chaignon <paul.chaignon@gmail.com> wrote:
>
> On Mon, Jun 15, 2026 at 03:38:55PM +0800, Sun Jian wrote:
> > For non-linear skbs, bpf_test_finish() derives the linear head copy
> > length from copy_size - frag_size. This only matches the skb head length
> > when copy_size is the full packet size.
> >
> > When userspace provides a short data_out buffer, copy_size is clamped to
> > that buffer size. If copy_size is smaller than frag_size, the computed
> > length becomes negative and bpf_test_finish() returns -ENOSPC before
> > copying the packet prefix or updating data_size_out.
>
> Thanks for fixing this!
>
> >
> > Compute the linear head length from the skb layout instead, and clamp the
> > head copy length to copy_size. This preserves the expected partial-copy
> > semantics: return -ENOSPC, copy the packet prefix that fits in data_out,
> > and report the full packet length through data_size_out.
> >
> > Fixes: 838baa351cee ("bpf: Craft non-linear skbs in BPF_PROG_TEST_RUN")
>
> Wouldn't this bug actually go back to 7855e0db150ad ("bpf: test_run:
> add xdp_shared_info pointer in bpf_test_finish signature") and also
> affect the XDP bpf_prog_test_run_xdp()? If so, could you also add a
> selftest that covers it for XDP?
>
> > Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
> > ---
> > net/bpf/test_run.c | 11 ++++-------
> > 1 file changed, 4 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> > index 2bc04feadfab..976e8fa31bc9 100644
> > --- a/net/bpf/test_run.c
> > +++ b/net/bpf/test_run.c
> > @@ -453,19 +453,16 @@ static int bpf_test_finish(const union bpf_attr *kattr,
> > }
> >
> > if (data_out) {
> > - int len = sinfo ? copy_size - frag_size : copy_size;
> > -
> > - if (len < 0) {
> > - err = -ENOSPC;
> > - goto out;
> > - }
> > + u32 head_len = size - frag_size;
> > + u32 len = min(copy_size, head_len);
> >
> > if (copy_to_user(data_out, data, len))
> > goto out;
> >
> > if (sinfo) {
> > - int i, offset = len;
> > + u32 offset = len;
> > u32 data_len;
> > + int i;
> >
> > for (i = 0; i < sinfo->nr_frags; i++) {
> > skb_frag_t *frag = &sinfo->frags[i];
> > --
> > 2.43.0
> >
Hi Paul,
Thanks for taking a look.
Yes, that makes sense. I'll re-check the Fixes tag against
7855e0db150ad and add XDP coverage in v2, since the issue is in the
shared bpf_test_finish() path.
Thanks,
Sun Jian
^ permalink raw reply
* [PATCH] rocker: Fix memory leak in ofdpa_port_fdb()
From: Ziran Zhang @ 2026-06-16 1:32 UTC (permalink / raw)
To: Jiri Pirko, Andrew Lunn, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
Cc: Ziran Zhang, netdev, linux-kernel
In ofdpa_port_fdb(), the hash_del() only unlinks the node from
hash table, but does not free it.
Fix this by adding kfree(found) after the !found == removing check,
where the pointer value is no longer needed.
Found by Coccinelle kfree script.
Signed-off-by: Ziran Zhang <zhangcoder@yeah.net>
---
drivers/net/ethernet/rocker/rocker_ofdpa.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 66a8ae67c..15d19a8a1 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -1924,6 +1924,9 @@ static int ofdpa_port_fdb(struct ofdpa_port *ofdpa_port,
flags |= OFDPA_OP_FLAG_REFRESH;
}
+ if (found && removing)
+ kfree(found);
+
return ofdpa_port_fdb_learn(ofdpa_port, flags, addr, vlan_id);
}
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next v5 0/5] ionic: Expose more port stats to ethtool
From: patchwork-bot+netdevbpf @ 2026-06-16 1:30 UTC (permalink / raw)
To: Eric Joyner
Cc: netdev, brett.creeley, andrew+netdev, davem, edumazet, kuba,
pabeni, nikhil.rao, horms
In-Reply-To: <20260614205303.48088-1-eric.joyner@amd.com>
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Sun, 14 Jun 2026 13:52:58 -0700 you wrote:
> The primary aim of this patchset is to support the reporting of new port
> statistics (and one old one) that firmware sends to the driver; these
> include general FEC codeword stats and the FEC histogram. A scheme for
> these extra stats is introduced in order to prevent devices that don't
> support these new statistics from unconditionally setting them or
> reporting them in ethtool.
>
> [...]
Here is the summary with links:
- [net-next,v5,1/5] ionic: Fix check in ionic_get_link_ext_stats
https://git.kernel.org/netdev/net-next/c/7678e69079c1
- [net-next,v5,2/5] ionic: Update ionic_if.h with new extra port stats
https://git.kernel.org/netdev/net-next/c/433bc5008149
- [net-next,v5,3/5] ionic: Report "rx_bits_phy" stat to ethtool
https://git.kernel.org/netdev/net-next/c/cb683ff6ee07
- [net-next,v5,4/5] ionic: Get "link_down_count" ext link stat from firmware
https://git.kernel.org/netdev/net-next/c/3277e605ac01
- [net-next,v5,5/5] ionic: Add .get_fec_stats ethtool handler
(no matching commit)
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next] eth: ionic: delete the incorrect link_down_count reporting
From: Jakub Kicinski @ 2026-06-16 1:28 UTC (permalink / raw)
To: Creeley, Brett
Cc: Breno Leitao, davem, netdev, edumazet, pabeni, andrew+netdev,
horms, brett.creeley, eric.joyner
In-Reply-To: <3193fb6f-d786-4b8a-ace3-d63264c95529@amd.com>
On Mon, 15 Jun 2026 10:29:54 -0700 Creeley, Brett wrote:
> >> memcpy_fromio(p + offset, idev->dev_cmd_regs->words, size);
> >> }
> >>
> >> -static void ionic_get_link_ext_stats(struct net_device *netdev,
> >> - struct ethtool_link_ext_stats *stats)
> >> -{
> >> - struct ionic_lif *lif = netdev_priv(netdev);
> >> -
> >> - if (lif->ionic->pdev->is_physfn)
> >> - stats->link_down_events = lif->link_down_count;
> > It seems this is the only place where link_down_count is read. Maybe you
> > want to kill it as well?
>
> It still shows in debugfs, but we'd be fine with it being removed
> completely in favor of the correct implementation from Eric at:
> https://lore.kernel.org/netdev/20260614205303.48088-5-eric.joyner@amd.com/.
TBH I missed that Eric posted v5 already and assumed he won't make it
before net-next is closed. Sorry for the noise...
--
pw-bot: reject
^ permalink raw reply
* Re: [PATCH net-next v5 5/5] ionic: Add .get_fec_stats ethtool handler
From: Jakub Kicinski @ 2026-06-16 1:27 UTC (permalink / raw)
To: Eric Joyner
Cc: netdev, Brett Creeley, Andrew Lunn, David S. Miller, Eric Dumazet,
Paolo Abeni, Nikhil P . Rao, Simon Horman
In-Reply-To: <20260614205303.48088-6-eric.joyner@amd.com>
On Sun, 14 Jun 2026 13:53:03 -0700 Eric Joyner wrote:
> + if (fec_cw_err_bin != IONIC_STAT_INVALID)
> + hist->values[i].sum = le64_to_cpu(fec_cw_err_bin);
> + else
> + hist->values[i].sum = 0;
Setting the sum to zero is very much against the API contract with
ethtool, no? Since we are just digging ourselves out of a whole with
the link down events maybe let's be more strict going forward. :S
I think mlx5 had a similar problem of not knowing bucket count.
You can put the bucket table in some driver struct and populate it
at runtime.
Looking at this scheme, tho, I think the ethtool core is buggy for
mlx5 :/ We should make a copy of the hist table, because we write
the Netlink attrs after releasing the locks. Another call may start
already and make the driver write its table in parallel.
Please send a fix for that if you can, or LMK if not, I'll chase
one of the people involved in adding the fec stats.
I'll apply the first 4 patches here in the meantime.
^ permalink raw reply
* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-16 1:15 UTC (permalink / raw)
To: joannelkoong
Cc: akpm, axboe, bernd, brauner, david, dhowells, fuse-devel, hch,
jack, linux-api, linux-fsdevel, linux-kernel, linux-mm, miklos,
netdev, patches, pfalcato, rostedt, safinaskar, torvalds, val,
viro, willy
In-Reply-To: <CAJnrk1Y9egYizkx1H9K0cqxSYuB+7vLvQbV7Tf4C5eHFqnnC-A@mail.gmail.com>
Joanne Koong <joannelkoong@gmail.com>:
> > speaking of fuse_dev_splice……_write actually, this series has broken
> > xdg-document-portal!
> >
> > https://github.com/flatpak/xdg-desktop-portal/issues/2026
> >
> > Specifically what happens is that the EINVAL is returned due to oh.len
> > != nbytes:
> >
> > fuse_dev_do_write: oh.len 16400 != nbytes 15526
> >
> > (where 16400 == 16384 (read len) + 16, 15526 == 15510 (file len) + 16)
> >
> > After reverting the series, there is no error because oh.len
> > becomes 15526 too.
>
> I think this is because of how libfuse handles eof / short reads. When
> it detects a short read, it fixes up the header length after the
> header was already vmspliced to the pipe because it assumes vmsplice
> mapped the header's page into the pipe by reference. It assumes that
> modifying the header length in place gets then reflected in what the
> pipe later splices out.
>
> The logic for this happens in fuse_send_data_iov() [1]:
> a) sets out->len = headerlen (16) + len (16384) = 16400 in the
> stack-allocated fuse_out_header
> b) vmsplices the header to the pipe
> c) splices the backing file to the pipe. if this hits EOF, it'll get
> back 15510 instead of 16384
> d) detects the short read [2], fixes up the stack out->len = 16 + 15510 = 15526
> e) splices the pipe to /dev/fuse
>
> After this patch, step b) is a straight copy which means step d)'s
> fixup doesn't modify what's in the pipe. This could be fixed up in
> libfuse to not depend on modify-after-vmsplice, but I don't think this
> helps for applications using already-released libfuse versions. I
> think this patch needs to be reverted.
>
> Thanks,
> Joanne
>
> [1] https://github.com/libfuse/libfuse/blob/master/lib/fuse_lowlevel.c#L846
> [2] https://github.com/libfuse/libfuse/blob/master/lib/fuse_lowlevel.c#L956
Uh, this is very unfortunate. But I still want to remove vmsplice.
Maybe we can somehow save my patchsets? For example, let's return EINVAL
for this particular combination (writable pipe + SPLICE_F_NONBLOCK).
--
Askar Safin
^ permalink raw reply
* Re: [PATCH net] net: dsa: Fix skb ownership in taggers
From: Jakub Kicinski @ 2026-06-16 1:01 UTC (permalink / raw)
To: Linus Walleij
Cc: Andrew Lunn, Vladimir Oltean, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Florian Fainelli, Jonas Gorski,
Hauke Mehrtens, Kurt Kanzenbach, Woojung Huh, UNGLinuxDriver,
Chester A. Unal, Daniel Golle, Matthias Brugger,
AngeloGioacchino Del Regno, Wei Fang, Clark Wang,
Clément Léger, George McCollister, David Yang, netdev,
Sashiko AI Review
In-Reply-To: <20260616-dsa-fix-free-skb-v1-1-fd30b35dcf66@kernel.org>
On Tue, 16 Jun 2026 00:33:00 +0200 Linus Walleij wrote:
> 24 files changed, 243 insertions(+), 91 deletions(-)
Impressive. Thanks a lot for doing this.
patchwork says it doesn't apply to net. Is it on top of net or net-next?
Since the merge window started already net-next is probably better but
you need to designate in the subject correctly. Feel free to repost
without the 24h wait, maybe we can still slip this into our main PR.
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH net-next 0/2] appletalk: move the protocol out of tree
From: Jakub Kicinski @ 2026-06-16 0:55 UTC (permalink / raw)
To: John Paul Adrian Glaubitz
Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, geert,
chleroy, npiggin, mpe, maddy, linux-mips, linux-m68k,
linuxppc-dev
In-Reply-To: <c3789160609a10e232ba3e27c4b13adbb170956c.camel@physik.fu-berlin.de>
On Tue, 16 Jun 2026 01:34:06 +0200 John Paul Adrian Glaubitz wrote:
> On Mon, 2026-06-15 at 15:29 -0700, Jakub Kicinski wrote:
> > This tiny series moves appletalk out of tree, to:
> >
> > https://github.com/linux-netdev/mod-orphan
> >
> > Core maintainainers are unable to keep up with the rate of security
> > bug reports and fixes. Nobody seems to care about appletalk enough
> > to review the patches.
>
> Why would fixing these vulnerabilities be relevant? No one is going to
> expose an Apple Talk server to an untrusted network, are they? The same
> applies to hamradio and AX.25, they are all used by hobbyists in DMZ
> networks, so no one really cares about vulnerabilities in these protocols.
>
> I find it sad that AI tools are basically used to shoot at the kernel
> to kill off features as some people are apparently getting scared by
> these AI reports and just nuke everything in a panic reaction as if it
> wouldn't just be possible to disable these protocols at compile time
> to reduce the attack surface.
>
> > As Eric pointed out Mac OS dropped AppleTalk over a decade ago.
>
> That's not the point though. No one is going to use AppleTalk to network
> a Linux box to a modern macOS machine. The usefulness lies in hooking up
> a Linux box to a vintage Mac or other retro computer.
>
> So far, one of the huge advantages of open source operating systems has
> always been that even niche use cases were supported and people could make
> use of old hardware by using open source operating systems over commercial
> offerings such as Windows or macOS.
>
> With the advent of AI security reports, these niche use cases are more and
> more being killed off with the argument that a vulnerability in the harmradio
> code could pose a threat to a large SAP database running on a Linux enterprise
> distribution. However, if your enterprise distribution is enabling kernel
> features their customers aren't using and therefore enlarging the attack surface,
> it's more a problem of said enterprise distribution and not of these old and
> obscure network protocols.
>
> I am trying my best to save as many classic features in the kernel as possible
> to enable retro computing but I am sometimes fearing that commercial interest
> in the kernel is taking over too much making my efforts harder every day.
We can complain about the AI slop til the cows comes home.
I don't like it, you don't like it. What difference does it make?
If y'all have real solutions please share. Complaining about
"commercial interests" and "nuk[ing] everything in a panic reaction"
is not helpful.
^ permalink raw reply
* Re: [PATCH net-next 8/8] net: mdio: realtek-rtl9300: Add support for RTL839x
From: Jakub Kicinski @ 2026-06-16 0:37 UTC (permalink / raw)
To: markus.stockhausen
Cc: Jakub Kicinski, andrew, hkallweit1, linux, davem, edumazet,
pabeni, netdev, chris.packham, daniel, robh, krzk+dt, conor+dt,
devicetree
In-Reply-To: <20260613112946.1071411-9-markus.stockhausen@gmx.de>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: net: mdio: realtek-rtl9300: Add support for RTL839x
This patch adds RTL839x (cypress) support to the realtek-rtl9300 MDIO
driver, reusing the multi-device infrastructure that was prepared for
this purpose. It introduces a new RTL8390 register set, four c22/c45
read/write helpers, and a new otto_emdio_8390_info compatible table
entry.
> diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
> index d9ff0b0aecbb..2ab1aeb85eed 100644
> --- a/drivers/net/mdio/mdio-realtek-rtl9300.c
> +++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
> @@ -139,6 +139,29 @@
> #define RTL8380_SMI_POLL_CTRL 0xa17c
> #define RTL8380_SMI_PORT0_5_ADDR_CTRL 0xa1c8
>
> +#define RTL8390_NUM_BUSES 2
> +#define RTL8390_NUM_PAGES 8192
> +#define RTL8390_NUM_PORTS 52
> +#define RTL8390_BCAST_PHYID_CTRL 0x03ec
> +#define RTL8390_PHYREG_ACCESS_CTRL 0x03dc
> +#define RTL8390_PHY_CTRL_REG_ADDR GENMASK(9, 5)
> +#define RTL8390_PHY_CTRL_PARK_PAGE GENMASK(27, 23)
[Low]
Is RTL8390_PHY_CTRL_PARK_PAGE used anywhere? It does not appear to
be referenced in the rest of the patch, while the analogous masks for
the 8380 and 9300 variants are populated with 0x1f in their c22 helpers.
For example, otto_emdio_9300_read_c22() does:
.c22_data = FIELD_PREP(RTL9300_PHY_CTRL_REG_ADDR, regnum) |
FIELD_PREP(RTL9300_PHY_CTRL_PARK_PAGE, 0x1f) |
FIELD_PREP(RTL9300_PHY_CTRL_MAIN_PAGE, priv->page[port]),
while otto_emdio_8390_read_c22() below only sets REG_ADDR and MAIN_PAGE,
leaving bits 27..23 of c22_data at 0:
> +#define RTL8390_PHY_CTRL_MAIN_PAGE GENMASK(22, 10)
> +#define RTL8390_PHY_CTRL_FAIL BIT(1)
> +#define RTL8390_PHY_CTRL_WRITE BIT(3)
> +#define RTL8390_PHY_CTRL_READ 0
> +#define RTL8390_PHY_CTRL_TYPE_C45 BIT(2)
> +#define RTL8390_PHY_CTRL_TYPE_C22 0
[ ... ]
> @@ -457,6 +480,62 @@ static int otto_emdio_8380_write_c45(struct mii_bus *bus, int port,
> return otto_emdio_write_cmd(bus, RTL8380_PHY_CTRL_TYPE_C45, &cmd_data);
> }
>
> +static int otto_emdio_8390_read_c22(struct mii_bus *bus, int port, int regnum, u32 *value)
> +{
> + struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
> + struct otto_emdio_cmd_regs cmd_data = {
> + .c22_data = FIELD_PREP(RTL8390_PHY_CTRL_REG_ADDR, regnum) |
> + FIELD_PREP(RTL8390_PHY_CTRL_MAIN_PAGE, priv->page[port]),
> + .ext_page = FIELD_PREP(RTL8390_PHY_CTRL_EXT_PAGE, 0x1ff),
> + .io_data = FIELD_PREP(RTL8390_PHY_CTRL_INDATA, port),
> + };
> +
> + return otto_emdio_read_cmd(bus, RTL8390_PHY_CTRL_TYPE_C22, &cmd_data,
> + RTL8390_PHY_CTRL_DATA, value);
> +}
> +
> +static int otto_emdio_8390_write_c22(struct mii_bus *bus, int port, int regnum, u16 value)
> +{
> + struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
> + struct otto_emdio_cmd_regs cmd_data = {
> + .c22_data = FIELD_PREP(RTL8390_PHY_CTRL_REG_ADDR, regnum) |
> + FIELD_PREP(RTL8390_PHY_CTRL_MAIN_PAGE, priv->page[port]),
Should these helpers also set RTL8390_PHY_CTRL_PARK_PAGE to 0x1f, matching
the 8380/9300 convention, or is the macro itself intended to be removed
because the 8390 hardware does not need a parking page?
> + .ext_page = FIELD_PREP(RTL8390_PHY_CTRL_EXT_PAGE, 0x1ff),
> + .io_data = FIELD_PREP(RTL8390_PHY_CTRL_INDATA, value),
> + .port_mask_high = (u32)(BIT_ULL(port) >> 32),
> + .port_mask_low = (u32)(BIT_ULL(port)),
> + };
> +
> + return otto_emdio_write_cmd(bus, RTL8390_PHY_CTRL_TYPE_C22, &cmd_data);
> +}
[ ... ]
^ permalink raw reply
* Re: [PATCH net-next 5/8] net: mdio: realtek-rtl9300: Add c45 over c22 mitigation
From: Jakub Kicinski @ 2026-06-16 0:37 UTC (permalink / raw)
To: markus.stockhausen
Cc: Jakub Kicinski, andrew, hkallweit1, linux, davem, edumazet,
pabeni, netdev, chris.packham, daniel, robh, krzk+dt, conor+dt,
devicetree
In-Reply-To: <20260613112946.1071411-6-markus.stockhausen@gmx.de>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: mdio: realtek-rtl9300: Add c45 over c22 mitigation
This patch adds a per-port state machine that disables hardware MAC polling
around a register 13/14/13/14 c45-over-c22 MMD access sequence so the
sequence cannot be interrupted by a parallel polling read. Polling is
disabled when the first MII_MMD_CTRL access is seen and re-enabled when
the four-step sequence completes or unexpected input is detected.
A few questions on the new state-machine corner cases below.
> diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
> index a7fd075947b6..e206ee3e2b1c 100644
> --- a/drivers/net/mdio/mdio-realtek-rtl9300.c
> +++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
[ ... ]
> @@ -254,6 +257,43 @@ static int otto_emdio_set_port_polling(struct otto_emdio_priv *priv, int port, b
> BIT(port % 32), active);
> }
>
> +static int otto_emdio_mmd_prefix(struct otto_emdio_priv *priv, int port, int regnum)
> +{
> + u8 newstate, *state = &priv->mmd_state[port];
> + int expected, ret = 0;
> +
> + if (!priv->init_done)
> + return 0;
> + /*
> + * Disabled polling might produce link flapping and false notification interrupts on the
> + * MAC layer. In this case disable c45 over c22 MMD access because chances are high that
> + * the register 13/14/13/14 sequence is intercepted by a parallel hardware access. As
> + * a workaround the PHY must provide its own mmd read/write() callbacks and redirect to
> + * normal c22 registers. See rtlgen_read_mmd().
> + */
> + if (priv->info->link_flap)
> + return (regnum == MII_MMD_DATA || regnum == MII_MMD_CTRL) ? -EIO : 0;
> +
> + expected = (*state & 1) ? MII_MMD_DATA : MII_MMD_CTRL;
> + newstate = regnum == expected ? *state + 1 : 0;
> +
> + if (newstate == 1 || newstate < *state)
> + ret = otto_emdio_set_port_polling(priv, port, !newstate);
> + *state = newstate;
> +
> + return ret;
> +}
> +
> +static int otto_emdio_mmd_postfix(struct otto_emdio_priv *priv, int port, int regnum)
> +{
> + if (priv->mmd_state[port] != 4)
> + return 0;
> +
> + priv->mmd_state[port] = 0;
> +
> + return otto_emdio_set_port_polling(priv, port, true);
> +}
[High]
Can otto_emdio_mmd_prefix() and otto_emdio_mmd_postfix() get out of sync
on a transient MDIO failure?
The prefix transitions state 0->1 on the first MII_MMD_CTRL access and
calls otto_emdio_set_port_polling(port, false). In the c22 read/write
helpers below, if priv->info->read_c22() / write_c22() then returns an
error (otto_emdio_run_cmd() can return -ETIMEDOUT from
regmap_read_poll_timeout(), -ENXIO via cmd_fail, or any regmap_write()
error), the function returns immediately and never calls
otto_emdio_mmd_postfix().
Polling is left disabled, and *state stays non-zero.
On subsequent retries with regnums that still match the parity-based
expected value (the natural retry shape for an EEE 13/14/13/14
sequence), mmd_prefix keeps doing *state + 1 unbounded (state walks
through 5, 6, 7, ...). mmd_postfix only re-enables polling when
mmd_state == 4 exactly:
if (priv->mmd_state[port] != 4)
return 0;
so once the counter overshoots 4 it stays disabled until either an
unrelated regnum forces newstate < *state in mmd_prefix, or the u8
wraps at 255->0.
The commit message says polling is re-enabled "as soon as the sequence
is finished or any other unexpected input is detected", but a transient
MDIO failure mid-sequence is neither.
Should the error paths in otto_emdio_read_c22() / otto_emdio_write_c22()
reset mmd_state and re-enable polling on a failed transfer?
> @@ -463,7 +503,15 @@ static int otto_emdio_read_c22(struct mii_bus *bus, int phy_id, int regnum)
> if (regnum == 31)
> return priv->page[port];
>
> + ret = otto_emdio_mmd_prefix(priv, port, regnum);
> + if (ret)
> + return ret;
> +
> ret = priv->info->read_c22(bus, port, regnum, &value);
> + if (ret)
> + return ret;
> +
> + ret = otto_emdio_mmd_postfix(priv, port, regnum);
> }
>
> return ret ? ret : value;
[Medium]
The regnum == 31 page-select path returns from inside the scoped_guard
before reaching mmd_prefix() / mmd_postfix():
if (regnum == 31)
return priv->page[port];
If a page-register access occurs while mmd_state is mid-sequence (1, 2,
or 3), the counter is not reset and the polling-disable bit is not
restored, even though a page write between MMD steps is exactly the
"unexpected input" the commit message describes.
Recovery only happens on the next non-13/14 access that falls through
mmd_prefix and triggers the newstate < *state branch.
Should the regnum == 31 path also reconcile mmd_state and re-enable
polling when a sequence is in progress? The same question applies in
otto_emdio_write_c22().
[Low]
A separate observation about the read return path:
> return ret ? ret : value;
After a successful priv->info->read_c22() populates value, ret is
overwritten by mmd_postfix()'s return. If
otto_emdio_set_port_polling(priv, port, true) inside mmd_postfix()
fails, the function returns that error even though the data transfer
succeeded, so the PHY layer discards correctly read data. For
read-clear / latched status registers the hardware-side bits have
already been consumed.
Would it be preferable to treat a postfix failure as a polling-
bookkeeping issue (log and proceed) and still return the read value?
> @@ -487,7 +535,15 @@ static int otto_emdio_write_c22(struct mii_bus *bus, int phy_id, int regnum, u16
> return 0;
> }
>
> - return priv->info->write_c22(bus, port, regnum, value);
> + ret = otto_emdio_mmd_prefix(priv, port, regnum);
> + if (ret)
> + return ret;
> +
> + ret = priv->info->write_c22(bus, port, regnum, value);
> + if (ret)
> + return ret;
> +
> + return otto_emdio_mmd_postfix(priv, port, regnum);
> }
> }
The same shape exists for the write path: a postfix failure here turns
a successful write into a reported error and may cause an upper-layer
retry of an already-applied write.
^ permalink raw reply
* Re: [PATCH net-next 2/8] net: mdio: realtek-rtl9300: Add polling documentation
From: Jakub Kicinski @ 2026-06-16 0:37 UTC (permalink / raw)
To: markus.stockhausen
Cc: Jakub Kicinski, andrew, hkallweit1, linux, davem, edumazet,
pabeni, netdev, chris.packham, daniel, robh, krzk+dt, conor+dt,
devicetree
In-Reply-To: <20260613112946.1071411-3-markus.stockhausen@gmx.de>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: mdio: realtek-rtl9300: Add polling documentation
This commit adds a comment block to the rtl9300 MDIO driver describing
how the Realtek Otto switches' hardware polling unit interacts with the
MDIO bus and what this implies for kernel-initiated PHY accesses.
> diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
> index 892ed3780a65..da2864c94d2c 100644
> --- a/drivers/net/mdio/mdio-realtek-rtl9300.c
> +++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
> @@ -35,6 +35,72 @@
[ ... ]
> + * phy_read(phy, 5); phy_read(phy, 14);
> + * phy_read(phy, 6);
> + *
> + * The c22 over c45 register 13/14 sequences read MDIO_AN_EEE_ADV and MDIO_AN_EEE_LPABLE. As soon
> + * as one PHY status is read, the polling engine goes over to the next PHY. Basically the bus is
> + * always busy and the MAC status is updated in realtime.
[ ... ]
> + * - A c45 over c22 kernel access sequence is most likely to fail because chances are high that
> + * the polling engine overwrites registers 13/14 in between.
[Low]
The two paragraphs above describe the same mechanism — using Clause 22
transactions on registers 13 (MII_MMD_CTRL) and 14 (MII_MMD_DATA) to
indirectly access Clause 45 MMD registers — but use opposite wordings:
"c22 over c45" in the first paragraph and "c45 over c22" in the second.
The example writes shown earlier in the same comment (phy_write(phy, 13,
0x7); phy_write(phy, 14, 0x3c); ...) are the standard C22 indirect access
to C45 MMD registers, conventionally referred to in the kernel as
"C45 over C22".
Since the stated goal of this comment is to simplify future development
and review, would it make sense to use the same phrasing in both places
(for instance "C45 over C22" in both)?
^ permalink raw reply
* Re: [PATCH net-next 1/8] dt-bindings: net: realtek,rtl9301-mdio: Add RTL83xx series
From: Jakub Kicinski @ 2026-06-16 0:37 UTC (permalink / raw)
To: markus.stockhausen
Cc: Jakub Kicinski, andrew, hkallweit1, linux, davem, edumazet,
pabeni, netdev, chris.packham, daniel, robh, krzk+dt, conor+dt,
devicetree
In-Reply-To: <20260613112946.1071411-2-markus.stockhausen@gmx.de>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: dt-bindings: net: realtek,rtl9301-mdio: Add RTL83xx series
This commit extends the realtek,rtl9301-mdio binding to cover the
lower-end Realtek Otto switch families: Maple (RTL8380/8381/8382) and
Cypress (RTL8391/8392/8393). New compatible strings are added for these
SoCs alongside the existing RTL9300/RTL9311 entries.
> diff --git a/Documentation/devicetree/bindings/net/realtek,rtl9301-mdio.yaml b/Documentation/devicetree/bindings/net/realtek,rtl9301-mdio.yaml
> index 271e05bae9c5..de33364b67ef 100644
> --- a/Documentation/devicetree/bindings/net/realtek,rtl9301-mdio.yaml
> +++ b/Documentation/devicetree/bindings/net/realtek,rtl9301-mdio.yaml
> @@ -12,6 +12,16 @@ maintainers:
> properties:
> compatible:
> oneOf:
> + - items:
> + - enum:
> + - realtek,rtl8381-mdio
> + - realtek,rtl8382-mdio
> + - const: realtek,rtl8380-mdio
> + - items:
> + - enum:
> + - realtek,rtl8392-mdio
> + - realtek,rtl8393-mdio
> + - const: realtek,rtl8391-mdio
[Low]
With this change the binding now describes three distinct SoC families
(RTL83xx Maple, RTL839x Cypress, and the existing RTL93xx), but the
file still carries:
title: Realtek RTL9300 MDIO Controller
Should the title (and the top-level description, if applicable) be
updated to reflect the broader scope now that RTL8380/8381/8382 and
RTL8391/8392/8393 are covered by the same schema?
> @@ -24,6 +34,8 @@ properties:
> - realtek,rtl9313-mdio
> - const: realtek,rtl9311-mdio
> - enum:
> + - realtek,rtl8380-mdio
> + - realtek,rtl8391-mdio
> - realtek,rtl9301-mdio
> - realtek,rtl9311-mdio
>
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-16 0:36 UTC (permalink / raw)
To: agordeev
Cc: akpm, axboe, brauner, david, dhowells, hch, jack, linux-api,
linux-fsdevel, linux-kernel, linux-mm, linux-next, linux-s390,
miklos, netdev, patches, pfalcato, safinaskar, torvalds, viro,
willy
In-Reply-To: <20260608171917.3195488Afc-agordeev@linux.ibm.com>
Alexander Gordeev <agordeev@linux.ibm.com>:
> Hi All,
>
> This patch as commit e2c0b2368081b ("vmsplice: make vmsplice a trivial
> wrapper for preadv2/pwritev2") in linux-next on s390 causes the selftest
> tools/testing/selftests/mm/cow.c to hang:
>
> # [RUN] vmsplice() + unmap in child ... with PTE-mapped THP (128 kB)
>
> Recently there has been changes in THP area, so the problem is not
> necessary linked to this patch per se.
>
> Please, let me know if you need any additional information.
>
> Thanks!
As well as I understand, this test uses vmsplice to pin pages.
I. e. if my patch lands, then this test should be rewriten to use
some other mechanism.
--
Askar Safin
^ permalink raw reply
* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Luigi Rizzo @ 2026-06-16 0:33 UTC (permalink / raw)
To: Jakub Kicinski
Cc: rizzo.unipi, m.szyprowski, robin.murphy, willemb, kuniyu, davem,
edumazet, pabeni, gregkh, rafael, akpm, david, netdev, linux-mm,
iommu, driver-core, linux-kernel
In-Reply-To: <20260615172535.080cf94f@kernel.org>
On Tue, Jun 16, 2026 at 2:25 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 15 Jun 2026 23:42:20 +0000 Luigi Rizzo wrote:
> > The use of swiotlb causes an extra data copy on I/O. For tx sockets,
> > especially with greedy senders, this has a high chance of happening in
> > the softirq handler for tx network interrupts, creating a significant
> > performance bottleneck.
>
> What's the use case? I associate swiotlb with debug / testing mostly,
> so it'd be useful for people like me to explain why you care.
Ah sorry, I forgot to mention.
swiotlb is used in guest kernels for confidential computing VMs.
Ordinary memory pages are encrypted and the host or devices
have no way to decrypt them, so the kernel must use
unencrypted bounce buffers to exchange data with I/O devices.
cheers
luigi
^ permalink raw reply
* [PATCH net 2/2] sctp: add INIT verification after cookie unpacking
From: Xin Long @ 2026-06-16 0:33 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Marcelo Ricardo Leitner
In-Reply-To: <cover.1781570014.git.lucien.xin@gmail.com>
In SCTP handshake, the INIT chunk is initially processed by the server
and embedded into the cookie carried in INIT-ACK. The client then
returns this cookie via COOKIE-ECHO, where the server unpacks it and
reconstructs the original INIT chunk.
When cookie authentication is enabled, the cookie contents are protected
against tampering, so reusing the unpacked INIT without re-verification
is safe.
However, when cookie authentication is disabled, the reconstructed INIT
can no longer be trusted. In this case, the INIT must be explicitly
validated after unpacking to avoid processing potentially tampered data.
Add sctp_verify_init() checks after cookie unpacking in COOKIE-ECHO
processing paths (sctp_sf_do_5_1D_ce() and sctp_sf_do_5_2_4_dupcook())
when cookie_auth_enable is disabled. On failure, the association is
freed and an ABORT is generated via sctp_abort_on_init_err().
Also update sctp_verify_init() to validate parameter bounds against the
actual peer_init length rather than chunk->chunk_end, since peer_init
may not span the full chunk buffer in COOKIE-ECHO.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/sctp/sm_make_chunk.c | 2 +-
net/sctp/sm_statefuns.c | 29 ++++++++++++++++++++++++++---
2 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 41958b8e59fd..21b9eb1c02e9 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2298,7 +2298,7 @@ int sctp_verify_init(struct net *net, const struct sctp_endpoint *ep,
* VIOLATION error. We build the ERROR chunk here and let the normal
* error handling code build and send the packet.
*/
- if (param.v != (void *)chunk->chunk_end)
+ if (param.v != (void *)peer_init + ntohs(peer_init->chunk_hdr.length))
return sctp_process_inv_paramlength(asoc, param.p, chunk, errp);
/* The only missing mandatory param possible today is
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 544f308ee527..a11a18678279 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -647,10 +647,10 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
struct sctp_cmd_seq *commands)
{
struct sctp_ulpevent *ev, *ai_ev = NULL, *auth_ev = NULL;
+ struct sctp_chunk *err_chk_p = NULL;
struct sctp_association *new_asoc;
struct sctp_init_chunk *peer_init;
struct sctp_chunk *chunk = arg;
- struct sctp_chunk *err_chk_p;
struct sctp_chunk *repl;
struct sock *sk;
int error = 0;
@@ -725,6 +725,17 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
}
}
+ peer_init = (struct sctp_init_chunk *)(chunk->subh.cookie_hdr + 1);
+ if (!sctp_sk(sk)->cookie_auth_enable &&
+ !sctp_verify_init(net, ep, asoc, peer_init->chunk_hdr.type,
+ peer_init, chunk, &err_chk_p)) {
+ sctp_association_free(new_asoc);
+ return sctp_abort_on_init_err(net, ep, asoc, chunk, arg,
+ commands, err_chk_p);
+ }
+ if (err_chk_p)
+ sctp_chunk_free(err_chk_p);
+
if (security_sctp_assoc_request(new_asoc, chunk->head_skb ?: chunk->skb)) {
sctp_association_free(new_asoc);
return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
@@ -738,7 +749,6 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
/* This is a brand-new association, so these are not yet side
* effects--it is safe to run them here.
*/
- peer_init = (struct sctp_init_chunk *)(chunk->subh.cookie_hdr + 1);
if (!sctp_process_init(new_asoc, chunk,
&chunk->subh.cookie_hdr->c.peer_addr,
peer_init, GFP_ATOMIC))
@@ -2128,10 +2138,11 @@ enum sctp_disposition sctp_sf_do_5_2_4_dupcook(
void *arg,
struct sctp_cmd_seq *commands)
{
+ struct sctp_chunk *err_chk_p = NULL;
struct sctp_association *new_asoc;
+ struct sctp_init_chunk *peer_init;
struct sctp_chunk *chunk = arg;
enum sctp_disposition retval;
- struct sctp_chunk *err_chk_p;
int error = 0;
char action;
@@ -2200,6 +2211,18 @@ enum sctp_disposition sctp_sf_do_5_2_4_dupcook(
switch (action) {
case 'A': /* Association restart. */
case 'B': /* Collision case B. */
+ peer_init = (struct sctp_init_chunk *)
+ (chunk->subh.cookie_hdr + 1);
+ if (!sctp_sk(ep->base.sk)->cookie_auth_enable &&
+ !sctp_verify_init(net, ep, asoc, peer_init->chunk_hdr.type,
+ peer_init, chunk, &err_chk_p)) {
+ sctp_association_free(new_asoc);
+ return sctp_abort_on_init_err(net, ep, asoc, chunk, arg,
+ commands, err_chk_p);
+ }
+ if (err_chk_p)
+ sctp_chunk_free(err_chk_p);
+ fallthrough;
case 'D': /* Collision case D. */
/* Update socket peer label if first association. */
if (security_sctp_assoc_request((struct sctp_association *)asoc,
--
2.47.1
^ permalink raw reply related
* [PATCH net 1/2] sctp: factor out INIT verification failure handling
From: Xin Long @ 2026-06-16 0:33 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Marcelo Ricardo Leitner
In-Reply-To: <cover.1781570014.git.lucien.xin@gmail.com>
Extract the duplicated INIT/INIT-ACK error handling logic into a new
helper, sctp_abort_on_init_err().
Several state functions open-code the same pattern after
sctp_verify_init() fails: construct an ABORT with error causes if
available, send it when allocation succeeds, or fall back to T-bit ABORT
handling when no error chunk is present. INIT-ACK handling also includes
additional teardown logic for malformed packets.
Move this logic into sctp_abort_on_init_err() to reduce duplication and
centralize INIT/INIT-ACK failure handling.
No functional change intended. The helper will be used in a subsequent
patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/sctp/sm_statefuns.c | 171 +++++++++++++++++-----------------------
1 file changed, 72 insertions(+), 99 deletions(-)
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 9b23c11cbb9e..544f308ee527 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -68,6 +68,14 @@ static void sctp_send_stale_cookie_err(struct net *net,
const struct sctp_chunk *chunk,
struct sctp_cmd_seq *commands,
struct sctp_chunk *err_chunk);
+static enum sctp_disposition sctp_abort_on_init_err(
+ struct net *net,
+ const struct sctp_endpoint *ep,
+ const struct sctp_association *asoc,
+ const struct sctp_chunk *chunk,
+ void *arg,
+ struct sctp_cmd_seq *commands,
+ struct sctp_chunk *err_chunk);
static enum sctp_disposition sctp_sf_do_5_2_6_stale(
struct net *net,
const struct sctp_endpoint *ep,
@@ -325,7 +333,6 @@ enum sctp_disposition sctp_sf_do_5_1B_init(struct net *net,
struct sctp_chunk *chunk = arg, *repl, *err_chunk;
struct sctp_unrecognized_param *unk_param;
struct sctp_association *new_asoc;
- struct sctp_packet *packet;
int len;
/* 6.10 Bundling
@@ -375,32 +382,9 @@ enum sctp_disposition sctp_sf_do_5_1B_init(struct net *net,
err_chunk = NULL;
if (!sctp_verify_init(net, ep, asoc, chunk->chunk_hdr->type,
(struct sctp_init_chunk *)chunk->chunk_hdr, chunk,
- &err_chunk)) {
- /* This chunk contains fatal error. It is to be discarded.
- * Send an ABORT, with causes if there is any.
- */
- if (err_chunk) {
- packet = sctp_abort_pkt_new(net, ep, asoc, arg,
- (__u8 *)(err_chunk->chunk_hdr) +
- sizeof(struct sctp_chunkhdr),
- ntohs(err_chunk->chunk_hdr->length) -
- sizeof(struct sctp_chunkhdr));
-
- sctp_chunk_free(err_chunk);
-
- if (packet) {
- sctp_add_cmd_sf(commands, SCTP_CMD_SEND_PKT,
- SCTP_PACKET(packet));
- SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
- return SCTP_DISPOSITION_CONSUME;
- } else {
- return SCTP_DISPOSITION_NOMEM;
- }
- } else {
- return sctp_sf_tabort_8_4_8(net, ep, asoc, type, arg,
- commands);
- }
- }
+ &err_chunk))
+ return sctp_abort_on_init_err(net, ep, asoc, chunk, arg,
+ commands, err_chunk);
/* Grab the INIT header. */
chunk->subh.init_hdr = (struct sctp_inithdr *)chunk->skb->data;
@@ -525,7 +509,6 @@ enum sctp_disposition sctp_sf_do_5_1C_ack(struct net *net,
struct sctp_init_chunk *initchunk;
struct sctp_chunk *chunk = arg;
struct sctp_chunk *err_chunk;
- struct sctp_packet *packet;
if (!sctp_vtag_verify(chunk, asoc))
return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
@@ -548,50 +531,9 @@ enum sctp_disposition sctp_sf_do_5_1C_ack(struct net *net,
err_chunk = NULL;
if (!sctp_verify_init(net, ep, asoc, chunk->chunk_hdr->type,
(struct sctp_init_chunk *)chunk->chunk_hdr, chunk,
- &err_chunk)) {
-
- enum sctp_error error = SCTP_ERROR_NO_RESOURCE;
-
- /* This chunk contains fatal error. It is to be discarded.
- * Send an ABORT, with causes. If there are no causes,
- * then there wasn't enough memory. Just terminate
- * the association.
- */
- if (err_chunk) {
- packet = sctp_abort_pkt_new(net, ep, asoc, arg,
- (__u8 *)(err_chunk->chunk_hdr) +
- sizeof(struct sctp_chunkhdr),
- ntohs(err_chunk->chunk_hdr->length) -
- sizeof(struct sctp_chunkhdr));
-
- sctp_chunk_free(err_chunk);
-
- if (packet) {
- sctp_add_cmd_sf(commands, SCTP_CMD_SEND_PKT,
- SCTP_PACKET(packet));
- SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
- error = SCTP_ERROR_INV_PARAM;
- }
- }
-
- /* SCTP-AUTH, Section 6.3:
- * It should be noted that if the receiver wants to tear
- * down an association in an authenticated way only, the
- * handling of malformed packets should not result in
- * tearing down the association.
- *
- * This means that if we only want to abort associations
- * in an authenticated way (i.e AUTH+ABORT), then we
- * can't destroy this association just because the packet
- * was malformed.
- */
- if (sctp_auth_recv_cid(SCTP_CID_ABORT, asoc))
- return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
-
- SCTP_INC_STATS(net, SCTP_MIB_ABORTEDS);
- return sctp_stop_t1_and_abort(net, commands, error, ECONNREFUSED,
- asoc, chunk->transport);
- }
+ &err_chunk))
+ return sctp_abort_on_init_err(net, ep, asoc, chunk, arg,
+ commands, err_chunk);
/* Tag the variable length parameters. Note that we never
* convert the parameters in an INIT chunk.
@@ -1522,7 +1464,6 @@ static enum sctp_disposition sctp_sf_do_unexpected_init(
struct sctp_unrecognized_param *unk_param;
struct sctp_association *new_asoc;
enum sctp_disposition retval;
- struct sctp_packet *packet;
int len;
/* 6.10 Bundling
@@ -1566,31 +1507,9 @@ static enum sctp_disposition sctp_sf_do_unexpected_init(
err_chunk = NULL;
if (!sctp_verify_init(net, ep, asoc, chunk->chunk_hdr->type,
(struct sctp_init_chunk *)chunk->chunk_hdr, chunk,
- &err_chunk)) {
- /* This chunk contains fatal error. It is to be discarded.
- * Send an ABORT, with causes if there is any.
- */
- if (err_chunk) {
- packet = sctp_abort_pkt_new(net, ep, asoc, arg,
- (__u8 *)(err_chunk->chunk_hdr) +
- sizeof(struct sctp_chunkhdr),
- ntohs(err_chunk->chunk_hdr->length) -
- sizeof(struct sctp_chunkhdr));
-
- if (packet) {
- sctp_add_cmd_sf(commands, SCTP_CMD_SEND_PKT,
- SCTP_PACKET(packet));
- SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
- retval = SCTP_DISPOSITION_CONSUME;
- } else {
- retval = SCTP_DISPOSITION_NOMEM;
- }
- goto cleanup;
- } else {
- return sctp_sf_tabort_8_4_8(net, ep, asoc, type, arg,
- commands);
- }
- }
+ &err_chunk))
+ return sctp_abort_on_init_err(net, ep, asoc, chunk, arg,
+ commands, err_chunk);
/*
* Other parameters for the endpoint SHOULD be copied from the
@@ -1691,7 +1610,6 @@ static enum sctp_disposition sctp_sf_do_unexpected_init(
nomem_retval:
if (new_asoc)
sctp_association_free(new_asoc);
-cleanup:
if (err_chunk)
sctp_chunk_free(err_chunk);
return retval;
@@ -6485,6 +6403,61 @@ static void sctp_send_stale_cookie_err(struct net *net,
}
}
+static enum sctp_disposition sctp_abort_on_init_err(
+ struct net *net,
+ const struct sctp_endpoint *ep,
+ const struct sctp_association *asoc,
+ const struct sctp_chunk *chunk,
+ void *arg,
+ struct sctp_cmd_seq *commands,
+ struct sctp_chunk *err_chunk)
+{
+ enum sctp_error error = SCTP_ERROR_NO_RESOURCE;
+ struct sctp_packet *packet;
+ struct sctp_chunkhdr *ch;
+
+ if (!err_chunk)
+ return sctp_sf_tabort_8_4_8(net, ep, asoc, SCTP_ST_CHUNK(0),
+ arg, commands);
+
+ ch = err_chunk->chunk_hdr;
+ packet = sctp_abort_pkt_new(net, ep, asoc, arg,
+ (__u8 *)ch + sizeof(*ch),
+ ntohs(ch->length) - sizeof(*ch));
+
+ sctp_chunk_free(err_chunk);
+
+ if (packet) {
+ sctp_add_cmd_sf(commands, SCTP_CMD_SEND_PKT,
+ SCTP_PACKET(packet));
+ SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
+ error = SCTP_ERROR_INV_PARAM;
+ }
+
+ if (chunk->chunk_hdr->type != SCTP_CID_INIT_ACK) {
+ if (!packet)
+ return SCTP_DISPOSITION_NOMEM;
+ return SCTP_DISPOSITION_CONSUME;
+ }
+ /* SCTP-AUTH, Section 6.3:
+ * It should be noted that if the receiver wants to tear
+ * down an association in an authenticated way only, the
+ * handling of malformed packets should not result in
+ * tearing down the association.
+ *
+ * This means that if we only want to abort associations
+ * in an authenticated way (i.e AUTH+ABORT), then we
+ * can't destroy this association just because the packet
+ * was malformed.
+ */
+ if (sctp_auth_recv_cid(SCTP_CID_ABORT, asoc))
+ return sctp_sf_pdiscard(net, ep, asoc, SCTP_ST_CHUNK(0), arg,
+ commands);
+
+ SCTP_INC_STATS(net, SCTP_MIB_ABORTEDS);
+ return sctp_stop_t1_and_abort(net, commands, error, ECONNREFUSED,
+ asoc, chunk->transport);
+}
/* Process a data chunk */
static int sctp_eat_data(const struct sctp_association *asoc,
--
2.47.1
^ permalink raw reply related
* [PATCH net 0/2] sctp: validate INIT in COOKIE-ECHO when auth disabled
From: Xin Long @ 2026-06-16 0:33 UTC (permalink / raw)
To: network dev, linux-sctp
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Marcelo Ricardo Leitner
This series fixes a security gap in SCTP's COOKIE-ECHO handling when
cookie authentication is disabled.
Currently, INIT chunks embedded in cookies are not re-verified after
unpacking, creating a vulnerability when cookie_auth_enable=0. This
series first refactors error handling, then adds the missing validation.
Xin Long (2):
sctp: factor out INIT verification failure handling
sctp: add INIT verification after cookie unpacking
net/sctp/sm_make_chunk.c | 2 +-
net/sctp/sm_statefuns.c | 200 +++++++++++++++++++--------------------
2 files changed, 99 insertions(+), 103 deletions(-)
--
2.47.1
^ permalink raw reply
* Re: [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-06-16 0:33 UTC (permalink / raw)
To: Dipayaan Roy
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
In-Reply-To: <ajCXIpDVaVcUcQwd@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Mon, 15 Jun 2026 17:21:54 -0700 Dipayaan Roy wrote:
> On Mon, Jun 15, 2026 at 01:42:47PM -0700, Jakub Kicinski wrote:
> > On Mon, 15 Jun 2026 12:25:53 -0700 Dipayaan Roy wrote:
> > > Just a gentle ping on this series. The approach was agreed upon, and it
> > > has picked up a few Reviewed-by tags as well.
> > >
> > > Please let me know if you need anything else from me, or if I should
> > > resend it to collect the tags.
> >
> > Don't recall now what the exact sequence was but pretty sure this
> > no longer applied after some other mana series was merged.
>
> I see, the net-next is closed now, I will rebase and resend this
> once it opens on June 29th.
Sorry for not flagging this sooner, IDK how it escaped the reply.
Maybe some mix of Jake's comments plus it not being applicable
later.
Not to deflect blame but y'all should coordinate better, the "no longer
applies" situation happens in mana a lot more often than with other
drivers :(
^ permalink raw reply
* [PATCH net] net: psample: fix info leak in PSAMPLE_ATTR_DATA
From: Jakub Kicinski @ 2026-06-16 0:30 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
Weiming Shi, yotam.gi, jhs, jiri
psample open codes nla_put() presumably to avoid wiping
the data with 0s just to override it with packet data.
This open coding is missing clearing the pad, however,
each netlink attr is padded to 4B and data_len may
not be divisible by 4B.
Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: yotam.gi@gmail.com
CC: jhs@mojatatu.com
CC: jiri@resnulli.us
---
net/psample/psample.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/psample/psample.c b/net/psample/psample.c
index 7763662036fb..c112e1f0ccac 100644
--- a/net/psample/psample.c
+++ b/net/psample/psample.c
@@ -476,15 +476,17 @@ void psample_sample_packet(struct psample_group *group,
goto error;
if (data_len) {
- int nla_len = nla_total_size(data_len);
+ int nla_len = nla_attr_size(data_len);
struct nlattr *nla;
nla = skb_put(nl_skb, nla_len);
nla->nla_type = PSAMPLE_ATTR_DATA;
- nla->nla_len = nla_attr_size(data_len);
+ nla->nla_len = nla_len;
if (skb_copy_bits(skb, 0, nla_data(nla), data_len))
goto error;
+
+ skb_put_zero(nl_skb, nla_padlen(data_len));
}
#ifdef CONFIG_INET
--
2.54.0
^ permalink raw reply related
* Re: [PATCH net-next v6 0/4] net: dsa: mxl862xx: SerDes ports
From: patchwork-bot+netdevbpf @ 2026-06-16 0:30 UTC (permalink / raw)
To: Daniel Golle
Cc: andrew, olteanv, davem, edumazet, kuba, pabeni, linux,
linux-kernel, netdev
In-Reply-To: <cover.1781319534.git.daniel@makrotopia.org>
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Sat, 13 Jun 2026 04:07:02 +0100 you wrote:
> Add support for the two SerDes PCS interfaces of the MxL862xx switch
> ICs, which can both either be used to connect PHYs or SFP cages, or as
> CPU port(s). 1000Base-X, 2500Base-X, 10GBase-R, 10GBase-KR, SGMII,
> QSGMII and USXGMII (single 10G or quad 2.5G) are supported.
>
> The firmware only added the API to directly control the PCS as of
> version 1.0.84, so the PCS features are gated behind a version check.
>
> [...]
Here is the summary with links:
- [net-next,v6,1/4] net: dsa: mxl862xx: store firmware version for feature gating
https://git.kernel.org/netdev/net-next/c/937137db7914
- [net-next,v6,2/4] net: dsa: mxl862xx: move phylink stubs to mxl862xx-phylink.c
https://git.kernel.org/netdev/net-next/c/c1034332003f
- [net-next,v6,3/4] net: dsa: mxl862xx: move API macros to mxl862xx-host.h
https://git.kernel.org/netdev/net-next/c/7dab839724ed
- [net-next,v6,4/4] net: dsa: mxl862xx: add support for SerDes ports
https://git.kernel.org/netdev/net-next/c/13ea4d4888c4
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH v3] flow_dissector: fix uninit-value in __skb_flow_dissect() for ETH_ADDRS
From: Jakub Kicinski @ 2026-06-16 0:28 UTC (permalink / raw)
To: yun.zhou
Cc: Jakub Kicinski, davem, edumazet, pabeni, horms, qingfang.deng,
jiri, netdev, linux-kernel
In-Reply-To: <20260613113154.2550286-1-yun.zhou@windriver.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
flow_dissector: fix uninit-value in __skb_flow_dissect() for ETH_ADDRS
This patch addresses a KMSAN report by gating the read of eth_hdr(skb)
in __skb_flow_dissect() on skb->dev->hard_header_len being at least the
size of the ETH_ADDRS dissector key, zeroing the key otherwise. The
goal is to avoid reading uninitialized memory when a TUN device in
IFF_TUN (L3) mode passes a packet without a link-layer header to a
flower filter on a shared block.
Closes: https://syzkaller.appspot.com/bug?extid=fa2f5b1fb06147be5e16
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 2a98f5fa74eb..de6ca9beee03 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -1173,13 +1173,20 @@ bool __skb_flow_dissect(const struct net *net,
>
> if (dissector_uses_key(flow_dissector,
> FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
> - struct ethhdr *eth = eth_hdr(skb);
> struct flow_dissector_key_eth_addrs *key_eth_addrs;
>
> key_eth_addrs = skb_flow_dissector_target(flow_dissector,
> FLOW_DISSECTOR_KEY_ETH_ADDRS,
> target_container);
> - memcpy(key_eth_addrs, eth, sizeof(*key_eth_addrs));
> + /* TC filter blocks can be shared across devices with
> + * different header lengths, so we cannot validate this
> + * when the filter is installed -- check at dissect time.
> + */
> + if (skb->dev &&
> + skb->dev->hard_header_len >= sizeof(*key_eth_addrs))
> + memcpy(key_eth_addrs, eth_hdr(skb), sizeof(*key_eth_addrs));
[Medium]
Is hard_header_len the right predicate to gate this read on?
The dissector key is ETH_ALEN+ETH_ALEN, so the gate accepts any device
with hard_header_len >= 12. IPoIB sets hard_header_len = IPOIB_HARD_LEN
with dev->type = ARPHRD_INFINIBAND, and ARPHRD_FDDI uses 21 bytes.
Wouldn't packets on those link types pass this check and have the first
12 bytes of a non-Ethernet L2 header copied into key_eth_addrs as if
they were the destination and source MAC addresses?
Would a check on dev->type == ARPHRD_ETHER (or dev->addr_len == ETH_ALEN)
be more appropriate, so that only true Ethernet framing is read into
the key?
[Medium]
Does this gate behave correctly across act_mirred?
act_mirred reassigns skb->dev to the mirror/redirect target before the
target device's classifier path runs. Consider a packet originally
framed on an Ethernet device, with a valid mac_header in the linear
region, redirected to a headerless device such as TUN IFF_TUN that
shares a TC block with the source. The key_eth_addrs would now be
zeroed via the else branch even though the original Ethernet header
is still present in the skb. Is that intended for shared blocks?
[Medium]
Does this still leave a path where eth_hdr(skb) reads past the buffer?
eth_hdr(skb) expands to skb->head + skb->mac_header. When mac_header is
the unset sentinel (typeof(skb->mac_header))~0U set by skb allocation
or by skb_unset_mac_header(), the dereference resolves to skb->head +
0xFFFF, well past skb_end_pointer(skb). The new gate only inspects
skb->dev->hard_header_len, which is independent of skb->mac_header, so
any skb delivered to the dissector with mac_header unset on a device
whose hard_header_len >= 12 would still read out-of-bounds bytes into
key_eth_addrs.
Would skb_mac_header_was_set(skb) plus a bounds check that the 12 bytes
lie within [skb->head, skb_tail_pointer(skb)) be a more direct match
for the property the commit message describes:
This assumes the skb has a valid Ethernet header at mac_header,
which is not always the case.
> + else
> + memset(key_eth_addrs, 0, sizeof(*key_eth_addrs));
[Low]
The commit message says:
If the device does not have a link-layer header large enough to
contain the Ethernet addresses, zero the key so the filter will
not match.
Is that wording accurate for all filter shapes?
__fl_lookup() uses key_eth_addrs as part of the rhashtable lookup key.
A flower filter installed to match src_mac 00:00:00:00:00:00 and
dst_mac 00:00:00:00:00:00 (for example a catch-all on a shared block)
would, after this change, deterministically match every packet on a
headerless device, where previously the random skb memory made such a
filter effectively never match there.
Could the commit message note this corner case, or could a sentinel
value that cannot collide with a legal user-supplied key be used
instead of all zeros?
> }
>
> if (dissector_uses_key(flow_dissector,
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH net] netdev-genl: report NAPI thread PID in the caller's pid namespace
From: Samiullah Khawaja @ 2026-06-16 0:27 UTC (permalink / raw)
To: Maoyi Xie
Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Daniel Borkmann, Nikolay Aleksandrov, David Wei,
Stanislav Fomichev, Dragos Tatulea, netdev, linux-kernel, stable
In-Reply-To: <20260615171736.1709318-1-maoyixie.tju@gmail.com>
On Tue, Jun 16, 2026 at 01:17:36AM +0800, Maoyi Xie wrote:
>netdev_nl_napi_fill_one() reports the NAPI kthread PID in NETDEV_A_NAPI_PID
>using task_pid_nr(), which returns the PID in the initial pid namespace.
>
>NETDEV_CMD_NAPI_GET does not have GENL_ADMIN_PERM and the netdev genl family
>is netnsok, so a caller in a child pid namespace can issue it. That caller
>then sees the kthread's global PID, even though the kthread is not visible
>in its pid namespace, where the value should be 0.
>
>Translate the PID through the caller's pid namespace, the same way commit
>3799c2570982 ("io_uring/fdinfo: translate SqThread PID through caller's
>pid_ns") did for the io_uring SQPOLL thread. The doit and dumpit paths both
>run synchronously in the caller's context, so task_active_pid_ns(current) is
>the caller's pid namespace.
>
>Fixes: db4704f4e4df ("netdev-genl: Add PID for the NAPI thread")
>Cc: stable@vger.kernel.org
>Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
>---
> net/core/netdev-genl.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
>index b8f6076d8007..4c23e985cc01 100644
>--- a/net/core/netdev-genl.c
>+++ b/net/core/netdev-genl.c
>@@ -2,6 +2,7 @@
>
> #include <linux/netdevice.h>
> #include <linux/notifier.h>
>+#include <linux/pid_namespace.h>
> #include <linux/rtnetlink.h>
> #include <net/busy_poll.h>
> #include <net/net_namespace.h>
>@@ -189,7 +190,8 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
> goto nla_put_failure;
>
> if (napi->thread) {
>- pid = task_pid_nr(napi->thread);
>+ pid = task_pid_nr_ns(napi->thread,
>+ task_active_pid_ns(current));
> if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
> goto nla_put_failure;
> }
>--
>2.43.0
>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
^ permalink raw reply
* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Jakub Kicinski @ 2026-06-16 0:25 UTC (permalink / raw)
To: Luigi Rizzo
Cc: rizzo.unipi, m.szyprowski, robin.murphy, willemb, kuniyu, davem,
edumazet, pabeni, gregkh, rafael, akpm, david, netdev, linux-mm,
iommu, driver-core, linux-kernel
In-Reply-To: <20260615234220.3946885-1-lrizzo@google.com>
On Mon, 15 Jun 2026 23:42:20 +0000 Luigi Rizzo wrote:
> The use of swiotlb causes an extra data copy on I/O. For tx sockets,
> especially with greedy senders, this has a high chance of happening in
> the softirq handler for tx network interrupts, creating a significant
> performance bottleneck.
What's the use case? I associate swiotlb with debug / testing mostly,
so it'd be useful for people like me to explain why you care.
BTW net-next is closed: https://netdev.bots.linux.dev/net-next.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox