* Re: [PATCH net-next 3/5] selftests/bpf: remove sockmap + ktls tests
From: Jakub Sitnicki @ 2026-06-16 10:04 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, bpf,
john.fastabend, sd
In-Reply-To: <20260614014102.461064-4-kuba@kernel.org>
On Sat, Jun 13, 2026 at 06:40 PM -07, Jakub Kicinski wrote:
> The combination of sockmap and TLS is no longer supported - installing
> the TLS ULP on a sockmap socket (and vice versa) is now rejected. Remove
> the tests that exercise the combination along with their BPF program;
> the file covered nothing but sockmap sockets holding kTLS contexts.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> .../selftests/bpf/prog_tests/sockmap_ktls.c | 355 ------------------
> .../selftests/bpf/progs/test_sockmap_ktls.c | 61 ---
> tools/testing/selftests/bpf/test_sockmap.c | 227 +----------
> 3 files changed, 1 insertion(+), 642 deletions(-)
> delete mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_ktls.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
> index 6ed8e149e3d5..cda6b22cf759 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
[...]
> static void run_ktls_test(int family, int sotype)
> {
> if (test__start_subtest("tls simple offload"))
> test_sockmap_ktls_offload(family, sotype);
Nit: We probably don't need to keep this one test around.
It tests pure kTLS and overlaps with selftests/net/tls.c.
> - if (test__start_subtest("tls tx cork"))
> - test_sockmap_ktls_tx_cork(family, sotype, false);
> - if (test__start_subtest("tls tx cork with push"))
> - test_sockmap_ktls_tx_cork(family, sotype, true);
> - if (test__start_subtest("tls tx egress with no buf"))
> - test_sockmap_ktls_tx_no_buf(family, sotype, true);
> - if (test__start_subtest("tls tx with pop"))
> - test_sockmap_ktls_tx_pop(family, sotype);
> - if (test__start_subtest("tls verdict with tls rx"))
> - test_sockmap_ktls_verdict_with_tls_rx(family, sotype);
> }
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
^ permalink raw reply
* [PATCH net] net: dst_metadata: fix false-positive memcpy overflow in tun_dst_unclone
From: Ilya Maximets @ 2026-06-16 10:03 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kees Cook, Gustavo A. R. Silva, Nathan Chancellor,
Nick Desaulniers, Bill Wendling, Justin Stitt, linux-kernel,
linux-hardening, llvm, Ilya Maximets, Johan Thomsen
kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
structure to the options_len, which is then initialized to zero.
Later, we're initializing the structure by copying the tunnel info
together with the options, and this triggers a warning for a potential
memcpy overflow, since the compiler estimates that the options can't
fit into the structure, even though the memory for them is actually
allocated.
memcpy: detected buffer overflow: 104 byte write of buffer size 96
WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
skb_tunnel_info_unclone+0x179/0x190
geneve_xmit+0x7fe/0xe00
The issue is triggered when built with clang and source fortification.
Fix that by doing the copy in two stages: first - the main data with
the options_len, then the options. This way the correct length should
be known at the time of the copy.
It would be better if the options_len never changed after allocation,
but the allocation code is a little separate from the initialization
and it would be awkward and potentially dangerous to return a struct
with options_len set to a non-zero value from the metadata_dst_alloc().
Another option would be to use ip_tunnel_info_opts_set(), but it is
doing too many unnecessary operations for the use case here.
Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Reported-by: Johan Thomsen <write@ownrisk.dk>
Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
Johan, if you can test this one in your setup as well, that would
be great. Thanks.
include/net/dst_metadata.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 1fc2fb03ce3f..f45d1e3163f0 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -164,8 +164,11 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
if (!new_md)
return ERR_PTR(-ENOMEM);
- memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
- sizeof(struct ip_tunnel_info) + md_size);
+ /* Copy in two stages to keep the __counted_by happy. */
+ new_md->u.tun_info = md_dst->u.tun_info;
+ memcpy(ip_tunnel_info_opts(&new_md->u.tun_info),
+ ip_tunnel_info_opts(&md_dst->u.tun_info), md_size);
+
#ifdef CONFIG_DST_CACHE
/* Unclone the dst cache if there is one */
if (new_md->u.tun_info.dst_cache.cache) {
--
2.54.0
^ permalink raw reply related
* Re: [PATCH 07/23] driver core: platform: provide platform_device_set_fwnode()
From: Bartosz Golaszewski @ 2026-06-16 9:51 UTC (permalink / raw)
To: Andy Shevchenko
Cc: Lee Jones, Mark Brown, Thierry Reding, Sebastian Hesselbarth,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Srinivas Kandagatla, Greg Kroah-Hartman, Vinod Koul,
Rafael J. Wysocki, Danilo Krummrich, Rob Herring, Saravana Kannan,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy (CS GROUP), Andi Shyti, Joerg Roedel,
Will Deacon, Robin Murphy, Doug Berger, Florian Fainelli,
Broadcom internal kernel review list, Ulf Hansson, Frank Li,
Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
Matthew Brost, Thomas Hellström, Rodrigo Vivi, David Airlie,
Simona Vetter, Peter Chen, Paul Cercueil, Bin Liu, Philipp Zabel,
Maximilian Luz, Hans de Goede, Ilpo Järvinen,
Krzysztof Kozlowski, Benjamin Herrenschmidt, linux-kernel, netdev,
linux-arm-msm, linux-sound, driver-core, devicetree, linuxppc-dev,
linux-i2c, iommu, linux-pm, imx, linux-arm-kernel, intel-xe,
dri-devel, linux-usb, linux-mips, platform-driver-x86,
Bartosz Golaszewski, Bartosz Golaszewski
In-Reply-To: <ajEcDq0S067wMFaK@black.igk.intel.com>
On Tue, 16 Jun 2026 11:49:02 +0200, Andy Shevchenko
<andriy.shevchenko@linux.intel.com> said:
> On Thu, Jun 04, 2026 at 05:32:27AM -0700, Bartosz Golaszewski wrote:
>> On Tue, 2 Jun 2026 23:41:53 +0200, Andy Shevchenko
>> <andriy.shevchenko@linux.intel.com> said:
>> > On Thu, May 21, 2026 at 10:36:30AM +0200, Bartosz Golaszewski wrote:
>> >> Provide a helper function encapsulating the logic of assigning firmware
>> >> nodes to platform devices created with platform_device_alloc(). Make the
>> >> kerneldoc state that this is the proper interface for assigning firmware
>> >> nodes to dynamically allocated platform devices. This will allow us to
>> >> switch to counting the references of the device's firmware nodes in the
>> >> future, not only the OF nodes.
>> >
>> > But why different for of_node and fwnode to begin with?!
>>
>> I'm not following. What are you suggesting?
>
> After re-reading of this thread, I think I'm suggesting the same what you have
> in plans to do in the future as you put it as "This will allow us to switch to
> counting the references of the device's firmware nodes in the future, not only
> the OF nodes."
>
> // Offtopic
> I haven't heard from you for more than a month on this:
> https://lore.kernel.org/r/af18zdP5HF3_P9Vo@black.igk.intel.com
> Anything should I do? Please, answer to that thread.
>
Eek, sorry, must have flown under the radar.
I'll pull it now, I will do a second PR for this merge window anyway.
Bart
^ permalink raw reply
* Re: [PATCH RFC 8/9] arm64: dts: qcom: shikra-cqs-evk: Enable ethernet0
From: Konrad Dybcio @ 2026-06-16 9:50 UTC (permalink / raw)
To: Mohd Ayaan Anwar, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Richard Cochran, Bjorn Andersson, Konrad Dybcio,
Maxime Coquelin, Alexandre Torgue, Russell King
Cc: linux-arm-msm, netdev, devicetree, linux-kernel, linux-stm32,
linux-arm-kernel
In-Reply-To: <20260612-shikra_ethernet-v1-8-f0f4a1d19929@oss.qualcomm.com>
On 6/11/26 8:37 PM, Mohd Ayaan Anwar wrote:
> Enable the first Gigabit Ethernet controller. The board layout is
> identical to the CQM EVK.
>
> Signed-off-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
> ---
> arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts | 119 ++++++++++++++++++++++++++++
> 1 file changed, 119 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts b/arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts
> index 26ff8007a819e46bbc9ffa3dddc6fee6530a4a7a..1f2e4f6dd7cca436f62ba9f09cd328e5a2079095 100644
> --- a/arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts
> +++ b/arch/arm64/boot/dts/qcom/shikra-cqs-evk.dts
> @@ -7,6 +7,7 @@
>
> #include "shikra-cqm-som.dtsi"
> #include "shikra-evk.dtsi"
> +#include <dt-bindings/net/ti-dp83867.h>
>
> / {
> model = "Qualcomm Technologies, Inc. Shikra CQS EVK";
> @@ -60,6 +61,92 @@ vreg_pmu_ch1: ldo4 {
> };
> };
>
> +ðernet0 {
> + status = "okay";
'status' should go last, with a \n before it
> + phy-handle = <ðphy0>;
> + phy-mode = "rgmii-id";
> +
> + pinctrl-names = "default";
> + pinctrl-0 = <ðernet0_defaults>;
property-n
property-names
in this order, please
[...]
> +&tlmm {
> + ethernet0_defaults: ethernet0-defaults-state {
s/defaults/default
Please move this definition to shikra.dtsi
> + rgmii-rx-pins {
> + pins = "gpio121", "gpio122", "gpio123",
> + "gpio124", "gpio125", "gpio126";
> + function = "rgmii";
> + bias-disable;
> + drive-strength = <16>;
Let's move drive-strength before bias (that's the order used in other
places)
> + };
> + rgmii-tx-pins {
Please separate subsequent subnodes with \n
> + pins = "gpio127", "gpio128", "gpio129",
> + "gpio130", "gpio131", "gpio132";
> + function = "rgmii";
> + bias-pull-up;
> + drive-strength = <16>;
> + };
> + rgmii-mdio-pins {
> + pins = "gpio133", "gpio134";
> + function = "rgmii";
> + bias-pull-up;
> + drive-strength = <16>;
> + };
> + };
> +
> + emac0_phy_en_hog: emac0-phy-en-hog {
> + gpio-hog;
> + gpios = <149 GPIO_ACTIVE_HIGH>;
> + output-high;
> + line-name = "emac0-phy-en";
> + };
This looks like a hack - what does this pin actually do?
Konrad
^ permalink raw reply
* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Luigi Rizzo @ 2026-06-16 9:48 UTC (permalink / raw)
To: Pedro Falcato
Cc: rizzo.unipi, m.szyprowski, robin.murphy, willemb, kuniyu, davem,
edumazet, kuba, pabeni, gregkh, rafael, akpm, david, netdev,
linux-mm, iommu, driver-core, linux-kernel,
Jesper Dangaard Brouer, Ilias Apalodimas
In-Reply-To: <ajESl4osXP7roz5q@pedro-suse.lan>
On Tue, Jun 16, 2026 at 11:20 AM Pedro Falcato <pfalcato@suse.de> wrote:
>
> (+cc page pool maintainers)
> On Mon, Jun 15, 2026 at 11:42:20PM +0000, Luigi Rizzo wrote:
> > The use of swiotlb causes an extra data copy on I/O. For tx sockets,
> > especially with greedy senders, this has a high chance of happening in
> > the softirq handler for tx network interrupts, creating a significant
> > performance bottleneck.
> >
> > Allow tx sockets to allocate socket buffers directly from the bounce
> > buffers. This avoids the second copy and removes the above bottleneck.
> > The fraction of swiotlb buffers allowed for this feature is set with
> > /sys/module/swiotlb/parameters/zerocopy_tx_percent
> > (0 means disabled, 90 is the maximum, to avoid persistent I/O failures).
> >
> > Implementation:
> > - define a new page type to unambiguously identify bounce buffers used
> > as backing storage for socket buffers
> > - modify skb_page_frag_refill to perform the modified allocation
> > - modify the destructors __free_frozen_pages(), free_unref_folio() to
> > handle those pages and return them to the pool.
> >
> > The savings are especially visible with fewer queues. In synthetic
> > benchmarks, senders with 1-2 queues would cap around 50Gbps with
> > conventional swiotlb, and reach over 170Gbps with the feature enabled.
>
> I could be wrong, but I genuinely think that the way to go about this is
> using page_pool for regular TX as well. page_pool pages are all dma-mapped
> (so whatever swiotlb optimization you want can be done there), and the net
> stack already has awareness of these special pages and special skbs, so it
> won't Just Return Them back to the page allocator.
I am not sure I follow your comment above, can you expand/clarify?
The problem I am dealing with is that the copy from the socket buffer
to the bounce buffer is done in the device xmit function. Under high
it is almost always done by the tx softirq.
This means that even if we move the copy outside the HARD_TX_LOCK(),
it would still be almost completely serialized.
Hence the proposed method to make skb_page_frag_refill() allocate
directly a bounce buffer (under specific conditions) so there is a single copy
done directly to the dma-able buffer, and ii is done in the user threads/CPUs
and is not seriallized in the softirq thread.
I am not sure how page_pool on tx could help here.
cheers
luigi
^ permalink raw reply
* Re: [PATCH 07/23] driver core: platform: provide platform_device_set_fwnode()
From: Andy Shevchenko @ 2026-06-16 9:49 UTC (permalink / raw)
To: Bartosz Golaszewski
Cc: Lee Jones, Mark Brown, Thierry Reding, Sebastian Hesselbarth,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Srinivas Kandagatla, Greg Kroah-Hartman, Vinod Koul,
Rafael J. Wysocki, Danilo Krummrich, Rob Herring, Saravana Kannan,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy (CS GROUP), Andi Shyti, Joerg Roedel,
Will Deacon, Robin Murphy, Doug Berger, Florian Fainelli,
Broadcom internal kernel review list, Ulf Hansson, Frank Li,
Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
Matthew Brost, Thomas Hellström, Rodrigo Vivi, David Airlie,
Simona Vetter, Peter Chen, Paul Cercueil, Bin Liu, Philipp Zabel,
Maximilian Luz, Hans de Goede, Ilpo Järvinen,
Krzysztof Kozlowski, Benjamin Herrenschmidt, linux-kernel, netdev,
linux-arm-msm, linux-sound, driver-core, devicetree, linuxppc-dev,
linux-i2c, iommu, linux-pm, imx, linux-arm-kernel, intel-xe,
dri-devel, linux-usb, linux-mips, platform-driver-x86,
Bartosz Golaszewski
In-Reply-To: <CAMRc=McLN9Ovoqo3om-3uC=q+=rcKCoiWMctC=yvwiaHacU0PQ@mail.gmail.com>
On Thu, Jun 04, 2026 at 05:32:27AM -0700, Bartosz Golaszewski wrote:
> On Tue, 2 Jun 2026 23:41:53 +0200, Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> said:
> > On Thu, May 21, 2026 at 10:36:30AM +0200, Bartosz Golaszewski wrote:
> >> Provide a helper function encapsulating the logic of assigning firmware
> >> nodes to platform devices created with platform_device_alloc(). Make the
> >> kerneldoc state that this is the proper interface for assigning firmware
> >> nodes to dynamically allocated platform devices. This will allow us to
> >> switch to counting the references of the device's firmware nodes in the
> >> future, not only the OF nodes.
> >
> > But why different for of_node and fwnode to begin with?!
>
> I'm not following. What are you suggesting?
After re-reading of this thread, I think I'm suggesting the same what you have
in plans to do in the future as you put it as "This will allow us to switch to
counting the references of the device's firmware nodes in the future, not only
the OF nodes."
// Offtopic
I haven't heard from you for more than a month on this:
https://lore.kernel.org/r/af18zdP5HF3_P9Vo@black.igk.intel.com
Anything should I do? Please, answer to that thread.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH 08/23] driver core: platform: provide platform_device_set_of_node_from_dev()
From: Andy Shevchenko @ 2026-06-16 9:41 UTC (permalink / raw)
To: Johan Hovold
Cc: Bartosz Golaszewski, Lee Jones, Mark Brown, Thierry Reding,
Sebastian Hesselbarth, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Srinivas Kandagatla,
Greg Kroah-Hartman, Vinod Koul, Rafael J. Wysocki,
Danilo Krummrich, Rob Herring, Saravana Kannan,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy (CS GROUP), Andi Shyti, Joerg Roedel,
Will Deacon, Robin Murphy, Doug Berger, Florian Fainelli,
Broadcom internal kernel review list, Ulf Hansson, Frank Li,
Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
Matthew Brost, Thomas Hellström, Rodrigo Vivi, David Airlie,
Simona Vetter, Peter Chen, Paul Cercueil, Bin Liu, Philipp Zabel,
Maximilian Luz, Hans de Goede, Ilpo Järvinen,
Krzysztof Kozlowski, Benjamin Herrenschmidt, brgl, linux-kernel,
netdev, linux-arm-msm, linux-sound, driver-core, devicetree,
linuxppc-dev, linux-i2c, iommu, linux-pm, imx, linux-arm-kernel,
intel-xe, dri-devel, linux-usb, linux-mips, platform-driver-x86
In-Reply-To: <aiZpJkQBXg2pcczy@hovoldconsulting.com>
On Mon, Jun 08, 2026 at 09:03:02AM +0200, Johan Hovold wrote:
> On Fri, Jun 05, 2026 at 05:53:04PM +0300, Andy Shevchenko wrote:
> > On Fri, Jun 05, 2026 at 02:16:17PM +0200, Johan Hovold wrote:
> > > On Wed, Jun 03, 2026 at 12:44:55AM +0300, Andy Shevchenko wrote:
> > > > On Thu, May 21, 2026 at 10:36:31AM +0200, Bartosz Golaszewski wrote:
> > > > > Provide a platform-specific variant of device_set_of_node_from_dev(). In
> > > > > addition to bumping the reference count of the OF node being assigned,
> > > > > it also assigns the fwnode of the platform device.
> > > >
> > > > Can we rather investigate the way how to make that of node reuse thingy
> > > > (which is used solely by pin control) differently and then drop this confusing
> > > > device_set_of_node_from_dev() call altogether?
> > >
> > > No, that call is needed. See commit 4e75e1d7dac9 ("driver core: add
> > > helper to reuse a device-tree node") for details.
> >
> > Bart fixes the problem with the platform driver. At the result this will be
> > the only device_set_node() + 'reused = true'. As for 'reused' flag, the need
> > is only for pinmux/pin control stuff.
>
> And any other resource which may (eventually) be claimed by driver core
> or bus code.
>
> > The question here is if there is a better
> > way to make that 'reused' be done automatically without need of setting some
> > flag explicitly.
>
> That's not really relevant to the series at hand.
It's not, but it's relevant in a long-term for understanding how we can get
this done in a better way.
> If this is something we want to merge then you need to continue setting
> the flag in order not to cause regressions.
Yes, that's how it's now.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* [PATCH net-next v2] net: dsa: Fix skb ownership in taggers
From: Linus Walleij @ 2026-06-16 9:36 UTC (permalink / raw)
To: Andrew Lunn, Vladimir Oltean, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Florian Fainelli,
Jonas Gorski, Hauke Mehrtens, Kurt Kanzenbach, Woojung Huh,
UNGLinuxDriver, Chester A. Unal, Daniel Golle, Matthias Brugger,
AngeloGioacchino Del Regno, Wei Fang, Clark Wang,
Clément Léger, George McCollister, David Yang
Cc: netdev, Sashiko AI Review, Linus Walleij
The tag_8021q.c tagger calls vlan_insert_tag() in dsa_8021q_xmit().
vlan_insert_tag() will consume the skb with kfree_skb() on failure
and return NULL.
When NULL is returned as error code to ->xmit() in dsa_user_xmit()
it will free the same skb again leading to a double-free.
The idea of dsa_user_xmit() and dsa_switch_rcv() dropping the skb
they held before the call to ->xmit() and ->rcv() is conceptually
wrong: the pattern elsewhere in the networking code is that consumers
drop their skb:s on failure.
Modify the ->xmit() and ->rcv() call sites to not drop the SKB if
the taggers return NULL from any of these calls. Move those drops into
the taggers so every callback error path that retains ownership consumes
the skb before returning NULL.
Keep the existing helper ownership rules: VLAN insertion helpers already
free on failure (this is the case in tag_8021q.c), while deferred
transmit paths either transfer the skb reference to worker context or
hold a worker reference with skb_get() and drop the caller's reference.
For SJA1105 meta RX, transfer the buffered stampable skb under the meta
lock and return NULL while the skb is waiting for its meta frame: the
skb is not dropped in this case.
Reported-by: Sashiko AI Review <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/r/20260610153952.1685895-1-kuba@kernel.org/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Assisted-by: Codex:gpt-5-5
Acked-by: David Yang <mmyangfl@gmail.com> # yt921x
Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Signed-off-by: Linus Walleij <linusw@kernel.org>
---
Changes in v2:
- In some instances __skb_pad() and __skb_put_padto() followed by a
kfree_skb() could be simplified to just call skb_pad() and
skb_put_padto() which will free the skb on failure.
- Use a label and goto for the kfree_skb(); return NULL; in
the netc_rcv() callback in tag_netc.c as requested.
- Collect ACKs.
- Retag for net-next.
- Link to v1: https://patch.msgid.link/20260616-dsa-fix-free-skb-v1-1-fd30b35dcf66@kernel.org
---
net/dsa/tag.c | 4 +---
net/dsa/tag_ar9331.c | 10 ++++++++--
net/dsa/tag_brcm.c | 39 ++++++++++++++++++++++++---------------
net/dsa/tag_dsa.c | 15 ++++++++++++---
net/dsa/tag_gswip.c | 8 ++++++--
net/dsa/tag_hellcreek.c | 9 +++++++--
net/dsa/tag_ksz.c | 44 +++++++++++++++++++++++++++++++-------------
net/dsa/tag_lan9303.c | 2 ++
net/dsa/tag_mtk.c | 8 ++++++--
net/dsa/tag_mxl-gsw1xx.c | 3 +++
net/dsa/tag_mxl862xx.c | 3 +++
net/dsa/tag_netc.c | 18 ++++++++++--------
net/dsa/tag_ocelot.c | 4 +++-
net/dsa/tag_ocelot_8021q.c | 20 +++++++++++++-------
net/dsa/tag_qca.c | 14 +++++++++++---
net/dsa/tag_rtl4_a.c | 10 ++++++++--
net/dsa/tag_rtl8_4.c | 24 ++++++++++++++++++------
net/dsa/tag_rzn1_a5psw.c | 8 ++++++--
net/dsa/tag_sja1105.c | 42 +++++++++++++++++++++++++++---------------
net/dsa/tag_trailer.c | 16 ++++++++++++----
net/dsa/tag_vsc73xx_8021q.c | 1 +
net/dsa/tag_xrs700x.c | 12 +++++++++---
net/dsa/tag_yt921x.c | 7 ++++++-
net/dsa/user.c | 7 +++----
24 files changed, 230 insertions(+), 98 deletions(-)
diff --git a/net/dsa/tag.c b/net/dsa/tag.c
index 79ad105902d9..cfc8f5a0cbd9 100644
--- a/net/dsa/tag.c
+++ b/net/dsa/tag.c
@@ -84,10 +84,8 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
nskb = cpu_dp->rcv(skb, dev);
}
- if (!nskb) {
- kfree_skb(skb);
+ if (!nskb)
return 0;
- }
skb = nskb;
skb_push(skb, ETH_HLEN);
diff --git a/net/dsa/tag_ar9331.c b/net/dsa/tag_ar9331.c
index cbb588ca73aa..2e2388143b02 100644
--- a/net/dsa/tag_ar9331.c
+++ b/net/dsa/tag_ar9331.c
@@ -51,8 +51,10 @@ static struct sk_buff *ar9331_tag_rcv(struct sk_buff *skb,
u8 ver, port;
u16 hdr;
- if (unlikely(!pskb_may_pull(skb, AR9331_HDR_LEN)))
+ if (unlikely(!pskb_may_pull(skb, AR9331_HDR_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
hdr = le16_to_cpu(*(__le16 *)skb_mac_header(skb));
@@ -60,12 +62,14 @@ static struct sk_buff *ar9331_tag_rcv(struct sk_buff *skb,
if (unlikely(ver != AR9331_HDR_VERSION)) {
netdev_warn_once(ndev, "%s:%i wrong header version 0x%2x\n",
__func__, __LINE__, hdr);
+ kfree_skb(skb);
return NULL;
}
if (unlikely(hdr & AR9331_HDR_FROM_CPU)) {
netdev_warn_once(ndev, "%s:%i packet should not be from cpu 0x%2x\n",
__func__, __LINE__, hdr);
+ kfree_skb(skb);
return NULL;
}
@@ -75,8 +79,10 @@ static struct sk_buff *ar9331_tag_rcv(struct sk_buff *skb,
port = FIELD_GET(AR9331_HDR_PORT_NUM_MASK, hdr);
skb->dev = dsa_conduit_find_user(ndev, 0, port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
return skb;
}
diff --git a/net/dsa/tag_brcm.c b/net/dsa/tag_brcm.c
index cf9420439054..411e3b57d16a 100644
--- a/net/dsa/tag_brcm.c
+++ b/net/dsa/tag_brcm.c
@@ -102,9 +102,9 @@ static struct sk_buff *brcm_tag_xmit_ll(struct sk_buff *skb,
* (including FCS and tag) because the length verification is done after
* the Broadcom tag is stripped off the ingress packet.
*
- * Let dsa_user_xmit() free the SKB
+ * Free the SKB on error.
*/
- if (__skb_put_padto(skb, ETH_ZLEN + BRCM_TAG_LEN, false))
+ if (skb_put_padto(skb, ETH_ZLEN + BRCM_TAG_LEN))
return NULL;
skb_push(skb, BRCM_TAG_LEN);
@@ -151,27 +151,35 @@ static struct sk_buff *brcm_tag_rcv_ll(struct sk_buff *skb,
int source_port;
u8 *brcm_tag;
- if (unlikely(!pskb_may_pull(skb, BRCM_TAG_LEN)))
+ if (unlikely(!pskb_may_pull(skb, BRCM_TAG_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
brcm_tag = skb->data - offset;
/* The opcode should never be different than 0b000 */
- if (unlikely((brcm_tag[0] >> BRCM_OPCODE_SHIFT) & BRCM_OPCODE_MASK))
+ if (unlikely((brcm_tag[0] >> BRCM_OPCODE_SHIFT) & BRCM_OPCODE_MASK)) {
+ kfree_skb(skb);
return NULL;
+ }
/* We should never see a reserved reason code without knowing how to
* handle it
*/
- if (unlikely(brcm_tag[2] & BRCM_EG_RC_RSVD))
+ if (unlikely(brcm_tag[2] & BRCM_EG_RC_RSVD)) {
+ kfree_skb(skb);
return NULL;
+ }
/* Locate which port this is coming from */
source_port = brcm_tag[3] & BRCM_EG_PID_MASK;
skb->dev = dsa_conduit_find_user(dev, 0, source_port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
/* Remove Broadcom tag and update checksum */
skb_pull_rcsum(skb, BRCM_TAG_LEN);
@@ -228,8 +236,10 @@ static struct sk_buff *brcm_leg_tag_rcv(struct sk_buff *skb,
__be16 *proto;
u8 *brcm_tag;
- if (unlikely(!pskb_may_pull(skb, BRCM_LEG_TAG_LEN + VLAN_HLEN)))
+ if (unlikely(!pskb_may_pull(skb, BRCM_LEG_TAG_LEN + VLAN_HLEN))) {
+ kfree_skb(skb);
return NULL;
+ }
brcm_tag = dsa_etype_header_pos_rx(skb);
proto = (__be16 *)(brcm_tag + BRCM_LEG_TAG_LEN);
@@ -237,8 +247,10 @@ static struct sk_buff *brcm_leg_tag_rcv(struct sk_buff *skb,
source_port = brcm_tag[5] & BRCM_LEG_PORT_ID;
skb->dev = dsa_conduit_find_user(dev, 0, source_port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
/* The internal switch in BCM63XX SoCs always tags on egress on the CPU
* port. We use VID 0 internally for untagged traffic, so strip the tag
@@ -273,10 +285,8 @@ static struct sk_buff *brcm_leg_tag_xmit(struct sk_buff *skb,
* need to make sure that packets are at least 70 bytes
* (including FCS and tag) because the length verification is done after
* the Broadcom tag is stripped off the ingress packet.
- *
- * Let dsa_user_xmit() free the SKB
*/
- if (__skb_put_padto(skb, ETH_ZLEN + BRCM_LEG_TAG_LEN, false))
+ if (skb_put_padto(skb, ETH_ZLEN + BRCM_LEG_TAG_LEN))
return NULL;
skb_push(skb, BRCM_LEG_TAG_LEN);
@@ -325,10 +335,8 @@ static struct sk_buff *brcm_leg_fcs_tag_xmit(struct sk_buff *skb,
* need to make sure that packets are at least 70 bytes (including FCS
* and tag) because the length verification is done after the Broadcom
* tag is stripped off the ingress packet.
- *
- * Let dsa_user_xmit() free the SKB.
*/
- if (__skb_put_padto(skb, ETH_ZLEN + BRCM_LEG_TAG_LEN, false))
+ if (skb_put_padto(skb, ETH_ZLEN + BRCM_LEG_TAG_LEN))
return NULL;
fcs_len = skb->len;
@@ -351,8 +359,9 @@ static struct sk_buff *brcm_leg_fcs_tag_xmit(struct sk_buff *skb,
brcm_tag[5] = dp->index & BRCM_LEG_PORT_ID;
/* Original FCS value */
- if (__skb_pad(skb, ETH_FCS_LEN, false))
+ if (skb_pad(skb, ETH_FCS_LEN))
return NULL;
+
skb_put_data(skb, &fcs_val, ETH_FCS_LEN);
return skb;
diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index 2a2c4fb61a65..d5ffee35fbb5 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -224,6 +224,7 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
/* Remote management is not implemented yet,
* drop.
*/
+ kfree_skb(skb);
return NULL;
case DSA_CODE_ARP_MIRROR:
case DSA_CODE_POLICY_MIRROR:
@@ -244,12 +245,14 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
/* Reserved code, this could be anything. Drop
* seems like the safest option.
*/
+ kfree_skb(skb);
return NULL;
}
break;
default:
+ kfree_skb(skb);
return NULL;
}
@@ -271,8 +274,10 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
source_port);
}
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
/* When using LAG offload, skb->dev is not a DSA user interface,
* so we cannot call dsa_default_offload_fwd_mark and we need to
@@ -335,8 +340,10 @@ static struct sk_buff *dsa_xmit(struct sk_buff *skb, struct net_device *dev)
static struct sk_buff *dsa_rcv(struct sk_buff *skb, struct net_device *dev)
{
- if (unlikely(!pskb_may_pull(skb, DSA_HLEN)))
+ if (unlikely(!pskb_may_pull(skb, DSA_HLEN))) {
+ kfree_skb(skb);
return NULL;
+ }
return dsa_rcv_ll(skb, dev, 0);
}
@@ -375,8 +382,10 @@ static struct sk_buff *edsa_xmit(struct sk_buff *skb, struct net_device *dev)
static struct sk_buff *edsa_rcv(struct sk_buff *skb, struct net_device *dev)
{
- if (unlikely(!pskb_may_pull(skb, EDSA_HLEN)))
+ if (unlikely(!pskb_may_pull(skb, EDSA_HLEN))) {
+ kfree_skb(skb);
return NULL;
+ }
skb_pull_rcsum(skb, EDSA_HLEN - DSA_HLEN);
diff --git a/net/dsa/tag_gswip.c b/net/dsa/tag_gswip.c
index 5fa436121087..5c407d448c9f 100644
--- a/net/dsa/tag_gswip.c
+++ b/net/dsa/tag_gswip.c
@@ -80,16 +80,20 @@ static struct sk_buff *gswip_tag_rcv(struct sk_buff *skb,
int port;
u8 *gswip_tag;
- if (unlikely(!pskb_may_pull(skb, GSWIP_RX_HEADER_LEN)))
+ if (unlikely(!pskb_may_pull(skb, GSWIP_RX_HEADER_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
gswip_tag = skb->data - ETH_HLEN;
/* Get source port information */
port = (gswip_tag[7] & GSWIP_RX_SPPID_MASK) >> GSWIP_RX_SPPID_SHIFT;
skb->dev = dsa_conduit_find_user(dev, 0, port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
/* remove GSWIP tag */
skb_pull_rcsum(skb, GSWIP_RX_HEADER_LEN);
diff --git a/net/dsa/tag_hellcreek.c b/net/dsa/tag_hellcreek.c
index 544ab15685a2..dd9f328f3182 100644
--- a/net/dsa/tag_hellcreek.c
+++ b/net/dsa/tag_hellcreek.c
@@ -27,8 +27,10 @@ static struct sk_buff *hellcreek_xmit(struct sk_buff *skb,
* checksums after the switch strips the tag.
*/
if (skb->ip_summed == CHECKSUM_PARTIAL &&
- skb_checksum_help(skb))
+ skb_checksum_help(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
/* Tag encoding */
tag = skb_put(skb, HELLCREEK_TAG_LEN);
@@ -47,11 +49,14 @@ static struct sk_buff *hellcreek_rcv(struct sk_buff *skb,
skb->dev = dsa_conduit_find_user(dev, 0, port);
if (!skb->dev) {
netdev_warn_once(dev, "Failed to get source port: %d\n", port);
+ kfree_skb(skb);
return NULL;
}
- if (pskb_trim_rcsum(skb, skb->len - HELLCREEK_TAG_LEN))
+ if (pskb_trim_rcsum(skb, skb->len - HELLCREEK_TAG_LEN)) {
+ kfree_skb(skb);
return NULL;
+ }
dsa_default_offload_fwd_mark(skb);
diff --git a/net/dsa/tag_ksz.c b/net/dsa/tag_ksz.c
index d2475c3bbb7d..67fa89f102e0 100644
--- a/net/dsa/tag_ksz.c
+++ b/net/dsa/tag_ksz.c
@@ -88,11 +88,15 @@ static struct sk_buff *ksz_common_rcv(struct sk_buff *skb,
unsigned int port, unsigned int len)
{
skb->dev = dsa_conduit_find_user(dev, 0, port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
- if (pskb_trim_rcsum(skb, skb->len - len))
+ if (pskb_trim_rcsum(skb, skb->len - len)) {
+ kfree_skb(skb);
return NULL;
+ }
dsa_default_offload_fwd_mark(skb);
@@ -123,8 +127,10 @@ static struct sk_buff *ksz8795_xmit(struct sk_buff *skb, struct net_device *dev)
struct ethhdr *hdr;
u8 *tag;
- if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+ if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
/* Tag encoding */
tag = skb_put(skb, KSZ_INGRESS_TAG_LEN);
@@ -141,8 +147,10 @@ static struct sk_buff *ksz8795_rcv(struct sk_buff *skb, struct net_device *dev)
{
u8 *tag;
- if (skb_linearize(skb))
+ if (skb_linearize(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
tag = skb_tail_pointer(skb) - KSZ_EGRESS_TAG_LEN;
@@ -255,22 +263,24 @@ static struct sk_buff *ksz_defer_xmit(struct dsa_port *dp, struct sk_buff *skb)
xmit_work_fn = tagger_data->xmit_work_fn;
xmit_worker = priv->xmit_worker;
- if (!xmit_work_fn || !xmit_worker)
+ if (!xmit_work_fn || !xmit_worker) {
+ kfree_skb(skb);
return NULL;
+ }
xmit_work = kzalloc_obj(*xmit_work, GFP_ATOMIC);
- if (!xmit_work)
+ if (!xmit_work) {
+ kfree_skb(skb);
return NULL;
+ }
kthread_init_work(&xmit_work->work, xmit_work_fn);
- /* Increase refcount so the kfree_skb in dsa_user_xmit
- * won't really free the packet.
- */
xmit_work->dp = dp;
xmit_work->skb = skb_get(skb);
kthread_queue_work(xmit_worker, &xmit_work->work);
+ kfree_skb(skb);
return NULL;
}
@@ -284,8 +294,10 @@ static struct sk_buff *ksz9477_xmit(struct sk_buff *skb,
__be16 *tag;
u16 val;
- if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+ if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
/* Tag encoding */
ksz_xmit_timestamp(dp, skb);
@@ -310,8 +322,10 @@ static struct sk_buff *ksz9477_rcv(struct sk_buff *skb, struct net_device *dev)
unsigned int port;
u8 *tag;
- if (skb_linearize(skb))
+ if (skb_linearize(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
/* Tag decoding */
tag = skb_tail_pointer(skb) - KSZ_EGRESS_TAG_LEN;
@@ -352,8 +366,10 @@ static struct sk_buff *ksz9893_xmit(struct sk_buff *skb,
struct ethhdr *hdr;
u8 *tag;
- if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+ if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
/* Tag encoding */
ksz_xmit_timestamp(dp, skb);
@@ -418,8 +434,10 @@ static struct sk_buff *lan937x_xmit(struct sk_buff *skb,
__be16 *tag;
u16 val;
- if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+ if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
ksz_xmit_timestamp(dp, skb);
diff --git a/net/dsa/tag_lan9303.c b/net/dsa/tag_lan9303.c
index 258e5d7dc5ef..d1194696499a 100644
--- a/net/dsa/tag_lan9303.c
+++ b/net/dsa/tag_lan9303.c
@@ -85,6 +85,7 @@ static struct sk_buff *lan9303_rcv(struct sk_buff *skb, struct net_device *dev)
if (unlikely(!pskb_may_pull(skb, LAN9303_TAG_LEN))) {
dev_warn_ratelimited(&dev->dev,
"Dropping packet, cannot pull\n");
+ kfree_skb(skb);
return NULL;
}
@@ -102,6 +103,7 @@ static struct sk_buff *lan9303_rcv(struct sk_buff *skb, struct net_device *dev)
skb->dev = dsa_conduit_find_user(dev, 0, source_port);
if (!skb->dev) {
dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid source port\n");
+ kfree_skb(skb);
return NULL;
}
diff --git a/net/dsa/tag_mtk.c b/net/dsa/tag_mtk.c
index dea3eecaf093..c7dc7731675e 100644
--- a/net/dsa/tag_mtk.c
+++ b/net/dsa/tag_mtk.c
@@ -72,8 +72,10 @@ static struct sk_buff *mtk_tag_rcv(struct sk_buff *skb, struct net_device *dev)
int port;
__be16 *phdr;
- if (unlikely(!pskb_may_pull(skb, MTK_HDR_LEN)))
+ if (unlikely(!pskb_may_pull(skb, MTK_HDR_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
phdr = dsa_etype_header_pos_rx(skb);
hdr = ntohs(*phdr);
@@ -87,8 +89,10 @@ static struct sk_buff *mtk_tag_rcv(struct sk_buff *skb, struct net_device *dev)
port = (hdr & MTK_HDR_RECV_SOURCE_PORT_MASK);
skb->dev = dsa_conduit_find_user(dev, 0, port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
dsa_default_offload_fwd_mark(skb);
diff --git a/net/dsa/tag_mxl-gsw1xx.c b/net/dsa/tag_mxl-gsw1xx.c
index 60f7c445e656..4b1b6ef94196 100644
--- a/net/dsa/tag_mxl-gsw1xx.c
+++ b/net/dsa/tag_mxl-gsw1xx.c
@@ -73,6 +73,7 @@ static struct sk_buff *gsw1xx_tag_rcv(struct sk_buff *skb,
if (unlikely(!pskb_may_pull(skb, GSW1XX_HEADER_LEN))) {
dev_warn_ratelimited(&dev->dev, "Dropping packet, cannot pull SKB\n");
+ kfree_skb(skb);
return NULL;
}
@@ -81,6 +82,7 @@ static struct sk_buff *gsw1xx_tag_rcv(struct sk_buff *skb,
if (unlikely(ntohs(gsw1xx_tag[0]) != ETH_P_MXLGSW)) {
dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid special tag\n");
dev_warn_ratelimited(&dev->dev, "Tag: %8ph\n", gsw1xx_tag);
+ kfree_skb(skb);
return NULL;
}
@@ -90,6 +92,7 @@ static struct sk_buff *gsw1xx_tag_rcv(struct sk_buff *skb,
if (!skb->dev) {
dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid source port\n");
dev_warn_ratelimited(&dev->dev, "Tag: %8ph\n", gsw1xx_tag);
+ kfree_skb(skb);
return NULL;
}
diff --git a/net/dsa/tag_mxl862xx.c b/net/dsa/tag_mxl862xx.c
index 8daefeb8d49d..87b80ddf0946 100644
--- a/net/dsa/tag_mxl862xx.c
+++ b/net/dsa/tag_mxl862xx.c
@@ -64,6 +64,7 @@ static struct sk_buff *mxl862_tag_rcv(struct sk_buff *skb,
if (unlikely(!pskb_may_pull(skb, MXL862_HEADER_LEN))) {
dev_warn_ratelimited(&dev->dev, "Cannot pull SKB, packet dropped\n");
+ kfree_skb(skb);
return NULL;
}
@@ -73,6 +74,7 @@ static struct sk_buff *mxl862_tag_rcv(struct sk_buff *skb,
dev_warn_ratelimited(&dev->dev,
"Invalid special tag marker, packet dropped, tag: %8ph\n",
mxl862_tag);
+ kfree_skb(skb);
return NULL;
}
@@ -83,6 +85,7 @@ static struct sk_buff *mxl862_tag_rcv(struct sk_buff *skb,
dev_warn_ratelimited(&dev->dev,
"Invalid source port, packet dropped, tag: %8ph\n",
mxl862_tag);
+ kfree_skb(skb);
return NULL;
}
diff --git a/net/dsa/tag_netc.c b/net/dsa/tag_netc.c
index ccedfe3a80b6..df72a61796ad 100644
--- a/net/dsa/tag_netc.c
+++ b/net/dsa/tag_netc.c
@@ -131,14 +131,13 @@ static struct sk_buff *netc_rcv(struct sk_buff *skb,
int type, subtype;
if (unlikely(!pskb_may_pull(skb, NETC_TAG_MAX_LEN)))
- return NULL;
+ goto err_free_skb;
tag_cmn = dsa_etype_header_pos_rx(skb);
if (ntohs(tag_cmn->tpid) != ETH_P_NXP_NETC) {
dev_warn_ratelimited(&ndev->dev, "Unknown TPID 0x%04x\n",
ntohs(tag_cmn->tpid));
-
- return NULL;
+ goto err_free_skb;
}
if (tag_cmn->qos & NETC_TAG_QV)
@@ -149,14 +148,13 @@ static struct sk_buff *netc_rcv(struct sk_buff *skb,
if (!sw_id) {
dev_warn_ratelimited(&ndev->dev,
"VEPA switch ID is not supported yet\n");
-
- return NULL;
+ goto err_free_skb;
}
port = FIELD_GET(NETC_TAG_PORT, tag_cmn->switch_port);
skb->dev = dsa_conduit_find_user(ndev, sw_id, port);
if (!skb->dev)
- return NULL;
+ goto err_free_skb;
type = FIELD_GET(NETC_TAG_TYPE, tag_cmn->type);
subtype = FIELD_GET(NETC_TAG_SUBTYPE, tag_cmn->type);
@@ -165,11 +163,11 @@ static struct sk_buff *netc_rcv(struct sk_buff *skb,
} else if (type == NETC_TAG_TO_HOST) {
/* Currently only subtype0 supported */
if (subtype != NETC_TAG_TH_SUBTYPE0)
- return NULL;
+ goto err_free_skb;
} else {
dev_warn_ratelimited(&ndev->dev,
"Unexpected tag type %d\n", type);
- return NULL;
+ goto err_free_skb;
}
/* Remove Switch tag from the frame */
@@ -178,6 +176,10 @@ static struct sk_buff *netc_rcv(struct sk_buff *skb,
dsa_strip_etype_header(skb, tag_len);
return skb;
+
+err_free_skb:
+ kfree_skb(skb);
+ return NULL;
}
static void netc_flow_dissect(const struct sk_buff *skb, __be16 *proto,
diff --git a/net/dsa/tag_ocelot.c b/net/dsa/tag_ocelot.c
index 3405def79c2d..d208c7322cd6 100644
--- a/net/dsa/tag_ocelot.c
+++ b/net/dsa/tag_ocelot.c
@@ -107,14 +107,16 @@ static struct sk_buff *ocelot_rcv(struct sk_buff *skb,
ocelot_xfh_get_rew_val(extraction, &rew_val);
skb->dev = dsa_conduit_find_user(netdev, 0, src_port);
- if (!skb->dev)
+ if (!skb->dev) {
/* The switch will reflect back some frames sent through
* sockets opened on the bare DSA conduit. These will come back
* with src_port equal to the index of the CPU port, for which
* there is no user registered. So don't print any error
* message here (ignore and drop those frames).
*/
+ kfree_skb(skb);
return NULL;
+ }
dsa_default_offload_fwd_mark(skb);
skb->priority = qos_class;
diff --git a/net/dsa/tag_ocelot_8021q.c b/net/dsa/tag_ocelot_8021q.c
index e89d9254e90a..f50f1cd83f16 100644
--- a/net/dsa/tag_ocelot_8021q.c
+++ b/net/dsa/tag_ocelot_8021q.c
@@ -33,30 +33,34 @@ static struct sk_buff *ocelot_defer_xmit(struct dsa_port *dp,
xmit_work_fn = data->xmit_work_fn;
xmit_worker = priv->xmit_worker;
- if (!xmit_work_fn || !xmit_worker)
+ if (!xmit_work_fn || !xmit_worker) {
+ kfree_skb(skb);
return NULL;
+ }
/* PTP over IP packets need UDP checksumming. We may have inherited
* NETIF_F_HW_CSUM from the DSA conduit, but these packets are not sent
* through the DSA conduit, so calculate the checksum here.
*/
- if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+ if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
xmit_work = kzalloc_obj(*xmit_work, GFP_ATOMIC);
- if (!xmit_work)
+ if (!xmit_work) {
+ kfree_skb(skb);
return NULL;
+ }
/* Calls felix_port_deferred_xmit in felix.c */
kthread_init_work(&xmit_work->work, xmit_work_fn);
- /* Increase refcount so the kfree_skb in dsa_user_xmit
- * won't really free the packet.
- */
xmit_work->dp = dp;
xmit_work->skb = skb_get(skb);
kthread_queue_work(xmit_worker, &xmit_work->work);
+ kfree_skb(skb);
return NULL;
}
@@ -84,8 +88,10 @@ static struct sk_buff *ocelot_rcv(struct sk_buff *skb,
dsa_8021q_rcv(skb, &src_port, &switch_id, NULL, NULL);
skb->dev = dsa_conduit_find_user(netdev, switch_id, src_port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
dsa_default_offload_fwd_mark(skb);
diff --git a/net/dsa/tag_qca.c b/net/dsa/tag_qca.c
index 9e3b429e8b36..510792fbfa92 100644
--- a/net/dsa/tag_qca.c
+++ b/net/dsa/tag_qca.c
@@ -46,16 +46,20 @@ static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev)
tagger_data = ds->tagger_data;
- if (unlikely(!pskb_may_pull(skb, QCA_HDR_LEN)))
+ if (unlikely(!pskb_may_pull(skb, QCA_HDR_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
phdr = dsa_etype_header_pos_rx(skb);
hdr = ntohs(*phdr);
/* Make sure the version is correct */
ver = FIELD_GET(QCA_HDR_RECV_VERSION, hdr);
- if (unlikely(ver != QCA_HDR_VERSION))
+ if (unlikely(ver != QCA_HDR_VERSION)) {
+ kfree_skb(skb);
return NULL;
+ }
/* Get pk type */
pk_type = FIELD_GET(QCA_HDR_RECV_TYPE, hdr);
@@ -64,6 +68,7 @@ static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev)
if (pk_type == QCA_HDR_RECV_TYPE_RW_REG_ACK) {
if (likely(tagger_data->rw_reg_ack_handler))
tagger_data->rw_reg_ack_handler(ds, skb);
+ kfree_skb(skb);
return NULL;
}
@@ -71,6 +76,7 @@ static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev)
if (pk_type == QCA_HDR_RECV_TYPE_MIB) {
if (likely(tagger_data->mib_autocast_handler))
tagger_data->mib_autocast_handler(ds, skb);
+ kfree_skb(skb);
return NULL;
}
@@ -78,8 +84,10 @@ static struct sk_buff *qca_tag_rcv(struct sk_buff *skb, struct net_device *dev)
port = FIELD_GET(QCA_HDR_RECV_SOURCE_PORT, hdr);
skb->dev = dsa_conduit_find_user(dev, 0, port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
/* Remove QCA tag and recalculate checksum */
skb_pull_rcsum(skb, QCA_HDR_LEN);
diff --git a/net/dsa/tag_rtl4_a.c b/net/dsa/tag_rtl4_a.c
index 3cc63eacfa03..9805c56025de 100644
--- a/net/dsa/tag_rtl4_a.c
+++ b/net/dsa/tag_rtl4_a.c
@@ -41,8 +41,10 @@ static struct sk_buff *rtl4a_tag_xmit(struct sk_buff *skb,
u16 out;
/* Pad out to at least 60 bytes */
- if (unlikely(__skb_put_padto(skb, ETH_ZLEN, false)))
+ if (unlikely(__skb_put_padto(skb, ETH_ZLEN, false))) {
+ kfree_skb(skb);
return NULL;
+ }
netdev_dbg(dev, "add realtek tag to package to port %d\n",
dp->index);
@@ -75,8 +77,10 @@ static struct sk_buff *rtl4a_tag_rcv(struct sk_buff *skb,
u8 prot;
u8 port;
- if (unlikely(!pskb_may_pull(skb, RTL4_A_HDR_LEN)))
+ if (unlikely(!pskb_may_pull(skb, RTL4_A_HDR_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
tag = dsa_etype_header_pos_rx(skb);
p = (__be16 *)tag;
@@ -92,6 +96,7 @@ static struct sk_buff *rtl4a_tag_rcv(struct sk_buff *skb,
prot = (protport >> RTL4_A_PROTOCOL_SHIFT) & 0x0f;
if (prot != RTL4_A_PROTOCOL_RTL8366RB) {
netdev_err(dev, "unknown realtek protocol 0x%01x\n", prot);
+ kfree_skb(skb);
return NULL;
}
port = protport & 0xff;
@@ -99,6 +104,7 @@ static struct sk_buff *rtl4a_tag_rcv(struct sk_buff *skb,
skb->dev = dsa_conduit_find_user(dev, 0, port);
if (!skb->dev) {
netdev_dbg(dev, "could not find user for port %d\n", port);
+ kfree_skb(skb);
return NULL;
}
diff --git a/net/dsa/tag_rtl8_4.c b/net/dsa/tag_rtl8_4.c
index 852c6b88079a..4da3beebef75 100644
--- a/net/dsa/tag_rtl8_4.c
+++ b/net/dsa/tag_rtl8_4.c
@@ -143,8 +143,10 @@ static struct sk_buff *rtl8_4t_tag_xmit(struct sk_buff *skb,
/* Calculate the checksum here if not done yet as trailing tags will
* break either software or hardware based checksum
*/
- if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb))
+ if (skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_help(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
rtl8_4_write_tag(skb, dev, skb_put(skb, RTL8_4_TAG_LEN));
@@ -201,11 +203,15 @@ static int rtl8_4_read_tag(struct sk_buff *skb, struct net_device *dev,
static struct sk_buff *rtl8_4_tag_rcv(struct sk_buff *skb,
struct net_device *dev)
{
- if (unlikely(!pskb_may_pull(skb, RTL8_4_TAG_LEN)))
+ if (unlikely(!pskb_may_pull(skb, RTL8_4_TAG_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
- if (unlikely(rtl8_4_read_tag(skb, dev, dsa_etype_header_pos_rx(skb))))
+ if (unlikely(rtl8_4_read_tag(skb, dev, dsa_etype_header_pos_rx(skb)))) {
+ kfree_skb(skb);
return NULL;
+ }
/* Remove tag and recalculate checksum */
skb_pull_rcsum(skb, RTL8_4_TAG_LEN);
@@ -218,14 +224,20 @@ static struct sk_buff *rtl8_4_tag_rcv(struct sk_buff *skb,
static struct sk_buff *rtl8_4t_tag_rcv(struct sk_buff *skb,
struct net_device *dev)
{
- if (skb_linearize(skb))
+ if (skb_linearize(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
- if (unlikely(rtl8_4_read_tag(skb, dev, skb_tail_pointer(skb) - RTL8_4_TAG_LEN)))
+ if (unlikely(rtl8_4_read_tag(skb, dev, skb_tail_pointer(skb) - RTL8_4_TAG_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
- if (pskb_trim_rcsum(skb, skb->len - RTL8_4_TAG_LEN))
+ if (pskb_trim_rcsum(skb, skb->len - RTL8_4_TAG_LEN)) {
+ kfree_skb(skb);
return NULL;
+ }
return skb;
}
diff --git a/net/dsa/tag_rzn1_a5psw.c b/net/dsa/tag_rzn1_a5psw.c
index 10994b3470f6..df0098513f3e 100644
--- a/net/dsa/tag_rzn1_a5psw.c
+++ b/net/dsa/tag_rzn1_a5psw.c
@@ -48,7 +48,7 @@ static struct sk_buff *a5psw_tag_xmit(struct sk_buff *skb, struct net_device *de
* least 60 bytes otherwise they will be discarded when they enter the
* switch port logic.
*/
- if (__skb_put_padto(skb, ETH_ZLEN, false))
+ if (skb_put_padto(skb, ETH_ZLEN))
return NULL;
/* provide 'A5PSW_TAG_LEN' bytes additional space */
@@ -77,6 +77,7 @@ static struct sk_buff *a5psw_tag_rcv(struct sk_buff *skb,
if (unlikely(!pskb_may_pull(skb, A5PSW_TAG_LEN))) {
dev_warn_ratelimited(&dev->dev,
"Dropping packet, cannot pull\n");
+ kfree_skb(skb);
return NULL;
}
@@ -84,14 +85,17 @@ static struct sk_buff *a5psw_tag_rcv(struct sk_buff *skb,
if (tag->ctrl_tag != htons(ETH_P_DSA_A5PSW)) {
dev_warn_ratelimited(&dev->dev, "Dropping packet due to invalid TAG marker\n");
+ kfree_skb(skb);
return NULL;
}
port = FIELD_GET(A5PSW_CTRL_DATA_PORT, ntohs(tag->ctrl_data));
skb->dev = dsa_conduit_find_user(dev, 0, port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
skb_pull_rcsum(skb, A5PSW_TAG_LEN);
dsa_strip_etype_header(skb, A5PSW_TAG_LEN);
diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
index de6d4ce8668b..bfe1f746f55b 100644
--- a/net/dsa/tag_sja1105.c
+++ b/net/dsa/tag_sja1105.c
@@ -149,19 +149,20 @@ static struct sk_buff *sja1105_defer_xmit(struct dsa_port *dp,
xmit_work_fn = tagger_data->xmit_work_fn;
xmit_worker = priv->xmit_worker;
- if (!xmit_work_fn || !xmit_worker)
+ if (!xmit_work_fn || !xmit_worker) {
+ kfree_skb(skb);
return NULL;
+ }
xmit_work = kzalloc_obj(*xmit_work, GFP_ATOMIC);
- if (!xmit_work)
+ if (!xmit_work) {
+ kfree_skb(skb);
return NULL;
+ }
kthread_init_work(&xmit_work->work, xmit_work_fn);
- /* Increase refcount so the kfree_skb in dsa_user_xmit
- * won't really free the packet.
- */
xmit_work->dp = dp;
- xmit_work->skb = skb_get(skb);
+ xmit_work->skb = skb;
kthread_queue_work(xmit_worker, &xmit_work->work);
@@ -401,10 +402,7 @@ static struct sk_buff
kfree_skb(priv->stampable_skb);
}
- /* Hold a reference to avoid dsa_switch_rcv
- * from freeing the skb.
- */
- priv->stampable_skb = skb_get(skb);
+ priv->stampable_skb = skb;
spin_unlock(&priv->meta_lock);
/* Tell DSA we got nothing */
@@ -436,6 +434,7 @@ static struct sk_buff
dev_err_ratelimited(ds->dev,
"Unexpected meta frame\n");
spin_unlock(&priv->meta_lock);
+ kfree_skb(skb);
return NULL;
}
@@ -443,6 +442,7 @@ static struct sk_buff
dev_err_ratelimited(ds->dev,
"Meta frame on wrong port\n");
spin_unlock(&priv->meta_lock);
+ kfree_skb(skb);
return NULL;
}
@@ -501,18 +501,21 @@ static struct sk_buff *sja1105_rcv(struct sk_buff *skb,
/* Normal data plane traffic and link-local frames are tagged with
* a tag_8021q VLAN which we have to strip
*/
- if (sja1105_skb_has_tag_8021q(skb))
+ if (sja1105_skb_has_tag_8021q(skb)) {
dsa_8021q_rcv(skb, &source_port, &switch_id, &vbid, &vid);
- else if (source_port == -1 && switch_id == -1)
+ } else if (source_port == -1 && switch_id == -1) {
/* Packets with no source information have no chance of
* getting accepted, drop them straight away.
*/
+ kfree_skb(skb);
return NULL;
+ }
skb->dev = dsa_tag_8021q_find_user(netdev, source_port, switch_id,
vid, vbid);
if (!skb->dev) {
netdev_warn(netdev, "Couldn't decode source port\n");
+ kfree_skb(skb);
return NULL;
}
@@ -539,12 +542,15 @@ static struct sk_buff *sja1110_rcv_meta(struct sk_buff *skb, u16 rx_header)
if (!ds) {
net_err_ratelimited("%s: cannot find switch id %d\n",
conduit->name, switch_id);
+ kfree_skb(skb);
return NULL;
}
tagger_data = sja1105_tagger_data(ds);
- if (!tagger_data->meta_tstamp_handler)
+ if (!tagger_data->meta_tstamp_handler) {
+ kfree_skb(skb);
return NULL;
+ }
for (i = 0; i <= n_ts; i++) {
u8 ts_id, source_port, dir;
@@ -562,6 +568,7 @@ static struct sk_buff *sja1110_rcv_meta(struct sk_buff *skb, u16 rx_header)
}
/* Discard the meta frame, we've consumed the timestamps it contained */
+ kfree_skb(skb);
return NULL;
}
@@ -572,8 +579,10 @@ static struct sk_buff *sja1110_rcv_inband_control_extension(struct sk_buff *skb,
{
u16 rx_header;
- if (unlikely(!pskb_may_pull(skb, SJA1110_HEADER_LEN)))
+ if (unlikely(!pskb_may_pull(skb, SJA1110_HEADER_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
/* skb->data points to skb_mac_header(skb) + ETH_HLEN, which is exactly
* what we need because the caller has checked the EtherType (which is
@@ -609,8 +618,10 @@ static struct sk_buff *sja1110_rcv_inband_control_extension(struct sk_buff *skb,
* padding and trailer we need to account for the fact that
* skb->data points to skb_mac_header(skb) + ETH_HLEN.
*/
- if (pskb_trim_rcsum(skb, start_of_padding - ETH_HLEN))
+ if (pskb_trim_rcsum(skb, start_of_padding - ETH_HLEN)) {
+ kfree_skb(skb);
return NULL;
+ }
/* Trap-to-host frame, no timestamp trailer */
} else {
*source_port = SJA1110_RX_HEADER_SRC_PORT(rx_header);
@@ -653,6 +664,7 @@ static struct sk_buff *sja1110_rcv(struct sk_buff *skb,
if (!skb->dev) {
netdev_warn(netdev, "Couldn't decode source port\n");
+ kfree_skb(skb);
return NULL;
}
diff --git a/net/dsa/tag_trailer.c b/net/dsa/tag_trailer.c
index 4dce24cfe6a7..49c802c10ca6 100644
--- a/net/dsa/tag_trailer.c
+++ b/net/dsa/tag_trailer.c
@@ -30,22 +30,30 @@ static struct sk_buff *trailer_rcv(struct sk_buff *skb, struct net_device *dev)
u8 *trailer;
int source_port;
- if (skb_linearize(skb))
+ if (skb_linearize(skb)) {
+ kfree_skb(skb);
return NULL;
+ }
trailer = skb_tail_pointer(skb) - 4;
if (trailer[0] != 0x80 || (trailer[1] & 0xf8) != 0x00 ||
- (trailer[2] & 0xef) != 0x00 || trailer[3] != 0x00)
+ (trailer[2] & 0xef) != 0x00 || trailer[3] != 0x00) {
+ kfree_skb(skb);
return NULL;
+ }
source_port = trailer[1] & 7;
skb->dev = dsa_conduit_find_user(dev, 0, source_port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
- if (pskb_trim_rcsum(skb, skb->len - 4))
+ if (pskb_trim_rcsum(skb, skb->len - 4)) {
+ kfree_skb(skb);
return NULL;
+ }
return skb;
}
diff --git a/net/dsa/tag_vsc73xx_8021q.c b/net/dsa/tag_vsc73xx_8021q.c
index af121a9aff7f..f4736a1a7a0f 100644
--- a/net/dsa/tag_vsc73xx_8021q.c
+++ b/net/dsa/tag_vsc73xx_8021q.c
@@ -44,6 +44,7 @@ vsc73xx_rcv(struct sk_buff *skb, struct net_device *netdev)
if (!skb->dev) {
dev_warn_ratelimited(&netdev->dev,
"Couldn't decode source port\n");
+ kfree_skb(skb);
return NULL;
}
diff --git a/net/dsa/tag_xrs700x.c b/net/dsa/tag_xrs700x.c
index a05219f702c6..bb268020ee86 100644
--- a/net/dsa/tag_xrs700x.c
+++ b/net/dsa/tag_xrs700x.c
@@ -30,15 +30,21 @@ static struct sk_buff *xrs700x_rcv(struct sk_buff *skb, struct net_device *dev)
source_port = ffs((int)trailer[0]) - 1;
- if (source_port < 0)
+ if (source_port < 0) {
+ kfree_skb(skb);
return NULL;
+ }
skb->dev = dsa_conduit_find_user(dev, 0, source_port);
- if (!skb->dev)
+ if (!skb->dev) {
+ kfree_skb(skb);
return NULL;
+ }
- if (pskb_trim_rcsum(skb, skb->len - 1))
+ if (pskb_trim_rcsum(skb, skb->len - 1)) {
+ kfree_skb(skb);
return NULL;
+ }
/* Frame is forwarded by hardware, don't forward in software. */
dsa_default_offload_fwd_mark(skb);
diff --git a/net/dsa/tag_yt921x.c b/net/dsa/tag_yt921x.c
index f3ced99b1c85..294784ab6694 100644
--- a/net/dsa/tag_yt921x.c
+++ b/net/dsa/tag_yt921x.c
@@ -87,8 +87,10 @@ yt921x_tag_rcv(struct sk_buff *skb, struct net_device *netdev)
__be16 *tag;
u16 rx;
- if (unlikely(!pskb_may_pull(skb, YT921X_TAG_LEN)))
+ if (unlikely(!pskb_may_pull(skb, YT921X_TAG_LEN))) {
+ kfree_skb(skb);
return NULL;
+ }
tag = dsa_etype_header_pos_rx(skb);
@@ -96,6 +98,7 @@ yt921x_tag_rcv(struct sk_buff *skb, struct net_device *netdev)
dev_warn_ratelimited(&netdev->dev,
"Unexpected EtherType 0x%04x\n",
ntohs(tag[0]));
+ kfree_skb(skb);
return NULL;
}
@@ -104,6 +107,7 @@ yt921x_tag_rcv(struct sk_buff *skb, struct net_device *netdev)
if (unlikely((rx & YT921X_TAG_PORT_EN) == 0)) {
dev_warn_ratelimited(&netdev->dev,
"Unexpected rx tag 0x%04x\n", rx);
+ kfree_skb(skb);
return NULL;
}
@@ -112,6 +116,7 @@ yt921x_tag_rcv(struct sk_buff *skb, struct net_device *netdev)
if (unlikely(!skb->dev)) {
dev_warn_ratelimited(&netdev->dev,
"Couldn't decode source port %u\n", port);
+ kfree_skb(skb);
return NULL;
}
diff --git a/net/dsa/user.c b/net/dsa/user.c
index 8704c1a3a5b7..072fa76972cc 100644
--- a/net/dsa/user.c
+++ b/net/dsa/user.c
@@ -935,13 +935,12 @@ static netdev_tx_t dsa_user_xmit(struct sk_buff *skb, struct net_device *dev)
eth_skb_pad(skb);
/* Transmit function may have to reallocate the original SKB,
- * in which case it must have freed it. Only free it here on error.
+ * in which case it must have freed it. Taggers will drop the
+ * passed skb on error.
*/
nskb = p->xmit(skb, dev);
- if (!nskb) {
- kfree_skb(skb);
+ if (!nskb)
return NETDEV_TX_OK;
- }
return dsa_enqueue_skb(nskb, dev);
}
---
base-commit: f34c6b3a3c3d98f34918e1d2ea846a5acccac6d1
change-id: 20260616-dsa-fix-free-skb-bb028ce90802
Best regards,
--
Linus Walleij <linusw@kernel.org>
^ permalink raw reply related
* Re: [BUG] netdevsim: KASAN slab-use-after-free in ref_tracker_free
From: saeed bishara @ 2026-06-16 9:34 UTC (permalink / raw)
To: Shuangpeng Bai
Cc: netdev, Jakub Kicinski, Andrew Lunn, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, linux-kernel
In-Reply-To: <178144969601.60470.14764529841344817811@gmail.com>
I tried gemini, here its analysis and a fix suggestion:
This is a brilliant, subtle bug. Let's break this down with a rigorous
audit of the Linux network device refcounting architecture to see
exactly why this Use-After-Free is occurring.
The root cause is a classic "Reverse-Teardown Fallacy" colliding with
the kernel's dual-refcount lifecycle for network devices (pcpu_refcnt
vs. kobject refcount).
The Dual-Lifecycle Trap
In modern kernel networking, a struct net_device is kept alive by two
distinct mechanisms that must be carefully orchestrated:
dev->pcpu_refcnt (Operational Lifecycle): Managed by dev_hold() and
dev_put(). This tracks active operational references. The
unregister_netdevice sequence blocks in netdev_run_todo() waiting for
this to hit zero before it invokes free_netdev().
dev->dev.kobj (Memory Lifecycle): The underlying memory structure
manager. free_netdev() drops a reference to this. When the kobject
refcount hits 0, device_release() is synchronously triggered, and the
kmalloc-cg-8k slab is actually freed.
The Exact Race Sequence
Look at the teardown order in the error path of sysfs_rtnl_lock():
unbreak:
sysfs_unbreak_active_protection(kn); // LINE 121
dev_put(ndev); // LINE 122
Normally, developers write error paths in the exact reverse order of
initialization. However, in this specific case, adhering to that
convention is fatal.
The Setup: sysfs_rtnl_lock starts by calling dev_hold(ndev),
incrementing the pcpu_refcnt. It then calls
sysfs_break_active_protection(), which explicitly calls kobject_get(),
elevating the kobj memory refcount.
The Unregister Race: In parallel, another CPU initiates device
teardown (e.g., echo 1 > del_device on netdevsim). The sysfs entries
are deleted, and device_del() drops its primary kobj reference. The
teardown thread hits netdev_run_todo() and blocks, waiting for
sysfs_rtnl_lock to release its dev_hold.
The Lock Fails: sysfs_rtnl_lock fails to get the rtnl lock and jumps to unbreak.
The Fatal Drop (Line 121): sysfs_unbreak_active_protection(kn)
executes. It calls kobject_put(). Because the unregistration path
already dropped the standard kobject references, this is the absolute
last kobject reference. device_release() is synchronously invoked, and
the net_device memory is instantly freed to the slab.
The KASAN Trigger (Line 122): dev_put(ndev) executes. Under the hood,
it attempts to call netdev_tracker_free(&ndev->ref_tracker, ...). It
reads from the ndev structure that was just destroyed microseconds
earlier. KASAN screams.
To fix this, the teardown convention must be deliberately violated to
respect the underlying memory dependencies.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index a1b2c3d4e5f6..7f8e9d0c1b2a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -118,8 +118,8 @@ static int sysfs_rtnl_lock(struct kobject *kobj,
struct attribute *attr,
return 0;
unbreak:
- sysfs_unbreak_active_protection(kn);
dev_put(ndev);
+ sysfs_unbreak_active_protection(kn);
return ret;
}
On Mon, Jun 15, 2026 at 4:18 AM Shuangpeng Bai
<shuangpeng.kernel@gmail.com> wrote:
>
> Hi netdev maintainers,
>
> I hit the following KASAN report while testing an upstream kernel.
>
> The issue was reproduced with netdevsim. I have not confirmed whether this is
> specific to netdevsim or whether other net devices can trigger a similar issue.
>
> The KASAN report shows a slab-use-after-free in ref_tracker_free(), reached from
> sysfs_rtnl_lock() while reading phys_port_name.
>
> I reproduced this on commit: e8c2f9fdadee7cbc75134dc463c1e0d856d6e5c7 (May 25 2026)
>
> To help trigger the bug more reliably, we applied a minimal diagnostic patch
> that only adds delays and print statements.
>
> The reproducer and .config files are here.
> https://gist.github.com/shuangpengbai/b49765d646ec4610917015371aa1c3ca
>
> I'm happy to test debug patches or provide additional information.
>
> Reported-by: Shuangpeng Bai <shuangpeng.kernel@gmail.com>
>
> [ 3145.449971][T17497] BUG: KASAN: slab-use-after-free in ref_tracker_free (lib/ref_tracker.c:295)
> [ 3145.452089][T17497] Read of size 1 at addr ffff888107678598 by task cat/17497
> [ 3145.454439][T17497]
> [ 3145.454977][T17497] Tainted: [W]=WARN
> [ 3145.454980][T17497] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [ 3145.454985][T17497] Call Trace:
> [ 3145.454991][T17497] <TASK>
> [ 3145.454994][T17497] dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120)
> [ 3145.455002][T17497] print_report (mm/kasan/report.c:378 mm/kasan/report.c:482)
> [ 3145.455028][T17497] kasan_report (mm/kasan/report.c:595)
> [ 3145.455046][T17497] ref_tracker_free (lib/ref_tracker.c:295)
> [ 3145.455083][T17497] sysfs_rtnl_lock (include/linux/netdevice.h:4491 include/linux/netdevice.h:4508 include/linux/netdevice.h:4534 net/core/net-sysfs.c:122)
> [ 3145.455091][T17497] phys_port_name_show (net/core/net-sysfs.c:665)
> [ 3145.455118][T17497] dev_attr_show (drivers/base/core.c:2421)
> [ 3145.455128][T17497] sysfs_kf_seq_show (fs/sysfs/file.c:65)
> [ 3145.455135][T17497] seq_read_iter (fs/seq_file.c:231)
> [ 3145.455144][T17497] vfs_read (fs/read_write.c:493 fs/read_write.c:574)
> [ 3145.455169][T17497] ksys_read (fs/read_write.c:717)
> [ 3145.455181][T17497] do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
> [ 3145.455188][T17497] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
> [ 3145.455193][T17497] RIP: 0033:0x7fcf098c43ce
> [ 3145.455200][T17497] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 08 0b 00 e8 69 01 02 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> [ 3145.455204][T17497] RSP: 002b:00007ffd05e76b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 3145.455211][T17497] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fcf098c43ce
> [ 3145.455214][T17497] RDX: 0000000000020000 RSI: 00007fcf095e4000 RDI: 0000000000000003
> [ 3145.455217][T17497] RBP: 00007fcf095e4000 R08: 00007fcf095e3010 R09: 0000000000000000
> [ 3145.455219][T17497] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000
> [ 3145.455222][T17497] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
> [ 3145.455227][T17497] </TASK>
> [ 3145.455229][T17497]
> [ 3145.479014][T17497] Freed by task 17497 on cpu 0 at 3145.447575s:
> [ 3145.479559][T17497] kasan_save_track (mm/kasan/common.c:57 mm/kasan/common.c:78)
> [ 3145.479963][T17497] kasan_save_free_info (mm/kasan/generic.c:584)
> [ 3145.480411][T17497] __kasan_slab_free (mm/kasan/common.c:253 mm/kasan/common.c:285)
> [ 3145.480813][T17497] kfree (include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6251 mm/slub.c:6566)
> [ 3145.481148][T17497] device_release (drivers/base/core.c:2542)
> [ 3145.481567][T17497] kobject_put (lib/kobject.c:689 lib/kobject.c:720 include/linux/kref.h:65 lib/kobject.c:737)
> [ 3145.481951][T17497] sysfs_rtnl_lock (net/core/net-sysfs.c:121)
> [ 3145.482351][T17497] phys_port_name_show (net/core/net-sysfs.c:665)
> [ 3145.482782][T17497] dev_attr_show (drivers/base/core.c:2421)
> [ 3145.483154][T17497] sysfs_kf_seq_show (fs/sysfs/file.c:65)
> [ 3145.483586][T17497] seq_read_iter (fs/seq_file.c:231)
> [ 3145.483975][T17497] vfs_read (fs/read_write.c:493 fs/read_write.c:574)
> [ 3145.484334][T17497] ksys_read (fs/read_write.c:717)
> [ 3145.484701][T17497] do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
> [ 3145.485092][T17497] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
> [ 3145.485592][T17497]
> [ 3145.485794][T17497] The buggy address belongs to the object at ffff888107678000
> [ 3145.485794][T17497] which belongs to the cache kmalloc-cg-8k of size 8192
> [ 3145.486991][T17497] The buggy address is located 1432 bytes inside of
> [ 3145.486991][T17497] freed 8192-byte region [ffff888107678000, ffff88810767a000)
> [ 3145.488159][T17497]
> [ 3145.488367][T17497] The buggy address belongs to the physical page:
>
>
> Best,
> Shuangpeng
>
^ permalink raw reply related
* [PATCH bpf v2 2/2] selftests/bpf: Cover partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-16 9:31 UTC (permalink / raw)
To: bpf
Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
toke, lorenzo, paul.chaignon
In-Reply-To: <20260616093103.471444-1-sun.jian.kdev@gmail.com>
prog_run_opts already verifies that BPF_PROG_TEST_RUN returns -ENOSPC
for a short data_out buffer while still reporting the full output size
through data_size_out.
Add the same coverage for non-linear test_run output. Use pass-through
TC and XDP programs with a 9000-byte packet, a 64-byte linear data area,
and a 100-byte data_out buffer. The expected output spans both the linear
data and the first fragment.
Verify that test_run returns -ENOSPC, reports the full packet length
through data_size_out, and copies the packet prefix into data_out for
both non-linear skb and XDP frags paths.
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
---
.../selftests/bpf/prog_tests/prog_run_opts.c | 72 +++++++++++++++++++
.../selftests/bpf/progs/test_pkt_access.c | 12 ++++
2 files changed, 84 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
index 01f1d1b6715a..71af1ff02023 100644
--- a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
+++ b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
@@ -4,6 +4,10 @@
#include "test_pkt_access.skel.h"
+#define NONLINEAR_PKT_LEN 9000
+#define NONLINEAR_LINEAR_DATA_LEN 64
+#define SHORT_OUT_LEN 100
+
static const __u32 duration;
static void check_run_cnt(int prog_fd, __u64 run_cnt)
@@ -20,6 +24,71 @@ static void check_run_cnt(int prog_fd, __u64 run_cnt)
"incorrect number of repetitions, want %llu have %llu\n", run_cnt, info.run_cnt);
}
+static void init_pkt(__u8 *pkt, size_t len)
+{
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ pkt[i] = i & 0xff;
+}
+
+static void test_skb_nonlinear_data_out_partial(struct test_pkt_access *skel)
+{
+ LIBBPF_OPTS(bpf_test_run_opts, topts);
+ __u8 pkt[NONLINEAR_PKT_LEN];
+ __u8 out[SHORT_OUT_LEN];
+ struct __sk_buff skb = {};
+ int prog_fd, err;
+
+ init_pkt(pkt, sizeof(pkt));
+ memset(out, 0xa5, sizeof(out));
+
+ skb.data_end = NONLINEAR_LINEAR_DATA_LEN;
+
+ topts.data_in = pkt;
+ topts.data_size_in = sizeof(pkt);
+ topts.data_out = out;
+ topts.data_size_out = sizeof(out);
+ topts.ctx_in = &skb;
+ topts.ctx_size_in = sizeof(skb);
+
+ prog_fd = bpf_program__fd(skel->progs.tc_pass_prog);
+ err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+ ASSERT_EQ(err, -ENOSPC, "skb_nonlinear_partial_err");
+ ASSERT_EQ(topts.data_size_out, sizeof(pkt), "skb_nonlinear_partial_data_size_out");
+ ASSERT_OK(memcmp(out, pkt, sizeof(out)), "skb_nonlinear_partial_data_out");
+}
+
+static void test_xdp_nonlinear_data_out_partial(struct test_pkt_access *skel)
+{
+ LIBBPF_OPTS(bpf_test_run_opts, topts);
+ __u8 pkt[NONLINEAR_PKT_LEN];
+ __u8 out[SHORT_OUT_LEN];
+ struct xdp_md ctx = {};
+ int prog_fd, err;
+
+ init_pkt(pkt, sizeof(pkt));
+ memset(out, 0xa5, sizeof(out));
+
+ ctx.data = 0;
+ ctx.data_end = NONLINEAR_LINEAR_DATA_LEN;
+
+ topts.data_in = pkt;
+ topts.data_size_in = sizeof(pkt);
+ topts.data_out = out;
+ topts.data_size_out = sizeof(out);
+ topts.ctx_in = &ctx;
+ topts.ctx_size_in = sizeof(ctx);
+
+ prog_fd = bpf_program__fd(skel->progs.xdp_frags_pass_prog);
+ err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+ ASSERT_EQ(err, -ENOSPC, "xdp_nonlinear_partial_err");
+ ASSERT_EQ(topts.data_size_out, sizeof(pkt), "xdp_nonlinear_partial_data_size_out");
+ ASSERT_OK(memcmp(out, pkt, sizeof(out)), "xdp_nonlinear_partial_data_out");
+}
+
void test_prog_run_opts(void)
{
struct test_pkt_access *skel;
@@ -69,6 +138,9 @@ void test_prog_run_opts(void)
run_cnt += topts.repeat;
check_run_cnt(prog_fd, run_cnt);
+ test_skb_nonlinear_data_out_partial(skel);
+ test_xdp_nonlinear_data_out_partial(skel);
+
cleanup:
if (skel)
test_pkt_access__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/test_pkt_access.c b/tools/testing/selftests/bpf/progs/test_pkt_access.c
index bce7173152c6..cd284401eebd 100644
--- a/tools/testing/selftests/bpf/progs/test_pkt_access.c
+++ b/tools/testing/selftests/bpf/progs/test_pkt_access.c
@@ -150,3 +150,15 @@ int test_pkt_access(struct __sk_buff *skb)
return TC_ACT_UNSPEC;
}
+
+SEC("tc")
+int tc_pass_prog(struct __sk_buff *skb)
+{
+ return TC_ACT_OK;
+}
+
+SEC("xdp.frags")
+int xdp_frags_pass_prog(struct xdp_md *ctx)
+{
+ return XDP_PASS;
+}
--
2.43.0
^ permalink raw reply related
* [PATCH bpf v2 1/2] bpf: Fix partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-16 9:31 UTC (permalink / raw)
To: bpf
Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
toke, lorenzo, paul.chaignon
In-Reply-To: <20260616093103.471444-1-sun.jian.kdev@gmail.com>
For non-linear test_run output, bpf_test_finish() derives the linear
data copy length from copy_size - frag_size. This only matches the
linear data length when copy_size is the full packet size.
When userspace provides a short data_out buffer, copy_size is clamped to
that buffer size. If copy_size is smaller than frag_size, the computed
length becomes negative and bpf_test_finish() returns -ENOSPC before
copying the packet prefix or updating data_size_out.
Compute the linear data length from the packet layout instead, and clamp
the linear copy length to copy_size. This preserves the expected
partial-copy semantics: return -ENOSPC, copy the packet prefix that fits
in data_out, and report the full packet length through data_size_out.
Fixes: 7855e0db150ad ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
---
net/bpf/test_run.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2bc04feadfab..976e8fa31bc9 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -453,19 +453,16 @@ static int bpf_test_finish(const union bpf_attr *kattr,
}
if (data_out) {
- int len = sinfo ? copy_size - frag_size : copy_size;
-
- if (len < 0) {
- err = -ENOSPC;
- goto out;
- }
+ u32 head_len = size - frag_size;
+ u32 len = min(copy_size, head_len);
if (copy_to_user(data_out, data, len))
goto out;
if (sinfo) {
- int i, offset = len;
+ u32 offset = len;
u32 data_len;
+ int i;
for (i = 0; i < sinfo->nr_frags; i++) {
skb_frag_t *frag = &sinfo->frags[i];
--
2.43.0
^ permalink raw reply related
* [PATCH bpf v2 0/2] Fix partial copy of non-linear test_run output
From: Sun Jian @ 2026-06-16 9:31 UTC (permalink / raw)
To: bpf
Cc: netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
toke, lorenzo, paul.chaignon
When BPF_PROG_TEST_RUN returns non-linear output and userspace provides a
short data_out buffer, bpf_test_finish() can return -ENOSPC before copying
the packet prefix or updating data_size_out.
Fix this by deriving the linear copy length from the packet layout rather
than from the already-clamped copy_size. Add selftest coverage for both
non-linear skb and XDP frags paths.
---
Changes in v2:
* Fix the Fixes tag to point to the commit that introduced the shared
non-linear copy-out logic.
* Drop skb-specific wording from the fix commit.
* Move the selftest from skb_load_bytes.c to prog_run_opts.c.
* Add XDP frags coverage in addition to non-linear skb coverage.
v1:
https://lore.kernel.org/bpf/20260615073856.152479-1-sun.jian.kdev@gmail.com/
Tested with:
./test_progs -t prog_run_opts -v
./test_progs -t skb_load_bytes -v
./test_progs -t xdp_pull_data -v
Sun Jian (2):
bpf: Fix partial copy of non-linear test_run output
selftests/bpf: Cover partial copy of non-linear test_run output
net/bpf/test_run.c | 11 ++-
.../selftests/bpf/prog_tests/prog_run_opts.c | 72 +++++++++++++++++++
.../selftests/bpf/progs/test_pkt_access.c | 12 ++++
3 files changed, 88 insertions(+), 7 deletions(-)
Range-diff:
1: 3691b07aa440 ! 1: e5a0c426d4cb bpf: Fix partial copy of non-linear skb test_run output
@@ Metadata
Author: Sun Jian <sun.jian.kdev@gmail.com>
## Commit message ##
- bpf: Fix partial copy of non-linear skb test_run output
+ bpf: Fix partial copy of non-linear test_run output
- For non-linear skbs, bpf_test_finish() derives the linear head copy
- length from copy_size - frag_size. This only matches the skb head length
- when copy_size is the full packet size.
+ For non-linear test_run output, bpf_test_finish() derives the linear
+ data copy length from copy_size - frag_size. This only matches the
+ linear data length when copy_size is the full packet size.
When userspace provides a short data_out buffer, copy_size is clamped to
that buffer size. If copy_size is smaller than frag_size, the computed
length becomes negative and bpf_test_finish() returns -ENOSPC before
copying the packet prefix or updating data_size_out.
- Compute the linear head length from the skb layout instead, and clamp the
- head copy length to copy_size. This preserves the expected partial-copy
- semantics: return -ENOSPC, copy the packet prefix that fits in data_out,
- and report the full packet length through data_size_out.
+ Compute the linear data length from the packet layout instead, and clamp
+ the linear copy length to copy_size. This preserves the expected
+ partial-copy semantics: return -ENOSPC, copy the packet prefix that fits
+ in data_out, and report the full packet length through data_size_out.
- Fixes: 838baa351cee ("bpf: Craft non-linear skbs in BPF_PROG_TEST_RUN")
+ Fixes: 7855e0db150ad ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
## net/bpf/test_run.c ##
2: 663847520f0b < -: ------------ selftests/bpf: Cover partial copy of non-linear skb test_run output
-: ------------ > 2: 680506532d97 selftests/bpf: Cover partial copy of non-linear test_run output
--
2.43.0
^ permalink raw reply
* Re: [PATCH net 1/1] net: smc: fix splice entry lifetime imbalance in smc_rx_splice
From: Dust Li @ 2026-06-16 9:30 UTC (permalink / raw)
To: Ren Wei, linux-rdma, linux-s390, netdev
Cc: alibuda, sidraya, wenjia, mjambigi, tonylu, guwen, ubraun,
stefan.raspl, davem, yuantan098, zcliangcn, bird, lx24,
d4n.for.sec
In-Reply-To: <192d1b44ed358ca143f44ef167d14153bccc51e9.1781097957.git.d4n.for.sec@gmail.com>
On 2026-06-11 01:54:11, Ren Wei wrote:
>From: Daming Li <d4n.for.sec@gmail.com>
>
>smc_rx_splice() hands candidate pages to splice_to_pipe() without taking
>references for the lifetime of each splice entry first. That breaks the
>splice ownership contract in the VM-backed RMB path.
>
>splice_to_pipe() drops unqueued entries through spd_release(), while
>queued entries are later dropped through the pipe buffer release
>callback. The current code only tries to take page references after the
>splice succeeds, and it derives the number of queued VM pages from a
>mutated offset value. This can underflow page refcounts and trigger a
>use-after-free. It also leaves the socket lifetime imbalanced in the
>multi-page VM case, where one sock_hold() can be followed by multiple
>sock_put() calls.
>
>Fix this by taking the page and socket references for every candidate
>splice entry before calling splice_to_pipe(), and by releasing the
>matching private state, page reference, and socket reference from
>smc_rx_spd_release() for entries that never get queued. This makes the
>SMC splice path follow the normal splice lifetime rules and removes the
>broken post-splice VM page counting entirely.
>
>Fixes: 9014db202cb7 ("smc: add support for splice()")
>Cc: stable@vger.kernel.org
>Reported-by: Yuan Tan <yuantan098@gmail.com>
>Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
>Reported-by: Xin Liu <bird@lzu.edu.cn>
>Assisted-by: Codex:GPT-5.4
>Co-developed-by: Liu Xiao <lx24@stu.ynu.edu.cn>
>Signed-off-by: Liu Xiao <lx24@stu.ynu.edu.cn>
>Signed-off-by: Daming Li <d4n.for.sec@gmail.com>
>Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
The patch looks good to me, a minor nit below
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
>---
> net/smc/smc_rx.c | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
>diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
>index c1d9b923938d..88aee0d93597 100644
>--- a/net/smc/smc_rx.c
>+++ b/net/smc/smc_rx.c
>@@ -150,18 +150,23 @@ static const struct pipe_buf_operations smc_pipe_ops = {
> static void smc_rx_spd_release(struct splice_pipe_desc *spd,
> unsigned int i)
> {
>+ struct smc_spd_priv *priv = (struct smc_spd_priv *)spd->partial[i].private;
>+ struct sock *sk = &priv->smc->sk;
>+
>+ kfree(priv);
> put_page(spd->pages[i]);
>+ sock_put(sk);
> }
>
> static int smc_rx_splice(struct pipe_inode_info *pipe, char *src, size_t len,
> struct smc_sock *smc)
> {
> struct smc_link_group *lgr = smc->conn.lgr;
>- int offset = offset_in_page(src);
> struct partial_page *partial;
> struct splice_pipe_desc spd;
> struct smc_spd_priv **priv;
> struct page **pages;
>+ int offset = offset_in_page(src);
Minor nit:
moving int offset = offset_in_page(src) down breaks the existing
reverse-xmas-tree declaration ordering. We keep this style in SMC.
Best regards,
Dust
^ permalink raw reply
* Re: net: thunderbolt: tbnet_poll() can overflow skb_shinfo()->frags[]
From: Mika Westerberg @ 2026-06-16 9:25 UTC (permalink / raw)
To: Maoyi Xie
Cc: Mika Westerberg, Yehezkel Bernat, Andrew Lunn, Jakub Kicinski,
Paolo Abeni, netdev, linux-kernel
In-Reply-To: <178159529251.2170936.1136950368069628844@maoyixie.com>
Hi,
On Tue, Jun 16, 2026 at 03:34:52PM +0800, Maoyi Xie wrote:
> Hi all,
>
> After the recent skb frags[] overflow fixes (t7xx, cdc-phonet, f_phonet), I
> went looking for the same pattern. I think tbnet_poll() in
> drivers/net/thunderbolt/main.c has it too. I would appreciate it if you could
> take a look.
>
> tbnet_poll() reassembles a ThunderboltIP packet that spans several frames into
> one skb. It adds one rx fragment per frame.
>
> skb = net->skb;
> if (!skb) {
> skb = build_skb(...);
> ...
> net->skb = skb;
> } else {
> skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> page, hdr_size, frame_size,
> TBNET_RX_PAGE_SIZE - hdr_size);
> }
>
> Nothing checks skb_shinfo(skb)->nr_frags against MAX_SKB_FRAGS here. The frame
> count comes from the peer, in the frame header. tbnet_check_frame() only bounds
> it at the start of a packet.
>
> if (frame_count == 0 || frame_count > TBNET_RING_SIZE / 4) {
> net->stats.rx_length_errors++;
> return false;
> }
>
> TBNET_RING_SIZE is 256, so frame_count can be as large as 64. MAX_SKB_FRAGS is 17
> by default. Frame 0 builds the skb and every frame after it adds a fragment, so
> nr_frags can reach 63. Once nr_frags hits MAX_SKB_FRAGS, skb_add_rx_frag() writes
> one entry past skb_shinfo()->frags[]. The frame_size and MTU checks do not stop
> this. With small frames, 64 fragments stay well under TBNET_MAX_MTU.
>
> So a malicious or buggy peer can send a packet with frame_count between 19 and
> 64. The frames only need to increment the way tbnet_check_frame() wants. That
> drives nr_frags past frags[] and overruns skb_shared_info.
I agree this can happen.
> The fix I had in mind mirrors f0813bcd2d9d ("net: wwan: t7xx: fix potential
> skb->frags overflow in RX path") and 600dc40554dc ("net: usb: cdc-phonet: fix
> skb frags[] overflow in rx_complete()"). Add the fragment only while there is
> room, and drop the packet otherwise.
>
> - } else {
> + } else if (skb_shinfo(skb)->nr_frags < MAX_SKB_FRAGS) {
> skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> page, hdr_size, frame_size,
> TBNET_RX_PAGE_SIZE - hdr_size);
> + } else {
> + net->stats.rx_length_errors++;
> + __free_pages(page, TBNET_RX_PAGE_ORDER);
> + dev_kfree_skb_any(net->skb);
> + net->skb = NULL;
> + continue;
> }
>
> I do not have two Thunderbolt hosts, so this is from reading the code. I can put
> together a focused reproducer if that helps.
>
> Does this look like a real overflow? And is the MAX_SKB_FRAGS guard the right
> place, or would you rather tighten the frame_count bound in tbnet_check_frame()?
> It has been there since the driver was added (e69b6c02b4c3), so it is a stable
> candidate. Happy to send a proper patch once you confirm.
I would prefer do this in tbnet_check_frame(). Thanks!
^ permalink raw reply
* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Pedro Falcato @ 2026-06-16 9:20 UTC (permalink / raw)
To: Luigi Rizzo
Cc: rizzo.unipi, m.szyprowski, robin.murphy, willemb, kuniyu, davem,
edumazet, kuba, pabeni, gregkh, rafael, akpm, david, netdev,
linux-mm, iommu, driver-core, linux-kernel,
Jesper Dangaard Brouer, Ilias Apalodimas
In-Reply-To: <20260615234220.3946885-1-lrizzo@google.com>
(+cc page pool maintainers)
On Mon, Jun 15, 2026 at 11:42:20PM +0000, Luigi Rizzo wrote:
> The use of swiotlb causes an extra data copy on I/O. For tx sockets,
> especially with greedy senders, this has a high chance of happening in
> the softirq handler for tx network interrupts, creating a significant
> performance bottleneck.
>
> Allow tx sockets to allocate socket buffers directly from the bounce
> buffers. This avoids the second copy and removes the above bottleneck.
> The fraction of swiotlb buffers allowed for this feature is set with
> /sys/module/swiotlb/parameters/zerocopy_tx_percent
> (0 means disabled, 90 is the maximum, to avoid persistent I/O failures).
>
> Implementation:
> - define a new page type to unambiguously identify bounce buffers used
> as backing storage for socket buffers
> - modify skb_page_frag_refill to perform the modified allocation
> - modify the destructors __free_frozen_pages(), free_unref_folio() to
> handle those pages and return them to the pool.
>
> The savings are especially visible with fewer queues. In synthetic
> benchmarks, senders with 1-2 queues would cap around 50Gbps with
> conventional swiotlb, and reach over 170Gbps with the feature enabled.
I could be wrong, but I genuinely think that the way to go about this is
using page_pool for regular TX as well. page_pool pages are all dma-mapped
(so whatever swiotlb optimization you want can be done there), and the net
stack already has awareness of these special pages and special skbs, so it
won't Just Return Them back to the page allocator.
Otherwise you can easily go all over the place, and that's just not great.
Also this could possibly benefit setups that use IOMMU as well.
--
Pedro
^ permalink raw reply
* Re: [PATCH net-next v5 0/3] airoha: add the capability to configure GDM3/GDM4 as WAN/LAN on demand
From: Lorenzo Bianconi @ 2026-06-16 9:12 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
linux-arm-kernel, linux-mediatek, netdev, Madhur Agrawal,
Alexander Lobakin
In-Reply-To: <20260615163713.665271a2@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 435 bytes --]
> On Thu, 11 Jun 2026 23:55:50 +0200 Lorenzo Bianconi wrote:
> > net: airoha: use int instead of atomic_t for qdma users counter
> > net: airoha: refactor QDMA start/stop into reusable helpers
> > net: airoha: defer GDM3/GDM4 WAN mode and GDM2 loopback to QoS offload
>
> only the first patch applies cleanly right now
ack, I will repost missing ones as soon as net-next is open again.
Regards,
Lorenzo
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* [PATCH bpf] bpf, sockmap: fix lock inversion between stab->lock and sk_callback_lock
From: Sechang Lim @ 2026-06-16 9:11 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki
Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
David S . Miller, Jakub Kicinski, Simon Horman, netdev, bpf,
linux-kernel
sock_map_update_common() and __sock_map_delete() hold stab->lock and call
sock_map_unref() -> sock_map_del_link() under it. sock_map_del_link() takes
sk_callback_lock for write to stop the strparser and verdict, giving the
lock order stab->lock -> sk_callback_lock.
The opposite order comes from an SK_SKB stream parser. On RX,
sk_psock_strp_data_ready() holds sk_callback_lock for read while running
the parser. The verdict redirects the skb to egress, where a sched_cls
program calls bpf_map_delete_elem() on a sockmap, which takes stab->lock:
WARNING: possible circular locking dependency detected
7.1.0-rc6 Not tainted
------------------------------------------------------
syz.9.8824 is trying to acquire lock:
(&stab->lock){+.-.}-{3:3}, at: __sock_map_delete net/core/sock_map.c:421
but task is already holding lock:
(clock-AF_INET){++.-}-{3:3}, at: sk_psock_strp_data_ready net/core/skmsg.c:1173
-> #1 (clock-AF_INET){++.-}-{3:3}:
_raw_write_lock_bh
sock_map_del_link net/core/sock_map.c:167
sock_map_unref net/core/sock_map.c:184
sock_map_update_common net/core/sock_map.c:509
sock_map_update_elem_sys net/core/sock_map.c:588
map_update_elem kernel/bpf/syscall.c:1805
-> #0 (&stab->lock){+.-.}-{3:3}:
_raw_spin_lock_bh
__sock_map_delete net/core/sock_map.c:421
sock_map_delete_elem net/core/sock_map.c:452
bpf_prog_06044d24140080b6
tcx_run net/core/dev.c:4451
sch_handle_egress net/core/dev.c:4541
__dev_queue_xmit net/core/dev.c:4808
...
tcp_bpf_strp_read_sock net/ipv4/tcp_bpf.c:701
strp_data_ready net/strparser/strparser.c:402
sk_psock_strp_data_ready net/core/skmsg.c:1174
tcp_data_queue net/ipv4/tcp_input.c:5661
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(clock-AF_INET);
lock(&stab->lock);
lock(clock-AF_INET);
lock(&stab->lock);
*** DEADLOCK ***
sk_callback_lock is an rwlock and the established side takes it for write,
so the read side cannot re-enter once a writer is queued.
sock_map_del_link() uses psock->link_lock and sk_callback_lock, not
stab->lock. The socket is removed from the slot with xchg() under
stab->lock, which leaves a single deleter owning it, and its reference is
dropped only by sk_psock_put() in sock_map_unref(). Release stab->lock
right after the xchg() and run sock_map_unref() outside it. Do the same
for the replaced socket in sock_map_update_common(). sock_map_free()
already unrefs without stab->lock.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
net/core/sock_map.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 99e3789492a0..390bd5ee46d4 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -421,13 +421,13 @@ static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
spin_lock_bh(&stab->lock);
if (!sk_test || sk_test == *psk)
sk = xchg(psk, NULL);
+ spin_unlock_bh(&stab->lock);
if (likely(sk))
sock_map_unref(sk, psk);
else
err = -EINVAL;
- spin_unlock_bh(&stab->lock);
return err;
}
@@ -505,9 +505,10 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
sock_map_add_link(psock, link, map, &stab->sks[idx]);
stab->sks[idx] = sk;
+ spin_unlock_bh(&stab->lock);
+
if (osk)
sock_map_unref(osk, &stab->sks[idx]);
- spin_unlock_bh(&stab->lock);
return 0;
out_unlock:
spin_unlock_bh(&stab->lock);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH] vhost/net: fix clear_user start address in VHOST_GET_FEATURES_ARRAY
From: rom.wang @ 2026-06-16 9:01 UTC (permalink / raw)
To: r4o5m6e8o
Cc: eperezma, jasowang, kvm, linux-kernel, mst, netdev, pabeni,
virtualization, wangyufeng
In-Reply-To: <20260526080336.61296-1-r4o5m6e8o@163.com>
Gentle ping. Any comments on this patch?
Thanks
Yufeng Wang
^ permalink raw reply
* Re: [PATCH net] net: dsa: Fix skb ownership in taggers
From: Linus Walleij @ 2026-06-16 9:01 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Andrew Lunn, Vladimir Oltean, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Florian Fainelli, Jonas Gorski,
Hauke Mehrtens, Kurt Kanzenbach, Woojung Huh, UNGLinuxDriver,
Chester A. Unal, Daniel Golle, Matthias Brugger,
AngeloGioacchino Del Regno, Wei Fang, Clark Wang,
Clément Léger, George McCollister, David Yang, netdev,
Sashiko AI Review
In-Reply-To: <20260615180113.13fca89f@kernel.org>
On Tue, Jun 16, 2026 at 3:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
> Impressive. Thanks a lot for doing this.
The grunt work was AI assisted actually, I just took a deep breath
and jumped in on the deep end and vibe coded it. If AI finds the bug
AI should help fixing it...
> patchwork says it doesn't apply to net. Is it on top of net or net-next?
It's on net-next so it covers the new tagger, since the merge window
opened I just guessed it needed to be "net" at this point.
> Since the merge window started already net-next is probably better but
> you need to designate in the subject correctly. Feel free to repost
> without the 24h wait, maybe we can still slip this into our main PR.
OK I fix the improvements pointed out by Wei and Quingfang and
repost tagged for net-next.
Yours,
Linus Walleij
^ permalink raw reply
* Re: [PATCH] nfc: fdp: reject an oversized device-reported packet length
From: Simon Horman @ 2026-06-16 9:00 UTC (permalink / raw)
To: hexlabsecurity
Cc: David Heidelberg, linux-kernel, Robert Dolca, netdev,
oe-linux-nfc, Samuel Ortiz, Kang Chen
In-Reply-To: <20260615-b4-disp-f42dce2d-v1-1-186ff3dcbf37@proton.me>
On Mon, Jun 15, 2026 at 03:04:02AM -0500, Bryam Vargas via B4 Relay wrote:
> From: Bryam Vargas <hexlabsecurity@proton.me>
>
> fdp_nci_i2c_read() reads the length of the next packet from the device
> into phy->next_read_size and uses it as the i2c_master_recv() byte count
> into a fixed on-stack buffer:
>
> u8 tmp[FDP_NCI_I2C_MAX_PAYLOAD]; /* 261 bytes */
> ...
> len = phy->next_read_size;
> r = i2c_master_recv(client, tmp, len);
>
> When a "length packet" arrives (tmp[0] == 0 && tmp[1] == 0), the next
> length is taken verbatim from two device-supplied bytes:
>
> phy->next_read_size = (tmp[2] << 8) + tmp[3] + 3;
>
> next_read_size is a u16, so this can be driven as high as 65535 - far
> larger than the 261-byte tmp[] buffer - and it is never bounded before
> the next iteration's i2c_master_recv(). A malfunctioning, malicious or
> counterfeit FDP NFC controller (or an attacker tampering with the I2C
> bus) that sends such a length packet makes i2c_master_recv() write up to
> about 64 KB into the 261-byte on-stack buffer: a stack out-of-bounds
> write that clobbers the stack canary, saved registers and the return
> address.
>
> Reject a next_read_size larger than the receive buffer the same way a
> corrupted packet is already handled - drop it and force resynchronization
> - so a device can never drive an over-length read.
>
> Fixes: a06347c04c13 ("NFC: Add Intel Fields Peak NFC solution driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
> ---
> I reproduced the out-of-bounds write with an in-kernel test that drives
> the fdp_nci_i2c_read() buffer geometry verbatim under KASAN
> (CONFIG_KASAN_STACK=y), modelling i2c_master_recv() delivering
> next_read_size device bytes into the 261-byte tmp[] buffer:
>
> next_read_size = 281, no bound:
> BUG: KASAN: stack-out-of-bounds in i2c_master_recv...
> Write of size 281 ... [48, 309) 'tmp' (the 261-byte buffer)
> with the device length bounded to <= FDP_NCI_I2C_MAX_PAYLOAD (what this
> patch enforces): no KASAN report.
> a well-formed packet (length <= 261) is unaffected, no KASAN report.
>
> The full device range - next_read_size = 65535 (tmp[2] = 0xff,
> tmp[3] = 0xfc; the u16 field truncates the + 3), a 65535-byte write =
> 65274 bytes past the buffer, smashing the stack canary and the return
> address - reproduces the same way under userspace AddressSanitizer on
> both -m32 and -m64.
> ---
> drivers/nfc/fdp/i2c.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/drivers/nfc/fdp/i2c.c b/drivers/nfc/fdp/i2c.c
> index c1896a1d978c..0392bb49bb4b 100644
> --- a/drivers/nfc/fdp/i2c.c
> +++ b/drivers/nfc/fdp/i2c.c
> @@ -166,6 +166,20 @@ static int fdp_nci_i2c_read(struct fdp_i2c_phy *phy, struct sk_buff **skb)
> /* Packet that contains a length */
> if (tmp[0] == 0 && tmp[1] == 0) {
> phy->next_read_size = (tmp[2] << 8) + tmp[3] + 3;
Thanks Bryam,
I agree with your analysis regarding overrunning tmp and that the
fix for that is correct.
But I am concerned that there is also an expectation in the code that
next_read_size is always at least FDP_NCI_I2C_MIN_PAYLOAD (5).
But that smaller values can be achieved if either:
* tmp[2] is 0 and tmp[3] is < 2.
* the addition above overflows 16bits. e.g. both tmp[2] and tmp[3] are 255.
So I wonder if the check you are adding below should also guard
against phy->next_read_size < FDP_NCI_I2C_MIN_PAYLOAD.
> +
> + /*
> + * next_read_size is taken from the device and is used
> + * as the i2c_master_recv() count on the next iteration.
> + * A value larger than the receive buffer would overflow
> + * tmp[]; treat it like a corrupted packet and force
> + * resynchronization.
> + */
> + if (phy->next_read_size > FDP_NCI_I2C_MAX_PAYLOAD) {
> + dev_dbg(&client->dev, "%s: corrupted packet\n",
> + __func__);
> + phy->next_read_size = FDP_NCI_I2C_MIN_PAYLOAD;
> + goto flush;
> + }
> } else {
> phy->next_read_size = FDP_NCI_I2C_MIN_PAYLOAD;
>
>
> ---
> base-commit: 8e65320d91cdc3b241d4b94855c88459b91abf66
> change-id: 20260615-b4-disp-f42dce2d-055035ea37ba
>
> Best regards,
> --
> Bryam Vargas <hexlabsecurity@proton.me>
>
>
^ permalink raw reply
* RE: [Intel-wired-lan] [PATCH net-next] i40e: add devlink parameter for Flow Director ATR sample rate
From: Kwapulinski, Piotr @ 2026-06-16 8:53 UTC (permalink / raw)
To: mheib@redhat.com, intel-wired-lan@lists.osuosl.org
Cc: netdev@vger.kernel.org, jiri@resnulli.us, davem@davemloft.net,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
horms@kernel.org, corbet@lwn.net, Nguyen, Anthony L,
Kitszel, Przemyslaw, andrew+netdev@lunn.ch
In-Reply-To: <20260614161131.192068-1-mheib@redhat.com>
>-----Original Message-----
>From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of mheib@redhat.com
>Sent: Sunday, June 14, 2026 6:12 PM
>To: intel-wired-lan@lists.osuosl.org
>Cc: netdev@vger.kernel.org; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; horms@kernel.org; corbet@lwn.net; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; andrew+netdev@lunn.ch; Mohammad Heib <mheib@redhat.com>
>Subject: [Intel-wired-lan] [PATCH net-next] i40e: add devlink parameter for Flow Director ATR sample rate
>
>From: Mohammad Heib <mheib@redhat.com>
>
>The i40e driver uses Flow Director ATR to periodically update flow steering information for active TCP flows. The update frequency is currently controlled by I40E_DEFAULT_ATR_SAMPLE_RATE and is fixed at driver build time.
>
>On systems with a large number of queues and high-rate TCP workloads, the default sampling interval can result in frequent Flow Director reprogramming for long-lived flows.
>
>The amount of TCP packet reordering observed on some systems is sensitive to the ATR sampling interval. Increasing the interval reduces Flow Director programming activity and can significantly reduce the associated reordering.
>
>Since the optimal sampling interval depends on the workload and system configuration, a single fixed value is not suitable for all deployments.
>
>Add a devlink parameter to allow administrators to tune the ATR sample rate at runtime without rebuilding the driver or disabling ATR functionality entirely.
>
>Signed-off-by: Mohammad Heib <mheib@redhat.com>
>---
> Documentation/networking/devlink/i40e.rst | 19 ++++++
> drivers/net/ethernet/intel/i40e/i40e.h | 1 +
> .../net/ethernet/intel/i40e/i40e_devlink.c | 65 +++++++++++++++++++
> drivers/net/ethernet/intel/i40e/i40e_main.c | 4 +-
> drivers/net/ethernet/intel/i40e/i40e_txrx.h | 4 +-
> 5 files changed, 90 insertions(+), 3 deletions(-)
>
>diff --git a/Documentation/networking/devlink/i40e.rst b/Documentation/networking/devlink/i40e.rst
>index 51c887f0dc83..704469aa9acf 100644
>--- a/Documentation/networking/devlink/i40e.rst
>+++ b/Documentation/networking/devlink/i40e.rst
>@@ -40,6 +40,25 @@ Parameters
>
> The default value is ``0`` (internal calculation is used).
>
>+.. list-table:: Driver specific parameters implemented
>+ :widths: 5 5 90
>+
>+ * - Name
>+ - Mode
>+ - Description
>+ * - ``atr_sample_rate``
>+ - runtime
>+ - Controls how frequently Flow Director ATR updates flow steering
>+ information for active TCP flows.
>+
>+ ATR programs Flow Director entries based on sampled transmitted
>+ packets. The sampling interval is specified as the number of
>+ transmitted packets between ATR updates.
>+
>+ Lower values increase Flow Director programming activity, while
>+ higher values reduce the update frequency.
>+
>+ The default value is ``20``.
>
> Info versions
> =============
>diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>index 1b6a8fbaa648..88eb40ee45f0 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e.h
>+++ b/drivers/net/ethernet/intel/i40e/i40e.h
>@@ -487,6 +487,7 @@ struct i40e_pf {
> u16 rss_size_max; /* HW defined max RSS queues */
> u16 fdir_pf_filter_count; /* num of guaranteed filters for this PF */
> u16 num_alloc_vsi; /* num VSIs this driver supports */
>+ u32 atr_sample_rate;
> bool wol_en;
>
> struct hlist_head fdir_filter_list;
>diff --git a/drivers/net/ethernet/intel/i40e/i40e_devlink.c b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
>index 229179ccc131..16e51762db45 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e_devlink.c
>+++ b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
>@@ -33,12 +33,77 @@ static int i40e_max_mac_per_vf_get(struct devlink *devlink,
> return 0;
> }
>
>+static int i40e_atr_sample_rate_set(struct devlink *devlink,
>+ u32 id,
>+ struct devlink_param_gset_ctx *ctx,
>+ struct netlink_ext_ack *extack) {
>+ struct i40e_pf *pf = devlink_priv(devlink);
>+ struct i40e_vsi *vsi;
>+ u32 sample_rate = ctx->val.vu32;
>+ int i;
Please keep the RCT and put 'i' right within a loop.
Thank you.
Piotr
>+
>+ pf->atr_sample_rate = sample_rate;
>+
>+ if (!test_bit(I40E_FLAG_FD_ATR_ENA, pf->flags))
>+ return 0;
>+
>+ vsi = i40e_pf_get_main_vsi(pf);
>+ if (!vsi)
>+ return 0;
>+
>+ for (i = 0; i < vsi->num_queue_pairs; i++) {
>+ if (!vsi->tx_rings[i])
>+ continue;
>+ vsi->tx_rings[i]->atr_sample_rate = sample_rate;
>+ vsi->tx_rings[i]->atr_count = 0;
>+ }
>+
>+ return 0;
>+}
>+
>+static int i40e_atr_sample_rate_get(struct devlink *devlink,
>+ u32 id,
>+ struct devlink_param_gset_ctx *ctx,
>+ struct netlink_ext_ack *extack) {
>+ struct i40e_pf *pf = devlink_priv(devlink);
>+
>+ ctx->val.vu32 = pf->atr_sample_rate;
>+
>+ return 0;
>+}
>+
>+static int i40e_atr_sample_rate_validate(struct devlink *devlink, u32 id,
>+ union devlink_param_value val,
>+ struct netlink_ext_ack *extack)
>+{
>+ if (!val.vu32) {
>+ NL_SET_ERR_MSG_MOD(extack,
>+ "ATR sample rate must be greater than 0");
>+ return -EINVAL;
>+ }
>+ return 0;
>+}
>+
>+enum i40e_dl_param_id {
>+ I40E_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
>+ I40E_DEVLINK_PARAM_ID_ATR_SAMPLE_RATE,
>+};
>+
> static const struct devlink_param i40e_dl_params[] = {
> DEVLINK_PARAM_GENERIC(MAX_MAC_PER_VF,
> BIT(DEVLINK_PARAM_CMODE_RUNTIME),
> i40e_max_mac_per_vf_get,
> i40e_max_mac_per_vf_set,
> NULL),
>+ DEVLINK_PARAM_DRIVER(I40E_DEVLINK_PARAM_ID_ATR_SAMPLE_RATE,
>+ "atr_sample_rate",
>+ DEVLINK_PARAM_TYPE_U32,
>+ BIT(DEVLINK_PARAM_CMODE_RUNTIME),
>+ i40e_atr_sample_rate_get,
>+ i40e_atr_sample_rate_set,
>+ i40e_atr_sample_rate_validate),
> };
>
> static void i40e_info_get_dsn(struct i40e_pf *pf, char *buf, size_t len) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>index d59750c490f4..9c8144970a34 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>@@ -3458,7 +3458,7 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
>
> /* some ATR related tx ring init */
> if (test_bit(I40E_FLAG_FD_ATR_ENA, vsi->back->flags)) {
>- ring->atr_sample_rate = I40E_DEFAULT_ATR_SAMPLE_RATE;
>+ ring->atr_sample_rate = vsi->back->atr_sample_rate;
> ring->atr_count = 0;
> } else {
> ring->atr_sample_rate = 0;
>@@ -12745,6 +12745,8 @@ static int i40e_sw_init(struct i40e_pf *pf)
> }
> }
>
>+ pf->atr_sample_rate = I40E_DEFAULT_ATR_SAMPLE_RATE;
>+
> if ((pf->hw.func_caps.fd_filters_guaranteed > 0) ||
> (pf->hw.func_caps.fd_filters_best_effort > 0)) {
> set_bit(I40E_FLAG_FD_ATR_ENA, pf->flags); diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>index bb741ff3e5f2..7e29e9244c3a 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
>@@ -372,8 +372,8 @@ struct i40e_ring {
> u16 next_to_clean;
> u16 xdp_tx_active;
>
>- u8 atr_sample_rate;
>- u8 atr_count;
>+ u32 atr_sample_rate;
>+ u32 atr_count;
>
> bool ring_active; /* is ring online or not */
> bool arm_wb; /* do something to arm write back */
>--
>2.53.0
>
^ permalink raw reply
* Re: [PATCH net] net: psample: fix info leak in PSAMPLE_ATTR_DATA
From: Jiri Pirko @ 2026-06-16 8:44 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
Weiming Shi, yotam.gi, jhs
In-Reply-To: <20260616003046.1099490-1-kuba@kernel.org>
Tue, Jun 16, 2026 at 02:30:46AM +0200, kuba@kernel.org wrote:
>psample open codes nla_put() presumably to avoid wiping
>the data with 0s just to override it with packet data.
>This open coding is missing clearing the pad, however,
>each netlink attr is padded to 4B and data_len may
>not be divisible by 4B.
>
>Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling")
>Reported-by: Weiming Shi <bestswngs@gmail.com>
>Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
^ permalink raw reply
* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: David Hildenbrand (Arm) @ 2026-06-16 8:36 UTC (permalink / raw)
To: Luigi Rizzo, rizzo.unipi, m.szyprowski, robin.murphy, willemb,
kuniyu, davem, edumazet, kuba, pabeni
Cc: gregkh, rafael, akpm, netdev, linux-mm, iommu, driver-core,
linux-kernel
In-Reply-To: <20260615234220.3946885-1-lrizzo@google.com>
On 6/16/26 01:42, Luigi Rizzo wrote:
> The use of swiotlb causes an extra data copy on I/O. For tx sockets,
> especially with greedy senders, this has a high chance of happening in
> the softirq handler for tx network interrupts, creating a significant
> performance bottleneck.
>
> Allow tx sockets to allocate socket buffers directly from the bounce
> buffers. This avoids the second copy and removes the above bottleneck.
> The fraction of swiotlb buffers allowed for this feature is set with
> /sys/module/swiotlb/parameters/zerocopy_tx_percent
> (0 means disabled, 90 is the maximum, to avoid persistent I/O failures).
>
> Implementation:
> - define a new page type to unambiguously identify bounce buffers used
> as backing storage for socket buffers
> - modify skb_page_frag_refill to perform the modified allocation
> - modify the destructors __free_frozen_pages(), free_unref_folio() to
> handle those pages and return them to the pool.
>
> The savings are especially visible with fewer queues. In synthetic
> benchmarks, senders with 1-2 queues would cap around 50Gbps with
> conventional swiotlb, and reach over 170Gbps with the feature enabled.
>
> Signed-off-by: Luigi Rizzo <lrizzo@google.com>
> ---
> drivers/base/core.c | 1 +
> include/linux/netdevice.h | 22 ++++
> include/linux/page-flags.h | 4 +
> include/linux/skbuff.h | 7 +-
> include/linux/swiotlb.h | 74 ++++++++++++
> include/net/sock.h | 29 +++++
> kernel/dma/swiotlb.c | 227 +++++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 32 ++++++
> net/core/sock.c | 98 ++++++++++++++--
> 9 files changed, 485 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index bd2ddf2aab505..e1257dea37ba0 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -3855,6 +3855,7 @@ void device_del(struct device *dev)
> unsigned int noio_flag;
>
> device_lock(dev);
> + swiotlb_device_deleted();
> kill_device(dev);
> device_unlock(dev);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0e1e581efc5ac..d7e5929e73c92 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -5368,13 +5368,35 @@ static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
> return ops->ndo_start_xmit(skb, dev);
> }
>
> +struct sock;
> +
> +#ifdef CONFIG_SWIOTLB
> +/* Per-CPU pointer to the socket currently performing transmission.
> + * Used to bridge the networking and DMA layers, allowing the dma_map_page()
> + * path to identify the socket originating the packet and apply SWIOTLB optimizations.
> + */
> +DECLARE_PER_CPU(struct sock *, current_tx_socket);
> +static inline struct sock *__set_current_tx_socket(struct sock *sk)
> +{
> + struct sock *old_sk = this_cpu_read(current_tx_socket);
> +
> + this_cpu_write(current_tx_socket, sk);
> + return old_sk;
> +}
> +#else
> +static inline struct sock *__set_current_tx_socket(struct sock *sk) { return NULL; }
> +#endif
> +
> static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_device *dev,
> struct netdev_queue *txq, bool more)
> {
> const struct net_device_ops *ops = dev->netdev_ops;
> + struct sock *old_sk;
> netdev_tx_t rc;
>
> + old_sk = __set_current_tx_socket(skb->sk);
> rc = __netdev_start_xmit(ops, skb, dev, more);
> + __set_current_tx_socket(old_sk);
> if (rc == NETDEV_TX_OK)
> txq_trans_update(dev, txq);
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7223f6f4e2b40..0ecbb404038a0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -923,6 +923,7 @@ enum pagetype {
> PGTY_zsmalloc = 0xf6,
> PGTY_unaccepted = 0xf7,
> PGTY_large_kmalloc = 0xf8,
> + PGTY_zcswiotlb = 0xf9,
>
> PGTY_mapcount_underflow = 0xff
> };
> @@ -1055,6 +1056,9 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
> PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
> PAGE_TYPE_OPS(LargeKmalloc, large_kmalloc, large_kmalloc)
>
> +/* Pages in socket buffers from the swiotlb pool. */
> +PAGE_TYPE_OPS(ZCSwiotlb, zcswiotlb, zcswiotlb)
> +
> /**
> * PageHuge - Determine if the page belongs to hugetlbfs
> * @page: The page to test.
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 3f06254ab1b72..62340909409e5 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3787,7 +3787,12 @@ static inline void skb_frag_page_copy(skb_frag_t *fragto,
> fragto->netmem = fragfrom->netmem;
> }
>
> -bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
> +/* zerocopy swiotlb uses an additional non-null struct sock pointer. */
> +bool __skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio, struct sock *sk);
> +static inline bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
> +{
> + return __skb_page_frag_refill(sz, pfrag, prio, NULL);
> +}
>
> /**
> * __skb_frag_dma_map - maps a paged fragment via the DMA API
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 3dae0f592063e..bd2d0e160a9d8 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -7,8 +7,10 @@
> #include <linux/init.h>
> #include <linux/types.h>
> #include <linux/limits.h>
> +#include <linux/percpu.h>
> #include <linux/spinlock.h>
> #include <linux/workqueue.h>
> +#include <linux/atomic.h>
>
> struct device;
> struct page;
> @@ -122,6 +124,9 @@ struct io_tlb_mem {
> atomic_long_t total_used;
> atomic_long_t used_hiwater;
> atomic_long_t transient_nslabs;
> +#else
> + unsigned long last_used_slots;
> + unsigned long last_used_jiffies;
> #endif
> };
>
> @@ -185,6 +190,69 @@ bool is_swiotlb_active(struct device *dev);
> void __init swiotlb_adjust_size(unsigned long size);
> phys_addr_t default_swiotlb_base(void);
> phys_addr_t default_swiotlb_limit(void);
> +
> +/* Helpers for zerocopy swiotlb. */
> +/* Control allocation fraction. */
> +extern unsigned int swiotlb_zc_tx_percent;
> +
> +/* Track freshness of the leaf device info. */
> +extern atomic_t global_device_serial;
> +
> +static inline u32 swiotlb_get_device_serial(void)
> +{
> + return atomic_read(&global_device_serial);
> +}
> +
> +static inline void swiotlb_device_deleted(void)
> +{
> + atomic_inc(&global_device_serial);
> +}
> +
> +struct page *swiotlb_alloc_pages(struct device *dev, unsigned int order);
> +bool swiotlb_free_pages(struct page *page, bool where_debug_only);
> +void swiotlb_safe_put_device(struct device *dev);
> +
> +static inline void swiotlb_set_page_dev(struct page *page, struct device *dev)
> +{
> + page->private = (unsigned long)dev;
> +}
> +
> +static inline struct device *swiotlb_page_to_dev(struct page *page)
> +{
> + return (struct device *)compound_head(page)->private;
> +}
> +
> +static inline bool is_zerocopy_swiotlb_folio(struct page *page)
> +{
> + struct folio *folio = page_folio(page);
> +
> + return folio_test_zcswiotlb(folio) && folio->private != 0;
> +}
> +
> +/* These two are in mm/page_alloc.c */
> +void swiotlb_prep_compound_page(struct page *page, unsigned int order);
> +void swiotlb_destroy_compound_page(struct page *page, unsigned int order);
> +
> +#if defined(CONFIG_NET)
> +/*
> + * Track the socket for the currently transmitted packet, so the dma mapping
> + * function can record there the leaf device if it needs bounce buffers.
> + */
> +struct sock;
> +DECLARE_PER_CPU(struct sock *, current_tx_socket);
> +void sk_set_bounce_device(struct sock *sk, struct device *dev);
> +static inline void dma_learn_bounce_device(struct device *dev)
> +{
> + struct sock *sk = this_cpu_read(current_tx_socket);
> +
> + if (sk)
> + sk_set_bounce_device(sk, dev);
> +}
> +#else
> +static inline void dma_learn_bounce_device(struct device *dev) {}
> +#endif
> +/* End helpers for zerocopy swiotlb. */
> +
> #else
> static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
> {
> @@ -234,6 +302,12 @@ static inline phys_addr_t default_swiotlb_limit(void)
> {
> return 0;
> }
> +
> +/* zerocopy swiotlb stubs */
> +static inline bool swiotlb_free_pages(struct page *page, int reason) { return false; }
> +static inline u32 swiotlb_get_device_serial(void) { return 0; }
> +static inline void swiotlb_device_deleted(void) {}
> +
> #endif /* CONFIG_SWIOTLB */
>
> phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
> diff --git a/include/net/sock.h b/include/net/sock.h
> index dccd3738c3687..1e6caf4bd1366 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -47,6 +47,7 @@
> #include <linux/skbuff.h> /* struct sk_buff */
> #include <linux/mm.h>
> #include <linux/security.h>
> +#include <linux/swiotlb.h>
> #include <linux/slab.h>
> #include <linux/uaccess.h>
> #include <linux/page_counter.h>
> @@ -70,6 +71,14 @@
> #include <net/l3mdev.h>
> #include <uapi/linux/socket.h>
>
> +#ifdef CONFIG_SWIOTLB
> +struct sk_swiotlb_info {
> + struct device *dev;
> + u32 serial;
> + unsigned long jiffies;
> +};
> +#endif
> +
> /*
> * This structure really needs to be cleaned up.
> * Most of it is for TCP, and not used by any of
> @@ -602,8 +611,28 @@ struct sock {
> #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
> struct module *sk_owner;
> #endif
> +#ifdef CONFIG_SWIOTLB
> + struct sk_swiotlb_info sk_swiotlb;
> +#endif
> };
>
> +#ifdef CONFIG_SWIOTLB
> +static inline void sk_init_bounce_device(struct sock *sk)
> +{
> + sk->sk_swiotlb.dev = NULL;
> +}
> +static inline void sk_cleanup_bounce_device(struct sock *sk)
> +{
> + if (sk->sk_swiotlb.dev) {
> + swiotlb_safe_put_device(sk->sk_swiotlb.dev);
> + sk->sk_swiotlb.dev = NULL;
> + }
> +}
> +#else
> +static inline void sk_init_bounce_device(struct sock *sk) {}
> +static inline void sk_cleanup_bounce_device(struct sock *sk) {}
> +#endif
> +
> struct sock_bh_locked {
> struct sock *sock;
> local_lock_t bh_lock;
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 1abd3e6146f45..e27f23d03c482 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -37,12 +37,16 @@
> #include <linux/mm.h>
> #include <linux/pfn.h>
> #include <linux/rculist.h>
> +#include <linux/refcount.h>
> #include <linux/scatterlist.h>
> #include <linux/set_memory.h>
> #include <linux/spinlock.h>
> #include <linux/string.h>
> #include <linux/swiotlb.h>
> +#include <linux/moduleparam.h>
> +#include <linux/percpu.h>
> #include <linux/types.h>
> +#include <linux/atomic.h>
> #ifdef CONFIG_DMA_RESTRICTED_POOL
> #include <linux/of.h>
> #include <linux/of_fdt.h>
> @@ -81,6 +85,17 @@ struct io_tlb_slot {
> static bool swiotlb_force_bounce;
> static bool swiotlb_force_disable;
>
> +/**
> + * global_device_serial - Global sequence number for device deletions
> + *
> + * Incremented every time a device is unregistered (in device_del()).
> + * Used by subsystems (like SWIOTLB zero-copy sockets) as a fast, lockless
> + * O(1) cache invalidation serial to detect when a cached device pointer
> + * might have been deleted and needs to be expired to prevent Use-After-Free.
> + */
> +atomic_t global_device_serial = ATOMIC_INIT(0);
> +EXPORT_SYMBOL(global_device_serial);
> +
> #ifdef CONFIG_SWIOTLB_DYNAMIC
>
> static void swiotlb_dyn_alloc(struct work_struct *work);
> @@ -1442,6 +1457,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> offset &= (IO_TLB_SIZE - 1);
> index += pad_slots;
> pool->slots[index].pad_slots = pad_slots;
> + /* Fix an upstream bug with alloc_align_mask = 0xffff */
> + pool->slots[index].alloc_size = mapping_size;
> for (i = 0; i < (nr_slots(size) - pad_slots); i++)
> pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
> tlb_addr = slot_addr(pool->start, index) + offset;
> @@ -1554,6 +1571,13 @@ void __swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
> size_t mapping_size, enum dma_data_direction dir,
> unsigned long attrs, struct io_tlb_pool *pool)
> {
> + /*
> + * Recognize and avoid unmapping pages allocated for Zero-Copy SWIOTLB Page Bypass.
> + * They will be eventually released when the page reference count drops to 0.
> + */
> + if (is_zerocopy_swiotlb_folio(pfn_to_page(PHYS_PFN(tlb_addr))))
> + return;
> +
> /*
> * First, sync the memory before unmapping the entry
> */
> @@ -1597,6 +1621,21 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
> phys_addr_t swiotlb_addr;
> dma_addr_t dma_addr;
>
> + dma_learn_bounce_device(dev);
> +
> + /*
> + * If the page was allocated via Zero-Copy SWIOTLB Page Bypass, it is likely
> + * already good for DMA so we can return its dma address.
> + */
> + if (is_zerocopy_swiotlb_folio(pfn_to_page(PHYS_PFN(paddr)))) {
> + dma_addr = phys_to_dma_unencrypted(dev, paddr);
> + if (likely(dma_capable(dev, dma_addr, size, true))) {
> + if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> + arch_sync_dma_for_device(paddr, size, dir);
> + return dma_addr;
> + }
> + }
> +
> trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size);
>
> swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, attrs);
> @@ -1899,3 +1938,191 @@ static const struct reserved_mem_ops rmem_swiotlb_ops = {
>
> RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", &rmem_swiotlb_ops);
> #endif /* CONFIG_DMA_RESTRICTED_POOL */
> +
> +/*
> + * Asynchronous/Deferred Device Release.
> + * put_device() can trigger the final release path of a device which may sleep.
> + * Since SWIOTLB pages can be freed in atomic or interrupt context (e.g. TX completion),
> + * we must defer the put_device() call to task context using a workqueue.
> + */
> +struct swiotlb_deferred_put {
> + struct work_struct work;
> + struct device *dev;
> +};
> +
> +static void swiotlb_deferred_put_work(struct work_struct *work)
> +{
> + struct swiotlb_deferred_put *dp = container_of(work, struct swiotlb_deferred_put, work);
> +
> + put_device(dp->dev);
> + kfree(dp);
> +}
> +
> +/**
> + * swiotlb_safe_put_device() - Safely release device reference from atomic/interrupt context
> + * @dev: The device structure to release.
> + *
> + * Enqueues a deferred put_device() call on a workqueue using GFP_ATOMIC.
> + * If memory allocation fails, the reference is leaked to avoid an immediate crash.
> + */
> +void swiotlb_safe_put_device(struct device *dev)
> +{
> + struct swiotlb_deferred_put *dp;
> +
> + if (!dev)
> + return;
> +
> + /*
> + * FAST PATH (O(1) lockless): If this is not the last reference,
> + * we can decrement it atomically and safely in any context
> + * without allocating memory or scheduling work!
> + */
> + if (refcount_dec_not_one(&dev->kobj.kref.refcount))
> + return;
> +
> + /*
> + * SLOW PATH: It is the last reference (refcount == 1). We must
> + * defer the final put_device() to task context because it will
> + * trigger device_release() which can sleep.
> + */
> + dp = kmalloc_obj(*dp, GFP_ATOMIC);
> + if (dp) {
> + INIT_WORK(&dp->work, swiotlb_deferred_put_work);
> + dp->dev = dev;
> + schedule_work(&dp->work);
> + } else {
> + pr_warn_ratelimited("swiotlb: failed to allocate deferred put, leaking device ref\n");
> + }
> +}
> +EXPORT_SYMBOL_GPL(swiotlb_safe_put_device);
> +
> +unsigned int swiotlb_zc_tx_percent;
> +module_param_named(zerocopy_tx_percent, swiotlb_zc_tx_percent, uint, 0644);
> +
> +static unsigned long fast_mem_used(struct io_tlb_mem *mem)
> +{
> +#ifdef CONFIG_DEBUG_FS
> + return mem_used(mem);
> +#else
> + unsigned long last_j = READ_ONCE(mem->last_used_jiffies);
> + unsigned long now = jiffies;
> +
> + if (time_after(now, last_j + HZ / 100) &&
> + try_cmpxchg(&mem->last_used_jiffies, &last_j, now)) {
> + WRITE_ONCE(mem->last_used_slots, mem_used(mem));
> + }
> + return READ_ONCE(mem->last_used_slots);
> +#endif
> +}
> +
> +/**
> + * swiotlb_alloc_pages() - Allocate long-lived contiguous pages from SWIOTLB pool
> + * @dev: Device which requires the SWIOTLB bounce buffers.
> + * @order: Allocation order (log2 of number of pages).
> + */
> +struct page *swiotlb_alloc_pages(struct device *dev, unsigned int order)
> +{
> + struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> + struct io_tlb_pool *pool;
> + int npages = 1 << order;
> + unsigned int max_pct;
> + phys_addr_t tlb_addr;
> + struct page *page;
> + int index;
> +
> + if (!mem || !mem->nslabs)
> + return NULL;
> +
> + max_pct = clamp(READ_ONCE(swiotlb_zc_tx_percent), 0u, 90u);
> + if (max_pct == 0 || max_pct * mem->nslabs <= fast_mem_used(mem) * 100)
> + return NULL;
> +
> + /*
> + * Enforce natural alignment for compound pages. The mask-based
> + * compound_head() optimization (used when HVO is enabled and struct page
> + * size is a power of 2) assumes that compound pages are naturally aligned
> + * to their size. Without this, compound_head() on tail pages can return
> + * a wrong head page pointer, leading to refcount corruption.
> + */
> + index = swiotlb_find_slots(dev, 0, PAGE_SIZE * npages, ~(PAGE_MASK << order), &pool);
> + if (index == -1)
> + return NULL;
> +
> + tlb_addr = slot_addr(pool->start, index);
> +
> + pool->slots[index].pad_slots = 0;
> + pool->slots[index].alloc_size = PAGE_SIZE * npages;
> +
> + page = pfn_to_page(PHYS_PFN(tlb_addr));
> +
> + set_page_count(page, 1);
> +
> + /* Strictly tag page[0] to prevent clobbering folio tail overlays */
> + __SetPageZCSwiotlb(page);
> +
> + swiotlb_set_page_dev(page, dev);
> + get_device(dev);
> + swiotlb_prep_compound_page(page, order);
> + return page;
> +}
> +EXPORT_SYMBOL_GPL(swiotlb_alloc_pages);
> +
> +/*
> + * Debugging to track how swiotlb_free_pages() was called.
> + * b2: 0 from __free_frozen_pages(), 1 from free_unref_folios()
> + * b1: pool found b0: dev present,
> + */
> +static unsigned long zc_debug[8];
> +static int ctrs_num = 8;
> +module_param_array(zc_debug, ulong, &ctrs_num, 0644);
> +static void __zc_debug_stats(bool where, bool has_dev, bool has_pool)
> +{
> + zc_debug[has_dev + has_pool * 2 + where * 4]++;
> +}
> +
> +/**
> + * swiotlb_free_pages() - Free pages allocated via swiotlb_alloc_pages()
> + * @page: The starting struct page to release.
> + */
> +bool swiotlb_free_pages(struct page *page, bool where_debug_only)
> +{
> + struct page *head = compound_head(page);
> + struct device *dev = swiotlb_page_to_dev(head);
> + phys_addr_t head_tlb_addr = page_to_phys(head);
> + struct io_tlb_pool *pool;
> + int index, npages, i;
> +
> + if (!folio_test_zcswiotlb(page_folio(head)))
> + return false;
> +
> + pool = dev ? swiotlb_find_pool(dev, head_tlb_addr) : NULL;
> + __zc_debug_stats(where_debug_only, !!dev, !!pool);
> +
> + /* Check for any false positives. */
> + if (!pool)
> + return false;
> +
> + /* Read alloc_size first, it is reset by swiotlb_release_slots(). */
> + index = (head_tlb_addr - pool->start) >> IO_TLB_SHIFT;
> + npages = pool->slots[index].alloc_size >> PAGE_SHIFT;
> +
> + WARN_ON_ONCE(!is_power_of_2(npages));
> +
> + /* Step 1: Sever compound links (clobbers compound_info / lru.next) */
> + swiotlb_destroy_compound_page(head, ilog2(npages));
> +
> + /* Step 2: Re-init LRU, drop refcounts, and strip flag across all constituent pages */
> + for (i = 0; i < npages; i++) {
> + INIT_LIST_HEAD(&head[i].lru);
> + set_page_count(&head[i], 0);
> + head[i].private = 0;
> + __ClearPageZCSwiotlb(&head[i]);
> + }
> +
> + /* Step 3: Safely release slots back to the pool */
> + swiotlb_release_slots(dev, head_tlb_addr, pool);
> + swiotlb_del_transient(dev, head_tlb_addr, pool);
> + swiotlb_safe_put_device(dev);
> + return true;
> +}
> +EXPORT_SYMBOL_GPL(swiotlb_free_pages);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d49c254174da7..eaba683b5b2a8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -16,6 +16,7 @@
>
> #include <linux/stddef.h>
> #include <linux/mm.h>
> +#include <linux/swiotlb.h>
> #include <linux/highmem.h>
> #include <linux/interrupt.h>
> #include <linux/jiffies.h>
> @@ -705,6 +706,31 @@ void prep_compound_page(struct page *page, unsigned int order)
> prep_compound_head(page, order);
> }
>
> +#ifdef CONFIG_SWIOTLB
> +void swiotlb_prep_compound_page(struct page *page, unsigned int order)
> +{
> + if (order > 0)
> + prep_compound_page(page, order);
> +}
Gah.
> +
> +void swiotlb_destroy_compound_page(struct page *page, unsigned int order)
> +{
> + if (order > 0) {
> + struct folio *folio = (struct folio *)page;
> +
> + __ClearPageHead(page);
> + page[1].flags.f &= ~PAGE_FLAGS_SECOND;
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> + folio->_nr_pages = 0;
> +#endif
> + for (int i = 1; i < (1 << order); i++) {
> + page[i].mapping = NULL;
> + clear_compound_head(&page[i]);
> + }
> + }
> +}
Gah.
> +#endif /* CONFIG_SWIOTLB */
> +
> static inline void set_buddy_order(struct page *page, unsigned int order)
> {
> set_page_private(page, order);
> @@ -2930,6 +2956,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> unsigned long pfn = page_to_pfn(page);
> int migratetype;
>
> + if (unlikely(swiotlb_free_pages(page, false)))
> + return;
> +
Oh my.
We shouldn't be handling randomg swiotlb stuff in the page allocator like that.
IIUC, you are writing your own pool+allocator and roughly mimic what hugetlb +
ZONE_DEVICE does.
The creation+destruction of compound pages should very likely be factored out
from other code in a type-unspecific fashion, if really required.
You should probably look into
https://lore.kernel.org/all/20250318161823.4005529-2-tabba@google.com/
to see how to possibly hook into the page freeing path in a cleaner way.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH net-next] virtio-net: support xsk wake up
From: Eugenio Perez Martin @ 2026-06-16 8:35 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Menglong Dong, xuanzhuo, mst, jasowang, andrew+netdev, davem,
edumazet, pabeni, netdev, virtualization, linux-kernel
In-Reply-To: <20260613144612.0c5b7ba4@kernel.org>
On Sat, Jun 13, 2026 at 11:46 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 10 Jun 2026 10:27:28 +0200 Eugenio Perez Martin wrote:
> > And the From and Signed-off-by emails don't match, which I'm not sure is valid.
>
> It's clearly the same person. Please focus on the code, not trivial
> process issues.
>
> Quoting documentation:
>
> Reviewer guidance
> -----------------
>
> [...]
>
> Reviewers are highly encouraged to do more in-depth review of submissions
> and not focus exclusively on process issues, trivial or subjective
> matters like code formatting, tags etc.
>
> See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#reviewer-guidance
>
Ack'd, it was just a nitpick since the fixes tag was already needed.
Thanks for the doc pointer, I agree with that so I'll try to avoid
these nits in the future!
^ permalink raw reply
* RE: [Intel-wired-lan] [PATCH net] ice: Fix use-after-scope in ice_sched_add_nodes_to_layer()
From: Kwapulinski, Piotr @ 2026-06-16 8:27 UTC (permalink / raw)
To: NeKon69, Nguyen, Anthony L, Kitszel, Przemyslaw
Cc: andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, victor.raj@intel.com,
intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <20260613101440.80190-1-nobodqwe@gmail.com>
>-----Original Message-----
>From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of NeKon69
>Sent: Saturday, June 13, 2026 12:15 PM
>To: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>
>Cc: andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; victor.raj@intel.com; intel-wired-lan@lists.osuosl.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; NeKon69 <nobodqwe@gmail.com>
>Subject: [Intel-wired-lan] [PATCH net] ice: Fix use-after-scope in ice_sched_add_nodes_to_layer()
>
>Commit 7fb09a737536 ("ice: Modify recursive way of adding nodes") changed ice_sched_add_nodes_to_layer() from recursive control flow to an iterative loop.
>
>Inside the loop, first_teid_ptr may be set to the address of a block-local variable:
>
> u32 temp;
> ...
> if (num_added)
> first_teid_ptr = &temp;
>
>On the next loop iteration, first_teid_ptr may be passed to ice_sched_add_nodes_to_hw_layer(), after temp from the previous iteration has gone out of scope.
>
>Move temp outside the loop so the pointer remains valid for the lifetime of ice_sched_add_nodes_to_layer().
>
>This was found by Clang with LifetimeSafety enabled while testing C language support on a Linux allmodconfig build.
>
>Fixes: 7fb09a737536 ("ice: Modify recursive way of adding nodes")
>Link: https://github.com/llvm/llvm-project/pull/203270
>Signed-off-by: NeKon69 <nobodqwe@gmail.com>
>---
> drivers/net/ethernet/intel/ice/ice_sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c b/drivers/net/ethernet/intel/ice/ice_sched.c
>index fff0c1afdb41..089ad3967be5 100644
>--- a/drivers/net/ethernet/intel/ice/ice_sched.c
>+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
>@@ -1074,11 +1074,11 @@ ice_sched_add_nodes_to_layer(struct ice_port_info *pi,
> u32 *first_teid_ptr = first_node_teid;
> u16 new_num_nodes = num_nodes;
> int status = 0;
>+ u32 temp;
>
> *num_nodes_added = 0;
> while (*num_nodes_added < num_nodes) {
> u16 max_child_nodes, num_added = 0;
>- u32 temp;
>
> status = ice_sched_add_nodes_to_hw_layer(pi, tc_node, parent,
> layer, new_num_nodes,
>--
>2.54.0
>
Reviewed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox