* Re: [PATCH net] net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
From: patchwork-bot+netdevbpf @ 2026-04-14 13:20 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-arm-kernel,
linux-mediatek, netdev
In-Reply-To: <20260412-airoha_ppe_hw_init-missing-bits-v1-1-06ac670819e3@kernel.org>
Hello:
This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Sun, 12 Apr 2026 10:43:26 +0200 you wrote:
> Add the following PPE configuration in airoha_ppe_hw_init routine:
> - 6RD hw offloading is currently not supported by Netfilter flowtable.
> Disable explicitly PPE 6RD offloading in order to prevent PPE to learn
> 6RD flows and eventually interrupt the traffic.
> - Add missing PPE bind rate configuration for L3 and L2 traffic.
> PPE bind rate configuration specifies the pps threshold to move a PPE
> entry state from UNBIND to BIND. Without this configuration this value
> is random.
> - Set ageing thresholds to the values used in the vendor SDK in order to
> improve connection stability under load and avoid packet loss caused by
> fast aging.
>
> [...]
Here is the summary with links:
- [net] net: airoha: Add missing PPE configurations in airoha_ppe_hw_init()
https://git.kernel.org/netdev/net/c/b9d8b856689d
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH iwl-next v2 5/8] ixgbe: remove ixgbe_ping_all_vfs() from link state change handlers
From: Simon Horman @ 2026-04-14 13:23 UTC (permalink / raw)
To: aleksandr.loktionov
Cc: 'Simon Horman', intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260408131216.2662245-6-aleksandr.loktionov@intel.com>
From: 'Simon Horman' <horms@kernel.org>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ixgbe: remove ixgbe_ping_all_vfs() from link state change handlers
This commit removes ixgbe_ping_all_vfs() from the link watchdog handlers
to prevent a race condition during VF initialization where a gratuitous
ping collides with the VF mailbox state machine.
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 8110bcf006a30..3071302901499 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -8164,9 +8164,6 @@ static void ixgbe_watchdog_link_is_up(struct ixgbe_adapter *adapter)
>
> /* update the default user priority for VFs */
> ixgbe_update_default_up(adapter);
> -
> - /* ping all the active vfs to let them know link has changed */
> - ixgbe_ping_all_vfs(adapter);
> }
How will legacy VFs (such as 82599, X540, and X550) receive link state
changes without these pings?
The commit message states that link-state changes are already communicated
to VFs through the normal mailbox protocol. However, older VFs appear to
rely on receiving an IXGBE_VT_MSGTYPE_CTS message from the PF to clear
mac->get_link_status and complete the link up sequence in
ixgbevf_check_mac_link_vf().
Without the ping, will these legacy VFs fail to read the mailbox and stay
stuck reporting the link as down?
Is the mailbox corruption race still triggerable from other paths?
If ixgbe_ping_vf() is called locklessly from process-context NDO callbacks
like ixgbe_ndo_set_vf_link_state() or ixgbe_ndo_set_vf_trust(), could it be
preempted by a hardirq triggering ixgbe_msg_task() (which correctly acquires
vfs_lock before writing to the mailbox)?
Since ixgbe_write_mbx_pf() only arbitrates between the PF and VF via the
IXGBE_PFMAILBOX_PFU hardware lock, it provides no mutual exclusion between
concurrent host threads. Could concurrent writes corrupt the IXGBE_PFMBMEM
buffer and cause the exact same VF initialization failure this patch intended
to fix?
Would wrapping the pings in spin_lock_irqsave(&adapter->vfs_lock, flags)
prevent the mailbox corruption instead of removing the pings entirely?
[ ... ]
> @@ -8197,9 +8194,6 @@ static void ixgbe_watchdog_link_is_down(struct ixgbe_adapter *adapter)
>
> e_info(drv, "NIC Link is Down\n");
> netif_carrier_off(netdev);
> -
> - /* ping all the active vfs to let them know link has changed */
> - ixgbe_ping_all_vfs(adapter);
> }
Without this ping, will older VFs miss the mailbox interrupt needed to set
mac->get_link_status = 1, causing them to remain stuck reporting the link as
up indefinitely and silently drop traffic when the cable is unplugged?
This isn't a bug, but removing the calls from the link handlers appears to
leave ixgbe_ping_all_vfs() as dead code without any remaining callers in
the kernel. Should the function definition in ixgbe_sriov.c and its
declaration in ixgbe_sriov.h be removed as well?
^ permalink raw reply
* Re: [PATCH RFC bpf-next 3/8] bpf: add BPF_JIT_KASAN for KASAN instrumentation of JITed programs
From: Alexis Lothoré @ 2026-04-14 13:24 UTC (permalink / raw)
To: Andrey Konovalov, Alexis Lothoré (eBPF Foundation)
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
Alexander Potapenko, Dmitry Vyukov, Vincenzo Frascino,
Andrew Morton, ebpf, Bastien Curutchet, Thomas Petazzoni,
Xu Kuohai, bpf, linux-kernel, netdev, linux-kselftest,
linux-stm32, linux-arm-kernel, kasan-dev, linux-mm
In-Reply-To: <CA+fCnZf-o8tiv_tX9YB5eBUGx17OpztKZsEB6Awjw3WAqBAiUw@mail.gmail.com>
On Tue Apr 14, 2026 at 12:20 AM CEST, Andrey Konovalov wrote:
> On Mon, Apr 13, 2026 at 8:29 PM Alexis Lothoré (eBPF Foundation)
> <alexis.lothore@bootlin.com> wrote:
>>
>> Add a new Kconfig option CONFIG_BPF_JIT_KASAN that automatically enables
>> KASAN (Kernel Address Sanitizer) memory access checks for JIT-compiled
>> BPF programs, when both KASAN and JIT compiler are enabled. When
>> enabled, the JIT compiler will emit shadow memory checks before memory
>> loads and stores to detect use-after-free, out-of-bounds, and other
>> memory safety bugs at runtime. The option is gated behind
>> HAVE_EBPF_JIT_KASAN, as it needs proper arch-specific implementation.
>>
>> Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
>> ---
>> kernel/bpf/Kconfig | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
>> index eb3de35734f0..28392adb3d7e 100644
>> --- a/kernel/bpf/Kconfig
>> +++ b/kernel/bpf/Kconfig
>> @@ -17,6 +17,10 @@ config HAVE_CBPF_JIT
>> config HAVE_EBPF_JIT
>> bool
>>
>> +# KASAN support for JIT compiler
>> +config HAVE_EBPF_JIT_KASAN
>> + bool
>> +
>> # Used by archs to tell that they want the BPF JIT compiler enabled by
>> # default for kernels that were compiled with BPF JIT support.
>> config ARCH_WANT_DEFAULT_BPF_JIT
>> @@ -101,4 +105,9 @@ config BPF_LSM
>>
>> If you are unsure how to answer this question, answer N.
>>
>> +config BPF_JIT_KASAN
>> + bool
>> + depends on HAVE_EBPF_JIT_KASAN
>> + default y if BPF_JIT && KASAN_GENERIC
>
> Should this be "depends on KASAN && KASAN_GENERIC"?
Meaning, making it an explicit user-selectable option ?
If so, the current design choice is voluntary and based on the feedback
received on the original RFC, where I have been suggested to
automatically enable the KASAN instrumentation in BPF programs if KASAN
support is enabled in the kernel ([1]). But if a user-selectable toggle
is eventually a better solution, I'm fine with changing it.
[1] https://lore.kernel.org/bpf/CAADnVQLX7RSnOqQuU32Cgq-e0MVqyeNrtCQSBbk0W2xGkE-ZNw@mail.gmail.com/
>
>
>> +
>> endmenu # "BPF subsystem"
>>
>> --
>> 2.53.0
>>
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
^ permalink raw reply
* Re: [PATCH v2] wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
From: Jason A. Donenfeld @ 2026-04-14 13:28 UTC (permalink / raw)
To: Shardul Bankar
Cc: kuniyu, andrew+netdev, davem, edumazet, kuba, pabeni, wireguard,
netdev, linux-kernel, janak, kalpan.jani, shardulsb08,
syzbot+f2fbf7478a35a94c8b7c
In-Reply-To: <20260413151232.1004611-1-shardul.b@mpiricsoftware.com>
Hi Shardul,
On Mon, Apr 13, 2026 at 5:13 PM Shardul Bankar
<shardul.b@mpiricsoftware.com> wrote:
>
> wg_netns_pre_exit() manually acquires rtnl_lock() inside the
> pernet .pre_exit callback. This causes a hung task when another
> thread holds rtnl_mutex - the cleanup_net workqueue (or the
> setup_net failure rollback path) blocks indefinitely in
> wg_netns_pre_exit() waiting to acquire the lock.
>
> Convert to .exit_rtnl, introduced in commit 7a60d91c690b ("net:
> Add ->exit_rtnl() hook to struct pernet_operations."), where the
> framework already holds RTNL and batches all callbacks under a
> single rtnl_lock()/rtnl_unlock() pair, eliminating the contention
> window.
>
> The rcu_assign_pointer(wg->creating_net, NULL) is safe to move
> from .pre_exit to .exit_rtnl (which runs after synchronize_rcu())
> because all RCU readers of creating_net either use maybe_get_net()
> - which returns NULL for a dying namespace with zero refcount - or
> access net->user_ns which remains valid throughout the entire
> ops_undo_list sequence.
>
> Reported-by: syzbot+f2fbf7478a35a94c8b7c@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?id=cb64c22a492202ca929e18262fdb8cb89e635c70
> Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com>
Thanks. Applied to the wireguard tree, and also added the missing
__net_exit and __read_mostly annotations in the process.
Jason
^ permalink raw reply
* Re: [PATCH iwl-next v2 7/8] ixgbe: use GFP_KERNEL in ixgbe_fcoe_ddp_setup()
From: Simon Horman @ 2026-04-14 13:29 UTC (permalink / raw)
To: aleksandr.loktionov
Cc: Simon Horman, anthony.l.nguyen, marcin.szycik, intel-wired-lan,
jacob.e.keller, netdev
In-Reply-To: <20260408131216.2662245-8-aleksandr.loktionov@intel.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
ixgbe: use GFP_KERNEL in ixgbe_fcoe_ddp_setup()
This commit restructures ixgbe_fcoe_ddp_setup() to use GFP_KERNEL
instead of GFP_ATOMIC by splitting get_cpu()/put_cpu() usage into two
narrow critical sections, moving DMA mapping and pool allocation between
them where preemption is enabled.
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> index 011fda9c6193..e338ff0e6522 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
[ ... ]
> @@ -179,29 +180,43 @@ static int ixgbe_fcoe_ddp_setup(struct net_device *netdev, u16 xid,
> return 0;
> }
>
> + /* Pin to current CPU only to read the per-CPU pool pointer; drop
> + * the pin before any allocations that may sleep under direct reclaim.
> + */
> ddp_pool = per_cpu_ptr(fcoe->ddp_pool, get_cpu());
> if (!ddp_pool->pool) {
> e_warn(drv, "xid=0x%x no ddp pool for fcoe\n", xid);
> - goto out_noddp;
> + put_cpu();
> + return 0;
> }
> + pool = ddp_pool->pool;
> + put_cpu();
>
> /* setup dma from scsi command sgl */
> dmacount = dma_map_sg(&adapter->pdev->dev, sgl, sgc, DMA_FROM_DEVICE);
> if (dmacount == 0) {
> e_err(drv, "xid 0x%x DMA map error\n", xid);
> - goto out_noddp;
> + return 0;
> }
>
> - /* alloc the udl from per cpu ddp pool */
> - ddp->udl = dma_pool_alloc(ddp_pool->pool, GFP_ATOMIC, &ddp->udp);
> + /* Allocate from per-CPU pool; GFP_KERNEL is safe: preemption is
> + * re-enabled after the put_cpu() above. Per-CPU DMA pools are only
> + * destroyed under RTNL during interface teardown, so the saved pool
> + * pointer remains valid.
> + */
> + ddp->udl = dma_pool_alloc(pool, GFP_KERNEL, &ddp->udp);
Is this safe given the calling context?
Looking at the call path from the FCoE stack:
fc_exch_seq_send()
fc_exch_alloc()
fc_exch_em_alloc()
spin_lock_bh(&pool->lock)
# allocate exchange
# set ep->ex_lock = held
return with ep->ex_lock still held
fc_fcp_ddp_setup()
lport->tt.ddp_setup()
.ndo_fcoe_ddp_setup() == ixgbe_fcoe_ddp_get()
ixgbe_fcoe_ddp_setup() # <-- we are here
# ... continues with ep->ex_lock held
spin_unlock_bh(&ep->ex_lock)
fc_exch_em_alloc() acquires ep->ex_lock (a spinlock) and returns with
it held. The lock remains held through fc_fcp_ddp_setup() and into
ixgbe_fcoe_ddp_setup(). Additionally, spin_lock_bh() disables bottom
halves.
Can GFP_KERNEL be used while holding ep->ex_lock with BH disabled, or
does this require GFP_ATOMIC?
^ permalink raw reply
* Re: [PATCH RFC bpf-next 8/8] selftests/bpf: add tests to validate KASAN on JIT programs
From: Alexis Lothoré @ 2026-04-14 13:43 UTC (permalink / raw)
To: Andrey Konovalov, Alexis Lothoré (eBPF Foundation)
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
Alexander Potapenko, Dmitry Vyukov, Vincenzo Frascino,
Andrew Morton, ebpf, Bastien Curutchet, Thomas Petazzoni,
Xu Kuohai, bpf, linux-kernel, netdev, linux-kselftest,
linux-stm32, linux-arm-kernel, kasan-dev, linux-mm
In-Reply-To: <CA+fCnZekgcEgsZnRrOB=+HoG=neRg=oLTt2jStyrPJ6mYf2ctQ@mail.gmail.com>
On Tue Apr 14, 2026 at 12:20 AM CEST, Andrey Konovalov wrote:
> On Mon, Apr 13, 2026 at 8:29 PM Alexis Lothoré (eBPF Foundation)
> <alexis.lothore@bootlin.com> wrote:
>>
>> Add a basic KASAN test runner that loads and test-run programs that can
>> trigger memory management bugs. The test captures kernel logs and ensure
>> that the expected KASAN splat is emitted by searching for the
>> corresponding first lines in the report.
>>
>> This version implements two faulty programs triggering either a
>> user-after-free, or an out-of-bounds memory usage. The bugs are
>> triggered thanks to some dedicated kfuncs in bpf_testmod.c, but two
>> different techniques are used, as some cases can be quite hard to
>> trigger in a pure "black box" approach:
>> - for reads, we can make the used kfuncs return some faulty pointers
>> that ebpf programs will manipulate, they will generate legitimate
>> kasan reports as a consequence
>> - applying the same trick for faulty writes is harder, as ebpf programs
>> can't write kernel data freely. So ebpf programs can call another
>> specific testing kfunc that will alter the shadow memory matching the
>> passed memory (eg: a map). When the program will try to write to the
>> corresponding memory, it will trigger a report as well.
>>
>> Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
>> ---
>> The way of bringing kasan_poison into bpf_testmod is definitely not
>> ideal. But I would like to validate the testing approach (triggering
>> real faulty accesses, which is hard on some cases, VS manually poisoning
>> BPF-manipulated memory) before eventually making clean bridges between
>> KASAN APIs and bpf_testmod.c, if the latter approach is the valid one.
>
> Would it make sense to put these tests into KASAN KUnit tests in
> mm/kasan/kasan_test_c.c? I assume there is a kernel API to JIT BPF
> programs from the kernel itself?
Possibly indeed, but I think one important use case will be the
possibility to run those tests in BPF CI ([1]), and this depends on
those tests being integrated in the test_progs framework
(tools/testing/selftests/bpf)
[1] https://github.com/kernel-patches/bpf/pulls
> There, you can just call kasan_poison(), some tests already do this.
> And you can also extend the KASAN KUnit test framework to find out
> whether the bad access is a read or write, if you want to check this.
Alexis
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
^ permalink raw reply
* Re: [PATCH net] slip: reject VJ frames when no receive slots are allocated
From: Simon Horman @ 2026-04-14 13:41 UTC (permalink / raw)
To: bestswngs
Cc: 'Simon Horman', andrew+netdev, davem, edumazet, kuba,
pabeni, kees, netdev, xmei5
In-Reply-To: <20260412154252.2060940-2-bestswngs@gmail.com>
From: 'Simon Horman' <horms@kernel.org>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
slip: reject VJ frames when no receive slots are allocated
This commit prevents a null pointer dereference by adding checks for
a NULL comp->rstate in slhc_uncompress() and slhc_remember(). It ensures
frames are rejected when no receive slots are allocated.
> diff --git a/drivers/net/slip/slhc.c b/drivers/net/slip/slhc.c
> index e3c785da3eef3..e67052bcab57b 100644
> --- a/drivers/net/slip/slhc.c
> +++ b/drivers/net/slip/slhc.c
> @@ -502,6 +502,10 @@ slhc_uncompress(struct slcompress *comp, unsigned char *icp, int isize)
>
> /* We've got a compressed packet; read the change byte */
> comp->sls_i_compressed++;
> + if (!comp->rstate) {
> + comp->sls_i_error++;
> + return 0;
> + }
While reviewing this code, a few pre-existing issues were noticed in and
around slhc_uncompress().
Can slhc_uncompress() cause a heap buffer overflow?
It expands compressed packets by prepending reconstructed IP and TCP
headers, which can total up to 120 bytes if options are maximized. The
caller in slip.c assumes a maximum expansion of 80 bytes:
if (count + 80 > sl->buffsize)
If a compressed packet is sized at sl->buffsize - 80, slhc_uncompress()
shifts the payload and copies the 120-byte headers, writing up to 37 bytes
past the dynamically allocated sl->rbuff chunk. Does this corrupt memory?
Also, does the decode() function read variable-length deltas without
bounds checking against the packet length isize?
A malformed packet with multiple change flags set could cause decode()
to read up to 15 bytes past the end of the packet data. Because the packet
buffer sl->rbuff is allocated via kmalloc without zero-initialization,
could this read uninitialized kernel heap memory and inject it into the
reconstructed TCP/IP headers, leaking it back to the network?
Finally, do slhc_compress() and slhc_uncompress() read and write the
16-bit TCP checksum using direct pointer casts?
drivers/net/slip/slhc.c:slhc_uncompress() {
...
thp->check = *(__sum16 *)cp;
...
}
Since the compressed header varies dynamically in size, the cp pointer
frequently lands on odd byte boundaries. On architectures with strict
alignment requirements, could this direct dereference trigger an alignment
fault and kernel panic? Should the code use get_unaligned() and
put_unaligned() instead?
[ ... ]
> @@ -651,8 +655,9 @@ slhc_remember(struct slcompress *comp, unsigned char *icp, int isize)
>
> /* The packet is shorter than a legal IP header.
> * Also make sure isize is positive.
> + * Reject if no receive slots are configured (rstate is NULL).
> */
> - if (isize < (int)sizeof(struct iphdr)) {
> + if (!comp->rstate || isize < (int)sizeof(struct iphdr)) {
> runt:
> comp->sls_i_runt++;
> return slhc_toss(comp);
When dropping a packet due to a lack of configured receive slots
(rslots == 0), should this increment comp->sls_i_error instead of
comp->sls_i_runt?
Dropping a packet because rstate is missing is an internal configuration
state, not a runt packet.
^ permalink raw reply
* Re: [PATCH v3 1/3] net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Fidelio LAWSON @ 2026-04-14 13:48 UTC (permalink / raw)
To: Andrew Lunn, Marek Vasut
Cc: Woojung Huh, UNGLinuxDriver, Vladimir Oltean, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Marek Vasut,
Maxime Chevallier, Simon Horman, Heiner Kallweit, Russell King,
netdev, linux-kernel, Fidelio Lawson
In-Reply-To: <d9b161dd-f698-4d7e-8ccb-9ec12411bf87@lunn.ch>
On 4/14/26 14:40, Andrew Lunn wrote:
> On Tue, Apr 14, 2026 at 01:05:49PM +0200, Marek Vasut wrote:
>> On 4/14/26 11:12 AM, Fidelio Lawson wrote:
>>> Implement the "Module 3: Equalizer fix for short cables" erratum from
>>> Microchip document DS80000687C for KSZ87xx switches.
>>>
>>> The issue affects short or low-loss cable links (e.g. CAT5e/CAT6),
>>> where the PHY receiver equalizer may amplify high-amplitude signals
>>> excessively, resulting in internal distortion and link establishment
>>> failures.
>>>
>>> KSZ87xx devices require a workaround for the Module 3 low-loss cable
>>> condition, controlled through the switch TABLE_LINK_MD_V indirect
>>> registers.
>>>
>>> The affected registers are part of the switch address space and are not
>>> directly accessible from the PHY driver. To keep the PHY-facing API
>>> clean and avoid leaking switch-specific details, model this errata
>>> control as vendor-specific Clause 22 PHY registers.
>>>
>>> A vendor-specific Clause 22 PHY register is introduced as a mode
>>> selector in PHY_REG_LOW_LOSS_CTRL, and ksz8_r_phy() / ksz8_w_phy()
>>> translate accesses to these bits into the appropriate indirect
>>> TABLE_LINK_MD_V accesses.
>>>
>>> The control register defines the following modes:
>>> 0: disabled (default behavior)
>>> 1: EQ training workaround
>>> 2: LPF 90 MHz
>>> 3: LPF 62 MHz
>>> 4: LPF 55 MHz
>>> 5: LPF 44 MHz
>> I may not fully understand this, but aren't the EQ and LPF settings
>> orthogonal ?
>
> What is the real life experience using this feature? Is it needed for
> 1cm cables, but most > 1m cables are O.K with the defaults? Do we need
> all these configuration options? How is a user supposed to discover
> the different options? Can we simplify it down to a Boolean?
>
> Ethernet is just supposed to work with any valid length of cable,
> KISS. So maybe we should try to keep this feature KISS. Just tell the
> driver it is a short cable, pick different defaults which should work
> with any short cable?
>
> A boolean should also help with making this tunable reusable with
> other devices. It is unlikely any other devices have these same
> configuration options, unless it is from the same vendor.
>
> Andrew
The issue has been observed with very short or low‑loss
cables, typically in industrial or embedded setups where the cable is
below 3m or in a board-to-board setup.
From our practical experience, this issue occurs in our setup where a
very short CAT‑6e cable (~20cm) is used.
We were seeing random link dropouts with the default settings, and since
enabling the workaround 2, the link has remained stable and we have not
observed any further issues.
We don’t need all these configuration options.
According to the Microchip erratum, the user should try workaround 1 (EQ
training), and if that does not resolve the random link dropouts,
fall back to workaround 2 by reducing the LPF bandwidth to 62MHz.
Since this procedure for determining which workaround is effective is
inherently experimental and requires observation in real deployments,
this is why I originally chose to expose the selection of the workaround
to the user, at least allowing them to choose between workaround 1 and
workaround 2.
regards
Fidelio
^ permalink raw reply
* Re: [PATCH net] net: ethernet: ravb: Do not check URAM suspension when WoL is active
From: Simon Horman @ 2026-04-14 13:56 UTC (permalink / raw)
To: Niklas Söderlund
Cc: Paul Barker, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Yoshihiro Shimoda,
Geert Uytterhoeven, netdev, linux-renesas-soc
In-Reply-To: <20260412173213.3179426-1-niklas.soderlund+renesas@ragnatech.se>
On Sun, Apr 12, 2026 at 07:32:13PM +0200, Niklas Söderlund wrote:
> When updating the driver to match latest datasheet to suspend access to
> URAM when suspending DMA transfers a corner-case was missed, URAM access
> will not be suspended if WoL is enabled. This lead to the error message
> (correctly) being triggered as URAM access is not suspended even tho
> it's requested as part of stopping DMA.
>
> Avoid checking if URAM access is suspended and printing the error
> message if WoL is enabled when we suspend the system, as we know it will
> not be.
>
> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
> Closes: https://lore.kernel.org/all/CAMuHMdWnjV%3DHGE1o08zLhUfTgOSene5fYx1J5GG10mB%2BToq8qg@mail.gmail.com/
> Fixes: 353d8e7989b6 ("net: ethernet: ravb: Suspend and resume the transmission flow")
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Hi Niklas,
This is a bit awkward.
1. This patch doesn't apply cleanly to net (yet). Because the cited
commit, which is a dependency, has not propagated there.
2. OTHO, net-next is closed for the merge window.
Regardless of the 2nd point, I'm suspecting that the best option is to
repost this targeting net-next.
...
--
pw-bot: changes-requested
^ permalink raw reply
* Re: [PATCH bpf-next 1/2] bpf: tcp: Reject TCP_NODELAY from BPF hdr opt callbacks
From: KaFai Wan @ 2026-04-14 13:56 UTC (permalink / raw)
To: edumazet, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
ast, daniel, andrii, martin.lau, eddyz87, memxor, song,
yonghong.song, jolsa, shuah, sdf, netdev, linux-kernel, bpf,
linux-kselftest
Cc: Quan Sun, Yinhao Hu, Kaiyan Mei
In-Reply-To: <20260414112310.1285783-2-kafai.wan@linux.dev>
On Tue, 2026-04-14 at 19:23 +0800, KaFai Wan wrote:
AI is right and I'm late for the issue. Please ignore this. Sorry for the noise.
> A BPF_SOCK_OPS program can enable
> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG and then call
> bpf_setsockopt(TCP_NODELAY) from BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>
> That reaches __tcp_sock_set_nodelay(), which may call
> tcp_push_pending_frames(). The transmit path then computes TCP
> options again, re-enters bpf_skops_hdr_opt_len(), and invokes the
> same BPF callback recursively. This can loop until the kernel
> stack overflows.
>
> TCP_NODELAY is not safe from the header option callback context.
> Reject it with -EOPNOTSUPP when TCP header option callbacks are
> enabled on the socket, so the callback cannot recurse back into
> tcp_push_pending_frames() through do_tcp_setsockopt().
>
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 7e41df5dbba2 ("bpf: Add a few optnames to bpf_setsockopt")
> Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
> ---
> net/ipv4/tcp.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 202a4e57a218..7ac4c98be19d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4004,7 +4004,10 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
>
> switch (optname) {
> case TCP_NODELAY:
> - __tcp_sock_set_nodelay(sk, val);
> + if (val && BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))
> + err = -EOPNOTSUPP;
> + else
> + __tcp_sock_set_nodelay(sk, val);
> break;
>
> case TCP_THIN_LINEAR_TIMEOUTS:
--
Thanks,
KaFai
^ permalink raw reply
* Re: [PATCH v2] Bluetooth: Add Broadcom channel priority commands
From: Luiz Augusto von Dentz @ 2026-04-14 14:00 UTC (permalink / raw)
To: fnkl.kernel
Cc: Sven Peter, Janne Grunau, Neal Gompa, Marcel Holtmann,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, linux-kernel, asahi, linux-arm-kernel,
linux-bluetooth, netdev
In-Reply-To: <20260407-brcm-prio-v2-1-3f745edf49af@gmail.com>
Hi Sasha,
On Tue, Apr 7, 2026 at 1:46 PM Sasha Finkelstein via B4 Relay
<devnull+fnkl.kernel.gmail.com@kernel.org> wrote:
>
> From: Sasha Finkelstein <fnkl.kernel@gmail.com>
>
> Certain Broadcom bluetooth chips (bcm4377/bcm4378/bcm438) need ACL
> streams carrying audio to be set as "high priority" using a vendor
> specific command to prevent 10-ish second-long dropouts whenever
> something does a device scan. This patch sends the command when the
> socket priority is set to TC_PRIO_INTERACTIVE, as BlueZ does for audio.
>
> Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
> ---
> Changes in v2:
> - new ioctl got nack-ed, so let's use sk_priority as the trigger
> - Link to v1: https://lore.kernel.org/r/20260407-brcm-prio-v1-1-f38b17376640@gmail.com
> ---
> MAINTAINERS | 2 ++
> drivers/bluetooth/hci_bcm4377.c | 2 ++
> include/net/bluetooth/bluetooth.h | 4 ++++
> include/net/bluetooth/hci_core.h | 11 +++++++++++
> net/bluetooth/Kconfig | 7 +++++++
> net/bluetooth/Makefile | 1 +
> net/bluetooth/brcm.c | 29 +++++++++++++++++++++++++++++
> net/bluetooth/brcm.h | 17 +++++++++++++++++
> net/bluetooth/hci_conn.c | 28 ++++++++++++++++++++++++++++
> net/bluetooth/l2cap_sock.c | 13 +++++++++++++
> 10 files changed, 114 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c3fe46d7c4bc..81be021367ec 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2562,6 +2562,8 @@ F: include/dt-bindings/pinctrl/apple.h
> F: include/linux/mfd/macsmc.h
> F: include/linux/soc/apple/*
> F: include/uapi/drm/asahi_drm.h
> +F: net/bluetooth/brcm.c
> +F: net/bluetooth/brcm.h
>
> ARM/ARTPEC MACHINE SUPPORT
> M: Jesper Nilsson <jesper.nilsson@axis.com>
> diff --git a/drivers/bluetooth/hci_bcm4377.c b/drivers/bluetooth/hci_bcm4377.c
> index 925d0a635945..5f79920c0306 100644
> --- a/drivers/bluetooth/hci_bcm4377.c
> +++ b/drivers/bluetooth/hci_bcm4377.c
> @@ -2397,6 +2397,8 @@ static int bcm4377_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (bcm4377->hw->broken_le_ext_adv_report_phy)
> hci_set_quirk(hdev, HCI_QUIRK_FIXUP_LE_EXT_ADV_REPORT_PHY);
>
> + hci_set_brcm_capable(hdev);
> +
> pci_set_drvdata(pdev, bcm4377);
> hci_set_drvdata(hdev, bcm4377);
> SET_HCIDEV_DEV(hdev, &pdev->dev);
> diff --git a/include/net/bluetooth/bluetooth.h b/include/net/bluetooth/bluetooth.h
> index 69eed69f7f26..07a250673950 100644
> --- a/include/net/bluetooth/bluetooth.h
> +++ b/include/net/bluetooth/bluetooth.h
> @@ -457,6 +457,7 @@ struct l2cap_ctrl {
> };
>
> struct hci_dev;
> +struct hci_conn;
>
> typedef void (*hci_req_complete_t)(struct hci_dev *hdev, u8 status, u16 opcode);
> typedef void (*hci_req_complete_skb_t)(struct hci_dev *hdev, u8 status,
> @@ -469,6 +470,9 @@ void hci_req_cmd_complete(struct hci_dev *hdev, u16 opcode, u8 status,
> int hci_ethtool_ts_info(unsigned int index, int sk_proto,
> struct kernel_ethtool_ts_info *ts_info);
>
> +int hci_conn_setsockopt(struct hci_conn *conn, struct sock *sk, int level,
> + int optname, sockptr_t optval, unsigned int optlen);
> +
> #define HCI_REQ_START BIT(0)
> #define HCI_REQ_SKB BIT(1)
>
> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
> index a7bffb908c1e..947e7c2b08dd 100644
> --- a/include/net/bluetooth/hci_core.h
> +++ b/include/net/bluetooth/hci_core.h
> @@ -642,6 +642,10 @@ struct hci_dev {
> bool aosp_quality_report;
> #endif
>
> +#if IS_ENABLED(CONFIG_BT_BRCMEXT)
> + bool brcm_capable;
> +#endif
> +
> int (*open)(struct hci_dev *hdev);
> int (*close)(struct hci_dev *hdev);
> int (*flush)(struct hci_dev *hdev);
> @@ -1791,6 +1795,13 @@ static inline void hci_set_aosp_capable(struct hci_dev *hdev)
> #endif
> }
>
> +static inline void hci_set_brcm_capable(struct hci_dev *hdev)
> +{
> +#if IS_ENABLED(CONFIG_BT_BRCMEXT)
> + hdev->brcm_capable = true;
> +#endif
> +}
> +
> static inline void hci_devcd_setup(struct hci_dev *hdev)
> {
> #ifdef CONFIG_DEV_COREDUMP
> diff --git a/net/bluetooth/Kconfig b/net/bluetooth/Kconfig
> index 6b2b65a66700..0f2a5fbcafc5 100644
> --- a/net/bluetooth/Kconfig
> +++ b/net/bluetooth/Kconfig
> @@ -110,6 +110,13 @@ config BT_AOSPEXT
> This options enables support for the Android Open Source
> Project defined HCI vendor extensions.
>
> +config BT_BRCMEXT
> + bool "Enable Broadcom extensions"
> + depends on BT
> + help
> + This option enables support for the Broadcom defined HCI
> + vendor extensions.
> +
> config BT_DEBUGFS
> bool "Export Bluetooth internals in debugfs"
> depends on BT && DEBUG_FS
> diff --git a/net/bluetooth/Makefile b/net/bluetooth/Makefile
> index a7eede7616d8..b4c9013a46ce 100644
> --- a/net/bluetooth/Makefile
> +++ b/net/bluetooth/Makefile
> @@ -24,5 +24,6 @@ bluetooth-$(CONFIG_BT_LE) += iso.o
> bluetooth-$(CONFIG_BT_LEDS) += leds.o
> bluetooth-$(CONFIG_BT_MSFTEXT) += msft.o
> bluetooth-$(CONFIG_BT_AOSPEXT) += aosp.o
> +bluetooth-$(CONFIG_BT_BRCMEXT) += brcm.o
> bluetooth-$(CONFIG_BT_DEBUGFS) += hci_debugfs.o
> bluetooth-$(CONFIG_BT_SELFTEST) += selftest.o
> diff --git a/net/bluetooth/brcm.c b/net/bluetooth/brcm.c
> new file mode 100644
> index 000000000000..9aa0a265ab3d
> --- /dev/null
> +++ b/net/bluetooth/brcm.c
> @@ -0,0 +1,29 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2026 The Asahi Linux Contributors
> + */
> +
> +#include <net/bluetooth/bluetooth.h>
> +#include <net/bluetooth/hci_core.h>
> +
> +#include "brcm.h"
> +
> +int brcm_set_high_priority(struct hci_dev *hdev, u16 handle, bool enable)
> +{
> + struct sk_buff *skb;
> + u8 cmd[3];
> +
> + if (!hdev->brcm_capable)
> + return 0;
> +
> + cmd[0] = handle;
> + cmd[1] = handle >> 8;
Adding a packed struct and then using something like cpu_to_le16 is
probably preferable over above.
> + cmd[2] = !!enable;
> +
> + skb = hci_cmd_sync(hdev, 0xfc57, sizeof(cmd), cmd, HCI_CMD_TIMEOUT);
> + if (IS_ERR(skb))
> + return PTR_ERR(skb);
> +
> + kfree_skb(skb);
> + return 0;
> +}
> diff --git a/net/bluetooth/brcm.h b/net/bluetooth/brcm.h
> new file mode 100644
> index 000000000000..fdaee63bd1d2
> --- /dev/null
> +++ b/net/bluetooth/brcm.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2026 The Asahi Linux Contributors
> + */
> +
> +#if IS_ENABLED(CONFIG_BT_BRCMEXT)
> +
> +int brcm_set_high_priority(struct hci_dev *hdev, u16 handle, bool enable);
> +
> +#else
> +
> +static inline int brcm_set_high_priority(struct hci_dev *hdev, u16 handle, bool enable)
> +{
> + return 0;
> +}
> +
> +#endif
> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> index 11d3ad8d2551..096163840f62 100644
> --- a/net/bluetooth/hci_conn.c
> +++ b/net/bluetooth/hci_conn.c
> @@ -35,6 +35,7 @@
> #include <net/bluetooth/iso.h>
> #include <net/bluetooth/mgmt.h>
>
> +#include "brcm.h"
> #include "smp.h"
> #include "eir.h"
>
> @@ -3070,6 +3071,33 @@ int hci_conn_set_phy(struct hci_conn *conn, u32 phys)
> }
> }
>
> +int hci_conn_setsockopt(struct hci_conn *conn, struct sock *sk, int level,
> + int optname, sockptr_t optval, unsigned int optlen)
> +{
> + int val;
> + bool old_high, new_high, changed;
> +
> + if (level != SOL_SOCKET)
> + return 0;
> +
> + if (optname != SO_PRIORITY)
> + return 0;
> +
> + if (optlen < sizeof(int))
> + return -EINVAL;
> +
> + if (copy_from_sockptr(&val, optval, sizeof(val)))
> + return -EFAULT;
> +
> + old_high = sk->sk_priority >= TC_PRIO_INTERACTIVE;
> + new_high = val >= TC_PRIO_INTERACTIVE;
> + changed = old_high != new_high;
> + if (!changed)
> + return 0;
> +
> + return brcm_set_high_priority(conn->hdev, conn->handle, new_high);
The skb carries the priority (skb->priority), not sure why you need to
capture the sk_priority instead, doing so ignores the load balance
that hci_core performs to avoid starving connections.
> +}
> +
> static int abort_conn_sync(struct hci_dev *hdev, void *data)
> {
> struct hci_conn *conn = data;
> diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
> index 71e8c1b45bce..d5eef87accc4 100644
> --- a/net/bluetooth/l2cap_sock.c
> +++ b/net/bluetooth/l2cap_sock.c
> @@ -891,6 +891,16 @@ static int l2cap_sock_setsockopt(struct socket *sock, int level, int optname,
>
> BT_DBG("sk %p", sk);
>
> + if (level == SOL_SOCKET) {
> + conn = chan->conn;
> + if (conn)
> + err = hci_conn_setsockopt(conn->hcon, sock->sk, level,
> + optname, optval, optlen);
> + if (err)
> + return err;
> + return sock_setsockopt(sock, level, optname, optval, optlen);
> + }
> +
> if (level == SOL_L2CAP)
> return l2cap_sock_setsockopt_old(sock, optname, optval, optlen);
>
> @@ -1931,6 +1941,9 @@ static struct sock *l2cap_sock_alloc(struct net *net, struct socket *sock,
>
> INIT_LIST_HEAD(&l2cap_pi(sk)->rx_busy);
>
> + if (sock)
> + set_bit(SOCK_CUSTOM_SOCKOPT, &sock->flags);
This is more complicated than it needs to be. I'd just add a new
callback, `hdev->set_priority(handle, skb->priority)`, so the driver
is called whenever it needs to elevate a connection's priority, that
said there could be cases where a connection needs its priority set
momentarily to transmit A2DP, followed by OBEX packets that are best
effort. Therefore, `hci_conn` will probably need to track the priority
so it can detect when it needs changing on a per skb basis.
> chan = l2cap_chan_create();
> if (!chan) {
> sk_free(sk);
>
> ---
> base-commit: bfe62a454542cfad3379f6ef5680b125f41e20f4
> change-id: 20260407-brcm-prio-b630e6cc3834
>
> Best regards,
> --
> Sasha Finkelstein <fnkl.kernel@gmail.com>
>
>
--
Luiz Augusto von Dentz
^ permalink raw reply
* Re: [PATCH net-next v4 1/2] net: pse-pd: add devm_pse_poll_helper()
From: Kory Maincent @ 2026-04-14 14:05 UTC (permalink / raw)
To: Jakub Kicinski
Cc: github, o.rempel, andrew+netdev, davem, edumazet, pabeni, krzk,
krzk+dt, conor+dt, robh, netdev, linux-kernel, linux-leds, lkp
In-Reply-To: <20260413225021.3634733-1-kuba@kernel.org>
On Mon, 13 Apr 2026 15:50:21 -0700
Jakub Kicinski <kuba@kernel.org> wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: pse-pd: add devm_pse_poll_helper()
>
> Extract the common event handling loop from pse_isr() into a shared
> pse_handle_events() function, and add a generic poll-based alternative
> to the IRQ path for PSE controllers that lack interrupt support.
>
> > Work cancellation on teardown is handled via devm_add_action_or_reset()
> > to ensure the delayed work is cancelled before poll_notifs is freed
> > by devres, avoiding a use-after-free when devm_pse_poll_helper() is
> > called after devm_pse_controller_register() (devres LIFO ordering).
>
> Is it safe to rely on this devm registration order to prevent a
> use-after-free?
>
> Standard Linux driver initialization typically sets up interrupts and
> pollers before device registration to prevent race conditions where an
> unready device is exposed to userspace.
>
> If a driver follows this standard ordering by calling the helper before
> registration, devres will unregister the controller first. This will
> tear down resources in pse_controller_unregister() while the poll work
> remains active.
>
> If the worker executes during this race window, could it access the
> already-freed pcdev->pi array or push to the destroyed ntf_fifo? Would
> it be safer to explicitly cancel asynchronous event sources within
> pse_controller_unregister() instead?
>
> > --- a/drivers/net/pse-pd/pse_core.c
> > +++ b/drivers/net/pse-pd/pse_core.c
> [ ... ]
> > +static void pse_handle_events(struct pse_controller_dev *pcdev,
> > + unsigned long *notifs,
> > + unsigned long notifs_mask)
> > {
> > - struct pse_controller_dev *pcdev;
> > - unsigned long notifs_mask = 0;
> > - struct pse_irq_desc *desc;
> > - struct pse_irq *h = data;
> > - int ret, i;
> > -
> > - desc = &h->desc;
> > - pcdev = h->pcdev;
> > -
> > - /* Clear notifs mask */
> > - memset(h->notifs, 0, pcdev->nr_lines * sizeof(*h->notifs));
> > - mutex_lock(&pcdev->lock);
> > - ret = desc->map_event(irq, pcdev, h->notifs, ¬ifs_mask);
> > - if (ret || !notifs_mask) {
> > - mutex_unlock(&pcdev->lock);
> > - return IRQ_NONE;
> > - }
> > + int i;
> >
> > for_each_set_bit(i, ¬ifs_mask, pcdev->nr_lines) {
>
> Can this lead to an out-of-bounds read on the stack if a controller
> supports more than BITS_PER_LONG ports?
>
> The notifs_mask variable is passed by value to pse_handle_events(), but
> for_each_set_bit() takes its address. If pcdev->nr_lines is greater than
> BITS_PER_LONG (e.g., a 48-port switch on a 32-bit architecture), the
> macro will read past the function argument on the stack into uninitialized
> memory.
It's seems there is a possible out-of-bound issue in my code :/ Oops.
Carlo, could you take a look and propose a fix? Otherwise, I'll handle it.
Regards
--
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
^ permalink raw reply
* RE: [EXTERNAL] [PATCH net] netvsc: transfer lower device max tso size during VF transition
From: Haiyang Zhang @ 2026-04-14 14:08 UTC (permalink / raw)
To: Li Tian, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, Wei Liu, Dexuan Cui, Long Li,
Andrew Lunn, Eric Dumazet, Vitaly Kuznetsov, Paolo Abeni,
Jakub Kicinski, Jason Wang
In-Reply-To: <20260325045006.18607-1-litian@redhat.com>
> -----Original Message-----
> From: Li Tian <litian@redhat.com>
> Sent: Wednesday, March 25, 2026 12:50 AM
> To: netdev@vger.kernel.org; linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; Haiyang Zhang <haiyangz@microsoft.com>;
> Wei Liu <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; Eric Dumazet
> <edumazet@google.com>; Vitaly Kuznetsov <vkuznets@redhat.com>; Paolo Abeni
> <pabeni@redhat.com>; Jakub Kicinski <kuba@kernel.org>; Jason Wang
> <jasowang@redhat.com>; Li Tian <litian@redhat.com>
> Subject: [EXTERNAL] [PATCH net] netvsc: transfer lower device max tso size
> during VF transition
>
> When netvsc is accelerated by the lower device, we can advertise the
> lower device max tso size in order to get better performance.
> While a long-term migration to user-space bonding is planned, current
> users on RHEL 10 / Azure are experiencing significant performance
> regressions in 802.3ad environments. This patch provides a localized,
> safe fix within netvsc without introducing new core networking helpers.
>
> Signed-off-by: Li Tian <litian@redhat.com>
> ---
> drivers/net/hyperv/netvsc_drv.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/hyperv/netvsc_drv.c
> b/drivers/net/hyperv/netvsc_drv.c
> index ee5ab5ceb2be..971607c7406f 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -2428,10 +2428,14 @@ static int netvsc_vf_changed(struct net_device
> *vf_netdev, unsigned long event)
> * This value is only increased for netvsc NIC when datapath
> is
> * switched over to the VF
> */
> - if (vf_is_up)
> + if (vf_is_up) {
> netif_set_tso_max_size(ndev, vf_netdev->tso_max_size);
> - else
> + WRITE_ONCE(ndev->gso_max_size, READ_ONCE(vf_netdev-
> >gso_max_size));
> + WRITE_ONCE(ndev->gso_ipv4_max_size,
> + READ_ONCE(vf_netdev->gso_ipv4_max_size));
> + } else {
> netif_set_tso_max_size(ndev, netvsc_dev-
> >netvsc_gso_max_size);
> + }
> }
>
> return NOTIFY_OK;
Thanks.
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
^ permalink raw reply
* [PATCH net v2] net: airoha: Wait for NPU PPE configuration to complete in airoha_ppe_offload_setup()
From: Lorenzo Bianconi @ 2026-04-14 14:08 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Lorenzo Bianconi
Cc: linux-arm-kernel, linux-mediatek, netdev
In order to properly enable flowtable hw offloading, poll
REG_PPE_FLOW_CFG register in airoha_ppe_offload_setup routine and
wait for NPU PPE configuration triggered by ppe_init callback to complete
before running airoha_ppe_hw_init().
Fixes: 00a7678310fe3 ("net: airoha: Introduce flowtable offload support")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
Changes in v2:
- Check for both REG_PPE_PPE_FLOW_CFG(0) and REG_PPE_PPE_FLOW_CFG(1) to
complete.
- Link to v1: https://lore.kernel.org/r/20260412-airoha-wait-for-npu-config-offload-setup-v1-1-f4e0aa2a5d85@kernel.org
---
drivers/net/ethernet/airoha/airoha_ppe.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/drivers/net/ethernet/airoha/airoha_ppe.c b/drivers/net/ethernet/airoha/airoha_ppe.c
index 62cfffb4f0e5..684c8ae9576f 100644
--- a/drivers/net/ethernet/airoha/airoha_ppe.c
+++ b/drivers/net/ethernet/airoha/airoha_ppe.c
@@ -1335,6 +1335,29 @@ static struct airoha_npu *airoha_ppe_npu_get(struct airoha_eth *eth)
return npu;
}
+static int airoha_ppe_wait_for_npu_init(struct airoha_eth *eth)
+{
+ int err;
+ u32 val;
+
+ /* PPE_FLOW_CFG default register value is 0. Since we reset FE
+ * during the device probe we can just check the configured value
+ * is not 0 here.
+ */
+ err = read_poll_timeout(airoha_fe_rr, val, val, USEC_PER_MSEC,
+ 100 * USEC_PER_MSEC, false, eth,
+ REG_PPE_PPE_FLOW_CFG(0));
+ if (err)
+ return err;
+
+ if (airoha_ppe_is_enabled(eth, 1))
+ err = read_poll_timeout(airoha_fe_rr, val, val, USEC_PER_MSEC,
+ 100 * USEC_PER_MSEC, false, eth,
+ REG_PPE_PPE_FLOW_CFG(1));
+
+ return err;
+}
+
static int airoha_ppe_offload_setup(struct airoha_eth *eth)
{
struct airoha_npu *npu = airoha_ppe_npu_get(eth);
@@ -1348,6 +1371,11 @@ static int airoha_ppe_offload_setup(struct airoha_eth *eth)
if (err)
goto error_npu_put;
+ /* Wait for NPU PPE configuration to complete */
+ err = airoha_ppe_wait_for_npu_init(eth);
+ if (err)
+ goto error_npu_put;
+
ppe_num_stats_entries = airoha_ppe_get_total_num_stats_entries(ppe);
if (ppe_num_stats_entries > 0) {
err = npu->ops.ppe_init_stats(npu, ppe->foe_stats_dma,
---
base-commit: b9d8b856689d2b968495d79fe653d87fcb8ad98c
change-id: 20260412-airoha-wait-for-npu-config-offload-setup-19d04522412d
Best regards,
--
Lorenzo Bianconi <lorenzo@kernel.org>
^ permalink raw reply related
* Re: linux-next: manual merge of the bpf-next tree with the origin tree
From: Alexei Starovoitov @ 2026-04-14 14:09 UTC (permalink / raw)
To: Mark Brown
Cc: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, bpf,
Networking, Joel Fernandes, Kumar Kartikeya Dwivedi,
Linux Kernel Mailing List, Linux Next Mailing List,
Paul E. McKenney
In-Reply-To: <ad4whCJuB-viVAae@sirena.org.uk>
On Tue, Apr 14, 2026 at 5:18 AM Mark Brown <broonie@kernel.org> wrote:
>
> Hi all,
>
> Today's linux-next merge of the bpf-next tree got a conflict in:
>
> include/linux/rcupdate.h
>
> between commit:
>
> ad6ef775cbeff ("rcu-tasks: Document that RCU Tasks Trace grace periods now imply RCU grace periods")
>
> from the origin tree and commit:
>
> 57b23c0f612dc ("bpf: Retire rcu_trace_implies_rcu_gp()")
>
> from the bpf-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging. You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
>
> diff --combined include/linux/rcupdate.h
> index 18a85c30fd4f3,bfa765132de85..0000000000000
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@@ -205,15 -205,6 +205,6 @@@ static inline void exit_tasks_rcu_start
> static inline void exit_tasks_rcu_finish(void) { }
> #endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
>
> - /**
> - * rcu_trace_implies_rcu_gp - does an RCU Tasks Trace grace period imply an RCU grace period?
> - *
> - * Now that RCU Tasks Trace is implemented in terms of SRCU-fast, a
> - * call to synchronize_rcu_tasks_trace() is guaranteed to imply at least
> - * one call to synchronize_rcu().
> - */
> - static inline bool rcu_trace_implies_rcu_gp(void) { return true; }
> -
Right. I mentioned it in my bpf-next PR.
But how come you're saying it was discovered "today" ?
Paul's commit ad6ef775cbeff was committed to rcu tree on Mar 30,
while Kumar's 57b23c0f612dc was committed to bpf-next on Apr 7.
"today" is April 14.
My only explanation is that rcu tree was not in linux-next until today?!
^ permalink raw reply
* Re: [PATCH net-next v4 1/2] net: pse-pd: add devm_pse_poll_helper()
From: Kory Maincent @ 2026-04-14 14:11 UTC (permalink / raw)
To: Jakub Kicinski
Cc: github, o.rempel, andrew+netdev, davem, edumazet, pabeni, krzk,
krzk+dt, conor+dt, robh, netdev, linux-kernel, linux-leds, lkp
In-Reply-To: <20260414160506.4ce4d543@kmaincent-XPS-13-7390>
On Tue, 14 Apr 2026 16:05:06 +0200
Kory Maincent <kory.maincent@bootlin.com> wrote:
> On Mon, 13 Apr 2026 15:50:21 -0700
> Jakub Kicinski <kuba@kernel.org> wrote:
>
> > This is an AI-generated review of your patch. The human sending this
> > email has considered the AI review valid, or at least plausible.
> > ---
> > net: pse-pd: add devm_pse_poll_helper()
> >
> > Extract the common event handling loop from pse_isr() into a shared
> > pse_handle_events() function, and add a generic poll-based alternative
> > to the IRQ path for PSE controllers that lack interrupt support.
> >
> > > Work cancellation on teardown is handled via devm_add_action_or_reset()
> > > to ensure the delayed work is cancelled before poll_notifs is freed
> > > by devres, avoiding a use-after-free when devm_pse_poll_helper() is
> > > called after devm_pse_controller_register() (devres LIFO ordering).
> >
> > Is it safe to rely on this devm registration order to prevent a
> > use-after-free?
> >
> > Standard Linux driver initialization typically sets up interrupts and
> > pollers before device registration to prevent race conditions where an
> > unready device is exposed to userspace.
> >
> > If a driver follows this standard ordering by calling the helper before
> > registration, devres will unregister the controller first. This will
> > tear down resources in pse_controller_unregister() while the poll work
> > remains active.
> >
> > If the worker executes during this race window, could it access the
> > already-freed pcdev->pi array or push to the destroyed ntf_fifo? Would
> > it be safer to explicitly cancel asynchronous event sources within
> > pse_controller_unregister() instead?
> >
> > > --- a/drivers/net/pse-pd/pse_core.c
> > > +++ b/drivers/net/pse-pd/pse_core.c
> > [ ... ]
> > > +static void pse_handle_events(struct pse_controller_dev *pcdev,
> > > + unsigned long *notifs,
> > > + unsigned long notifs_mask)
> > > {
> > > - struct pse_controller_dev *pcdev;
> > > - unsigned long notifs_mask = 0;
> > > - struct pse_irq_desc *desc;
> > > - struct pse_irq *h = data;
> > > - int ret, i;
> > > -
> > > - desc = &h->desc;
> > > - pcdev = h->pcdev;
> > > -
> > > - /* Clear notifs mask */
> > > - memset(h->notifs, 0, pcdev->nr_lines * sizeof(*h->notifs));
> > > - mutex_lock(&pcdev->lock);
> > > - ret = desc->map_event(irq, pcdev, h->notifs, ¬ifs_mask);
> > > - if (ret || !notifs_mask) {
> > > - mutex_unlock(&pcdev->lock);
> > > - return IRQ_NONE;
> > > - }
> > > + int i;
> > >
> > > for_each_set_bit(i, ¬ifs_mask, pcdev->nr_lines) {
> >
> > Can this lead to an out-of-bounds read on the stack if a controller
> > supports more than BITS_PER_LONG ports?
> >
> > The notifs_mask variable is passed by value to pse_handle_events(), but
> > for_each_set_bit() takes its address. If pcdev->nr_lines is greater than
> > BITS_PER_LONG (e.g., a 48-port switch on a 32-bit architecture), the
> > macro will read past the function argument on the stack into uninitialized
> > memory.
>
> It's seems there is a possible out-of-bound issue in my code :/ Oops.
> Carlo, could you take a look and propose a fix? Otherwise, I'll handle it.
But currently it can't be reached as the only driver that support interrupt is
the TPS23881 with 8 ports.
Regards,
--
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
^ permalink raw reply
* Re: [PATCH net-next] net: stmmac: enable RPS and RBU interrupts
From: Russell King (Oracle) @ 2026-04-14 14:13 UTC (permalink / raw)
To: Sam Edwards
Cc: Jakub Kicinski, Andrew Lunn, Alexandre Torgue, Andrew Lunn,
David S. Miller, Eric Dumazet,
moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
linux-stm32, Linux Network Development Mailing List, Paolo Abeni
In-Reply-To: <CAH5Ym4i7VV53hQGY3AjAUW3B8g_ffgmw69kPhPrk2CmcRbguuQ@mail.gmail.com>
Hi Sam,
Most of this email was written this morning, but I didn't have a chance
to finish nor send it due to how busy I am.
I had also written a separate reply last night with detailed results of
what I was seeing but didn't/haven't got around to sending it. Not
currently sure whether I saved it as draft or got rid of it yet.
On Mon, Apr 13, 2026 at 02:54:30PM -0700, Sam Edwards wrote:
> On Mon, Apr 13, 2026, 11:49 Russell King (Oracle) <linux@armlinux.org.uk> wrote:
> >
> > On Mon, Apr 13, 2026 at 11:02:22AM -0700, Jakub Kicinski wrote:
> > > On Fri, 10 Apr 2026 14:07:51 +0100 Russell King (Oracle) wrote:
> > > > Since we are seeing receive buffer exhaustion on several platforms,
> > > > let's enable the interrupts so the statistics we publish via ethtool -S
> > > > actually work to aid diagnosis. I've been in two minds about whether
> > > > to send this patch, but given the problems with stmmac at the moment,
> > > > I think it should be merged.
> > >
> > > Sorry for a under-research response but wasn't there are person trying
> > > to fix the OOM starvation issue? Who was supposed to add a timer?
> > > Is your problem also OOM related or do you suspect something else?
> >
> > It is not OOM related. I have this patch applied:
> >
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > index 131ea887bedc..614d0e10e3e6 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > @@ -5095,14 +5095,18 @@ static inline void stmmac_rx_refill(struct stmmac_priv *priv, u32 queue)
> >
> > if (!buf->page) {
> > buf->page = page_pool_alloc_pages(rx_q->page_pool, gfp);
> > - if (!buf->page)
> > + if (!buf->page) {
> > + netdev_err(priv->dev, "q%u: no buffer 1\n", queue);
> > break;
> > + }
> > }
> >
> > if (priv->sph_active && !buf->sec_page) {
> > buf->sec_page = page_pool_alloc_pages(rx_q->page_pool, gfp);
> > - if (!buf->sec_page)
> > + if (!buf->sec_page) {
> > + netdev_err(priv->dev, "q%u: no buffer 2\n", queue);
> > break;
> > + }
> >
> > buf->sec_addr = page_pool_get_dma_addr(buf->sec_page);
> > }
> >
> > and it is silent, so we are not suffering starvation of buffers.
> >
> > However, the hardware hangs during iperf3, and because it triggers the
> > MAC to stream PAUSE frames, and my network uses Netgear GS108 and GS116
> > unmanaged switches that always use flow-control between them (there's no
> > way not to) it takes down the entire network - as we've discussed
> > before. So, this problem is pretty fatal to the *entire* network.
> >
> > With this patch, the existing statistical counters for this condition
> > are incremented, and thus users can use ethtool -S to see what happened
> > and report whether they are seeing the same issue.
> >
> > Without this patch applied, there are no diagnostics from stmmac that
> > report what the state is. ethtool -d doesn't list the appropriate
> > registers (as I suspect part of the problem is the number of queues
> > is somewhat dynamic - userspace can change that configuration through
> > ethtool).
> >
> > Thus, one has to resort to using devmem2 to find out what's happened.
> > That's not user friendly.
> >
> > For me, devmem2 shows:
> >
> > Channel 0 status register:
> > Value at address 0x02491160: 0x00000484
> > bit 10: ETI early transmit interrupt - set
> > bit 9 : RWT receive watchdog - clear
> > bit 8 : RPS receieve process stopped - clear
> > bit 7 : RBU receive buffer unavailable - set
> > bit 6 : RI receive interrupt - clear
> > bit 2 : TBU transmit buffer unavailable - set
> > bit 1 : TPS transmit process stopped - clear
> > bit 0 : TI transmit interrupt - clear
>
> Should that reset trigger be RPS, not RBU? My understanding of these
> status bits is RBU is just "RxDMA has failed to take a frame from the
> RxFIFO" while RPS is "the RxFIFO is full." That would make RBU our
> critical threshold to start proactively refilling, and RPS the "too
> late, we lose" threshold.
That's a fine theory, but look at the channel 0 status register above,
noting that any interrupts that are raised but not enabled remain set.
RPS is not set, so RPS is not being raised, only RBU when this
condition occurs.
> Thinking aloud: Do you suppose the RxDMA waits for a wakeup signal
> sent whenever a frame is added to RxFIFO? That might explain why the
> former never recovers once the latter is full: a manual wakeup needs
> to be sent whenever we resolve RBU. Does the .enable_dma_reception()
> op need to be implemented for dwmac5, or have you tried that already?
I've not found anything in the closest documentation I have. The Xavier
is Synopsys IP v5.0, whereas i.MX8M is v5.1 - and v5.1 compared to
previous versions reads the same for statements concerning recovering
from a RBU condition:
"In ring mode, the application should advance the Receive Descriptor
Tail Pointer register of a channel. This bit is set only when the DMA
owns the previous Rx descriptor."
I've tried expanding what happens when RBU fires, dumping some of the
receive state and the receive ring:
[ 55.766199] dwc-eth-dwmac 2490000.ethernet eth0: q0: receive buffer unavailable: cur_rx=309 dirty_rx=309 last_cur_rx=245 last_cur_rx_post=309 last_dirty_rx=245 count=64 budget=64
cur_rx == dirty_rx _should_ mean that we fully refilled the ring. These
are their values at the point the RBU interrupt fires.
last_cur_rx and last_dirty_rx are the values of cur_rx/dirty_rx when
stmmac_rx() was last entered.
last_cur_rx_post is the value of cur_rx when stmmac_rx() finished
looping but before we have refilled the ring.
count is the value of count just before stmmac_rx() returns, budget is
the limit at that point.
The patch that prints errors should we fail to allocate a buffer is in
place, none of those errors fire, so we are fully repopulating the ring
each time stmmac_rx() runs.
[ 55.766785] RX descriptor ring:
[ 55.766802] 000 [0x0000007fffffe000]: 0x0 0x12 0x0 0x340105ee
[ 55.766826] 001 [0x0000007fffffe010]: 0x0 0x12 0x0 0x340105ee
[ 55.766843] 002 [0x0000007fffffe020]: 0x0 0x12 0x0 0x340105ee
[ 55.766860] 003 [0x0000007fffffe030]: 0x0 0x12 0x0 0x340105ee
...
[ 55.772205] 308 [0x0000007ffffff340]: 0x0 0x12 0x0 0x340105ee
[ 55.772221] 309 [0x0000007ffffff350]: 0x0 0x12 0x0 0x340105ee
[ 55.772237] 310 [0x0000007ffffff360]: 0x0 0x12 0x0 0x340105ee
[ 55.772253] 311 [0x0000007ffffff370]: 0x0 0x12 0x0 0x340105ee
[ 55.772268] 312 [0x0000007ffffff380]: 0x0 0x12 0x0 0x340105ee
[ 55.772284] 313 [0x0000007ffffff390]: 0x0 0x12 0x0 0x340105ee
[ 55.772300] 314 [0x0000007ffffff3a0]: 0x0 0x12 0x0 0x340105ee
[ 55.772315] 315 [0x0000007ffffff3b0]: 0x0 0x12 0x0 0x340105ee
...
[ 55.775539] 511 [0x0000007ffffffff0]: 0x0 0x12 0x0 0x340105ee
Every ring entry contains the same RDES3 value, so it really is
completely full at the point RBU fires (bit 31 clear means software
owns the descriptor, and it's basically saying first/last segment,
RDES1 valid, buffer 1 length of 1518.
The Rx tail pointer register contains 0xfffff3a0 which is entry 314.
The current receive descriptor address is also 0xfffff3a0. Note that
these values were obtained some time after the RBU interrupt fired
(due to the time taken for devmem2 to access every stmmac register -
I have a script that dumps the entire stmmac register state via
devmem2.)
The other thing to note is that when looking at debugfs
stmmaceth/eth0/descriptor* (or whatever it's called, I don't have the
NX powered to look at the moment, and I didn't take a copy of it last
night) all tne descriptor entries are fully repopulated with buffers
and owned by the hardware.
I've tried using devmem2 to write to the rx tail pointer to kick it
back into action, but that changes nothing. I've tried writing the
next descriptor value and previous descriptor value, but that appears
to have no effect, it stedfastly remains stuck - and as that is the
documented recovery from RBU and there's no "receive demand" register
listed in dwmac v4 or v5 documentation, there seems to be no other
documented way.
The debug registers that I provided in my previous email suggest that
the MAC is waiting for a packet, and MTL's descriptor reader is idle
(I'm guessing it would only briefly change when the tail pointer is
updated.)
Note that I have augmented the driver with more dma_rmb() + dma_wmb()
in stmmac_rx(), dwmac4_wrback_get_rx_status(), and stmmac_rx_refill()
to ensure that reads and writes to the descriptor ring are correctly
ordered. While this generally allows iperf3 to run for a few more
seconds, it doesn't solve the problem - it is very rare for iperf3
to actually complete before stmmac has taken down my entire network.
I have noticed that on some occasions I see a small number of RBU
interrupts before it falls over.
I'm not going to have much time to look at this today due to further
appointments (I also didn't yesterday - only an hour in the morning
and a bit more time late in the evening/night.) I should have more
time during the rest of the week... but that may change.
From the above, it looks like NAPI/stmmac driver isn't keeping up with
the packet flow coming from an i.MX6 platform (which is limited to
around 470Mbps due to internal SoC bus limitations.)
I'll also mention that stmmac falls apart even more if I run iperf3 -c
-R against an x86 machine that is capable of saturating the network,
so much so that the arm-smmu IOMMU throws errors even after the stmmac
hardware has been soft-reset for addresses that were in the ring
*prior* to the soft-reset occuring (stmmac is soft-reset each time the
netdev is brought up.) The only recovery from that is to reboot -
down/up the interface just spews more IOMMU errors. I don't have the
details of that to hand and I don't have enough time to re-run that
test this morning. From what I remember, the transmit side also stops
processing descriptors (one can see them accumulate in the debugfs
file,) which eventually leads to the netdev watchdog firing.
It currently looks like the stmmac v5 EQoS IP works fine only under
light packet loads. If one puts any stress on it, then the hardware
totally falls apart. This may point to an issue with the AXI bus
configuration that is specific to this platform, but that requires
further investigation.
I'll mention again, in case anyone's forgotten, that these problems
pre-date any of my cleanups I've made to stmmac. From what I remember
they are reproducible with the kernels that are supplied as part of
the nVidia BSP. Again, as I don't have access to the nVidia platform
at the moment, I can't include the details in this email.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
^ permalink raw reply
* [PATCH bpf v4 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
Jiayuan Chen, 钱一铭
Updating sockmap/sockhash using a unix sock races unix_stream_connect():
when sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED),
unix_peer(sk) in unix_stream_bpf_update_proto() may still return NULL.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Changes in v4:
- Circle back to v1 approach
- More details in commit messages [Martin]
- Make unix iter take the state lock [Kaniyuki]
- Link to v3: https://lore.kernel.org/r/20260306-unix-proto-update-null-ptr-deref-v3-0-2f0c7410c523@rbox.co
Changes in v3:
- Drop sparse annotations [Martin]
- Keep lock_sock() along the unix_state_lock() [Kaniyuki]
- Unify BPF iter af_unix locking [Kaniyuki, Martin]
- Link to v2: https://lore.kernel.org/r/20260207-unix-proto-update-null-ptr-deref-v2-0-9f091330e7cd@rbox.co
Changes in v2:
- Instead of probing for unix peer, make sockmap take the right lock [Martin]
- Annotate data races [Kaniyuki, Martin]
- Extend bpf unix iter selftest to attempt a deadlock
- Link to v1: https://lore.kernel.org/r/20260129-unix-proto-update-null-ptr-deref-v1-1-e1daeb7012fd@rbox.co
To: John Fastabend <john.fastabend@gmail.com>
To: Jakub Sitnicki <jakub@cloudflare.com>
To: Eric Dumazet <edumazet@google.com>
To: Kuniyuki Iwashima <kuniyu@google.com>
To: Paolo Abeni <pabeni@redhat.com>
To: Willem de Bruijn <willemb@google.com>
To: "David S. Miller" <davem@davemloft.net>
To: Jakub Kicinski <kuba@kernel.org>
To: Simon Horman <horms@kernel.org>
To: Yonghong Song <yhs@fb.com>
To: Andrii Nakryiko <andrii@kernel.org>
To: Eduard Zingerman <eddyz87@gmail.com>
To: Alexei Starovoitov <ast@kernel.org>
To: Daniel Borkmann <daniel@iogearbox.net>
To: Martin KaFai Lau <martin.lau@linux.dev>
To: Song Liu <song@kernel.org>
To: Yonghong Song <yonghong.song@linux.dev>
To: KP Singh <kpsingh@kernel.org>
To: Stanislav Fomichev <sdf@fomichev.me>
To: Hao Luo <haoluo@google.com>
To: Jiri Olsa <jolsa@kernel.org>
To: Shuah Khan <shuah@kernel.org>
To: Cong Wang <cong.wang@bytedance.com>
Cc: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Michal Luczaj (5):
bpf, sockmap: Annotate af_unix sock::sk_state data-races
bpf, sockmap: Fix af_unix iter deadlock
selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
bpf, sockmap: Fix af_unix null-ptr-deref in proto update
bpf, sockmap: Take state lock for af_unix iter
net/core/sock_map.c | 4 ++--
net/unix/af_unix.c | 9 +++++----
net/unix/unix_bpf.c | 3 +++
tools/testing/selftests/bpf/progs/bpf_iter_unix.c | 10 ++++++++++
4 files changed, 20 insertions(+), 6 deletions(-)
---
base-commit: 0f00132132937ca01a99feaf8985109a9087c9ff
change-id: 20260129-unix-proto-update-null-ptr-deref-6a2733bcbbf8
Best regards,
--
Michal Luczaj <mhal@rbox.co>
^ permalink raw reply
* [PATCH bpf v4 3/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
Jiayuan Chen
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>
Updating a sockmap from a unix iterator prog may lead to a deadlock.
Piggyback on the original selftest.
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
tools/testing/selftests/bpf/progs/bpf_iter_unix.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
index fea275df9e22..a2652c8c3616 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
@@ -7,6 +7,13 @@
char _license[] SEC("license") = "GPL";
+SEC(".maps") struct {
+ __uint(type, BPF_MAP_TYPE_SOCKMAP);
+ __uint(max_entries, 1);
+ __type(key, __u32);
+ __type(value, __u64);
+} sockmap;
+
static long sock_i_ino(const struct sock *sk)
{
const struct socket *sk_socket = sk->sk_socket;
@@ -76,5 +83,8 @@ int dump_unix(struct bpf_iter__unix *ctx)
BPF_SEQ_PRINTF(seq, "\n");
+ /* Test for deadlock. */
+ bpf_map_update_elem(&sockmap, &(int){0}, sk, 0);
+
return 0;
}
--
2.53.0
^ permalink raw reply related
* [PATCH bpf v4 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
Jiayuan Chen
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>
sock_map_sk_state_allowed() and sock_map_redirect_allowed() read af_unix
socket sk_state locklessly.
Use READ_ONCE(). Note that for sock_map_redirect_allowed() change affects
not only af_unix, but all non-TCP sockets (UDP, af_vsock).
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
net/core/sock_map.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index b0e96337a269..02a68be3002a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -530,7 +530,7 @@ static bool sock_map_redirect_allowed(const struct sock *sk)
if (sk_is_tcp(sk))
return sk->sk_state != TCP_LISTEN;
else
- return sk->sk_state == TCP_ESTABLISHED;
+ return READ_ONCE(sk->sk_state) == TCP_ESTABLISHED;
}
static bool sock_map_sk_is_suitable(const struct sock *sk)
@@ -543,7 +543,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
if (sk_is_tcp(sk))
return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
if (sk_is_stream_unix(sk))
- return (1 << sk->sk_state) & TCPF_ESTABLISHED;
+ return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
if (sk_is_vsock(sk) &&
(sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET))
return (1 << sk->sk_state) & TCPF_ESTABLISHED;
--
2.53.0
^ permalink raw reply related
* Re: [syzbot] [lvs?] BUG: sleeping function called from invalid context in ip_vs_conn_expire
From: Julian Anastasov @ 2026-04-14 14:18 UTC (permalink / raw)
To: Jiayuan Chen
Cc: syzbot, coreteam, davem, edumazet, fw, horms, kuba, linux-kernel,
lvs-devel, netdev, netfilter-devel, pabeni, pablo, phil,
syzkaller-bugs
In-Reply-To: <927be094-315b-48ab-8e89-45bbe9183d5b@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 4838 bytes --]
Hello,
On Tue, 14 Apr 2026, Jiayuan Chen wrote:
>
> On 4/14/26 6:30 PM, syzbot wrote:
>
> [...]
>
> > if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+504e778ddaecd36fdd17@syzkaller.appspotmail.com
> >
> > BUG: sleeping function called from invalid context at
> > kernel/locking/spinlock_rt.c:48
>
>
>
> The problem occurs under PREEMPT_RT. conn_tab_lock pair with spin_lock has the
> problem:
>
> conn_tab_lock(...) -> hlist_bl_lock -> preempt_disable() ==> disables
> preemption
> spin_lock(&cp->lock) -> rt_mutex ==> sleepable under RT, but preemption
> is already disabled by conn_tab_lock
I guess, spin_lock(&cp->lock) which sleeps under
PREEMPT_RT, should not be called under bit spinlock.
I'll check it soon...
> > in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16, name: ktimers/0
> > preempt_count: 2, expected: 0
> > RCU nest depth: 3, expected: 3
> > 8 locks held by ktimers/0/16:
> > #0: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at:
> > __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
> > #1: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at:
> > __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
> > #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: spin_lock
> > include/linux/spinlock_rt.h:45 [inline]
> > #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at:
> > timer_base_lock_expiry kernel/time/timer.c:1502 [inline]
> > #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at:
> > __run_timer_base+0x120/0x9f0 kernel/time/timer.c:2384
> > #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
> > include/linux/rcupdate.h:300 [inline]
> > #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
> > include/linux/rcupdate.h:838 [inline]
> > #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __rt_spin_lock
> > kernel/locking/spinlock_rt.c:50 [inline]
> > #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at:
> > rt_spin_lock+0x1e0/0x400 kernel/locking/spinlock_rt.c:57
> > #4: ffffc90000157a80 ((&cp->timer)){+...}-{0:0}, at:
> > call_timer_fn+0xd4/0x5e0 kernel/time/timer.c:1745
> > #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
> > include/linux/rcupdate.h:300 [inline]
> > #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
> > include/linux/rcupdate.h:838 [inline]
> > #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_unlink
> > net/netfilter/ipvs/ip_vs_conn.c:315 [inline]
> > #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at:
> > ip_vs_conn_expire+0x257/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260
> > #6: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at:
> > __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
> > #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: spin_lock
> > include/linux/spinlock_rt.h:45 [inline]
> > #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_unlink
> > net/netfilter/ipvs/ip_vs_conn.c:324 [inline]
> > #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at:
> > ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260
> > Preemption disabled at:
> > [<ffffffff898a6358>] bit_spin_lock include/linux/bit_spinlock.h:38 [inline]
> > [<ffffffff898a6358>] hlist_bl_lock+0x18/0x110 include/linux/list_bl.h:149
> > CPU: 0 UID: 0 PID: 16 Comm: ktimers/0 Tainted: G W L
> > syzkaller #0 PREEMPT_{RT,(full)}
> > Tainted: [W]=WARN, [L]=SOFTLOCKUP
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 03/18/2026
> > Call Trace:
> > <TASK>
> > dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
> > __might_resched+0x329/0x480 kernel/sched/core.c:9162
> > __rt_spin_lock kernel/locking/spinlock_rt.c:48 [inline]
> > rt_spin_lock+0xc2/0x400 kernel/locking/spinlock_rt.c:57
> > spin_lock include/linux/spinlock_rt.h:45 [inline]
> > ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline]
> > ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260
> > call_timer_fn+0x192/0x5e0 kernel/time/timer.c:1748
> > expire_timers kernel/time/timer.c:1799 [inline]
> > __run_timers kernel/time/timer.c:2374 [inline]
> > __run_timer_base+0x6a3/0x9f0 kernel/time/timer.c:2386
> > run_timer_base kernel/time/timer.c:2395 [inline]
> > run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2405
> > handle_softirqs+0x1de/0x6d0 kernel/softirq.c:622
> > __do_softirq kernel/softirq.c:656 [inline]
> > run_ktimerd+0x69/0x100 kernel/softirq.c:1151
> > smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
> > kthread+0x388/0x470 kernel/kthread.c:436
> > ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
> > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> > </TASK>
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply
* Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock
From: Michal Luczaj @ 2026-04-14 14:19 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: Jiayuan Chen, John Fastabend, Jakub Sitnicki, Eric Dumazet,
Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
Alexei Starovoitov, Daniel Borkmann, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest
In-Reply-To: <ac2z6IqYyIxSZFPS@devbig1721.ftw5.facebook.com>
On 4/2/26 03:34, Martin KaFai Lau wrote:
> On Wed, 01 Apr 2026 00:43:58 +0200, Michal Luczaj wrote:
>> On 3/31/26 02:20, Martin KaFai Lau wrote:
>>> On 3/30/26 4:03 PM, Michal Luczaj wrote:
>>>> On 3/26/26 07:26, Martin KaFai Lau wrote:
>>>>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>>>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>>>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>>>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>>>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>>>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>>>>>
>>>>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>>>>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>>>>>> useful, but seems doable.
>>>>>
>>>>> [ Sorry for the late reply ]
>>>>>
>>>>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
>>>>> from tc :(
>>>>>
>>>>> Then unix_state_lock() in its current form cannot be safely acquired in
>>>>> sock_map_update_elem(). It is currently a spin_lock() instead of
>>>>> spin_lock_bh().
>>>>
>>>> Is there a specific deadlock you have in your mind?
>>>
>>> e.g. unix_stream_connect() is taking unix_state_lock(). Can a tc's
>>> ingress bpf prog call unix_state_lock()?
>>
>> Ah, right, that's the problem, thanks for explaining.
>>
>> But, as I've asked in the parallel thread, do we really need to take the
>> unix_state_lock() in sock_map_update_elem()? Taking it in
>> sock_map_update_elem_sys() fixes the null-ptr-deref and does not lead to a
>> deadlock. Taking unix_state_lock() in sock_map_update_elem() seems
>> unnecessary. Well, at least under the assumption progs can only access
>> unix_sk via the sockmap lookup.
>
> right, sock_map_update_elem_sys() should be safe to take
> unix_state_lock().
>
> If it is fixed by testing unix_peer(), is the TCPF_ESTABLISHED test
> in sock_map_sk_state_allowed() still useful and needed?
I don't think it's necessary. Although removing it may slightly mask the
fact that we're interested in TCP_ESTABLISHED sockets (we watch the sock's
life cycle and invoke sock_map_close() as it transitions to TCP_CLOSE).
Removing this check will also mean listening socks will be rejected not
early in sock_map_sk_state_allowed(), but deeper in
unix_stream_bpf_update_proto() (and with a different error code?).
> Also,
> please explain in detail in the commit message why testing for NULL
> without unix_state_lock() is enough.
OK, will do.
> For example, for the BPF iterator on
> sock_map, my understanding is that unix_release_sock() can still happen
> while the BPF iterator is iterating over a unix_sock. I guess a future
> unix_state_lock() in the iterator's seq_show() should be useful.
That's right. That's also why, I think, Kuniyuki was asking for
"lock_sock() + unix_state_lock() + SOCK_DEAD check" in a parallel thread.
> It will also be useful to mention what was discovered about TC + lookup
> + update_elem(&sock_map, ...) and why it is not safe to take
> unix_state_lock() in that path. Thanks.
The softirq vs. process context? Sure, I'll mention that.
Took a while (sorry), but here's v4:
https://lore.kernel.org/netdev/20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co/
^ permalink raw reply
* Re: [net,PATCH v3 1/2] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Marek Vasut @ 2026-04-14 14:20 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: netdev, stable, David S. Miller, Andrew Lunn, Eric Dumazet,
Jakub Kicinski, Nicolai Buchwitz, Paolo Abeni, Ronald Wahl,
Yicong Hui, linux-kernel
In-Reply-To: <20260414125753.Im6GAIHn@linutronix.de>
On 4/14/26 2:57 PM, Sebastian Andrzej Siewior wrote:
> On 2026-04-14 12:32:52 [+0200], Marek Vasut wrote:
>> If CONFIG_PREEMPT_RT=y is set AND the driver executes ks8851_irq() AND
>> KSZ_ISR register bit IRQ_RXI is set AND ks8851_rx_pkts() detects that
>> there are packets in the RX FIFO, then netdev_alloc_skb_ip_align() is
>> called to allocate SKBs. If netdev_alloc_skb_ip_align() is called with
>> BH enabled, local_bh_enable() at the end of netdev_alloc_skb_ip_align()
>> will call __local_bh_enable_ip(), which will call __do_softirq(), which
>> may trigger net_tx_action() softirq, which may ultimately call the xmit
>> callback ks8851_start_xmit_par(). The ks8851_start_xmit_par() will try
>> to lock struct ks8851_net_par .lock spinlock, which is already locked
>> by ks8851_irq() from which ks8851_start_xmit_par() was called. This
>> leads to a deadlock, which is reported by the kernel, including a trace
>> listed below.
>
> #1 [received RX packet and a] TX packet has been sent
> #2 Driver enables TX queue via netif_wake_queue() which schedules TX
> softirq to queue packets for this device.
> #2 After spin_unlock_bh(&ks->statelock) the pending softirqs will be
> processed
> #3 This deadlocks because of recursive locking via ks8851_net::lock in
> ks8851_irq() and ks8851_start_xmit_par().
>
> This is what happens since commit 0913ec336a6c0 ("net: ks8851: Fix
> deadlock with the SPI chip variant"). Before that commit the softirq
> execution will be picked up by netdev_alloc_skb_ip_align() and requires
> PREEMPT_RT and a RX packet in #1 to trigger the deadlock.
Do you want me to add this into the V4 commit message ?
>> Fix the problem by disabling BH around critical sections, including the
>> IRQ handler, thus preventing the net_tx_action() softirq from triggering
>> during these critical sections. The net_tx_action() softirq is triggered
>> at the end of the IRQ handler, once all the other IRQ handler actions have
>> been completed.
>>
>> __schedule from schedule_rtlock+0x1c/0x34
>> schedule_rtlock from rtlock_slowlock_locked+0x548/0x904
>> rtlock_slowlock_locked from rt_spin_lock+0x60/0x9c
>> rt_spin_lock from ks8851_start_xmit_par+0x74/0x1a8
>> ks8851_start_xmit_par from netdev_start_xmit+0x20/0x44
>> netdev_start_xmit from dev_hard_start_xmit+0xd0/0x188
>> dev_hard_start_xmit from sch_direct_xmit+0xb8/0x25c
>> sch_direct_xmit from __qdisc_run+0x1f8/0x4ec
>> __qdisc_run from qdisc_run+0x1c/0x28
>> qdisc_run from net_tx_action+0x1f0/0x268
>> net_tx_action from handle_softirqs+0x1a4/0x270
>> handle_softirqs from __local_bh_enable_ip+0xcc/0xe0
>> __local_bh_enable_ip from __alloc_skb+0xd8/0x128
>> __alloc_skb from __netdev_alloc_skb+0x3c/0x19c
>> __netdev_alloc_skb from ks8851_irq+0x388/0x4d4
>> ks8851_irq from irq_thread_fn+0x24/0x64
>> irq_thread_fn from irq_thread+0x178/0x28c
>> irq_thread from kthread+0x12c/0x138
>> kthread from ret_from_fork+0x14/0x28
>
> The backtrace here and the description is based on an older kernel.
> However
I actually did update the backtrace in V3 with the one from current next
20260413 .
^ permalink raw reply
* Re: [PATCH v3 net] vsock: fix buffer size clamping order
From: Michal Luczaj @ 2026-04-14 14:22 UTC (permalink / raw)
To: Norbert Szetei, Stefano Garzarella
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, virtualization, netdev, linux-kernel
In-Reply-To: <180118C5-8BCF-4A63-A305-4EE53A34AB9C@doyensec.com>
On 4/9/26 18:34, Norbert Szetei wrote:
> In vsock_update_buffer_size(), the buffer size was being clamped to the
> maximum first, and then to the minimum. If a user sets a minimum buffer
> size larger than the maximum, the minimum check overrides the maximum
> check, inverting the constraint.
>
> This breaks the intended socket memory boundaries by allowing the
> vsk->buffer_size to grow beyond the configured vsk->buffer_max_size.
>
> Fix this by checking the minimum first, and then the maximum. This
> ensures the buffer size never exceeds the buffer_max_size.
Something may be missing. After adding another ioctl to your reproducer, I
still see crashes.
SYSCHK(setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE, &min,
sizeof(min)));
+ SYSCHK(setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE, &min,
+ sizeof(min)));
}
[*] Setting buffer_min_size to 0x400000000.
[socket][0] sending...
refcount_t: saturated; leaking memory.
WARNING: lib/refcount.c:22 at refcount_warn_saturate+0x7d/0xb0, CPU#2:
a.out/1478
...
refcount_t: underflow; use-after-free.
WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x50/0xb0, CPU#12:
kworker/12:0/80
Workqueue: vsock-loopback vsock_loopback_work
...
^ permalink raw reply
* Re: [RFC PATCH v4 00/19] Support socket access-control
From: Mickaël Salaün @ 2026-04-14 14:27 UTC (permalink / raw)
To: Mikhail Ivanov
Cc: gnoack, willemdebruijn.kernel, matthieu, linux-security-module,
netdev, netfilter-devel, yusongping, artem.kuzin,
konstantin.meskhidze
In-Reply-To: <ca9b74f3-ce72-1d7f-c922-be1b276b69a8@huawei-partners.com>
On Mon, Apr 13, 2026 at 08:11:48PM +0300, Mikhail Ivanov wrote:
> On 4/8/2026 1:26 PM, Mickaël Salaün wrote:
> > Hi Mikhail,
>
> Hi!
>
> >
> > On Tue, Nov 18, 2025 at 09:46:20PM +0800, Mikhail Ivanov wrote:
> > > Hello! This is v4 RFC patch dedicated to socket protocols restriction.
> > >
> > > It is based on the landlock's mic-next branch on top of Linux 6.16-rc2
> > > kernel version.
> > >
> > > Objective
> > > =========
> > > Extend Landlock with a mechanism to restrict any set of protocols in
> > > a sandboxed process.
> > >
> > > Closes: https://github.com/landlock-lsm/linux/issues/6
> > >
> > > Motivation
> > > ==========
> > > Landlock implements the `LANDLOCK_RULE_NET_PORT` rule type, which provides
> > > fine-grained control of actions for a specific protocol. Any action or
> > > protocol that is not supported by this rule can not be controlled. As a
> > > result, protocols for which fine-grained control is not supported can be
> > > used in a sandboxed system and lead to vulnerabilities or unexpected
> > > behavior.
> > >
> > > Controlling the protocols used will allow to use only those that are
> > > necessary for the system and/or which have fine-grained Landlock control
> > > through others types of rules (e.g. TCP bind/connect control with
> > > `LANDLOCK_RULE_NET_PORT`, UNIX bind control with
> > > `LANDLOCK_RULE_PATH_BENEATH`).
> > >
> > > Consider following examples:
> > > * Server may want to use only TCP sockets for which there is fine-grained
> > > control of bind(2) and connect(2) actions [1].
> > > * System that does not need a network or that may want to disable network
> > > for security reasons (e.g. [2]) can achieve this by restricting the use
> > > of all possible protocols.
> > >
> > > [1] https://lore.kernel.org/all/ZJvy2SViorgc+cZI@google.com/
> > > [2] https://cr.yp.to/unix/disablenetwork.html
> > >
> > > Implementation
> > > ==============
> > > This patchset adds control over the protocols used by implementing a
> > > restriction of socket creation. This is possible thanks to the new type
> > > of rule - `LANDLOCK_RULE_SOCKET`, that allows to restrict actions on
> > > sockets, and a new access right - `LANDLOCK_ACCESS_SOCKET_CREATE`, that
> > > corresponds to user space sockets creation. The key in this rule
> > > corresponds to communication protocol signature from socket(2) syscall.
> >
> > FYI, I sent a new patch series that adds a handled_perm field to
> > rulesets:
> > https://lore.kernel.org/all/20260312100444.2609563-6-mic@digikod.net/
> > See also the rationale:
> > https://lore.kernel.org/all/20260312100444.2609563-12-mic@digikod.net/
> >
> > I think that would work well with the socket creation permission. WDYT?
>
> Agreed. AFAICS restrictions of protocols used for communication (eg.TCP)
> will complement restriction of network namespace which sandboxed process
> is pinned by LANDLOCK_PERM_NAMESPACE_ENTER permission.
I mean that socket creation restriction should use the same handled_perm
interface e.g. add a LANDLOCK_PERM_SOCKET_CREATE right with related
LANDLOCK_RULE_SOCKET rule type.
With the first RFC for handled_perm, the related rules (e.g. struct
landlock_socket_attr) don't have an allowed_access field but an
allowed_perm one instead. The related permission would then be
LANDLOCK_PERM_SOCKET_CREATE. WDYT?
>
> >
> > Do you think you'll be able to continue this work or would you like me
> > or Günther to complete the remaining last bits (while of course keeping
> > you as the main author)?
>
> Sorry for the delay. I will finish and send patch series ASAP.
This new version should then be on top of the Landlock namespace and
capability patchset to reuse the handled_perm interface. I plan to send
a new version by the end of the month, but this should not change the
handled_perm interface.
>
> >
> >
> > >
> > > The right to create a socket is checked in the LSM hook which is called
> > > in the __sock_create method. The following user space operations are
> > > subject to this check: socket(2), socketpair(2), io_uring(7).
> > >
> > > `LANDLOCK_ACCESS_SOCKET_CREATE` does not restrict socket creation
> > > performed by accept(2), because created socket is used for messaging
> > > between already existing endpoints.
> > >
> > > Design discussion
> > > ===================
> > > 1. Should `SCTP_SOCKOPT_PEELOFF` and socketpair(2) be restricted?
> > >
> > > SCTP socket can be connected to a multiple endpoints (one-to-many
> > > relation). Calling setsockopt(2) on such socket with option
> > > `SCTP_SOCKOPT_PEELOFF` detaches one of existing connections to a separate
> > > UDP socket. This detach is currently restrictable.
> > >
> > > Same applies for the socketpair(2) syscall. It was noted that denying
> > > usage of socketpair(2) in sandboxed environment may be not meaninful [1].
> > >
> > > Currently both operations use general socket interface to create sockets.
> > > Therefore it's not possible to distinguish between socket(2) and those
> > > operations inside security_socket_create LSM hook which is currently
> > > used for protocols restriction. Providing such separation may require
> > > changes in socket layer (eg. in __sock_create) interface which may not be
> > > acceptable.
> > >
> > > [1] https://lore.kernel.org/all/ZurZ7nuRRl0Zf2iM@google.com/
> > >
> > > Code coverage
> > > =============
> > > Code coverage(gcov) report with the launch of all the landlock selftests:
> > > * security/landlock:
> > > lines......: 94.0% (1200 of 1276 lines)
> > > functions..: 95.0% (134 of 141 functions)
> > >
> > > * security/landlock/socket.c:
> > > lines......: 100.0% (56 of 56 lines)
> > > functions..: 100.0% (5 of 5 functions)
> > >
> > > Currently landlock-test-tools fails on mini.kernel_socket test due to lack
> > > of SMC protocol support.
> > >
> > > General changes v3->v4
> > > ======================
> > > * Implementation
> > > * Adds protocol field to landlock_socket_attr.
> > > * Adds protocol masks support via wildcards values in
> > > landlock_socket_attr.
> > > * Changes LSM hook used from socket_post_create to socket_create.
> > > * Changes protocol ranges acceptable by socket rules.
> > > * Adds audit support.
> > > * Changes ABI version to 8.
> > > * Tests
> > > * Adds 5 new tests:
> > > * mini.rule_with_wildcard, protocol_wildcard.access,
> > > mini.ruleset_with_wildcards_overlap:
> > > verify rulesets containing rules with wildcard values.
> > > * tcp_protocol.alias_restriction: verify that Landlock doesn't
> > > perform protocol mappings.
> > > * audit.socket_create: tests audit denial logging.
> > > * Squashes tests corresponding to Landlock rule adding to a single commit.
> > > * Documentation
> > > * Refactors Documentation/userspace-api/landlock.rst.
> > > * Commits
> > > * Rebases on mic-next.
> > > * Refactors commits.
> > >
> > > Previous versions
> > > =================
> > > v3: https://lore.kernel.org/all/20240904104824.1844082-1-ivanov.mikhail1@huawei-partners.com/
> > > v2: https://lore.kernel.org/all/20240524093015.2402952-1-ivanov.mikhail1@huawei-partners.com/
> > > v1: https://lore.kernel.org/all/20240408093927.1759381-1-ivanov.mikhail1@huawei-partners.com/
> > >
> > > Mikhail Ivanov (19):
> > > landlock: Support socket access-control
> > > selftests/landlock: Test creating a ruleset with unknown access
> > > selftests/landlock: Test adding a socket rule
> > > selftests/landlock: Testing adding rule with wildcard value
> > > selftests/landlock: Test acceptable ranges of socket rule key
> > > landlock: Add hook on socket creation
> > > selftests/landlock: Test basic socket restriction
> > > selftests/landlock: Test network stack error code consistency
> > > selftests/landlock: Test overlapped rulesets with rules of protocol
> > > ranges
> > > selftests/landlock: Test that kernel space sockets are not restricted
> > > selftests/landlock: Test protocol mappings
> > > selftests/landlock: Test socketpair(2) restriction
> > > selftests/landlock: Test SCTP peeloff restriction
> > > selftests/landlock: Test that accept(2) is not restricted
> > > lsm: Support logging socket common data
> > > landlock: Log socket creation denials
> > > selftests/landlock: Test socket creation denial log for audit
> > > samples/landlock: Support socket protocol restrictions
> > > landlock: Document socket rule type support
> > >
> > > Documentation/userspace-api/landlock.rst | 48 +-
> > > include/linux/lsm_audit.h | 8 +
> > > include/uapi/linux/landlock.h | 60 +-
> > > samples/landlock/sandboxer.c | 118 +-
> > > security/landlock/Makefile | 2 +-
> > > security/landlock/access.h | 3 +
> > > security/landlock/audit.c | 12 +
> > > security/landlock/audit.h | 1 +
> > > security/landlock/limits.h | 4 +
> > > security/landlock/ruleset.c | 37 +-
> > > security/landlock/ruleset.h | 46 +-
> > > security/landlock/setup.c | 2 +
> > > security/landlock/socket.c | 198 +++
> > > security/landlock/socket.h | 20 +
> > > security/landlock/syscalls.c | 61 +-
> > > security/lsm_audit.c | 4 +
> > > tools/testing/selftests/landlock/base_test.c | 2 +-
> > > tools/testing/selftests/landlock/common.h | 14 +
> > > tools/testing/selftests/landlock/config | 47 +
> > > tools/testing/selftests/landlock/net_test.c | 11 -
> > > .../selftests/landlock/protocols_define.h | 169 +++
> > > .../testing/selftests/landlock/socket_test.c | 1169 +++++++++++++++++
> > > 22 files changed, 1990 insertions(+), 46 deletions(-)
> > > create mode 100644 security/landlock/socket.c
> > > create mode 100644 security/landlock/socket.h
> > > create mode 100644 tools/testing/selftests/landlock/protocols_define.h
> > > create mode 100644 tools/testing/selftests/landlock/socket_test.c
> > >
> > >
> > > base-commit: 6dde339a3df80a57ac3d780d8cfc14d9262e2acd
> > > --
> > > 2.34.1
> > >
> > >
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox