Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [Intel-wired-lan] [PATCH] ice: fix ICE_AQ_LINK_SPEED_M for 200G
From: Mekala, SunithaX D @ 2026-04-13 19:55 UTC (permalink / raw)
  To: Loktionov, Aleksandr, intel-wired-lan@lists.osuosl.org,
	Nguyen, Anthony L, Loktionov, Aleksandr
  Cc: netdev@vger.kernel.org, Greenwalt, Paul
In-Reply-To: <20260320050537.422528-1-aleksandr.loktionov@intel.com>

> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Aleksandr Loktionov
> Sent: Thursday, March 19, 2026 10:06 PM
> To: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Loktionov, Aleksandr <aleksandr.loktionov@intel.com>
> Cc: netdev@vger.kernel.org; Greenwalt, Paul <paul.greenwalt@intel.com>
> Subject: [Intel-wired-lan] [PATCH] ice: fix ICE_AQ_LINK_SPEED_M for 200G
>
> From: Paul Greenwalt <paul.greenwalt@intel.com>
>
> When setting PHY configuration during driver initialization, 200G link
> speed is not being advertised even when the PHY is capable. This is
> because the get PHY capabilities link speed response is being masked by
> ICE_AQ_LINK_SPEED_M, which does not include the 200G link speed bit.
>
> ICE_AQ_LINK_SPEED_200GB is defined as BIT(11), but the mask 0x7FF only
> covers bits 0-10. Fix ICE_AQ_LINK_SPEED_M to use GENMASK(11, 0) so
> that it covers all defined link speed bits including 200G.
>
> Fixes: 24407a01e57c ("ice: Add 200G speed/phy type use")
> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)

^ permalink raw reply

* [syzbot ci] Re: veth: add Byte Queue Limits (BQL) support
From: syzbot ci @ 2026-04-13 19:49 UTC (permalink / raw)
  To: andrew, ast, bpf, corbet, daniel, davem, edumazet, frederic, hawk,
	horms, j.koeppeler, jhs, jiri, john.fastabend, kernel-team,
	krikku, kuba, kuniyu, linux-doc, linux-kernel, linux-kselftest,
	netdev, pabeni, sdf, shuah, skhan, yajun.deng
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260413094442.1376022-1-hawk@kernel.org>

syzbot ci has tested the following series

[v2] veth: add Byte Queue Limits (BQL) support
https://lore.kernel.org/all/20260413094442.1376022-1-hawk@kernel.org
* [PATCH net-next v2 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
* [PATCH net-next v2 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
* [PATCH net-next v2 3/5] veth: add tx_timeout watchdog as BQL safety net
* [PATCH net-next v2 4/5] net: sched: add timeout count to NETDEV WATCHDOG message
* [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test

and found the following issue:
WARNING in veth_napi_del_range

Full report is available here:
https://ci.syzbot.org/series/ee732006-8545-4abd-a105-b4b1592a7baf

***

WARNING in veth_napi_del_range

tree:      net-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base:      8806d502e0a7e7d895b74afbd24e8550a65a2b17
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/90743a26-f003-44cf-abcc-5991c47588b2/config
syz repro: https://ci.syzbot.org/findings/d068bfb2-9f8b-466a-95b4-cd7e7b00006c/syz_repro

------------[ cut here ]------------
index >= dev->num_tx_queues
WARNING: ./include/linux/netdevice.h:2672 at netdev_get_tx_queue include/linux/netdevice.h:2672 [inline], CPU#0: syz.1.27/6002
WARNING: ./include/linux/netdevice.h:2672 at veth_napi_del_range+0x3b7/0x4e0 drivers/net/veth.c:1142, CPU#0: syz.1.27/6002
Modules linked in:
CPU: 0 UID: 0 PID: 6002 Comm: syz.1.27 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:netdev_get_tx_queue include/linux/netdevice.h:2672 [inline]
RIP: 0010:veth_napi_del_range+0x3b7/0x4e0 drivers/net/veth.c:1142
Code: 00 e8 ad 96 69 fe 44 39 6c 24 10 74 5e e8 41 61 44 fb 41 ff c5 49 bc 00 00 00 00 00 fc ff df e9 6d ff ff ff e8 2a 61 44 fb 90 <0f> 0b 90 42 80 3c 23 00 75 8e eb 94 48 8b 0c 24 80 e1 07 80 c1 03
RSP: 0018:ffffc90003adf918 EFLAGS: 00010293
RAX: ffffffff86814ec6 RBX: 1ffff110227a6c03 RCX: ffff888103a857c0
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000002
RBP: 1ffff110227a6c9a R08: ffff888113f01ab7 R09: 0000000000000000
R10: ffff888113f01a98 R11: ffffed10227e0357 R12: dffffc0000000000
R13: 0000000000000002 R14: 0000000000000002 R15: ffff888113d36018
FS:  000055555ea16500(0000) GS:ffff88818de4a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007efc287456b8 CR3: 000000010cdd0000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 veth_napi_del drivers/net/veth.c:1153 [inline]
 veth_disable_xdp+0x1b0/0x310 drivers/net/veth.c:1255
 veth_xdp_set drivers/net/veth.c:1693 [inline]
 veth_xdp+0x48e/0x730 drivers/net/veth.c:1717
 dev_xdp_propagate+0x125/0x260 net/core/dev_api.c:348
 bond_xdp_set drivers/net/bonding/bond_main.c:5715 [inline]
 bond_xdp+0x3ca/0x830 drivers/net/bonding/bond_main.c:5761
 dev_xdp_install+0x42c/0x600 net/core/dev.c:10387
 dev_xdp_detach_link net/core/dev.c:10579 [inline]
 bpf_xdp_link_release+0x362/0x540 net/core/dev.c:10595
 bpf_link_free+0x103/0x480 kernel/bpf/syscall.c:3292
 bpf_link_put_direct kernel/bpf/syscall.c:3344 [inline]
 bpf_link_release+0x6b/0x80 kernel/bpf/syscall.c:3351
 __fput+0x44f/0xa70 fs/file_table.c:469
 task_work_run+0x1d9/0x270 kernel/task_work.c:233
 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
 __exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
 exit_to_user_mode_loop+0xed/0x480 kernel/entry/common.c:98
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
 do_syscall_64+0x32d/0xf80 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5bda39c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffdca2969e8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4
RAX: 0000000000000000 RBX: 00007f5bda617da0 RCX: 00007f5bda39c819
RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003
RBP: 00007f5bda617da0 R08: 00007f5bda616128 R09: 0000000000000000
R10: 000000000003fd78 R11: 0000000000000246 R12: 0000000000010fb8
R13: 00007f5bda61609c R14: 0000000000010cdd R15: 00007ffdca296af0
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply

* [ANNOUNCE] iproute2 7.0 release
From: Stephen Hemminger @ 2026-04-13 19:44 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1300 bytes --]

This is the regular release of iproute2 corresponding to the 7.0 kernel.

The main addition is `cake_mq` support for tc-cake, enabling CAKE on
multiqueue devices. The `dpll` command gained mode setting and
fractional frequency offset display in parts-per-trillion.

Devlink now supports displaying and resetting parameters to defaults.

The `ss` tool saw several cleanups: trailing whitespace in non-TTY
output, suppressed netlink errors for unsupported protocols, and
proper command name escaping.

The JSON writer gained control characterescaping.

Eric Biggers replaced the AF_ALG SHA-1 in legacy BPF with a
userspace implementation.

Chen Linxuan eliminated redundant mounts in
`ip netns`, and leftover `/usr/lib/route2` references were removed.

Matthieu Baerts added interface name display and colored output to
MPTCP monitor.

A large batch of man page fixes addressed grammar and
style across `dcb`, `devlink`, `netshaper`, `tipc`, `vdpa`, `rdma`,
and `ss`.

Download:
    https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-7.0.0.tar.gz

Repository for current release:
    https://github.com/shemminger/iproute2.git
    git://git.kernel.org/pub/scm/network/iproute2/iproute2.git

And future release (net-next):
    git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git



[-- Attachment #2: changes-iproute2-7.0.0.txt --]
[-- Type: text/plain, Size: 2511 bytes --]

Andrea Claudi (1):
      dpll: Fix missing notifications in monitor mode

Chen Linxuan (1):
      ip/netns: avoid redundant mounts

Daniel Schulte (1):
      ss: Remove trailing whitespace when output is not a TTY

Daniel Zahka (2):
      devlink: Pull the value printing logic out of pr_out_param_value()
      devlink: support displaying and resetting to default params

David Ahern (2):
      Update kernel headers
      Update kernel headers

Eric Biggers (1):
      lib/bpf_legacy: Use userspace SHA-1 code instead of AF_ALG

Ivan Vecera (1):
      dpll: add support for fractional frequency offset in ppt

Jonas KÃ¶ppeler (1):
      tc: cake: add cake_mq support

Matthieu Baerts (NGI0) (2):
      mptcp: monitor: also show iface name
      mptcp: display addr & ifname in color

Petr Oros (2):
      dpll: add mode setting support
      dpll: fix pin id-get type filter parsing

Sergei Trofimovich (1):
      include/json_print.h: add includes for `__u32` and `timeval` declarations

Stephen Hemminger (17):
      uapi: update mptcp and rdma headers
      utils: do not be restrictive about alternate network device names
      dcb: fix grammar and style issues in man pages
      devlink: fix grammar and style issues in man pages
      fix grammar and style issues in man pages for stat related pages
      netshaper: fix grammar and style issues in man page
      tipc: fix grammar and style issues in man pages
      vdpa: fix grammar, titles, and formatting in man pages
      rdma: fix grammar, formatting, and style in man pages
      ss: fix grammar, articles, and phrasing in man page
      uapi: headers update from 7.0-rc0
      ss: suppress netlink errors for unsupported protocols
      remove leftover references to /usr/lib/route2
      ss: escape characters in command name
      json_writer: support control character escaping
      json_writer: fix builtin test code
      vv7.0.0

Toke HÃ¸iland-JÃ¸rgensen (1):
      man: Add cake_mq documentation to the tc-cake man page

Vincent Mailhol (7):
      iplink_can: print_usage: fix the text indentation
      iplink_can: print_usage: change unit for minimum time quanta to mtq
      iplink_can: print_usage: describe the CAN bittiming units
      iplink_can: add RESTRICTED operation mode support
      iplink_can: add initial CAN XL support
      iplink_can: add CAN XL transceiver mode setting (TMS) support
      iplink_can: add CAN XL TMS PWM configuration support


^ permalink raw reply

* Re: [patch 15/38] ptp: ptp_vmclock: Replace get_cycles() usage
From: Arnd Bergmann @ 2026-04-13 19:30 UTC (permalink / raw)
  To: David Woodhouse, Thomas Gleixner, LKML
  Cc: x86, Baolu Lu, iommu, Michael Grzeschik, Netdev, linux-wireless,
	Herbert Xu, linux-crypto, Vlastimil Babka (SUSE), linux-mm,
	Bernie Thompson, linux-fbdev, Theodore Ts'o, linux-ext4,
	Andrew Morton, Uladzislau Rezki (Sony), Marco Elver,
	Dmitry Vyukov, kasan-dev, Andrey Ryabinin, Thomas Sailer,
	linux-hams, Jason A . Donenfeld, Richard Henderson, linux-alpha,
	Russell King, linux-arm-kernel, Catalin Marinas, Huacai Chen,
	loongarch, Geert Uytterhoeven, linux-m68k, Dinh Nguyen,
	Jonas Bonn, linux-openrisc@vger.kernel.org, Helge Deller,
	linux-parisc, Michael Ellerman, linuxppc-dev, Paul Walmsley,
	linux-riscv, Heiko Carstens, linux-s390, David S . Miller,
	sparclinux
In-Reply-To: <7a48b636cb3146f4f7134c6d4fe42070ac2edb43.camel@infradead.org>

On Mon, Apr 13, 2026, at 17:33, David Woodhouse wrote:
> On Fri, 2026-04-10 at 14:19 +0200, Thomas Gleixner wrote:
>
> ... depend on TSC_RELIABLE¹, since if the guest doesn't believe that it
> is, then the guest shouldn't be trying to use it as the basis for
> precise timing.
>
> ¹ (Or... one of the other zoo of TSC flags for the gradually reducing
> brokenness over the years...)

It looks like this is sufficiently handled in the caller:

static int vmclock_get_crosststamp(struct vmclock_state *st,
                                   struct ptp_system_timestamp *sts,
                                   struct system_counterval_t *system_counter,
                                   struct timespec64 *tspec)
{
....
#ifdef CONFIG_X86
        /*
         * We'd expect the hypervisor to know this and to report the clock
         * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
         */
        if (check_tsc_unstable())
                return -EINVAL;
#endif

With 486 and ELAN out of the way, Winchip6 seems to be the only
one without X86_FEATURE_TSC, so I think the next logical step would
be to turn off Winchip6 as well and remove all X86_FEATURE_TSC
and CONFIG_X86_TSC checks.

      Arnd

^ permalink raw reply

* Re: [RFC v2 1/2] vfio: add callback to get tph info for dmabuf
From: Leon Romanovsky @ 2026-04-13 19:23 UTC (permalink / raw)
  To: Zhiping Zhang
  Cc: Keith Busch, Jason Gunthorpe, Bjorn Helgaas, linux-rdma,
	linux-pci, netdev, dri-devel, Yochai Cohen, Yishai Hadas,
	Bjorn Helgaas
In-Reply-To: <CAH3zFs0hx_-3LetSUaPRMg=0jaL=GD7Mop3pEUhJ3O3qkaJrQg@mail.gmail.com>

On Mon, Apr 13, 2026 at 11:32:48AM -0700, Zhiping Zhang wrote:
> On Thu, Apr 9, 2026 at 5:04 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > >
> > On Tue, Mar 31, 2026 at 01:44:02PM -0600, Keith Busch wrote:
> > > On Tue, Mar 31, 2026 at 10:02:20PM +0300, Leon Romanovsky wrote:
> > > >
> > > > Right, what about adding TPH fields to struct vfio_region_dma_range
> > > > instead of struct vfio_device_feature_dma_buf?
> > >
> > > You might have to show me with code what you're talking about because I
> > > can't see any way we can add fields to any struct here without breaking
> > > backward compatibility.
> > >
> > > If we can't claim bits out of the unused "flags" field for this feature,
> > > then my initial reply is the only sane approach: we can introduce a new
> > > feature and struct for it that closely mirrors the existing one, but
> > > with the extra hint fields.
> >
> > Something like that, on top of this proposal:
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > index 3961afa640391..70d5ee1e3ef7b 100644
> > --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > @@ -241,9 +241,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> >                 return -EFAULT;
> >
> >         if (!get_dma_buf.nr_ranges ||
> > -           (get_dma_buf.flags & ~(VFIO_DMABUF_FL_TPH |
> > -                                  VFIO_DMABUF_TPH_PH_MASK |
> > -                                  VFIO_DMABUF_TPH_ST_MASK)))
> > +           (get_dma_buf.flags & ~VFIO_DMABUF_FLAG_TPH))
> >                 return -EINVAL;
> >
> >         /*
> > @@ -300,13 +298,10 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> >                 ret = PTR_ERR(priv->dmabuf);
> >                 goto err_dev_put;
> >         }
> > -       if (get_dma_buf.flags & VFIO_DMABUF_FL_TPH) {
> > -               priv->steering_tag = (get_dma_buf.flags &
> > -                                     VFIO_DMABUF_TPH_ST_MASK) >>
> > -                                    VFIO_DMABUF_TPH_ST_SHIFT;
> > -               priv->ph = (get_dma_buf.flags &
> > -                           VFIO_DMABUF_TPH_PH_MASK) >>
> > -                          VFIO_DMABUF_TPH_PH_SHIFT;
> > +       if (get_dma_buf.flags & VFIO_DMABUF_FLAG_TPH) {
> > +               priv->steering_tag =
> > +                       dma_ranges[get_dma_buf.nr_ranges + 1].tph.tag;
> > +               priv->ph = dma_ranges[get_dma_buf.nr_ranges + 1].tph.ph;
> >         }
> >         /* dma_buf_put() now frees priv */
> >         INIT_LIST_HEAD(&priv->dmabufs_elm);
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index e2a8962641d2c..a8b8d8b1a3278 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -1497,20 +1497,30 @@ struct vfio_device_feature_bus_master {
> >   */
> >  #define VFIO_DEVICE_FEATURE_DMA_BUF 11
> >
> > +struct vfio_region_dma_tph {
> > +       u16 tag;
> > +       u8 ph;
> > +};
> > +
> >  struct vfio_region_dma_range {
> > -       __u64 offset;
> > -       __u64 length;
> > +       union {
> > +               __u64 offset;
> > +               struct vfio_region_dma_tph tph;
> > +       };
> > +       union {
> > +               __u64 length;
> > +               __u64 reserved;
> > +       };
> > +};
> > +
> > +enum {
> > +       VFIO_DMABUF_FLAG_TPH = 1 << 0,
> >  };
> >
> >  struct vfio_device_feature_dma_buf {
> >         __u32   region_index;
> >         __u32   open_flags;
> >         __u32   flags;
> > -#define VFIO_DMABUF_FL_TPH             (1U << 0) /* TPH info is present */
> > -#define VFIO_DMABUF_TPH_PH_SHIFT       1         /* bits 1-2: PH (2-bit) */
> > -#define VFIO_DMABUF_TPH_PH_MASK        0x6U
> > -#define VFIO_DMABUF_TPH_ST_SHIFT       16        /* bits 16-31: steering tag */
> > -#define VFIO_DMABUF_TPH_ST_MASK                0xffff0000U
> >         __u32   nr_ranges;
> >         struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
> >  };
> 
> Sounds good, thanks! We will follow up and move this RFC to a formal patch.

Great. Also, please rename "struct vfio_region_dma_range dma_ranges" to
something that makes it clear this is a storage object, not something
limited to a DMA range.

Thanks

> 
> Zhiping
> 

^ permalink raw reply

* Re: [RFC] Proposal: Add sysfs interface for PCIe TPH Steering Tag retrieval and configuration
From: Leon Romanovsky @ 2026-04-13 19:19 UTC (permalink / raw)
  To: fengchengwen
  Cc: Jason Gunthorpe, Bjorn Helgaas, linux-rdma, linux-pci, netdev,
	dri-devel, Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang
In-Reply-To: <c3a6c6ca-3b71-476c-947a-5f2393d046bd@huawei.com>

On Mon, Apr 13, 2026 at 08:04:10PM +0800, fengchengwen wrote:
> On 4/13/2026 6:01 PM, Leon Romanovsky wrote:
> > On Fri, Apr 10, 2026 at 10:30:52PM +0800, fengchengwen wrote:
> >> Hi all,
> >>
> >> I'm writing to propose adding a sysfs interface to expose and configure the
> >> PCIe TPH
> >> Steering Tag for PCIe devices, which is retrieved inside the kernel.
> >>
> >>
> >> Background: The TPH Steering Tag is tightly coupled with both a PCIe device
> >> (identified
> >> by its BDF) and a CPU core. It can only be obtained in kernel mode. To allow
> >> user-space
> >> applications to fetch and set this value securely and conveniently, we need
> >> a standard
> >> kernel-to-user interface.
> >>
> >>
> >> Proposed Solution: Add several sysfs attributes under each PCIe device's
> >> sysfs directory:
> >> 1. /sys/bus/pci/devices/<BDF>/tph_mode to query the TPH mode (interrupt or
> >> device specific)
> >> 2. /sys/bus/pci/devices/<BDF>/tph_enable to control the TPH feature
> >> 3. /sys/bus/pci/devices/<BDF>/tph_st to support both read and write
> >> operations, e.g.:
> >>    Read operation:
> >>      echo "cpu=3" > /sys/bus/pci/devices/0000:01:00.0/tph_st
> >>      cat /sys/bus/pci/devices/0000:01:00.0/tph_st
> >>    Write operation:
> >>      echo "index=10 st=123" > /sys/bus/pci/devices/0000:01:00.0/tph_st
> >>
> >>
> >> The design strictly follows PCI subsystem sysfs standards and has the
> >> following key properties:
> >>
> >> 1. Dynamic Visibility: The sysfs attributes will only be present for PCIe
> >> devices that
> >>    support TPH Steering Tag. Devices without TPH capability will not show
> >> these nodes,
> >>    avoiding unnecessary user confusion.
> >>
> >> 2. Permission Control: The attributes will use 0600 file permissions,
> >> ensuring only
> >>    privileged root users can read or write them, which satisfies security
> >> requirements
> >>    for hardware configuration interfaces.
> >>
> >> 3. Standard Implementation Location: The interface will be implemented in
> >>    drivers/pci/pci-sysfs.c, the canonical location for all PCI device sysfs
> >> attributes,
> >>    ensuring consistency and maintainability within the PCI subsystem.
> >>
> >>
> >> Why sysfs instead of alternatives like VFIO-PCI ioctl:
> >>
> >> - Universality: sysfs does not require binding the device to a special
> >> driver such as
> >>   vfio-pci. It is available to any privileged user-space component,
> >> including system
> >>   utilities, daemons, and monitoring tools.
> >>
> >> - Simplicity: Both user-space usage (cat/echo) and kernel implementation are
> >>   straightforward, reducing code complexity and long-term maintenance cost.
> >>
> >> - Design Alignment: TPH Steering Tag is a generic PCIe device feature, not
> >> specific to
> >>   user-space drivers like DPDK or VFIO. Exposing it via sysfs matches the
> >> kernel's
> >>   standard pattern for hardware capabilities.
> >>
> >>
> >> I look forward to your comments about this design before submitting the
> >> final patch.
> > 
> > You need to explain more clearly why this write functionality is useful
> > and necessary outside the VFIO/RDMA context:
> > https://lore.kernel.org/all/20260324234615.3731237-1-zhipingz@meta.com/
> > 
> > AFAIK, for non-VFIO TPH callers, kernel has enough knowledge to set
> > right ST values.
> > 
> > There are several comments regarding the implementation, but those can wait
> > until the rationale behind the proposal is fully clarified.
> 
> Thanks for your review and comments.
> 
> Let me clarify the rationale behind this user-space sysfs interface:
> 
> 1. VFIO is just one of the user-space device access frameworks.
>    There are many other in-kernel frameworks that expose devices
>    to user space, such as UIO, UACCE, etc., which may also require
>    TPH Steering Tag support.
> 
> 2. The kernel can automatically program Steering Tags only when
>    the device provides a standard ST table in MSI-X or config space.
>    However, many devices implement vendor-specific or platform-specific
>    Steering Tag programming methods that cannot be fully handled
>    by the generic kernel code.
> 
> 3. For such devices, user-space applications or framework drivers
>    need to retrieve and configure TPH Steering Tags directly.
>    A unified sysfs interface allows all user-space frameworks
>    (not just VFIO) to use a common, standard way to manage
>    TPH Steering Tags, rather than implementing duplicated logic
>    in each subsystem.
> 
> This interface provides a uniform method for any user-space
> device access solution to work with TPH, which is why I believe
> it is useful and necessary beyond the VFIO/RDMA case.

I understand the rationale for providing a read interface, for example for
debugging, but I do not see any justification for a write interface.

TPH is defined by the PCI specification. If a device intends to support it,
then it should conform to the specification.

Thanks


> 
> Thanks
> 
> > 
> > Thanks
> > 
> >>
> >> Best regards,
> >> Chengwen Feng
> >>
> 
> 

^ permalink raw reply

* [PATCH iwl-net] ice: fix infinite recursion in ice_cfg_tx_topo via ice_init_dev_hw
From: Petr Oros @ 2026-04-13 19:14 UTC (permalink / raw)
  To: netdev
  Cc: Petr Oros, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Aleksandr Loktionov, Nikolay Aleksandrov, Daniel Zahka,
	Paul Greenwalt, Dave Ertman, Michal Swiatkowski, jacob.e.keller,
	intel-wired-lan, linux-kernel

On certain E810 configurations where firmware supports Tx scheduler
topology switching (tx_sched_topo_comp_mode_en), ice_cfg_tx_topo()
may need to apply a new 5-layer or 9-layer topology from the DDP
package. If the AQ command to set the topology fails (e.g. due to
invalid DDP data or firmware limitations), the global configuration
lock must still be cleared via a CORER reset.

Commit 86aae43f21cf ("ice: don't leave device non-functional if Tx
scheduler config fails") correctly fixed this by refactoring
ice_cfg_tx_topo() to always trigger CORER after acquiring the global
lock and re-initialize hardware via ice_init_hw() afterwards.

However, commit 8a37f9e2ff40 ("ice: move ice_deinit_dev() to the end
of deinit paths") later moved ice_init_dev_hw() into ice_init_hw(),
breaking the reinit path introduced by 86aae43f21cf. This creates an
infinite recursive call chain:

  ice_init_hw()
    ice_init_dev_hw()
      ice_cfg_tx_topo()         # topology change needed
        ice_deinit_hw()
        ice_init_hw()           # reinit after CORER
          ice_init_dev_hw()     # recurse
            ice_cfg_tx_topo()
              ...               # stack overflow

Fix by moving ice_init_dev_hw() back out of ice_init_hw() and calling
it explicitly from ice_probe() and ice_devlink_reinit_up(). The third
caller, ice_cfg_tx_topo(), intentionally does not need ice_init_dev_hw()
during its reinit, it only needs the core HW reinitialization. This
breaks the recursion cleanly without adding flags or guards.

The deinit ordering changes from commit 8a37f9e2ff40 ("ice: move
ice_deinit_dev() to the end of deinit paths") which fixed slow rmmod
are preserved, only the init-side placement of ice_init_dev_hw() is
reverted.

Fixes: 8a37f9e2ff40 ("ice: move ice_deinit_dev() to the end of deinit paths")
Signed-off-by: Petr Oros <poros@redhat.com>
---
 drivers/net/ethernet/intel/ice/devlink/devlink.c | 2 ++
 drivers/net/ethernet/intel/ice/ice_common.c      | 2 --
 drivers/net/ethernet/intel/ice/ice_main.c        | 2 ++
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/devlink/devlink.c b/drivers/net/ethernet/intel/ice/devlink/devlink.c
index 6144cee8034d77..641d6e289d5ce6 100644
--- a/drivers/net/ethernet/intel/ice/devlink/devlink.c
+++ b/drivers/net/ethernet/intel/ice/devlink/devlink.c
@@ -1245,6 +1245,8 @@ static int ice_devlink_reinit_up(struct ice_pf *pf)
 		return err;
 	}

+	ice_init_dev_hw(pf);
+
 	/* load MSI-X values */
 	ice_set_min_max_msix(pf);

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index ce11fea122d03e..b617a6bff89134 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -1126,8 +1126,6 @@ int ice_init_hw(struct ice_hw *hw)
 	if (status)
 		goto err_unroll_fltr_mgmt_struct;

-	ice_init_dev_hw(hw->back);
-
 	mutex_init(&hw->tnl_lock);
 	ice_init_chk_recipe_reuse_support(hw);

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index e2a5534819d194..a27be29f9bbbfc 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -5314,6 +5314,8 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
 		return err;
 	}

+	ice_init_dev_hw(pf);
+
 	adapter = ice_adapter_get(pdev);
 	if (IS_ERR(adapter)) {
 		err = PTR_ERR(adapter);
-- 
2.52.0

^ permalink raw reply related

* Re: [PATCH iproute 1/2] json_writer: support control character escaping
From: patchwork-bot+netdevbpf @ 2026-04-13 19:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20260410224745.93416-1-stephen@networkplumber.org>

Hello:

This series was applied to iproute2/iproute2.git (main)
by Stephen Hemminger <stephen@networkplumber.org>:

On Fri, 10 Apr 2026 15:47:44 -0700 you wrote:
> Iproute2 never handled control characters in strings correctly.
> There are some cases like where string is under user control
> like paths in ss command. Make iproute2 json output conform
> to RFC 8259.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> [...]

Here is the summary with links:
  - [iproute,1/2] json_writer: support control character escaping
    https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/commit/?id=9e99260759b7
  - [iproute,2/2] json_writer: fix builtin test code
    https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/commit/?id=fe811be6564c

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next] net: stmmac: enable RPS and RBU interrupts
From: Russell King (Oracle) @ 2026-04-13 18:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, Alexandre Torgue, Andrew Lunn, David S. Miller,
	Eric Dumazet, linux-arm-kernel, linux-stm32, netdev, Paolo Abeni,
	Sam Edwards
In-Reply-To: <20260413110222.49fc3759@kernel.org>

On Mon, Apr 13, 2026 at 11:02:22AM -0700, Jakub Kicinski wrote:
> On Fri, 10 Apr 2026 14:07:51 +0100 Russell King (Oracle) wrote:
> > Since we are seeing receive buffer exhaustion on several platforms,
> > let's enable the interrupts so the statistics we publish via ethtool -S
> > actually work to aid diagnosis. I've been in two minds about whether
> > to send this patch, but given the problems with stmmac at the moment,
> > I think it should be merged.
> 
> Sorry for a under-research response but wasn't there are person trying
> to fix the OOM starvation issue? Who was supposed to add a timer?
> Is your problem also OOM related or do you suspect something else?

It is not OOM related. I have this patch applied:

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 131ea887bedc..614d0e10e3e6 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5095,14 +5095,18 @@ static inline void stmmac_rx_refill(struct stmmac_priv *priv, u32 queue)

 		if (!buf->page) {
 			buf->page = page_pool_alloc_pages(rx_q->page_pool, gfp);
-			if (!buf->page)
+			if (!buf->page) {
+				netdev_err(priv->dev, "q%u: no buffer 1\n", queue);
 				break;
+			}
 		}

 		if (priv->sph_active && !buf->sec_page) {
 			buf->sec_page = page_pool_alloc_pages(rx_q->page_pool, gfp);
-			if (!buf->sec_page)
+			if (!buf->sec_page) {
+				netdev_err(priv->dev, "q%u: no buffer 2\n", queue);
 				break;
+			}

 			buf->sec_addr = page_pool_get_dma_addr(buf->sec_page);
 		}

and it is silent, so we are not suffering starvation of buffers.

However, the hardware hangs during iperf3, and because it triggers the
MAC to stream PAUSE frames, and my network uses Netgear GS108 and GS116
unmanaged switches that always use flow-control between them (there's no
way not to) it takes down the entire network - as we've discussed
before. So, this problem is pretty fatal to the *entire* network.

With this patch, the existing statistical counters for this condition
are incremented, and thus users can use ethtool -S to see what happened
and report whether they are seeing the same issue.

Without this patch applied, there are no diagnostics from stmmac that
report what the state is. ethtool -d doesn't list the appropriate
registers (as I suspect part of the problem is the number of queues
is somewhat dynamic - userspace can change that configuration through
ethtool).

Thus, one has to resort to using devmem2 to find out what's happened.
That's not user friendly.

For me, devmem2 shows:

Channel 0 status register:
Value at address 0x02491160: 0x00000484
bit 10: ETI early transmit interrupt - set
bit 9 : RWT receive watchdog - clear
bit 8 : RPS receieve process stopped - clear
bit 7 : RBU receive buffer unavailable - set
bit 6 : RI  receive interrupt - clear
bit 2 : TBU transmit buffer unavailable - set
bit 1 : TPS transmit process stopped - clear
bit 0 : TI  transmit interrupt - clear

Debug status register:
Value at address 0x0249100c: 0x00006300
TPS[3:0] = 6 = Suspended, Tx descriptor unavailable or Tx buffer
		underflow
RPS[3:0] = 3 = Running, waiting for Rx packet

Metal Queue 0 debug register:
Value at address 0x02490d38: 0x002e0020
PRXQ[13:0] = 0x2e = 46 packets in receive queue
RXQSTS[1:0] = 2 = Rx queue fill-level above flow-control activate
		threshold
RRCSTS[1:0] = 0 = Rx Queue Read Controller State = Idle

> Firing interrupts when Rx fill ring runs dry (which IIUC this patches
> dies?) is not a good idea.

Well, I'm thinking that at least on some platforms, such as the Jetson
Xavier NX, unless a different solution can be found, we need the RBU
interrupt to fire off a reset of the stmmac IP when this happens to
reduce the PAUSE frame flood on the network.

If we can't do that, then I think stmmac on these platforms needs to be
marked with CONFIG_BROKEN because right now there doesn't seem to be any
other viable solution.

My intention with this patch is merely to start collecting the already
existing statistics so other users can start seeing whether they are
hitting the same or similar problem. If we're not prepared to do that,
then we should delete the useless statistics from ethtool -S, but I
suspect they're now part of the UAPI, even though without this patch
they will remain stedfastly stuck at zero.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply related

* Re: [PATCH net-next v7 00/15] net: sleepable ndo_set_rx_mode
From: Jakub Kicinski @ 2026-04-13 18:45 UTC (permalink / raw)
  To: netdev; +Cc: Stanislav Fomichev, davem, edumazet, pabeni
In-Reply-To: <20260413171131.550126-1-sdf@fomichev.me>

On Mon, 13 Apr 2026 10:11:16 -0700 Stanislav Fomichev wrote:
> This series adds a new ndo_set_rx_mode_async callback that enables
> drivers to handle address list updates in a sleepable context. The
> current ndo_set_rx_mode is called under the netif_addr_lock spinlock
> with BHs disabled, which prevents drivers from sleeping. This is
> problematic for ops-locked drivers that need to sleep.

A note to other reviewers - I asked Stanislav (off-list) to keep working
on this even tho he's targeting net-next, because the patch set
addresses a syzbot report. Chances of hitting the issue are low but
this is a fix, really.

^ permalink raw reply

* Re: [PATCH net] NFC: digital: bound SENSF response copy into nfc_target
From: Jakub Kicinski @ 2026-04-13 18:41 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Kees Cook, stable, linux-kernel
In-Reply-To: <20260413174715.197640-1-michael.bommarito@gmail.com>

On Mon, 13 Apr 2026 13:47:15 -0400 Michael Bommarito wrote:
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: Codex:gpt-5-4

Could you do some experimentation and figure out what we can do to the
kernel to make the bots check the submission history? It's the 4th time
we received this (incorrect) patch.

^ permalink raw reply

* Re: [PATCH net v3 4/5] bonding: 3ad: fix stuck negotiation on recovery
From: Jay Vosburgh @ 2026-04-13 18:39 UTC (permalink / raw)
  To: Louis Scalbert
  Cc: netdev, andrew+netdev, edumazet, kuba, pabeni, fbl, andy,
	shemminger, maheshb
In-Reply-To: <20260408152353.276204-5-louis.scalbert@6wind.com>

Louis Scalbert <louis.scalbert@6wind.com> wrote:

>The previous commit introduced a side effect caused by clearing the
>SELECTED flag on disabled ports. After all ports in an aggregator go
>down, if only a subset of ports comes back up, those ports can no
>longer renegotiate LACP unless all aggregator ports come back up.
>
>1. All aggregator ports go down
>  - The SELECTED flag is cleared on all of them.
>2. One port comes back up
>  - Its SELECTED flag is set again.
>  - It enters the WAITING state and gets its READY_N flag.
>  - The remaining ports stay UNSELECTED. Because of that, they cannot
>  enter the WAITING state and therefore never get READY_N.

	This is the part that I think we may be doing something else
incorrectly.  If the port is UNSELECTED, then that means that no
aggregator is currently selected for that port, and therefore it
shouldn't be assigned to an aggregator with other ports (per
802.1AX-2014 6.4.8, "Selected").

	I'm not seeing anything in the 6.4.14 Selection Logic that makes
me think a port that is down (port_enabled == FALSE) is disallowed from
being SELECTED.

	Looking at the Receive machine state diagram (Figure 6-18), I
tend to think that in this case the port would transition to
PORT_DISABLED state, as we're not asserting a BEGIN (reinitialization of
the LACP protocol entity), so the port variables can remain unchanged.
There's even some language that suggests this is intentional:

	"If the Aggregation Port becomes inoperable and the BEGIN
	variable is not asserted, the state machine enters the
	PORT_DISABLED state. [...] This state allows the current
	Selection state to remain undisturbed, so that, in the event
	that the Aggregation Port is still connected to the same Partner
	and Partner Aggregation Port when it becomes operable again,
	there will be no disturbance caused to higher layers by
	unnecessary re-configuration.

	So, perhaps the actual bug is that these ports are attached to
the aggregator but not SELECTED.

	-J


>  - __agg_ports_are_ready() returns 0 because it finds a port without
>  READY_N.
>  - As a result, __set_agg_ports_ready() keeps the READY flag cleared on
>  all ports.
>  - The port that came back up is therefore not marked READY and cannot
>  transition to ATTACHED.
>  - LACP negotiation becomes stuck, and the port cannot be used.
>3. All aggregator ports come back up
>  - They all regain SELECTED and READY_N.
>  - __agg_ports_are_ready() now returns 1.
>  - __set_agg_ports_ready() sets READY on all ports.
>  - They can then transition to ATTACHED.
>  - Negotiation resumes and the aggregator becomes operational again.
>
>Consider only ports currently in the WAITING mux state for READY_N in
>order to avoid __agg_ports_are_ready() to return 0 because of a disabled
>port. That matches 802.3ad, which states: "The Selection Logic asserts
>Ready TRUE when the values of Ready_N for all ports that are waiting to
>attach to a given Aggregator are TRUE.".
>
>Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
>Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
>---
> drivers/net/bonding/bond_3ad.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index 3a94fbcbf721..3f56d892b101 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -700,7 +700,8 @@ static void __update_ntt(struct lacpdu *lacpdu, struct port *port)
> }
> 
> /**
>- * __agg_ports_are_ready - check if all ports in an aggregator are ready
>+ * __agg_ports_are_ready - check if all ports in an aggregator that are in
>+ * the WAITING state are ready
>  * @aggregator: the aggregator we're looking at
>  *
>  */
>@@ -716,6 +717,8 @@ static int __agg_ports_are_ready(struct aggregator *aggregator)
> 		for (port = aggregator->lag_ports;
> 		     port;
> 		     port = port->next_port_in_aggregator) {
>+			if (port->sm_mux_state != AD_MUX_WAITING)
>+				continue;
> 			if (!(port->sm_vars & AD_PORT_READY_N)) {
> 				retval = 0;
> 				break;
>-- 
>2.39.2
>

---
	-Jay Vosburgh, jv@jvosburgh.net

^ permalink raw reply

* Re: [PATCH net-next v2] net: check qdisc_pkt_len_segs_init() return value on ingress
From: Daniel Borkmann @ 2026-04-13 18:38 UTC (permalink / raw)
  To: David Carlier, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni
  Cc: Simon Horman, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Hangbin Liu, Krishna Kumar, netdev,
	linux-kernel
In-Reply-To: <20260413182225.10683-1-devnexen@gmail.com>

On 4/13/26 8:22 PM, David Carlier wrote:
> Commit 7fb4c1967011 ("net: pull headers in qdisc_pkt_len_segs_init()")
> changed qdisc_pkt_len_segs_init() to return an skb drop reason when
> it detects malicious GSO packets. The egress path in __dev_queue_xmit()
> checks this return value and drops bad packets, but the ingress path in
> sch_handle_ingress() ignores it.
> 
> This means malformed GSO packets entering via TC ingress are not dropped
> and could be redirected to another interface or cause incorrect qdisc
> accounting.

Why we need to do this on both sides (and what's the perf impact)? If TC
ingress redirects it to some other device, then don't we hit the same via
__dev_queue_xmit() where the 7fb4c1967011 added the qdisc_pkt_len_segs_init()?

> Check the return value and drop the packet when a bad GSO is detected.
> 
> Fixes: 7fb4c1967011 ("net: pull headers in qdisc_pkt_len_segs_init()")
> Signed-off-by: David Carlier <devnexen@gmail.com>
> ---
> 
> v1 -> v2: reorder variable declarations for reverse xmas tree
> v1: https://lore.kernel.org/netdev/20260408172307.46498-1-devnexen@gmail.com/
>   net/core/dev.c | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5a31f9d2128c..d11c22cafca9 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4459,8 +4459,8 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>   		   struct net_device *orig_dev, bool *another)
>   {
>   	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
> -	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
>   	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
> +	enum skb_drop_reason drop_reason;
>   	int sch_ret;
>   
>   	if (!entry)
> @@ -4472,7 +4472,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>   		*pt_prev = NULL;
>   	}
>   
> -	qdisc_pkt_len_segs_init(skb);
> +	drop_reason = qdisc_pkt_len_segs_init(skb);
> +	if (unlikely(drop_reason)) {
> +		kfree_skb_reason(skb, drop_reason);
> +		*ret = NET_RX_DROP;
> +		bpf_net_ctx_clear(bpf_net_ctx);
> +		return NULL;
> +	}
> +
> +	drop_reason = SKB_DROP_REASON_TC_INGRESS;
>   	tcx_set_ingress(skb, true);
>   
>   	if (static_branch_unlikely(&tcx_needed_key)) {


^ permalink raw reply

* Re: [RFC v2 1/2] vfio: add callback to get tph info for dmabuf
From: Zhiping Zhang @ 2026-04-13 18:32 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Keith Busch, Jason Gunthorpe, Bjorn Helgaas, linux-rdma,
	linux-pci, netdev, dri-devel, Yochai Cohen, Yishai Hadas,
	Bjorn Helgaas
In-Reply-To: <20260409120415.GF86584@unreal>

On Thu, Apr 9, 2026 at 5:04 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> >
> On Tue, Mar 31, 2026 at 01:44:02PM -0600, Keith Busch wrote:
> > On Tue, Mar 31, 2026 at 10:02:20PM +0300, Leon Romanovsky wrote:
> > >
> > > Right, what about adding TPH fields to struct vfio_region_dma_range
> > > instead of struct vfio_device_feature_dma_buf?
> >
> > You might have to show me with code what you're talking about because I
> > can't see any way we can add fields to any struct here without breaking
> > backward compatibility.
> >
> > If we can't claim bits out of the unused "flags" field for this feature,
> > then my initial reply is the only sane approach: we can introduce a new
> > feature and struct for it that closely mirrors the existing one, but
> > with the extra hint fields.
>
> Something like that, on top of this proposal:
>
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index 3961afa640391..70d5ee1e3ef7b 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -241,9 +241,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>                 return -EFAULT;
>
>         if (!get_dma_buf.nr_ranges ||
> -           (get_dma_buf.flags & ~(VFIO_DMABUF_FL_TPH |
> -                                  VFIO_DMABUF_TPH_PH_MASK |
> -                                  VFIO_DMABUF_TPH_ST_MASK)))
> +           (get_dma_buf.flags & ~VFIO_DMABUF_FLAG_TPH))
>                 return -EINVAL;
>
>         /*
> @@ -300,13 +298,10 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>                 ret = PTR_ERR(priv->dmabuf);
>                 goto err_dev_put;
>         }
> -       if (get_dma_buf.flags & VFIO_DMABUF_FL_TPH) {
> -               priv->steering_tag = (get_dma_buf.flags &
> -                                     VFIO_DMABUF_TPH_ST_MASK) >>
> -                                    VFIO_DMABUF_TPH_ST_SHIFT;
> -               priv->ph = (get_dma_buf.flags &
> -                           VFIO_DMABUF_TPH_PH_MASK) >>
> -                          VFIO_DMABUF_TPH_PH_SHIFT;
> +       if (get_dma_buf.flags & VFIO_DMABUF_FLAG_TPH) {
> +               priv->steering_tag =
> +                       dma_ranges[get_dma_buf.nr_ranges + 1].tph.tag;
> +               priv->ph = dma_ranges[get_dma_buf.nr_ranges + 1].tph.ph;
>         }
>         /* dma_buf_put() now frees priv */
>         INIT_LIST_HEAD(&priv->dmabufs_elm);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index e2a8962641d2c..a8b8d8b1a3278 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1497,20 +1497,30 @@ struct vfio_device_feature_bus_master {
>   */
>  #define VFIO_DEVICE_FEATURE_DMA_BUF 11
>
> +struct vfio_region_dma_tph {
> +       u16 tag;
> +       u8 ph;
> +};
> +
>  struct vfio_region_dma_range {
> -       __u64 offset;
> -       __u64 length;
> +       union {
> +               __u64 offset;
> +               struct vfio_region_dma_tph tph;
> +       };
> +       union {
> +               __u64 length;
> +               __u64 reserved;
> +       };
> +};
> +
> +enum {
> +       VFIO_DMABUF_FLAG_TPH = 1 << 0,
>  };
>
>  struct vfio_device_feature_dma_buf {
>         __u32   region_index;
>         __u32   open_flags;
>         __u32   flags;
> -#define VFIO_DMABUF_FL_TPH             (1U << 0) /* TPH info is present */
> -#define VFIO_DMABUF_TPH_PH_SHIFT       1         /* bits 1-2: PH (2-bit) */
> -#define VFIO_DMABUF_TPH_PH_MASK        0x6U
> -#define VFIO_DMABUF_TPH_ST_SHIFT       16        /* bits 16-31: steering tag */
> -#define VFIO_DMABUF_TPH_ST_MASK                0xffff0000U
>         __u32   nr_ranges;
>         struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
>  };

Sounds good, thanks! We will follow up and move this RFC to a formal patch.

Zhiping

^ permalink raw reply

* [PATCH RFC bpf-next 8/8] selftests/bpf: add tests to validate KASAN on JIT programs
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
	Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton
  Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, Xu Kuohai, bpf,
	linux-kernel, netdev, linux-kselftest, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm,
	Alexis Lothoré (eBPF Foundation)
In-Reply-To: <20260413-kasan-v1-0-1a5831230821@bootlin.com>

Add a basic KASAN test runner that loads and test-run programs that can
trigger memory management bugs. The test captures kernel logs and ensure
that the expected KASAN splat is emitted by searching for the
corresponding first lines in the report.

This version implements two faulty programs triggering either a
user-after-free, or an out-of-bounds memory usage. The bugs are
triggered thanks to some dedicated kfuncs in bpf_testmod.c, but two
different techniques are used, as some cases can be quite hard to
trigger in a pure "black box" approach:
- for reads, we can make the used kfuncs return some faulty pointers
  that ebpf programs will manipulate, they will generate legitimate
  kasan reports as a consequence
- applying the same trick for faulty writes is harder, as ebpf programs
  can't write kernel data freely. So ebpf programs can call another
  specific testing kfunc that will alter the shadow memory matching the
  passed memory (eg: a map). When the program will try to write to the
  corresponding memory, it will trigger a report as well.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
The way of bringing kasan_poison into bpf_testmod is definitely not
ideal.  But I would like to validate the testing approach (triggering
real faulty accesses, which is hard on some cases, VS manually poisoning
BPF-manipulated memory) before eventually making clean bridges between
KASAN APIs and bpf_testmod.c, if the latter approach is the valid one.
---
 tools/testing/selftests/bpf/prog_tests/kasan.c     | 165 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/kasan.c          | 146 ++++++++++++++++++
 .../testing/selftests/bpf/test_kmods/bpf_testmod.c |  79 ++++++++++
 3 files changed, 390 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/kasan.c b/tools/testing/selftests/bpf/prog_tests/kasan.c
new file mode 100644
index 000000000000..fd628aaa8005
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/kasan.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+#include <bpf/bpf.h>
+#include <fcntl.h>
+#include <linux/if_ether.h>
+#include <sys/klog.h>
+#include <test_progs.h>
+#include <unpriv_helpers.h>
+#include "kasan.skel.h"
+
+#define SUBTEST_NAME_MAX_LEN	64
+#define SYSLOG_ACTION_READ_ALL	3
+#define SYSLOG_ACTION_CLEAR	5
+
+#define MAX_LOG_SIZE		(8*1024)
+#define READ_CHUNK_SIZE		128
+
+#define KASAN_PATTERN_SLAB_UAF "BUG: KASAN: slab-use-after-free in bpf_prog_"
+#define KASAN_PATTERN_GLOBAL_OOB "BUG: KASAN: global-out-of-bounds in bpf_prog_"
+
+static char klog_buffer[MAX_LOG_SIZE];
+
+static int read_kernel_logs(char *buf, size_t max_len)
+{
+	return klogctl(SYSLOG_ACTION_READ_ALL, buf, max_len);
+}
+
+static int clear_kernel_logs(void)
+{
+	return klogctl(SYSLOG_ACTION_CLEAR, NULL, 0);
+}
+
+static int kernel_logs_have_matching_kasan_report(char *buf, char *pattern,
+						  bool is_write, int size)
+{
+	char *access_desc_start, *access_desc_end, *tmp;
+	char access_log[READ_CHUNK_SIZE];
+	char *kasan_report_start;
+	int hsize, nsize;
+	/* Searched kasan report is valid if
+	 * - it contains the expected kasan pattern
+	 * - the next line is the description of the faulty access
+	 * - faulty access properties match the tested type and size
+	 */
+	kasan_report_start = strstr(buf, pattern);
+
+	if (!kasan_report_start)
+		return 1;
+
+	/* Find next line */
+	access_desc_start = strchr(kasan_report_start, '\n');
+	if (!access_desc_start)
+		return 1;
+	access_desc_start++;
+
+	access_desc_end = strchr(access_desc_start, '\n');
+	if (!access_desc_end)
+		return 1;
+
+	nsize = snprintf(access_log, READ_CHUNK_SIZE, "%s of size %d at addr",
+		 is_write ? "Write" : "Read", size);
+
+	hsize = access_desc_end - access_desc_start;
+	tmp = memmem(access_desc_start, hsize, access_log, nsize);
+
+	if (!tmp)
+		return 1;
+
+	return 0;
+}
+
+struct test_spec {
+	char *prog_name;
+	char *expected_report_pattern;
+};
+
+static struct test_spec tests[] = {
+	{
+		.prog_name = "bpf_kasan_uaf",
+		.expected_report_pattern = KASAN_PATTERN_SLAB_UAF
+	},
+	{
+		.prog_name = "bpf_kasan_oob",
+		.expected_report_pattern = KASAN_PATTERN_GLOBAL_OOB
+	}
+};
+
+static void run_test_with_type_and_size(struct kasan *skel,
+					struct test_spec *test, bool is_write,
+					int access_size)
+{
+	char subtest_name[SUBTEST_NAME_MAX_LEN];
+	struct bpf_program *prog;
+	uint8_t buf[ETH_HLEN];
+	int ret;
+
+	prog = bpf_object__find_program_by_name(skel->obj, test->prog_name);
+	if (!ASSERT_OK_PTR(prog, "find test prog"))
+		return;
+
+	snprintf(subtest_name, SUBTEST_NAME_MAX_LEN, "%s_%s_%d",
+		 test->prog_name, is_write ? "write" : "read", access_size);
+
+	if (!test__start_subtest(subtest_name))
+		return;
+
+	ret = clear_kernel_logs();
+	if (!ASSERT_OK(ret, "reset log buffer"))
+		return;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	topts.sz = sizeof(struct bpf_test_run_opts);
+	topts.data_size_in = ETH_HLEN;
+	topts.data_in = buf;
+	skel->bss->is_write = is_write;
+	skel->bss->access_size = access_size;
+	ret = bpf_prog_test_run_opts(bpf_program__fd(prog), &topts);
+	if (!ASSERT_OK(ret, "run prog"))
+		return;
+
+	ret = read_kernel_logs(klog_buffer, MAX_LOG_SIZE);
+	if (ASSERT_GE(ret, 0, "read kernel logs"))
+		ASSERT_OK(kernel_logs_have_matching_kasan_report(
+				  klog_buffer, test->expected_report_pattern,
+				  is_write, access_size),
+			  test->prog_name);
+}
+
+static void run_test_with_type(struct kasan *skel, struct test_spec *test,
+			       bool is_write)
+{
+	run_test_with_type_and_size(skel, test, is_write, 1);
+	run_test_with_type_and_size(skel, test, is_write, 2);
+	run_test_with_type_and_size(skel, test, is_write, 4);
+	run_test_with_type_and_size(skel, test, is_write, 8);
+}
+
+static void run_test(struct kasan *skel, struct test_spec *test)
+{
+	run_test_with_type(skel, test, false);
+	run_test_with_type(skel, test, true);
+}
+
+void test_kasan(void)
+{
+	struct test_spec *test;
+	struct kasan *skel;
+	int i;
+
+	if (!is_jit_enabled() || !get_kasan_jit_enabled()) {
+		test__skip();
+		return;
+	}
+
+	skel = kasan__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open and load prog"))
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		test = &tests[i];
+
+		run_test(skel, test);
+	}
+
+	kasan__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/kasan.c b/tools/testing/selftests/bpf/progs/kasan.c
new file mode 100644
index 000000000000..f713c9b7c9ce
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kasan.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define KASAN_SLAB_FREE 0xFB
+#define KASAN_GLOBAL_REDZONE 0xF9
+
+extern __u8 *bpf_kfunc_kasan_uaf_1(void) __ksym;
+extern __u16 *bpf_kfunc_kasan_uaf_2(void) __ksym;
+extern __u32 *bpf_kfunc_kasan_uaf_4(void) __ksym;
+extern __u64 *bpf_kfunc_kasan_uaf_8(void) __ksym;
+extern __u8 *bpf_kfunc_kasan_oob_1(void) __ksym;
+extern __u16 *bpf_kfunc_kasan_oob_2(void) __ksym;
+extern __u32 *bpf_kfunc_kasan_oob_4(void) __ksym;
+extern __u64 *bpf_kfunc_kasan_oob_8(void) __ksym;
+extern void bpf_kfunc_kasan_poison(void *mem, __u32 mem__sz, __u8 byte) __ksym;
+
+int access_size;
+int is_write;
+
+struct kasan_write_val {
+	__u8 data_1;
+	__u16 data_2;
+	__u32 data_4;
+	__u64 data_8;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, struct kasan_write_val);
+} test_map SEC(".maps");
+
+static void bpf_kasan_faulty_write(int size, __u8 poison_byte)
+{
+	struct kasan_write_val *val;
+	__u32 key = 0;
+
+	val = bpf_map_lookup_elem(&test_map, &key);
+	if (!val)
+		return;
+
+	bpf_kfunc_kasan_poison(val, sizeof(struct kasan_write_val),
+			       poison_byte);
+	switch (size) {
+	case 1:
+		val->data_1 = 0xAA;
+		break;
+	case 2:
+		val->data_2 = 0xAA;
+		break;
+	case 4:
+		val->data_4 = 0xAA;
+		break;
+	case 8:
+		val->data_8 = 0xAA;
+		break;
+	}
+	bpf_kfunc_kasan_poison(val, sizeof(struct kasan_write_val), 0x00);
+}
+
+
+static int bpf_kasan_uaf_read(int size)
+{
+	__u8 *result_1;
+	__u16 *result_2;
+	__u32 *result_4;
+	__u64 *result_8;
+	int ret = 0;
+
+	switch (size) {
+	case 1:
+		result_1 = bpf_kfunc_kasan_uaf_1();
+		ret = result_1[0] ? 1 : 0;
+		break;
+	case 2:
+		result_2 = bpf_kfunc_kasan_uaf_2();
+		ret = result_2[0] ? 1 : 0;
+		break;
+	case 4:
+		result_4 = bpf_kfunc_kasan_uaf_4();
+		ret = result_4[0] ? 1 : 0;
+		break;
+	case 8:
+		result_8 = bpf_kfunc_kasan_uaf_8();
+		ret = result_8[0] ? 1 : 0;
+		break;
+	}
+	return ret;
+}
+
+SEC("tcx/ingress")
+int bpf_kasan_uaf(struct __sk_buff *skb)
+{
+	if (is_write) {
+		bpf_kasan_faulty_write(access_size, KASAN_SLAB_FREE);
+		return 0;
+	}
+
+	return bpf_kasan_uaf_read(access_size);
+}
+
+static int bpf_kasan_oob_read(int size)
+{
+	__u8 *result_1;
+	__u16 *result_2;
+	__u32 *result_4;
+	__u64 *result_8;
+	int ret = 0;
+
+	switch (size) {
+	case 1:
+		result_1 = bpf_kfunc_kasan_oob_1();
+		ret = result_1[0] ? 1 : 0;
+		break;
+	case 2:
+		result_2 = bpf_kfunc_kasan_oob_2();
+		ret = result_2[0] ? 1 : 0;
+		break;
+	case 4:
+		result_4 = bpf_kfunc_kasan_oob_4();
+		ret = result_4[0] ? 1 : 0;
+		break;
+	case 8:
+		result_8 = bpf_kfunc_kasan_oob_8();
+		ret = result_8[0] ? 1 : 0;
+		break;
+	}
+	return ret;
+}
+
+SEC("tcx/ingress")
+int bpf_kasan_oob(struct __sk_buff *skb)
+{
+	if (is_write) {
+		bpf_kasan_faulty_write(access_size, KASAN_GLOBAL_REDZONE);
+		return 0;
+	}
+
+	return bpf_kasan_oob_read(access_size);
+}
+
+char LICENSE[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
index d876314a4d67..01554bcbbbb0 100644
--- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
@@ -271,6 +271,76 @@ __bpf_kfunc void bpf_kfunc_put_default_trusted_ptr_test(struct prog_test_member
 	 */
 }
 
+static void *kasan_uaf(void)
+{
+	void *p = kmalloc(64, GFP_ATOMIC);
+
+	if (!p)
+		return NULL;
+	memset(p, 0xAA, 64);
+	kfree(p);
+
+	return p;
+}
+
+#ifdef CONFIG_KASAN_GENERIC
+extern void kasan_poison(const void *addr, size_t size, u8 value, bool init);
+
+__bpf_kfunc void bpf_kfunc_kasan_poison(void *mem, u32 mem__sz, u8 byte)
+{
+	kasan_poison(mem, mem__sz, byte, false);
+}
+#else
+__bpf_kfunc void bpf_kfunc_kasan_poison(void *mem, u32 mem__sz, u8 byte) { }
+#endif
+
+__bpf_kfunc u8 *bpf_kfunc_kasan_uaf_1(void)
+{
+	return kasan_uaf();
+}
+
+__bpf_kfunc u16 *bpf_kfunc_kasan_uaf_2(void)
+{
+	return kasan_uaf();
+}
+
+__bpf_kfunc u32 *bpf_kfunc_kasan_uaf_4(void)
+{
+	return kasan_uaf();
+}
+
+__bpf_kfunc u64 *bpf_kfunc_kasan_uaf_8(void)
+{
+	return kasan_uaf();
+}
+
+static u8 test_oob_buffer[64];
+
+static void *bpf_kfunc_kasan_oob(void)
+{
+	return test_oob_buffer+64;
+}
+
+__bpf_kfunc u8 *bpf_kfunc_kasan_oob_1(void)
+{
+	return bpf_kfunc_kasan_oob();
+}
+
+__bpf_kfunc u16 *bpf_kfunc_kasan_oob_2(void)
+{
+	return bpf_kfunc_kasan_oob();
+}
+
+__bpf_kfunc u32 *bpf_kfunc_kasan_oob_4(void)
+{
+	return bpf_kfunc_kasan_oob();
+}
+
+__bpf_kfunc u64 *bpf_kfunc_kasan_oob_8(void)
+{
+	return bpf_kfunc_kasan_oob();
+}
+
 __bpf_kfunc struct bpf_testmod_ctx *
 bpf_testmod_ctx_create(int *err)
 {
@@ -740,6 +810,15 @@ BTF_ID_FLAGS(func, bpf_testmod_ops3_call_test_1)
 BTF_ID_FLAGS(func, bpf_testmod_ops3_call_test_2)
 BTF_ID_FLAGS(func, bpf_kfunc_get_default_trusted_ptr_test);
 BTF_ID_FLAGS(func, bpf_kfunc_put_default_trusted_ptr_test);
+BTF_ID_FLAGS(func, bpf_kfunc_kasan_poison)
+BTF_ID_FLAGS(func, bpf_kfunc_kasan_uaf_1)
+BTF_ID_FLAGS(func, bpf_kfunc_kasan_uaf_2)
+BTF_ID_FLAGS(func, bpf_kfunc_kasan_uaf_4)
+BTF_ID_FLAGS(func, bpf_kfunc_kasan_uaf_8)
+BTF_ID_FLAGS(func, bpf_kfunc_kasan_oob_1)
+BTF_ID_FLAGS(func, bpf_kfunc_kasan_oob_2)
+BTF_ID_FLAGS(func, bpf_kfunc_kasan_oob_4)
+BTF_ID_FLAGS(func, bpf_kfunc_kasan_oob_8)
 BTF_KFUNCS_END(bpf_testmod_common_kfunc_ids)
 
 BTF_ID_LIST(bpf_testmod_dtor_ids)

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC bpf-next 7/8] bpf, x86: enable KASAN for JITed programs on x86
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
	Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton
  Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, Xu Kuohai, bpf,
	linux-kernel, netdev, linux-kselftest, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm,
	Alexis Lothoré (eBPF Foundation)
In-Reply-To: <20260413-kasan-v1-0-1a5831230821@bootlin.com>

Mark x86 as supporting KASAN checks in JITed programs so that the
corresponding JIT compiler inserts checks on the translated
instructions.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e2df1b147184..a50aa9a0b93c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -234,6 +234,7 @@ config X86
 	select HAVE_SAMPLE_FTRACE_DIRECT	if X86_64
 	select HAVE_SAMPLE_FTRACE_DIRECT_MULTI	if X86_64
 	select HAVE_EBPF_JIT
+	select HAVE_EBPF_JIT_KASAN		if X86_64
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
 	select HAVE_EISA			if X86_32
 	select HAVE_EXIT_THREAD

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC bpf-next 6/8] selftests/bpf: do not run verifier JIT tests when BPF_JIT_KASAN is enabled
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
	Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton
  Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, Xu Kuohai, bpf,
	linux-kernel, netdev, linux-kselftest, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm,
	Alexis Lothoré (eBPF Foundation)
In-Reply-To: <20260413-kasan-v1-0-1a5831230821@bootlin.com>

Multiple verifier tests validate the exact list of JITed instructions.
Even if the test offers some flexibility in its checks (eg: not
enforcing the first instruction to be verified right at the beginning of
jited code, but rather searching where the expected JIT instructions
could be located), it is confused by the new KASAN instrumentation JITed
in programs: this instrumentation can be inserted anywhere in-between
searched instructions, leading to test failures despite the correct
instructions being generated.

Prevent those failures by skipping tests involving JITed instructions
checks when kernel is built with KASAN _and_ JIT is enabled, as those
two conditions lead the JITed code to contains KASAN checks.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
 tools/testing/selftests/bpf/test_loader.c    | 5 +++++
 tools/testing/selftests/bpf/unpriv_helpers.c | 5 +++++
 tools/testing/selftests/bpf/unpriv_helpers.h | 1 +
 3 files changed, 11 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_loader.c b/tools/testing/selftests/bpf/test_loader.c
index c4c34cae6102..d2c0062ef31a 100644
--- a/tools/testing/selftests/bpf/test_loader.c
+++ b/tools/testing/selftests/bpf/test_loader.c
@@ -1175,6 +1175,11 @@ void run_subtest(struct test_loader *tester,
 		return;
 	}
 
+	if (is_jit_enabled() && subspec->jited.cnt && get_kasan_jit_enabled()) {
+		test__skip();
+		return;
+	}
+
 	if (unpriv) {
 		if (!can_execute_unpriv(tester, spec)) {
 			test__skip();
diff --git a/tools/testing/selftests/bpf/unpriv_helpers.c b/tools/testing/selftests/bpf/unpriv_helpers.c
index f997d7ec8fd0..25bd08648f5f 100644
--- a/tools/testing/selftests/bpf/unpriv_helpers.c
+++ b/tools/testing/selftests/bpf/unpriv_helpers.c
@@ -142,3 +142,8 @@ bool get_unpriv_disabled(void)
 	}
 	return mitigations_off;
 }
+
+bool get_kasan_jit_enabled(void)
+{
+	return config_contains("CONFIG_BPF_JIT_KASAN=y");
+}
diff --git a/tools/testing/selftests/bpf/unpriv_helpers.h b/tools/testing/selftests/bpf/unpriv_helpers.h
index 151f67329665..bc5f4c953c9d 100644
--- a/tools/testing/selftests/bpf/unpriv_helpers.h
+++ b/tools/testing/selftests/bpf/unpriv_helpers.h
@@ -5,3 +5,4 @@
 #define UNPRIV_SYSCTL "kernel/unprivileged_bpf_disabled"
 
 bool get_unpriv_disabled(void);
+bool get_kasan_jit_enabled(void);

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC bpf-next 5/8] bpf, x86: emit KASAN checks into x86 JITed programs
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
	Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton
  Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, Xu Kuohai, bpf,
	linux-kernel, netdev, linux-kselftest, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm,
	Alexis Lothoré (eBPF Foundation)
In-Reply-To: <20260413-kasan-v1-0-1a5831230821@bootlin.com>

Insert KASAN shadow memory checks before memory load and store
operations in JIT-compiled BPF programs. This helps detect memory safety
bugs such as use-after-free and out-of-bounds accesses at runtime.

The main instructions being targeted are BPF_LDX and BPF_STX, but not
all of them are being instrumented:
- if the load/store instruction is in fact accessing the program stack,
  emit_kasan_check silently skips the instrumentation, as we already
  have page guards to monitor stack accesses. Stack accesses _could_ be
  monitored more finely by adding kasan checks, but it would need JIT
  compiler to insert red zones around any variable on stack, and we likely
  do not have enough info in JIT compiler to do so.
- if the load/store instruction is a BPF_PROBE_MEM or a BPF_PROBE_ATOMIC
  instruction, we do not instrument it, as the passed address can fault
  (hence the custom fault management with BPF_PROBE_XXX instructions),
  and so the corresponding kasan check could fault as well.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
This RFC also ignores for now atomic operations, because I am not
perfectly clear yet about how they are JITed and so how much kasan
instrumentation is legitimate here.
---
 arch/x86/net/bpf_jit_comp.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index b90103bd0080..111fe1d55121 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1811,6 +1811,7 @@ static int do_jit(struct bpf_verifier_env *env, struct bpf_prog *bpf_prog, int *
 		const s32 imm32 = insn->imm;
 		u32 dst_reg = insn->dst_reg;
 		u32 src_reg = insn->src_reg;
+		bool accesses_stack;
 		u8 b2 = 0, b3 = 0;
 		u8 *start_of_ldx;
 		s64 jmp_offset;
@@ -1831,6 +1832,7 @@ static int do_jit(struct bpf_verifier_env *env, struct bpf_prog *bpf_prog, int *
 			EMIT_ENDBR();
 
 		ip = image + addrs[i - 1] + (prog - temp);
+		accesses_stack = bpf_insn_accesses_stack(env, bpf_prog, i - 1);
 
 		switch (insn->code) {
 			/* ALU */
@@ -2242,6 +2244,11 @@ st:			if (is_imm8(insn->off))
 		case BPF_STX | BPF_MEM | BPF_H:
 		case BPF_STX | BPF_MEM | BPF_W:
 		case BPF_STX | BPF_MEM | BPF_DW:
+			err = emit_kasan_check(&prog, dst_reg, insn,
+					       image + addrs[i - 1],
+					       accesses_stack);
+			if (err)
+				return err;
 			emit_stx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
 			break;
 
@@ -2390,6 +2397,12 @@ st:			if (is_imm8(insn->off))
 				/* populate jmp_offset for JAE above to jump to start_of_ldx */
 				start_of_ldx = prog;
 				end_of_jmp[-1] = start_of_ldx - end_of_jmp;
+			} else {
+				err = emit_kasan_check(&prog, src_reg, insn,
+						       image + addrs[i - 1],
+						       accesses_stack);
+				if (err)
+					return err;
 			}
 			if (BPF_MODE(insn->code) == BPF_PROBE_MEMSX ||
 			    BPF_MODE(insn->code) == BPF_MEMSX)

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC bpf-next 4/8] bpf, x86: add helper to emit kasan checks in x86 JITed programs
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
	Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton
  Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, Xu Kuohai, bpf,
	linux-kernel, netdev, linux-kselftest, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm,
	Alexis Lothoré (eBPF Foundation)
In-Reply-To: <20260413-kasan-v1-0-1a5831230821@bootlin.com>

Add the emit_kasan_check() function that emits KASAN shadow memory
checks before memory accesses in JIT-compiled BPF programs. The
implementation relies on the existing __asan_{load,store}X functions
from KASAN subsystem. The helper:
- ensures that the kasan instrumention is actually needed: if the
  instruction being processed accesses the program stack, we skip the
  instrumentation, as those accesses are already protected with page
  guards
- saves registers. This includes caller-saved registers, but also
  temporary registers, as those were possibly used by the
  affected program
- computes the accessed address and stores it in %rdi
- calls the relevant function, depending on the instruction being a load
  or a store, and the size of the access.
- restores registeres

The special care needed when inserting this instrumentation comes at the
cost of a non negligeable increase in JITed code size. For example, a
bare

  mov 	0x0(%si),rbx # Load in rbx content at address stored in rsi

becomes

  push    %rax
  push    %rcx
  push    %rdx
  push    %rsi
  push    %rdi
  push    %r8
  push    %r9
  push    %r10
  push    %r11
  sub     $0x8,%rsp
  mov     %rsi,%rdi
  call    0xffffffff81da0a60 <__asan_load8>
  add     $0x8,%rsp
  pop     %r11
  pop     %r10
  pop     %r9
  pop     %r8
  pop     %rdi
  pop     %rsi
  pop     %rdx
  pop     %rcx
  pop     %rax
  mov     0x0(%rsi),rbx

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
 arch/x86/net/bpf_jit_comp.c | 93 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index ea9e707e8abf..b90103bd0080 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -20,6 +20,10 @@
 #include <asm/unwind.h>
 #include <asm/cfi.h>
 
+#ifdef CONFIG_BPF_JIT_KASAN
+#include <linux/kasan.h>
+#endif
+
 static bool all_callee_regs_used[4] = {true, true, true, true};
 
 static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
@@ -1301,6 +1305,95 @@ static void emit_store_stack_imm64(u8 **pprog, int reg, int stack_off, u64 imm64
 	emit_stx(pprog, BPF_DW, BPF_REG_FP, reg, stack_off);
 }
 
+static int emit_kasan_check(u8 **pprog, u32 addr_reg, struct bpf_insn *insn,
+			    u8 *ip, bool accesses_stack)
+{
+#ifdef CONFIG_BPF_JIT_KASAN
+	bool is_write = BPF_CLASS(insn->code) == BPF_STX;
+	u32 bpf_size = BPF_SIZE(insn->code);
+	s32 off = insn->off;
+	u8 *prog = *pprog;
+	void *kasan_func;
+
+	if (accesses_stack)
+		return 0;
+
+	/* Derive KASAN check function from access type and size */
+	switch (bpf_size) {
+	case BPF_B:
+		kasan_func = is_write ? __asan_store1 : __asan_load1;
+		break;
+	case BPF_H:
+		kasan_func = is_write ? __asan_store2 : __asan_load2;
+		break;
+	case BPF_W:
+		kasan_func = is_write ? __asan_store4 : __asan_load4;
+		break;
+	case BPF_DW:
+		kasan_func = is_write ? __asan_store8 : __asan_load8;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* Save rax */
+	EMIT1(0x50);
+	/* Save rcx */
+	EMIT1(0x51);
+	/* Save rdx */
+	EMIT1(0x52);
+	/* Save rsi */
+	EMIT1(0x56);
+	/* Save rdi */
+	EMIT1(0x57);
+	/* Save r8 */
+	EMIT2(0x41, 0x50);
+	/* Save r9 */
+	EMIT2(0x41, 0x51);
+	/* Save r10 */
+	EMIT2(0x41, 0x52);
+	/* Save r11 */
+	EMIT2(0x41, 0x53);
+	/* We have pushed 72 bytes, realign stack to 16 bytes: sub rsp, 8 */
+	EMIT4(0x48, 0x83, 0xEC, 8);
+
+	/* mov rdi, addr_reg */
+	EMIT_mov(BPF_REG_1, addr_reg);
+
+	/* add rdi, off (if offset is non-zero) */
+	if (off) {
+		if (is_imm8(off)) {
+			/* add rdi, imm8 */
+			EMIT4(0x48, 0x83, 0xC7, (u8)off);
+		} else {
+			/* add rdi, imm32 */
+			EMIT3_off32(0x48, 0x81, 0xC7, off);
+		}
+	}
+
+	/* Adjust ip to account for the instrumentation generated so far */
+	ip += (prog - *pprog);
+	/* call kasan_func */
+	if (emit_call(&prog, kasan_func, ip))
+		return -ERANGE;
+
+	/* Restore registers */
+	EMIT4(0x48, 0x83, 0xC4, 8);
+	EMIT2(0x41, 0x5B);
+	EMIT2(0x41, 0x5A);
+	EMIT2(0x41, 0x59);
+	EMIT2(0x41, 0x58);
+	EMIT1(0x5F);
+	EMIT1(0x5E);
+	EMIT1(0x5A);
+	EMIT1(0x59);
+	EMIT1(0x58);
+
+	*pprog = prog;
+#endif /* CONFIG_BPF_JIT_KASAN */
+	return 0;
+}
+
 static int emit_atomic_rmw(u8 **pprog, u32 atomic_op,
 			   u32 dst_reg, u32 src_reg, s16 off, u8 bpf_size)
 {

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC bpf-next 3/8] bpf: add BPF_JIT_KASAN for KASAN instrumentation of JITed programs
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
	Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton
  Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, Xu Kuohai, bpf,
	linux-kernel, netdev, linux-kselftest, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm,
	Alexis Lothoré (eBPF Foundation)
In-Reply-To: <20260413-kasan-v1-0-1a5831230821@bootlin.com>

Add a new Kconfig option CONFIG_BPF_JIT_KASAN that automatically enables
KASAN (Kernel Address Sanitizer) memory access checks for JIT-compiled
BPF programs, when both KASAN and JIT compiler are enabled. When
enabled, the JIT compiler will emit shadow memory checks before memory
loads and stores to detect use-after-free, out-of-bounds, and other
memory safety bugs at runtime. The option is gated behind
HAVE_EBPF_JIT_KASAN, as it needs proper arch-specific implementation.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
 kernel/bpf/Kconfig | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index eb3de35734f0..28392adb3d7e 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -17,6 +17,10 @@ config HAVE_CBPF_JIT
 config HAVE_EBPF_JIT
 	bool
 
+# KASAN support for JIT compiler
+config HAVE_EBPF_JIT_KASAN
+	bool
+
 # Used by archs to tell that they want the BPF JIT compiler enabled by
 # default for kernels that were compiled with BPF JIT support.
 config ARCH_WANT_DEFAULT_BPF_JIT
@@ -101,4 +105,9 @@ config BPF_LSM
 
 	  If you are unsure how to answer this question, answer N.
 
+config BPF_JIT_KASAN
+	bool
+	depends on HAVE_EBPF_JIT_KASAN
+	default y if BPF_JIT && KASAN_GENERIC
+
 endmenu # "BPF subsystem"

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC bpf-next 2/8] bpf: mark instructions accessing program stack
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
	Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton
  Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, Xu Kuohai, bpf,
	linux-kernel, netdev, linux-kselftest, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm,
	Alexis Lothoré (eBPF Foundation)
In-Reply-To: <20260413-kasan-v1-0-1a5831230821@bootlin.com>

In order to prepare to emit KASAN checks in JITed programs, JIT
compilers need to be aware about whether some load/store instructions
are targeting the bpf program stack, as those should not be monitored
(we already have guard pages for that, and it is difficult anyway to
correctly monitor any kind of data passed on stack).

To support this need, make the BPF verifier mark the instructions that
access program stack:
- add a setter that allows the verifier to mark instructions accessing
  the program stack
- add a getter that allows JIT compilers to check whether instructions
  being JITed are accessing the stack

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
 include/linux/bpf.h          |  2 ++
 include/linux/bpf_verifier.h |  2 ++
 kernel/bpf/core.c            | 10 ++++++++++
 kernel/bpf/verifier.c        |  7 +++++++
 4 files changed, 21 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b4b703c90ca9..774a0395c498 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1543,6 +1543,8 @@ void bpf_jit_uncharge_modmem(u32 size);
 bool bpf_prog_has_trampoline(const struct bpf_prog *prog);
 bool bpf_insn_is_indirect_target(const struct bpf_verifier_env *env, const struct bpf_prog *prog,
 				 int insn_idx);
+bool bpf_insn_accesses_stack(const struct bpf_verifier_env *env,
+			     const struct bpf_prog *prog, int insn_idx);
 #else
 static inline int bpf_trampoline_link_prog(struct bpf_tramp_link *link,
 					   struct bpf_trampoline *tr,
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index b148f816f25b..ab99ed4c4227 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -660,6 +660,8 @@ struct bpf_insn_aux_data {
 	u16 const_reg_map_mask;
 	u16 const_reg_subprog_mask;
 	u32 const_reg_vals[10];
+	/* instruction accesses stack */
+	bool accesses_stack;
 };
 
 #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8b018ff48875..340abfdadbed 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1582,6 +1582,16 @@ bool bpf_insn_is_indirect_target(const struct bpf_verifier_env *env, const struc
 	insn_idx += prog->aux->subprog_start;
 	return env->insn_aux_data[insn_idx].indirect_target;
 }
+
+bool bpf_insn_accesses_stack(const struct bpf_verifier_env *env,
+			     const struct bpf_prog *prog, int insn_idx)
+{
+	if (!env)
+		return false;
+	insn_idx += prog->aux->subprog_start;
+	return env->insn_aux_data[insn_idx].accesses_stack;
+}
+
 #endif /* CONFIG_BPF_JIT */
 
 /* Base function for offset calculation. Needs to go into .text section,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1e36b9e91277..7bce4fb4e540 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3502,6 +3502,11 @@ static void mark_indirect_target(struct bpf_verifier_env *env, int idx)
 	env->insn_aux_data[idx].indirect_target = true;
 }
 
+static void mark_insn_accesses_stack(struct bpf_verifier_env *env, int idx)
+{
+	env->insn_aux_data[idx].accesses_stack = true;
+}
+
 #define LR_FRAMENO_BITS	3
 #define LR_SPI_BITS	6
 #define LR_ENTRY_BITS	(LR_SPI_BITS + LR_FRAMENO_BITS + 1)
@@ -6490,6 +6495,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		else
 			err = check_stack_write(env, regno, off, size,
 						value_regno, insn_idx);
+
+		mark_insn_accesses_stack(env, insn_idx);
 	} else if (reg_is_pkt_pointer(reg)) {
 		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) {
 			verbose(env, "cannot write into packet\n");

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC bpf-next 0/8] bpf: add support for KASAN checks in JITed programs
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
	Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton
  Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, Xu Kuohai, bpf,
	linux-kernel, netdev, linux-kselftest, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm,
	Alexis Lothoré (eBPF Foundation)

Hello,
this series aims to bring basic support for KASAN checks to BPF JITed
programs. This follows the first RFC posted in [1].

KASAN allows to spot memory management mistakes by reserving a fraction
of memory as "shadow memory" that will map to the rest of the memory and
allow its monitoring. Each memory-accessing instruction is then
instrumented at build time to call some ASAN check function, that will
analyze the corresponding bits in shadow memory, and if it detects the
access as invalid, trigger a detailed report. The goal of this series is
to replicate this mechanism for BPF programs when they are being JITed
into native instructions: that's then the (runtime) JIT compiler who is
in charge of inserting calls to the corresponding kasan checks, when a
program is being loaded into the kernel. This task involves:
- identifying at program load time the instructions performing memory
  accesses
- identifying those accesses properties (size ? read or write ?) to
  define the relevant kasan check function to call
- just before the identified instructions:
  - perform the basic context saving (ie: saving registers)
  - inserting a call to the relevant kasan check function 
  - restore context
- whenever the instrumented program executes, if it performs an invalid
  access, it triggers a kasan report identical to those instrumented on
  kernel side at build time.

As discussed in [1], this series is based on some choices and
assumptions:
- it focuses on x86_64 for now, and so only on KASAN_GENERIC
- not all memory accessing BPF instructions are being instrumented:
  - it focuses on STX/LDX instructions
  - it discards instructions accessing BPF program stack (already
    monitored by page guards)
  - it discards possibly faulting instructions, like BPF_PROBE_MEM or
    BPF_PROBE_ATOMIC insns

The series is marked and sent as RFC:
- to allow collecting feedback early and make sure that it goes into the
  right direction
- because it depends on Xu's work to pass data between the verifier and
  JIT compilers. This work is not merged yet, see [2]. I have been
  tracking the various revisions he sent on the ML and based my local
  branch on his work
- because tests brought by this series currently can't run on BPF CI:
  they expect kasan multishot to be enabled, otherwise the first test
  will make all other kasan-related tests fail.
- because some cases like atomic loads/stores are not instrumented yet
  (and are still making me scratch my head)
- because it will hopefully provide a good basis to discuss the topic at
  LSFMMBPF (see [3])

Despite this series not being ready for integration yet, anyone
interested in running it locally can perform the following steps to run
the JITed KASAN instrumentation selftests:
- rebasing locally this series on [2]
- building and running the corresponding kernel with kasan_multi_shot
  enabled
- running `test_progs -a kasan`

And should get a variety of KASAN tests executed for BPF programs:

  #162/1   kasan/bpf_kasan_uaf_read_1:OK
  #162/2   kasan/bpf_kasan_uaf_read_2:OK
  #162/3   kasan/bpf_kasan_uaf_read_4:OK
  #162/4   kasan/bpf_kasan_uaf_read_8:OK
  #162/5   kasan/bpf_kasan_uaf_write_1:OK
  #162/6   kasan/bpf_kasan_uaf_write_2:OK
  #162/7   kasan/bpf_kasan_uaf_write_4:OK
  #162/8   kasan/bpf_kasan_uaf_write_8:OK
  #162/9   kasan/bpf_kasan_oob_read_1:OK
  #162/10  kasan/bpf_kasan_oob_read_2:OK
  #162/11  kasan/bpf_kasan_oob_read_4:OK
  #162/12  kasan/bpf_kasan_oob_read_8:OK
  #162/13  kasan/bpf_kasan_oob_write_1:OK
  #162/14  kasan/bpf_kasan_oob_write_2:OK
  #162/15  kasan/bpf_kasan_oob_write_4:OK
  #162/16  kasan/bpf_kasan_oob_write_8:OK
  #162     kasan:OK
  Summary: 1/16 PASSED, 0 SKIPPED, 0 FAILED

[1] https://lore.kernel.org/bpf/DG7UG112AVBC.JKYISDTAM30T@bootlin.com/
[2] https://lore.kernel.org/bpf/cover.1776062885.git.xukuohai@hotmail.com/
[3] https://lore.kernel.org/bpf/DGGNCXX79H8O.2P6K8L1QW1M8K@bootlin.com/

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
Alexis Lothoré (eBPF Foundation) (8):
      kasan: expose generic kasan helpers
      bpf: mark instructions accessing program stack
      bpf: add BPF_JIT_KASAN for KASAN instrumentation of JITed programs
      bpf, x86: add helper to emit kasan checks in x86 JITed programs
      bpf, x86: emit KASAN checks into x86 JITed programs
      selftests/bpf: do not run verifier JIT tests when BPF_JIT_KASAN is enabled
      bpf, x86: enable KASAN for JITed programs on x86
      selftests/bpf: add tests to validate KASAN on JIT programs

 arch/x86/Kconfig                                   |   1 +
 arch/x86/net/bpf_jit_comp.c                        | 106 +++++++++++++
 include/linux/bpf.h                                |   2 +
 include/linux/bpf_verifier.h                       |   2 +
 include/linux/kasan.h                              |  13 ++
 kernel/bpf/Kconfig                                 |   9 ++
 kernel/bpf/core.c                                  |  10 ++
 kernel/bpf/verifier.c                              |   7 +
 mm/kasan/kasan.h                                   |  10 --
 tools/testing/selftests/bpf/prog_tests/kasan.c     | 165 +++++++++++++++++++++
 tools/testing/selftests/bpf/progs/kasan.c          | 146 ++++++++++++++++++
 .../testing/selftests/bpf/test_kmods/bpf_testmod.c |  79 ++++++++++
 tools/testing/selftests/bpf/test_loader.c          |   5 +
 tools/testing/selftests/bpf/unpriv_helpers.c       |   5 +
 tools/testing/selftests/bpf/unpriv_helpers.h       |   1 +
 15 files changed, 551 insertions(+), 10 deletions(-)
---
base-commit: 7990a071b32887a1a883952e8cf60134b6d6fea0
change-id: 20260126-kasan-fcd68f64cd7b

Best regards,
--  
Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>


^ permalink raw reply

* [PATCH RFC bpf-next 1/8] kasan: expose generic kasan helpers
From: Alexis Lothoré (eBPF Foundation) @ 2026-04-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, John Fastabend,
	David S. Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Shuah Khan,
	Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
	Vincenzo Frascino, Andrew Morton
  Cc: ebpf, Bastien Curutchet, Thomas Petazzoni, Xu Kuohai, bpf,
	linux-kernel, netdev, linux-kselftest, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm,
	Alexis Lothoré (eBPF Foundation)
In-Reply-To: <20260413-kasan-v1-0-1a5831230821@bootlin.com>

In order to prepare KASAN helpers to be called from the eBPF subsystem
(to add KASAN instrumentation at runtime when JITing eBPF programs),
expose the __asan_{load,store}X functions in linux/kasan.h

Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
 include/linux/kasan.h | 13 +++++++++++++
 mm/kasan/kasan.h      | 10 ----------
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 338a1921a50a..6f580d4a39e4 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -710,4 +710,17 @@ void kasan_non_canonical_hook(unsigned long addr);
 static inline void kasan_non_canonical_hook(unsigned long addr) { }
 #endif /* CONFIG_KASAN_GENERIC || CONFIG_KASAN_SW_TAGS */
 
+#ifdef CONFIG_KASAN_GENERIC
+void __asan_load1(void *p);
+void __asan_store1(void *p);
+void __asan_load2(void *p);
+void __asan_store2(void *p);
+void __asan_load4(void *p);
+void __asan_store4(void *p);
+void __asan_load8(void *p);
+void __asan_store8(void *p);
+void __asan_load16(void *p);
+void __asan_store16(void *p);
+#endif /* CONFIG_KASAN_GENERIC */
+
 #endif /* LINUX_KASAN_H */
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index fc9169a54766..3bfce8eb3135 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -594,16 +594,6 @@ void __asan_handle_no_return(void);
 void __asan_alloca_poison(void *, ssize_t size);
 void __asan_allocas_unpoison(void *stack_top, ssize_t stack_bottom);
 
-void __asan_load1(void *);
-void __asan_store1(void *);
-void __asan_load2(void *);
-void __asan_store2(void *);
-void __asan_load4(void *);
-void __asan_store4(void *);
-void __asan_load8(void *);
-void __asan_store8(void *);
-void __asan_load16(void *);
-void __asan_store16(void *);
 void __asan_loadN(void *, ssize_t size);
 void __asan_storeN(void *, ssize_t size);
 

-- 
2.53.0


^ permalink raw reply related

* [PATCH net] ixgbevf: fix use-after-free in VEPA multicast source pruning
From: Michael Bommarito @ 2026-04-13 18:24 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, stable,
	linux-kernel, Michael Bommarito

ixgbevf_clean_rx_irq() prunes frames whose source MAC matches the VF's
own address (VEPA multicast workaround) by freeing the skb and
continuing to the next descriptor:

    dev_kfree_skb_irq(skb);
    continue;

The skb pointer is declared outside the while loop and persists across
iterations.  Because the continue skips the "skb = NULL" reset at the
bottom of the loop, the next iteration enters the "else if (skb)" path
and calls ixgbevf_add_rx_frag() on the freed skb, dereferencing
skb_shinfo(skb)->nr_frags — a use-after-free in NAPI softirq context.

The sibling driver iavf already handles this correctly by nulling the
pointer before continuing.  Apply the same pattern here.

I do not have ixgbevf hardware; the bug was found by static analysis
(scan_drop_continue_loops.py + semgrep drop_continue_in_loop, multi-tool
corroboration with the highest score in the scan).  The UAF was confirmed
under KASAN by loading a test module that reproduces the exact code
pattern (alloc skb, kfree_skb, then read skb_shinfo(skb)->nr_frags):

  BUG: KASAN: slab-use-after-free in ixgbevf_uaf_test_init+0x100/0x1000
  Read of size 8 at addr 000000006163ae78 by task insmod/30
  freed 208-byte region [000000006163adc0, 000000006163ae90)

QEMU emulates igb (82576) but not ixgbe (82599), and the igbvf VF
driver does not include the VEPA source pruning path, so a full
end-to-end reproduction with emulated hardware was not possible.

Fixes: bad17234ba70 ("ixgbevf: Change receive model to use double buffered page based receives")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-6
Assisted-by: Codex:gpt-5-4
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 42f89a179a3f..4ba3be961ab6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1221,6 +1221,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
 		    ether_addr_equal(rx_ring->netdev->dev_addr,
 				     eth_hdr(skb)->h_source)) {
 			dev_kfree_skb_irq(skb);
+			skb = NULL;
 			continue;
 		}

-- 
2.53.0

^ permalink raw reply related

* [PATCH net-next v2] net: check qdisc_pkt_len_segs_init() return value on ingress
From: David Carlier @ 2026-04-13 18:22 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni
  Cc: Simon Horman, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Hangbin Liu, Krishna Kumar, netdev,
	linux-kernel, David Carlier

Commit 7fb4c1967011 ("net: pull headers in qdisc_pkt_len_segs_init()")
changed qdisc_pkt_len_segs_init() to return an skb drop reason when
it detects malicious GSO packets. The egress path in __dev_queue_xmit()
checks this return value and drops bad packets, but the ingress path in
sch_handle_ingress() ignores it.

This means malformed GSO packets entering via TC ingress are not dropped
and could be redirected to another interface or cause incorrect qdisc
accounting.

Check the return value and drop the packet when a bad GSO is detected.

Fixes: 7fb4c1967011 ("net: pull headers in qdisc_pkt_len_segs_init()")
Signed-off-by: David Carlier <devnexen@gmail.com>
---

v1 -> v2: reorder variable declarations for reverse xmas tree
v1: https://lore.kernel.org/netdev/20260408172307.46498-1-devnexen@gmail.com/
 net/core/dev.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 5a31f9d2128c..d11c22cafca9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4459,8 +4459,8 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		   struct net_device *orig_dev, bool *another)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
-	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
 	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
+	enum skb_drop_reason drop_reason;
 	int sch_ret;
 
 	if (!entry)
@@ -4472,7 +4472,15 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		*pt_prev = NULL;
 	}
 
-	qdisc_pkt_len_segs_init(skb);
+	drop_reason = qdisc_pkt_len_segs_init(skb);
+	if (unlikely(drop_reason)) {
+		kfree_skb_reason(skb, drop_reason);
+		*ret = NET_RX_DROP;
+		bpf_net_ctx_clear(bpf_net_ctx);
+		return NULL;
+	}
+
+	drop_reason = SKB_DROP_REASON_TC_INGRESS;
 	tcx_set_ingress(skb, true);
 
 	if (static_branch_unlikely(&tcx_needed_key)) {
-- 
2.53.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox