Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net v2] octeontx2-pf: check DMAC extraction support before filtering
From: Harshitha Ramamurthy @ 2026-06-27  0:50 UTC (permalink / raw)
  To: nshettyj
  Cc: netdev, linux-kernel, sgoutham, gakula, sbhatta, hkelam,
	bbhushan2, andrew+netdev, davem, edumazet, kuba, pabeni, naveenm,
	tduszynski, sumang
In-Reply-To: <20260626062329.871990-1-nshettyj@marvell.com>

On Thu, Jun 25, 2026 at 11:24 PM <nshettyj@marvell.com> wrote:
>
> From: Suman Ghosh <sumang@marvell.com>
>
> Currently, configuring a VF MAC address via the PF (e.g., 'ip link
> set <pf> vf 0 mac <mac>') blindly attempts to install a DMAC-based
> hardware filter. However, the hardware parser profile might not
> support DMAC extraction.
>
> Check if the hardware parsing profile supports DMAC extraction
> before adding the filter. Additionally, emit a warning message
> to inform the operator if the MAC filter installation fails due
> to missing DMAC extraction support.
>
> Fixes: f0c2982aaf98 ("octeontx2-pf: Add support for SR-IOV management functions")
> Signed-off-by: Suman Ghosh <sumang@marvell.com>
> Signed-off-by: Nitin Shetty J <nshettyj@marvell.com>
>
> ---
> v2:
>  - Move the DMAC extraction check from otx2_set_vf_mac() into
>    otx2_do_set_vf_mac() which already holds pf->mbox.lock, so all
>    mbox operations are under a single lock/unlock pair. All error
>    paths now use the existing goto-out pattern, eliminating the
>    scattered mutex_unlock() + return calls from v1.
>  - Return -EOPNOTSUPP instead of 0 when DMAC extraction is not
>    supported, so the caller gets an explicit error rather than a
>    silent success.

Please ensure a minimum of 24 hr gap before posting a new revision and
also don't post patches in reply to a previous posting as documented
in:

https://www.kernel.org/doc/html/next/process/maintainer-netdev.html

> ---
>  .../ethernet/marvell/octeontx2/nic/otx2_pf.c  | 33 +++++++++++++++++++
>  1 file changed, 33 insertions(+)
>
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> index b63df5737ff2..dc7e4a225dd0 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> @@ -2517,10 +2517,43 @@ EXPORT_SYMBOL(otx2_config_hwtstamp_set);
>
>  static int otx2_do_set_vf_mac(struct otx2_nic *pf, int vf, const u8 *mac)
>  {
> +       struct npc_get_field_status_req *freq;
> +       struct npc_get_field_status_rsp *frsp;
>         struct npc_install_flow_req *req;
>         int err;
>
>         mutex_lock(&pf->mbox.lock);
> +
> +       /* Skip installing the DMAC filter if the hardware parser profile
> +        * does not support DMAC extraction.
> +        */
> +       freq = otx2_mbox_alloc_msg_npc_get_field_status(&pf->mbox);
> +       if (!freq) {
> +               err = -ENOMEM;
> +               goto out;
> +       }

I noticed that otx2_set_vf_mac() copies the MAC address into the vf
config structure before the programming is successful. Is that
intended?

> +
> +       freq->field = NPC_DMAC;
> +       if (otx2_sync_mbox_msg(&pf->mbox)) {
> +               err = -EINVAL;
> +               goto out;
> +       }
> +
> +       frsp = (struct npc_get_field_status_rsp *)otx2_mbox_get_rsp
> +              (&pf->mbox.mbox, 0, &freq->hdr);
> +       if (IS_ERR(frsp)) {
> +               err = PTR_ERR(frsp);
> +               goto out;
> +       }
> +
> +       if (!frsp->enable) {
> +               netdev_warn(pf->netdev,
> +                           "VF %d MAC filter not installed: DMAC extraction not supported by parser profile\n",
> +                           vf);

Would a netdev_warn_ratelimited() be better here to avoid spamming the log?

> +               err = -EOPNOTSUPP;
> +               goto out;
> +       }
> +
>         req = otx2_mbox_alloc_msg_npc_install_flow(&pf->mbox);
>         if (!req) {
>                 err = -ENOMEM;
> --
> 2.48.1
>
>

^ permalink raw reply

* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Jason Xing @ 2026-06-27  0:47 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Maciej Fijalkowski, netdev, bpf, magnus.karlsson, stfomichev,
	kuba, pabeni, horms, bjorn
In-Reply-To: <aj6mr-cIcF2Tg73r@devvm7509.cco0.facebook.com>

On Sat, Jun 27, 2026 at 12:30 AM Stanislav Fomichev
<sdf.kernel@gmail.com> wrote:
>
> On 06/26, Jason Xing wrote:
> > On Fri, Jun 26, 2026 at 12:05 AM Stanislav Fomichev
> > <sdf.kernel@gmail.com> wrote:
> > >
> > > On 06/25, Jason Xing wrote:
> > > > On Thu, Jun 25, 2026 at 12:37 AM Maciej Fijalkowski
> > > > <maciej.fijalkowski@intel.com> wrote:
> > > > >
> > > > > On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:
> > > > > > On 06/23, Maciej Fijalkowski wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> > > > > > > consumed from the Tx ring are not consistently returned to userspace
> > > > > > > through the completion ring when the packet is later dropped as invalid.
> > > > > > >
> > > > > > > The affected cases are invalid or oversized multi-buffer Tx packets in
> > > > > > > both the generic and zero-copy paths. In these cases, the kernel can
> > > > > > > consume one or more Tx descriptors while building or validating a
> > > > > > > multi-buffer packet, then drop the packet before it reaches the device.
> > > > > > > Userspace still owns the UMEM buffers only after the corresponding
> > > > > > > addresses are returned through the CQ. Missing completions therefore
> > > > > > > make userspace lose track of those buffers.
> > > > > > >
> > > > > > > The generic path fixes cover three related cases:
> > > > > > > * partially built multi-buffer skbs dropped by xsk_drop_skb();
> > > > > > >   continuation descriptors left in the Tx ring after xsk_build_skb()
> > > > > > >   reports overflow;
> > > > > > > * invalid descriptors encountered in the middle of a multi-buffer
> > > > > > >   packet, including the offending invalid descriptor itself.
> > > > > > >
> > > > > > > The zero-copy path is handled separately. The batched Tx parser now
> > > > > > > distinguishes descriptors that can be passed to the driver from
> > > > > > > descriptors that are consumed only because they belong to an invalid
> > > > > > > multi-buffer packet. Reclaim-only descriptors are written to the CQ
> > > > > > > address area and published in completion order, after any earlier
> > > > > > > driver-visible Tx descriptors.
> > > > > > >
> > > > > > > The ZC batching path can also retain drain state when userspace has not
> > > > > > > yet provided the end of an invalid multi-buffer packet. To keep this
> > > > > > > state local to the singular batched path, the series prevents a second
> > > > > > > Tx socket from joining the same pool while such drain state exists.
> > > > > > > During the singular-to-shared transition, Tx batching is gated,
> > > > > > > pre-existing readers are waited out, and bind fails with -EAGAIN if the
> > > > > > > existing socket still has pending drain state. This avoids adding
> > > > > > > multi-buffer drain handling to the shared-UMEM fallback path.
> > > > > > >
> > > > > > > The last two patches update xskxceiver so the tests account invalid
> > > > > > > multi-buffer Tx packets as descriptors that must be reclaimed, while
> > > > > > > still not expecting those invalid packets on the Rx side.
> > > > > > >
> > > > > > > This is a follow-up to Jason's changes [0] which were addressing generic
> > > > > > > xmit only and this set allows me to pass full xskxceiver test suite run
> > > > > > > against ice driver.
> > > > > >
> > > > > > There is a fair amount of feedback from sashiko already :-( So the meta
> > > > > > question from me is: is it time to scrap our current approach where
> > > > > > we parse descriptor by descriptor? (and maintain half-baked skb and
> > > > > > half-consumed descriptor queues)
> > > > > >
> > > > > > Should we:
> > > > > >
> > > > > > 1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
> > > > > > PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
> > > > > > and do a full stop here)
> > > > > > 2. now that we really know the number of valid descriptors -> reserve
> > > > > > the cq space (if not -> EAGAIN)
> > > > > > 3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
> > > > > > locally, don't ever create semi-initialized skb)
> > > > > > 4. construct the skb
> > > > > > 5. xmit
> > > > >
> > > > > Yeah generic xmit became utterly horrible, haven't gone through sashiko
> > > > > reviews yet, but bare in mind this set also aligns zc side to what was
> > > > > previously being addressed by Jason.
> > > > >
> > > > > I believe planned logistics were to get these fixes onto net and then
> > > > > Jason had an implementation of batching on generic xmit, directed towards
> > > > > -next and that's where we could address current flow.
> > > >
> > > > Agreed. That's what I'm hoping for. There would be much more
> > > > discussion on how to do batch xmit in an elegant way, I believe.
> > >
> > > This doesn't have to depend on the batch rewrite, we should be able to rewrite
> > > this non-zc in net, this is still technically fixes, not feature work..
> > >
> > > There was already a couple of revisions with this drain_cont approach
> > > and every time I look at it feels like the cure is worse than the
> > > decease :-( Obviously not gonna stop you from going with the current approach,
> > > but these fixes feel a bit of a wasted effort to me (since the bugs keep
> > > coming and we are piling more complexity).
> >
> > I see your point, but rewriting is something that cannot be easily
> > applied to the stable branches? Until now, we fix issues one by one
> > which have an explicit target branch (because of the fixes tag). Cross
> > fingers :(
> >
> > Sashiko has the magic to find out the hidden bugs more than ever and
> > AF_XDP is not the only place where a pile of reports are coming in.
>
> net vs net-next is fixes vs feature work. If we can't fix the current
> code, I think we can justify a rewrite using a better approach and
> route it via net. This series is 7 patches anyway, it's not like
> it is a quick short fix :-) But I'm ok with pushing it as it, I'm just
> trying to see if someone on your side is fed up with that part as well
> and wants to fix it "properly" :-p
>
> > My take is that batch xmit has been appending too long and at least so
> > far less and less bugs are found by sashiko. I believe if the mode is
> > changed to batch xmit, there are likely to be new and challenging
> > problems to discuss. I prefer to solve questions of the batch xmit
> > series.
>
> We can redo this part separately, without batching. Move from "read
> one chunk at a time" to "pre-read all chunks". Batching vs current issue
> are separate.

If the implementation of 'pre-read' is clear and simple, yes, it's a
better way. (I really want to engage myself in this right now, but
sorry, I can't since I'm writing many slides for Netdev.)

Probably Maciej will give it one last try; we'll see then.

>
> > BTW, would you both come to Netdev 0x1a next month? I believe we could
> > sit around the table and discuss some future plans there (in xdp
> > workshop?).
> > https://netdevconf.info/0x1A/sessions/workshop/xdp-workshop.html
>
> Yes, I plan to be there in person.

Great.

Thanks,
Jason

^ permalink raw reply

* Re: [PATCH net-next v3 0/3] net: pse-pd: decouple controller lookup from MDIO probe
From: Jakub Kicinski @ 2026-06-27  0:46 UTC (permalink / raw)
  To: Carlo Szelinsky
  Cc: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Paolo Abeni,
	Corey Leavitt, Jonas Jelonek, Simon Horman, netdev, linux-kernel
In-Reply-To: <20260626165929.2908782-1-github@szelinsky.de>

On Fri, 26 Jun 2026 18:59:26 +0200 Carlo Szelinsky wrote:
> Subject: [PATCH net-next v3 0/3] net: pse-pd: decouple controller lookup from MDIO probe

## Form letter - net-next-closed

We have already submitted our pull request with net-next material for v7.2,
and therefore net-next is closed for new drivers, features, code refactoring
and optimizations. We are currently accepting bug fixes only.

Please repost when net-next reopens after June 29th.

RFC patches sent for review only are obviously welcome at any time.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
-- 
pw-bot: defer
pv-bot: closed

^ permalink raw reply

* Re: [PATCH v4] virtio_net: disable cb when NAPI is busy-polled
From: Jakub Kicinski @ 2026-06-27  0:44 UTC (permalink / raw)
  To: Simon Horman, lange_tang
  Cc: mst, xuanzhuo, jasowang, edumazet, virtualization, netdev,
	tanglongjun
In-Reply-To: <20260626151508.1319440-1-horms@kernel.org>

On Fri, 26 Jun 2026 16:15:08 +0100 Simon Horman wrote:
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 26afa6341d161..c1e252400c0fc 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3011,6 +3011,8 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
> >  	unsigned int xdp_xmit = 0;
> >  	bool napi_complete;
> >  
> > +	virtqueue_disable_cb(rq->vq);
> > +  
> 
> [Severity: High]
> Can this unconditionally disable the RX callback and cause a permanent network
> stall when polled by netpoll?

Good catch, Longjun just add if (budget)

^ permalink raw reply

* Re: [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
From: Jason Xing @ 2026-06-27  0:36 UTC (permalink / raw)
  To: Fijalkowski, Maciej, Roman Gushchin
  Cc: Zaremba, Larysa, netdev@vger.kernel.org, bpf@vger.kernel.org,
	Karlsson, Magnus, stfomichev@gmail.com, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, bjorn@kernel.org, Jason Xing
In-Reply-To: <IA1PR11MB60978AF3799FD895C53A860882EB2@IA1PR11MB6097.namprd11.prod.outlook.com>

On Fri, Jun 26, 2026 at 9:43 PM Fijalkowski, Maciej
<maciej.fijalkowski@intel.com> wrote:
>
> >
> > On Fri, Jun 26, 2026 at 7:12 PM Larysa Zaremba <larysa.zaremba@intel.com>
> > wrote:
> > >
> > > On Tue, Jun 23, 2026 at 03:32:34PM +0200, Maciej Fijalkowski wrote:
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > This patch is inspired by the check[1] from sashiko. It says when
> > > > overflow happens, the address of cq to be published is invalid.
> > > > Actually the severer thing is the whole process of publishing the
> > > > address of cq in this particular case is not right: it should truely
> > > > publish the address and advance the cached_prod in cq as long as it
> > > > reads descriptors from txq.
> > > >
> > > > The following is the full analysis.
> > > > xsk_drop_skb() is called in three places, which all discard a partially
> > > > built multi-buffer skb:
> > > > 1) xsk_build_skb() -EOVERFLOW error path: packet exceeds
> > MAX_SKB_FRAGS
> > > > 2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
> > > >    the TX ring prevents the partial packet from completing
> > > > 3) xsk_release(): socket close while xs->skb holds an incomplete packet
> > > >
> > > > In all three cases, the TX descriptors for the already-processed frags
> > > > have been consumed from the TX ring (xskq_cons_release), and CQ slots
> > > > have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
> > > > which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
> > > > the buffer addresses never appear in the completion queue, userspace
> > > > permanently loses track of these buffers.
> > > >
> > > > Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
> > > > destructor, which already submits buffer addresses to the CQ via
> > > > xsk_cq_submit_addr_locked().
> > > >
> > > > Note that cancelling the descriptors back to the TX ring (via
> > > > xskq_cons_cancel_n) is not a appropriate option because an oversized
> > > > packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
> > > > which is an obviously deadlock bug in the TX path.
> > > >
> > > > Also move the desc->addr assignment in xsk_build_skb() above the
> > > > overflow check so that the current descriptor's address is recorded
> > > > before a potential -EOVERFLOW jump to free_err, consistent with the
> > > > zerocopy path in xsk_build_skb_zerocopy().
> > > >
> > > > [1]:
> > https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > >
> > > This change looks good, but overflow case with only 1 descriptor worries
> > me.
> >
> > I presume you referred to xsk_build_skb_zerocopy()?
> >
> > > In such cases, once we get to following code, kfree_skb() has already
> > happened:
> > >
> > >         if (err == -EOVERFLOW) {
> > >                 if (xs->skb) {
> > >                         /* Drop the packet */
> > >                         xsk_inc_num_desc(xs->skb);
> > >                         xsk_drop_skb(xs->skb);
> > >                 } else {
> > >                         xsk_cq_cancel_locked(xs->pool, 1);
> > >                         xs->tx->invalid_descs++;
> > >                 }
> > >                 xskq_cons_release(xs->tx);
> > >         }
> > >
> > > kfree_skb() should have resulted in submission of the single fat descriptor to
> > > xsk_cq_submit_addr_locked() via xsk_destruct_skb(), so far consistent with
> > the
> >
> > At least, in the NO_LINEAR case, xsk_skb_init_misc() is not called
> > since the OVERFLOW skips this function, which means kfree_skb()
> > doesn't invoke xsk_destruct_skb() to publish it in the CQ. So it's
> > safe to cancel the cq reservation (in xsk_cq_cancel_locked(xs->pool,
> > 1)).
>
> (responding from outlook so apologies for any broken formatting)
>
> Yes, I have the same understanding here. However, how technically
> possible would it be to produce > MAX_SKB_FRAGS from a single
> AF_XDP descriptor?

Very unlikely. But my viewpoint might change after a wide deployment
internally in the second half of the year.

>
> I know Sashiko has pointed this out and you came up with previous
> fix, but for valid descriptor it is simply not possible. And invalid
> descs wouldn't reach this function.

Yep.

>
> I wouldn't like to stir up the pot too much so let us keep this
> code, but is there any way to give Sashiko additional context?
> I mean, for case where we would say *this can't happen*, will
> It be able to carry this information onwards?

I don't know about how sashiko works, sorry. Maybe @Roman Gushchin has
unique insights on this?

Thanks,
Jason

>
> >
> > Thanks,
> > Jason
> >
> > > multi-descriptor bevaior you are proposing here.
> > >
> > > But what happens when we cancel a submitted CQ slot via
> > > xsk_cq_cancel_locked(xs->pool, 1) in the above code?
> > >
> > > >
> > > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx
> > path")
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > >  net/xdp/xsk.c | 13 ++++++++-----
> > > >  1 file changed, 8 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index b970f30ea9b9..a7a83dc4546a 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -794,8 +794,11 @@ static void xsk_consume_skb(struct sk_buff
> > *skb)
> > > >
> > > >  static void xsk_drop_skb(struct sk_buff *skb)
> > > >  {
> > > > -     xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
> > > > -     xsk_consume_skb(skb);
> > > > +     struct xdp_sock *xs = xdp_sk(skb->sk);
> > > > +
> > > > +     xs->tx->invalid_descs += xsk_get_num_desc(skb);
> > > > +     consume_skb(skb);
> > > > +     xs->skb = NULL;
> > > >  }
> > > >
> > > >  static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
> > > > @@ -877,7 +880,7 @@ static struct sk_buff
> > *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > >                       return ERR_PTR(-ENOMEM);
> > > >
> > > >               /* in case of -EOVERFLOW that could happen below,
> > > > -              * xsk_consume_skb() will release this node as whole skb
> > > > +              * xsk_drop_skb() will release this node as whole skb
> > > >                * would be dropped, which implies freeing all list elements
> > > >                */
> > > >               xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> > > > @@ -969,6 +972,8 @@ static struct sk_buff *xsk_build_skb(struct
> > xdp_sock *xs,
> > > >                               goto free_err;
> > > >                       }
> > > >
> > > > +                     xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> > > > +
> > > >                       if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) &&
> > xp_mb_desc(desc))) {
> > > >                               err = -EOVERFLOW;
> > > >                               goto free_err;
> > > > @@ -986,8 +991,6 @@ static struct sk_buff *xsk_build_skb(struct
> > xdp_sock *xs,
> > > >
> > > >                       skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
> > > >                       refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
> > > > -
> > > > -                     xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
> > > >               }
> > > >       }
> > > >
> > > > --
> > > > 2.43.0
> > > >
> > > >

^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Jakub Kicinski @ 2026-06-27  0:33 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Maxime Chevallier, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <5b7dbdbc-93fd-4664-abad-0f47855fab55@lunn.ch>

On Fri, 26 Jun 2026 14:39:57 +0200 Andrew Lunn wrote:
> On Fri, Jun 26, 2026 at 10:33:50AM +0200, Maxime Chevallier wrote:
> >   
> > > Sphinx follows pythons object orientate structure. So you could have a
> > > class test_ethtool_pause_advertising, with class documentation. And
> > > then methods within the class which are individual tests.  The
> > > commented out section would then be method documentation.  
> > 
> > Good point, so maybe something along these lines :
> > 
> >  - A class for the test
> >  - methods for indivitual tests
> >  - For readability, I've written what the internal test helper would look
> >    like (_adv_test), and how a test would look like without the helper in
> >    adv_rx_on_tx_on().
> > 
> > I'm already diving into coding, but it helps me a bit in the definition of the
> > "description" format :)
> > 
> > this is what the class would look like :  
> 
> I like this :-)

This is very far from what existing python tests do in netdev.

I would prefer to stick to the "bash on steroids" use of Python.

Are you both familiar with the existing tests?

^ permalink raw reply

* Re: [PATCH bpf-next v2 00/15] bpf: A common way to attach struct_ops to a cgroup
From: Roman Gushchin @ 2026-06-26 23:59 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, eddyz87, memxor,
	martin.lau, shakeel.butt, kuniyu, kerneljasonxing, kernel-team
In-Reply-To: <20260623175006.3136053-1-ameryhung@gmail.com>

Amery Hung <ameryhung@gmail.com> writes:

> Hi,
>
> I am continuing Martin's work to support attaching struct_ops to
> cgroup.

Awesome, thank you for working on this!

I'm going to rebase bpf oom work on top of this patchset and will give
it some additional testing.

Thanks!

^ permalink raw reply

* Re: [PATCH net-next v1] tcp/dccp: avoid parity split for socket-local bind range
From: Kuniyuki Iwashima @ 2026-06-26 23:40 UTC (permalink / raw)
  To: xuanqiang.luo
  Cc: Eric Dumazet, Neal Cardwell, netdev, David S . Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, luoxuanqiang
In-Reply-To: <20260626093856.61864-1-xuanqiang.luo@linux.dev>

On Fri, Jun 26, 2026 at 2:40 AM <xuanqiang.luo@linux.dev> wrote:
>
> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
>
> IP_LOCAL_PORT_RANGE lets applications override the netns ephemeral port
> range on a per-socket basis.  __inet_hash_connect() already treats such a
> range as an explicit application partition and scans it with step 1 [1].
>
> Do the same in inet_csk_find_open_port():

What's the use case of IP_LOCAL_PORT_RANGE + bind(, 0)
without IP_BIND_ADDRESS_NO_PORT ?


> when a socket-local range is set,
> walk the whole selected range instead of first splitting it by parity.
> Keep the existing step-2 parity behavior for sockets using the netns range,
> so the default bind/connect separation remains unchanged.
>
> [1] https://lore.kernel.org/r/20231214192939.1962891-3-edumazet@google.com
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn>
> ---
>  net/ipv4/inet_connection_sock.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 56902bba54838..ad8af70c92ca3 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -323,13 +323,16 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
>         struct inet_bind2_bucket *tb2;
>         struct inet_bind_bucket *tb;
>         u32 remaining, offset;
> +       bool local_ports;
>         bool relax = false;
> +       int step;
>
>         l3mdev = inet_sk_bound_l3mdev(sk);
>  ports_exhausted:
>         attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
>  other_half_scan:
> -       inet_sk_get_local_port_range(sk, &low, &high);
> +       local_ports = inet_sk_get_local_port_range(sk, &low, &high);
> +       step = local_ports ? 1 : 2;
>         high++; /* [32768, 60999] -> [32768, 61000[ */
>         if (high - low < 4)
>                 attempt_half = 0;
> @@ -342,18 +345,19 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
>                         low = half;
>         }
>         remaining = high - low;
> -       if (likely(remaining > 1))
> +       if (!local_ports && remaining > 1)
>                 remaining &= ~1U;
>
>         offset = get_random_u32_below(remaining);
>         /* __inet_hash_connect() favors ports having @low parity
>          * We do the opposite to not pollute connect() users.
>          */
> -       offset |= 1U;
> +       if (!local_ports)
> +               offset |= 1U;
>
>  other_parity_scan:
>         port = low + offset;
> -       for (i = 0; i < remaining; i += 2, port += 2) {
> +       for (i = 0; i < remaining; i += step, port += step) {
>                 if (unlikely(port >= high))
>                         port -= remaining;
>                 if (inet_is_local_reserved_port(net, port))
> @@ -384,9 +388,11 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
>                 cond_resched();
>         }
>
> -       offset--;
> -       if (!(offset & 1))
> -               goto other_parity_scan;
> +       if (!local_ports) {
> +               offset--;
> +               if (!(offset & 1))
> +                       goto other_parity_scan;
> +       }
>
>         if (attempt_half == 1) {
>                 /* OK we now try the upper half of the range */
> --
> 2.43.0
>

^ permalink raw reply

* [PATCH net] net: gianfar: dispose irq mappings on probe failure and device removal
From: Rosen Penev @ 2026-06-26 22:52 UTC (permalink / raw)
  To: netdev
  Cc: Claudiu Manoil, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Andy Fleming, open list

irq_of_parse_and_map() creates irqdomain mappings that should be
balanced with irq_dispose_mapping(). The driver never called
irq_dispose_mapping(), leaking mappings on probe failure and
device removal.

Fix by adding irq_dispose_mapping() in free_gfar_dev() and
expanding its loop from priv->num_grps to MAXGROUPS so the
error path also catches partially-initialized groups. All
irqinfo pointers are pre-initialized to NULL in gfar_of_init(),
making the NULL-guarded walk in free_gfar_dev() safe for every
scenario.

gfar_parse_group() itself is left as a simple parse function
with no resource management; cleanup is centralized in the
caller's error path.

Assisted-by: opencode:big-pickle
Fixes: b31a1d8b4151 ("gianfar: Convert gianfar to an of_platform_driver")
Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 drivers/net/ethernet/freescale/gianfar.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 3271de5844f8..89215e1ddc2d 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -469,10 +469,13 @@ static void free_gfar_dev(struct gfar_private *priv)
 {
 	int i, j;
 
-	for (i = 0; i < priv->num_grps; i++)
+	for (i = 0; i < MAXGROUPS; i++)
 		for (j = 0; j < GFAR_NUM_IRQS; j++) {
-			kfree(priv->gfargrp[i].irqinfo[j]);
-			priv->gfargrp[i].irqinfo[j] = NULL;
+			if (priv->gfargrp[i].irqinfo[j]) {
+				irq_dispose_mapping(priv->gfargrp[i].irqinfo[j]->irq);
+				kfree(priv->gfargrp[i].irqinfo[j]);
+				priv->gfargrp[i].irqinfo[j] = NULL;
+			}
 		}
 
 	free_netdev(priv->ndev);
@@ -616,7 +619,7 @@ static phy_interface_t gfar_get_interface(struct net_device *dev)
 static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev)
 {
 	const char *model;
-	int err = 0, i;
+	int err = 0, i, j;
 	phy_interface_t interface;
 	struct net_device *dev = NULL;
 	struct gfar_private *priv = NULL;
@@ -702,8 +705,11 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev)
 	priv->rx_list.count = 0;
 	mutex_init(&priv->rx_queue_access);
 
-	for (i = 0; i < MAXGROUPS; i++)
+	for (i = 0; i < MAXGROUPS; i++) {
 		priv->gfargrp[i].regs = NULL;
+		for (j = 0; j < GFAR_NUM_IRQS; j++)
+			priv->gfargrp[i].irqinfo[j] = NULL;
+	}
 
 	/* Parse and initialize group specific information */
 	if (priv->mode == MQ_MG_MODE) {
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH iwl-next v5 1/2] ethtool: treat RXH_GTP_TEID as intrinsically symmetric
From: Jakub Kicinski @ 2026-06-26 22:29 UTC (permalink / raw)
  To: Aleksandr Loktionov; +Cc: intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260626054730.1126969-2-aleksandr.loktionov@intel.com>

On Fri, 26 Jun 2026 07:47:29 +0200 Aleksandr Loktionov wrote:
> A GTP tunnel uses the same TEID value in both directions of a flow;
> including TEID in the hash input does not break src/dst symmetry.
> 
> ethtool_rxfh_config_is_sym() currently rejects any hash field bitmap
> that contains bits outside the four paired L3/L4 fields.  This causes
> drivers that hash GTP flows on TEID to fail the kernel's preflight
> validation in ethtool_check_flow_types(), making it impossible for
> those drivers to support symmetric-xor transforms at all.
> 
> Strip RXH_GTP_TEID from the bitmap before the paired-field check so
> that drivers may honestly report TEID hashing without blocking the
> configuration of symmetric transforms.

I don't know much about GTP, but "the Internet" does not seem to agree
with your claim:

  The TEID uniquely identifies the GSN tunnel endpoints. The tunnels 
  for an uplink and a downlink are separate and use a different TEID.

https://docs.paloaltonetworks.com/service-providers/10-1/mobile-network-infrastructure-getting-started/gtp/mobile-network-protection-profile

So I don't think this will fly..

^ permalink raw reply

* Re: [PATCH iwl-next v5 2/2] ice: implement symmetric RSS hash configuration
From: Jakub Kicinski @ 2026-06-26 22:26 UTC (permalink / raw)
  To: Aleksandr Loktionov; +Cc: intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260626054730.1126969-3-aleksandr.loktionov@intel.com>

On Fri, 26 Jun 2026 07:47:30 +0200 Aleksandr Loktionov wrote:
> -	/* Update the VSI's hash function */
> -	if (rxfh->input_xfrm & RXH_XFRM_SYM_XOR)
> -		hfunc = ICE_AQ_VSI_Q_OPT_RSS_HASH_SYM_TPLZ;
> +	/* Handle RSS symmetric hash transformation */
> +	if (rxfh->input_xfrm != RXH_XFRM_NO_CHANGE) {
> +		u8 new_hfunc;

I think this is the very bad part. Please extract it out and send it as
a fix to net. Looks like any changes to RSS confing on ice randomly
enable xfrm sym. I isolated it to the ntuple.py test which just changes
the indir table, and the driver says:

  ice 0000:e1:00.0 ens1f0np0: Hash function set to: Symmetric Toeplitz

Which we never asked for. I drafted this before seeing your reply:

--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -3692,10 +3692,10 @@ ice_set_rxfh(struct net_device *netdev, struct ethtool_rxfh_param *rxfh,
             struct netlink_ext_ack *extack)
 {
        struct ice_netdev_priv *np = netdev_priv(netdev);
-       u8 hfunc = ICE_AQ_VSI_Q_OPT_RSS_HASH_TPLZ;
        struct ice_vsi *vsi = np->vsi;
        struct ice_pf *pf = vsi->back;
        struct device *dev;
+       u8 hfunc;
        int err;
 
        dev = ice_pf_to_dev(pf);
@@ -3714,9 +3714,12 @@ ice_set_rxfh(struct net_device *netdev, struct ethtool_rxfh_param *rxfh,
                return -EOPNOTSUPP;
        }
 
-       /* Update the VSI's hash function */
-       if (rxfh->input_xfrm & RXH_XFRM_SYM_XOR)
+       if (rxfh->input_xfrm == RXH_XFRM_NO_CHANGE)
+               hfunc = vsi->rss_hfunc;
+       else if (rxfh->input_xfrm & RXH_XFRM_SYM_XOR)
                hfunc = ICE_AQ_VSI_Q_OPT_RSS_HASH_SYM_TPLZ;
+       else /* input_xfrm == 0; core rejects any other value */
+               hfunc = ICE_AQ_VSI_Q_OPT_RSS_HASH_TPLZ;
 
        err = ice_set_rss_hfunc(vsi, hfunc);

^ permalink raw reply

* [PATCH] netfilter: x_tables: replace strlcat() with snprintf()
From: Ian Bridges @ 2026-06-26 22:25 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Florian Westphal, Phil Sutter, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	netfilter-devel, coreteam, netdev, linux-kernel
  Cc: linux-hardening

In preparation for removing the deprecated strlcat() API[1], replace the
strscpy()/strlcat() pairs in xt_proto_init() and xt_proto_fini() with
snprintf(), which builds each /proc file name in a single call.

Each name is "<prefix><suffix>", where <prefix> is the address-family
string xt_prefix[af] and <suffix> is one of the FORMAT_TABLES,
FORMAT_MATCHES or FORMAT_TARGETS literals. snprintf() with a "%s%s"
format produces the same NUL-terminated, length-bounded string as the
strscpy()/strlcat() chain it replaces, so the proc entry names are
unchanged.

Link: https://github.com/KSPP/linux/issues/370 [1]
Signed-off-by: Ian Bridges <icb@fastmail.org>
---
 net/netfilter/x_tables.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 4e6708c23922..56f4546be336 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -2033,8 +2033,7 @@ int xt_proto_init(struct net *net, u_int8_t af)
 	root_uid = make_kuid(net->user_ns, 0);
 	root_gid = make_kgid(net->user_ns, 0);
 
-	strscpy(buf, xt_prefix[af], sizeof(buf));
-	strlcat(buf, FORMAT_TABLES, sizeof(buf));
+	snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TABLES);
 	proc = proc_create_net_data(buf, 0440, net->proc_net, &xt_table_seq_ops,
 			sizeof(struct seq_net_private),
 			(void *)(unsigned long)af);
@@ -2043,8 +2042,7 @@ int xt_proto_init(struct net *net, u_int8_t af)
 	if (uid_valid(root_uid) && gid_valid(root_gid))
 		proc_set_user(proc, root_uid, root_gid);
 
-	strscpy(buf, xt_prefix[af], sizeof(buf));
-	strlcat(buf, FORMAT_MATCHES, sizeof(buf));
+	snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_MATCHES);
 	proc = proc_create_seq_private(buf, 0440, net->proc_net,
 			&xt_match_seq_ops, sizeof(struct nf_mttg_trav),
 			(void *)(unsigned long)af);
@@ -2053,8 +2051,7 @@ int xt_proto_init(struct net *net, u_int8_t af)
 	if (uid_valid(root_uid) && gid_valid(root_gid))
 		proc_set_user(proc, root_uid, root_gid);
 
-	strscpy(buf, xt_prefix[af], sizeof(buf));
-	strlcat(buf, FORMAT_TARGETS, sizeof(buf));
+	snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TARGETS);
 	proc = proc_create_seq_private(buf, 0440, net->proc_net,
 			 &xt_target_seq_ops, sizeof(struct nf_mttg_trav),
 			 (void *)(unsigned long)af);
@@ -2068,13 +2065,11 @@ int xt_proto_init(struct net *net, u_int8_t af)
 
 #ifdef CONFIG_PROC_FS
 out_remove_matches:
-	strscpy(buf, xt_prefix[af], sizeof(buf));
-	strlcat(buf, FORMAT_MATCHES, sizeof(buf));
+	snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_MATCHES);
 	remove_proc_entry(buf, net->proc_net);
 
 out_remove_tables:
-	strscpy(buf, xt_prefix[af], sizeof(buf));
-	strlcat(buf, FORMAT_TABLES, sizeof(buf));
+	snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TABLES);
 	remove_proc_entry(buf, net->proc_net);
 out:
 	return -1;
@@ -2087,16 +2082,13 @@ void xt_proto_fini(struct net *net, u_int8_t af)
 #ifdef CONFIG_PROC_FS
 	char buf[XT_FUNCTION_MAXNAMELEN];
 
-	strscpy(buf, xt_prefix[af], sizeof(buf));
-	strlcat(buf, FORMAT_TABLES, sizeof(buf));
+	snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TABLES);
 	remove_proc_entry(buf, net->proc_net);
 
-	strscpy(buf, xt_prefix[af], sizeof(buf));
-	strlcat(buf, FORMAT_TARGETS, sizeof(buf));
+	snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_TARGETS);
 	remove_proc_entry(buf, net->proc_net);
 
-	strscpy(buf, xt_prefix[af], sizeof(buf));
-	strlcat(buf, FORMAT_MATCHES, sizeof(buf));
+	snprintf(buf, sizeof(buf), "%s%s", xt_prefix[af], FORMAT_MATCHES);
 	remove_proc_entry(buf, net->proc_net);
 #endif /*CONFIG_PROC_FS*/
 }
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH bpf-next v2 02/15] bpf: Make struct_ops tasks_rcu grace period optional
From: Eduard Zingerman @ 2026-06-26 22:20 UTC (permalink / raw)
  To: Amery Hung, bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
	shakeel.butt, roman.gushchin, kuniyu, kerneljasonxing,
	kernel-team
In-Reply-To: <20260623175006.3136053-3-ameryhung@gmail.com>

On Tue, 2026-06-23 at 10:49 -0700, Amery Hung wrote:
> From: Martin KaFai Lau <martin.lau@kernel.org>
> 
> bpf_struct_ops_map_free() currently waits for both a regular RCU grace
> period and a tasks RCU grace period for every struct_ops map through
> synchronize_rcu_mult(call_rcu, call_rcu_tasks).
> 
> A regular RCU grace period is still required for all struct_ops maps
> because the struct_ops trampoline ksyms requires a rcu grace period
> (take a look at the list_del_rcu in __bpf_ksym_del).
> Add a map_free_pre_rcu() callback so the struct_ops map can remove
> ksyms before bpf_map_put() wait for the regular rcu grace period.
> 
> The tasks RCU grace period is only needed by tcp_congestion_ops.
> Add free_after_tasks_rcu_gp only to struct bpf_struct_ops instead
> of the bpf_map.
> 
> When CONFIG_TASKS_RCU=n, synchronize_rcu_tasks() is the same as
> synchronize_rcu(). Since all struct_ops maps now complete a regular RCU
> grace period before bpf_struct_ops_map_free() runs, skip the extra
> synchronize_rcu_tasks() call in this case.
> 
> This cleanup prepares for a later patch that needs to support
> free_after_mult_rcu_gp.
> 
> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Amery Hung <ameryhung@gmail.com>
> ---

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>

[...]

> @@ -997,24 +1006,8 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>  
>  	bpf_struct_ops_map_dissoc_progs(st_map);
>  
> -	bpf_struct_ops_map_del_ksyms(st_map);
> -
> -	/* The struct_ops's function may switch to another struct_ops.
> -	 *
> -	 * For example, bpf_tcp_cc_x->init() may switch to
> -	 * another tcp_cc_y by calling
> -	 * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> -	 * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
> -	 * and its refcount may reach 0 which then free its
> -	 * trampoline image while tcp_cc_x is still running.
> -	 *
> -	 * A vanilla rcu gp is to wait for all bpf-tcp-cc prog
> -	 * to finish. bpf-tcp-cc prog is non sleepable.
> -	 * A rcu_tasks gp is to wait for the last few insn
> -	 * in the tramopline image to finish before releasing
> -	 * the trampoline image.
> -	 */
> -	synchronize_rcu_mult(call_rcu, call_rcu_tasks);
> +	if (tasks_rcu && IS_ENABLED(CONFIG_TASKS_RCU))
> +		synchronize_rcu_tasks();

As far as I understand, this removes the synchronize_rcu_tasks()
for qdisk, sched_ext, smc and hid struct ops. As far as I can tell,
each one of them employs separate means to guarantee that there won't
be any pending BPF trampolines referring to the image being freed here.
So, the change appears to be safe.

>  
>  	__bpf_struct_ops_map_free(map);
>  }

[...]

^ permalink raw reply

* Re: [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: John Fastabend @ 2026-06-26 21:43 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Kuniyuki Iwashima, Michal Luczaj, Willem de Bruijn, Jiayuan Chen,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, netdev, bpf, linux-kernel,
	linux-kselftest
In-Reply-To: <87o6gyyjxk.fsf@cloudflare.com>

On Thu, Jun 25, 2026 at 12:48:55PM +0200, Jakub Sitnicki wrote:
>On Wed, Jun 24, 2026 at 02:39 PM -07, Kuniyuki Iwashima wrote:
>> On Wed, Jun 24, 2026 at 2:33 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>>>
>>> On Wed, Jun 24, 2026 at 2:26 PM Michal Luczaj <mhal@rbox.co> wrote:
>>> >
>>> > On 6/24/26 22:01, Willem de Bruijn wrote:
>>> > > Jakub Sitnicki wrote:
>>> > >> On Tue, Jun 23, 2026 at 08:03 PM +02, Michal Luczaj wrote:
>>> > >>> UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
>>> > >>> sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
>>> > >>>
>>> > >>> Because sockmap accepts unbound UDP sockets, a BPF program can increment a
>>> > >>> socket's refcount via lookup. If the socket is subsequently bound, the
>>> > >>> transition from unbound to bound causes bpf_sk_release() to skip the
>>> > >>> decrement of the refcount, causing a memory leak.
>>> > >>>
>>> > >>> unreferenced object 0xffff88810bc2eb40 (size 1984):
>>> > >>>   comm "test_progs", pid 2451, jiffies 4295320596
>>> > >>>   hex dump (first 32 bytes):
>>> > >>>     7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00  ................
>>> > >>>     02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
>>> > >>>   backtrace (crc bdee079d):
>>> > >>>     kmem_cache_alloc_noprof+0x557/0x660
>>> > >>>     sk_prot_alloc+0x69/0x240
>>> > >>>     sk_alloc+0x30/0x460
>>> > >>>     inet_create+0x2ce/0xf80
>>> > >>>     __sock_create+0x25b/0x5c0
>>> > >>>     __sys_socket+0x119/0x1d0
>>> > >>>     __x64_sys_socket+0x72/0xd0
>>> > >>>     do_syscall_64+0xa1/0x5f0
>>> > >>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>> > >>>
>>> > >>> Maintain balanced refcounts across sk lookup/release: (re-)set
>>> > >>> SOCK_RCU_FREE on proto update to treat the socket (whether bound or
>>> > >>> unbound) as not requiring a refcount increment on (a RCU protected) lookup.
>>> > >>>
>>> > >>> Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
>>> > >>> Signed-off-by: Michal Luczaj <mhal@rbox.co>

[...]

>Rejecting unhashed UDP sockets on insert to sockmap SGTM.
>It is also in line with disable-problematic-cases strategy.

Agree ACK just disallow it.

^ permalink raw reply

* Re: [RFC PATCH 2/2] selftests/landlock: Add test for TCP fast open
From: Mickaël Salaün @ 2026-06-26 21:19 UTC (permalink / raw)
  To: Matthieu Buffet
  Cc: Bryam Vargas, Günther Noack, linux-security-module,
	Mikhail Ivanov, Paul Moore, Eric Dumazet, Neal Cardwell,
	linux-kernel, netdev
In-Reply-To: <20260617180526.15627-3-matthieu@buffet.re>

On Wed, Jun 17, 2026 at 08:05:24PM +0200, Matthieu Buffet wrote:
> Enforce that TCP Fast Open is controlled by
> LANDLOCK_ACCESS_NET_CONNECT_TCP. Semantics of connect() and
> sendmsg(MSG_FASTOPEN) should be identical from Landlock's perspective.
> Also enforce error code consistency, since UDP sockets ignore
> the MSG_FASTOPEN flag while Unix sockets reject it.
> 
> Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
> ---
>  tools/testing/selftests/landlock/net_test.c | 155 ++++++++++++++++++++
>  1 file changed, 155 insertions(+)
> 
> diff --git a/tools/testing/selftests/landlock/net_test.c b/tools/testing/selftests/landlock/net_test.c
> index 0c256e7c8675..177ed28e70f6 100644
> --- a/tools/testing/selftests/landlock/net_test.c
> +++ b/tools/testing/selftests/landlock/net_test.c
> @@ -258,6 +258,64 @@ static int connect_variant(const int sock_fd,
>  	return connect_variant_addrlen(sock_fd, srv, get_addrlen(srv, false));
>  }
>  
> +static int sendto_variant_addrlen(const int sock_fd,
> +				  const struct service_fixture *const srv,
> +				  const socklen_t addrlen, void *buf,
> +				  size_t len, size_t flags)
> +{
> +	const struct sockaddr *dst = NULL;
> +	ssize_t ret;
> +
> +	/*
> +        * We never want our processes to be killed by SIGPIPE: we check return
> +        * codes and errno, so that we have actual error messages.
> +        */

There are some extra spaces above.

> +	flags |= MSG_NOSIGNAL;
> +
> +	if (srv != NULL) {

Just `if (srv) {`

> +		switch (srv->protocol.domain) {
> +		case AF_UNSPEC:
> +		case AF_INET:
> +			dst = (const struct sockaddr *)&srv->ipv4_addr;
> +			break;
> +
> +		case AF_INET6:
> +			dst = (const struct sockaddr *)&srv->ipv6_addr;
> +			break;
> +
> +		case AF_UNIX:
> +			dst = (const struct sockaddr *)&srv->unix_addr;
> +			break;
> +
> +		default:
> +			errno = EAFNOSUPPORT;
> +			return -errno;
> +		}
> +	}
> +
> +	ret = sendto(sock_fd, buf, len, flags, dst, addrlen);
> +	if (ret < 0)
> +		return -errno;
> +
> +	/* errno is not set in cases of partial writes. */
> +	if (ret != len)
> +		return -EINTR;
> +
> +	return 0;
> +}
> +
> +static int sendto_variant(const int sock_fd,
> +			  const struct service_fixture *const srv, void *buf,
> +			  size_t len, size_t flags)
> +{
> +	socklen_t addrlen = 0;
> +
> +	if (srv != NULL)

ditto

> +		addrlen = get_addrlen(srv, false);
> +
> +	return sendto_variant_addrlen(sock_fd, srv, addrlen, buf, len, flags);
> +}
> +
>  FIXTURE(protocol)
>  {
>  	struct service_fixture srv0, srv1, srv2, unspec_any0, unspec_srv0;
> @@ -950,6 +1008,103 @@ TEST_F(protocol, connect_unspec)
>  	EXPECT_EQ(0, close(bind_fd));
>  }
>  
> +TEST_F(protocol, tcp_fastopen)
> +{
> +	const bool restricted = variant->sandbox == TCP_SANDBOX &&
> +				variant->prot.type == SOCK_STREAM &&
> +				(variant->prot.protocol == IPPROTO_TCP ||
> +				 variant->prot.protocol == IPPROTO_IP) &&
> +				(variant->prot.domain == AF_INET ||
> +				 variant->prot.domain == AF_INET6);
> +	const struct landlock_ruleset_attr ruleset_attr = {
> +		.handled_access_net = LANDLOCK_ACCESS_NET_CONNECT_TCP,
> +	};
> +	int bind_fd, client_fd, status;
> +	char buf;
> +	pid_t child;
> +
> +	bind_fd = socket_variant(&self->srv0);
> +	ASSERT_LE(0, bind_fd);
> +	EXPECT_EQ(0, bind_variant(bind_fd, &self->srv0));
> +	if (self->srv0.protocol.type == SOCK_STREAM)
> +		EXPECT_EQ(0, listen(bind_fd, backlog));
> +
> +	child = fork();
> +	ASSERT_LE(0, child);
> +	if (child == 0) {
> +		int connect_fd, ret;
> +
> +		/* Closes listening socket for the child. */
> +		EXPECT_EQ(0, close(bind_fd));
> +
> +		connect_fd = socket_variant(&self->srv0);
> +		ASSERT_LE(0, connect_fd);
> +
> +		if (variant->sandbox == TCP_SANDBOX) {
> +			const int ruleset_fd = landlock_create_ruleset(
> +				&ruleset_attr, sizeof(ruleset_attr), 0);
> +			ASSERT_LE(0, ruleset_fd);
> +
> +			enforce_ruleset(_metadata, ruleset_fd);
> +			EXPECT_EQ(0, close(ruleset_fd));
> +		}
> +
> +		/* Fast Open with no address. */
> +		ret = sendto_variant(connect_fd, NULL, NULL, 0, MSG_FASTOPEN);


> +		if (self->srv0.protocol.domain == AF_UNIX) {
> +			ASSERT_EQ(-ENOTCONN, ret);
> +		} else if (self->srv0.protocol.type == SOCK_DGRAM) {
> +			ASSERT_EQ(-EDESTADDRREQ, ret);
> +		} else {
> +			ASSERT_EQ(-EINVAL, ret);
> +		}
> +
> +		/* Fast Open to a denied address. */
> +		ret = sendto_variant(connect_fd, &self->srv0, "A", 1,
> +				     MSG_FASTOPEN);
> +		if (restricted) {
> +			ASSERT_EQ(-EACCES, ret);
> +		} else if (self->srv0.protocol.domain == AF_UNIX &&
> +			   self->srv0.protocol.type == SOCK_STREAM) {
> +			ASSERT_EQ(-EOPNOTSUPP, ret);
> +		} else {
> +			ASSERT_EQ(0, ret);
> +		}

All these ret checks should be done with EXPECT_EQ because they don't
block the test itself and we can get more info by checking more after
that.

> +
> +		EXPECT_EQ(0, close(connect_fd));
> +		_exit(_metadata->exit_code);
> +		return;
> +	}
> +
> +	client_fd = bind_fd;
> +	if (!restricted && self->srv0.protocol.type == SOCK_STREAM &&
> +	    self->srv0.protocol.domain != AF_UNIX) {
> +		client_fd = accept(bind_fd, NULL, 0);
> +		ASSERT_LE(0, client_fd);
> +	}
> +
> +	if (restricted) {
> +		EXPECT_EQ(-1, read(client_fd, &buf, 1));
> +		EXPECT_EQ(ENOTCONN, errno);
> +	} else if (self->srv0.protocol.domain == AF_UNIX &&
> +		   self->srv0.protocol.type == SOCK_STREAM) {
> +		EXPECT_EQ(-1, read(client_fd, &buf, 1));
> +		EXPECT_EQ(EINVAL, errno);
> +	} else {
> +		EXPECT_EQ(1, read(client_fd, &buf, 1));
> +		EXPECT_EQ('A', buf);
> +	}
> +
> +	EXPECT_EQ(child, waitpid(child, &status, 0));
> +	EXPECT_EQ(1, WIFEXITED(status));
> +	EXPECT_EQ(EXIT_SUCCESS, WEXITSTATUS(status));
> +
> +	if (client_fd != bind_fd)
> +		EXPECT_LE(0, close(client_fd));
> +
> +	EXPECT_EQ(0, close(bind_fd));
> +}
> +
>  FIXTURE(ipv4)
>  {
>  	struct service_fixture srv0, srv1;
> -- 
> 2.47.3
> 
> 

^ permalink raw reply

* Re: [PATCH bpf v2 3/4] selftests/bpf: Adapt sockmap update error handling
From: bot+bpf-ci @ 2026-06-26 21:14 UTC (permalink / raw)
  To: mhal, edumazet, kuniyu, pabeni, willemb, john.fastabend, jakub,
	jiayuan.chen, davem, kuba, horms, ast, cong.wang, daniel, andrii,
	eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	shuah
  Cc: netdev, bpf, linux-kernel, linux-kselftest, mhal, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260626-sockmap-lookup-udp-leak-v2-3-7e7e201c951a@rbox.co>

[-- Attachment #1: Type: text/plain, Size: 1189 bytes --]

>     selftests/bpf: Adapt sockmap update error handling
>
>     Update sockmap_listen to accommodate the recent change in sockmap that
>     rejects unbound UDP sockets.
>
>     TCP: Reject unbound and bound (unless established or listening).
>     UDP: Accept only bound sockets.
>
>     Signed-off-by: Michal Luczaj <mhal@rbox.co>

This adjusts the test expectations that commit 0c48eefae712 ("sock_map:
Lift socket state restriction for datagram sockets") had introduced, the
per-sotype branching in test_insert_opened() along with the matching
changes in test_insert_bound() and test_ops().

The companion kernel fix 77a34848a723 ("bpf, sockmap: Reject unhashed UDP
sockets on sockmap update") already carries:

  Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")

Should this selftest commit carry the same Fixes: tag?

  Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/28264556257

^ permalink raw reply

* Re: [PATCH bpf v2 4/4] selftests/bpf: Fail unbound UDP on sockmap update
From: Kuniyuki Iwashima @ 2026-06-26 21:03 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: Eric Dumazet, Paolo Abeni, Willem de Bruijn, John Fastabend,
	Jakub Sitnicki, Jiayuan Chen, David S. Miller, Jakub Kicinski,
	Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, netdev, bpf, linux-kernel,
	linux-kselftest
In-Reply-To: <20260626-sockmap-lookup-udp-leak-v2-4-7e7e201c951a@rbox.co>

On Fri, Jun 26, 2026 at 1:37 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> sockmap now rejects unbound UDP sockets. Adjust test_maps.
>
> This effectively reverts commit c39aa2159974 ("bpf, selftests: Fix
> test_maps now that sockmap supports UDP").
>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>  tools/testing/selftests/bpf/test_maps.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index c32da7bd8be2..81cd5d0d69c1 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -759,12 +759,12 @@ static void test_sockmap(unsigned int tasks, void *data)
>                 goto out_sockmap;
>         }
>
> -       /* Test update with unsupported UDP socket */
> +       /* Test update with unsupported unbound UDP socket */
>         udp = socket(AF_INET, SOCK_DGRAM, 0);
>         i = 0;
>         err = bpf_map_update_elem(fd, &i, &udp, BPF_ANY);
> -       if (err) {
> -               printf("Failed socket update SOCK_DGRAM '%i:%i'\n",
> +       if (!err) {
> +               printf("Failed allowed unbound SOCK_DGRAM socket update '%i:%i'\n",

nit: Maybe s/Failed/Unexpectedly succeeded/ ?

If we want to avoid breakage, this patch needs to be squashed to
the fix patch, but it's discouraged in netdev, not sure about bpf tree.

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* Re: [PATCH bpf v2 3/4] selftests/bpf: Adapt sockmap update error handling
From: Kuniyuki Iwashima @ 2026-06-26 20:58 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: Eric Dumazet, Paolo Abeni, Willem de Bruijn, John Fastabend,
	Jakub Sitnicki, Jiayuan Chen, David S. Miller, Jakub Kicinski,
	Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, netdev, bpf, linux-kernel,
	linux-kselftest
In-Reply-To: <20260626-sockmap-lookup-udp-leak-v2-3-7e7e201c951a@rbox.co>

On Fri, Jun 26, 2026 at 1:37 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> Update sockmap_listen to accommodate the recent change in sockmap that
> rejects unbound UDP sockets.
>
> TCP: Reject unbound and bound (unless established or listening).
> UDP: Accept only bound sockets.
>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>  tools/testing/selftests/bpf/prog_tests/sockmap_listen.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
> index cc0c68bab907..6ee1bc6b3b23 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
> @@ -63,11 +63,8 @@ static void test_insert_opened(struct test_sockmap_listen *skel __always_unused,
>         errno = 0;
>         value = s;
>         err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
> -       if (sotype == SOCK_STREAM) {
> -               if (!err || errno != EOPNOTSUPP)
> -                       FAIL_ERRNO("map_update: expected EOPNOTSUPP");
> -       } else if (err)
> -               FAIL_ERRNO("map_update: expected success");

Initially I thought AF_UNIX still exercised this path but it was removed
in f3de1cf621f7.  The leftover in family_str() was a bit confusing, so please
follow up on bpf-next.

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* Re: [PATCH bpf v2 2/4] selftests/bpf: Ensure UDP sockets are bound
From: Kuniyuki Iwashima @ 2026-06-26 20:47 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: Eric Dumazet, Paolo Abeni, Willem de Bruijn, John Fastabend,
	Jakub Sitnicki, Jiayuan Chen, David S. Miller, Jakub Kicinski,
	Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, netdev, bpf, linux-kernel,
	linux-kselftest
In-Reply-To: <20260626-sockmap-lookup-udp-leak-v2-2-7e7e201c951a@rbox.co>

On Fri, Jun 26, 2026 at 1:37 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> Update sockmap_basic tests to bind sockets before they are used. This
> accommodates the recent change in sockmap that rejects unbound UDP sockets.
>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>

nit: this should be patch 1.

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* Re: [PATCH bpf v2 1/4] bpf, sockmap: Reject unhashed UDP sockets on sockmap update
From: Kuniyuki Iwashima @ 2026-06-26 20:45 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: Eric Dumazet, Paolo Abeni, Willem de Bruijn, John Fastabend,
	Jakub Sitnicki, Jiayuan Chen, David S. Miller, Jakub Kicinski,
	Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, netdev, bpf, linux-kernel,
	linux-kselftest
In-Reply-To: <20260626-sockmap-lookup-udp-leak-v2-1-7e7e201c951a@rbox.co>

On Fri, Jun 26, 2026 at 1:37 PM Michal Luczaj <mhal@rbox.co> wrote:
>
> UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
> sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
>
> Because sockmap accepts unbound UDP sockets, a BPF program can increment a
> socket's refcount via lookup. If the socket is subsequently bound, the
> transition from unbound to bound causes bpf_sk_release() to skip the
> decrement of the refcount, causing a memory leak.
>
> unreferenced object 0xffff88810bc2eb40 (size 1984):
>   comm "test_progs", pid 2451, jiffies 4295320596
>   hex dump (first 32 bytes):
>     7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00  ................
>     02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
>   backtrace (crc bdee079d):
>     kmem_cache_alloc_noprof+0x557/0x660
>     sk_prot_alloc+0x69/0x240
>     sk_alloc+0x30/0x460
>     inet_create+0x2ce/0xf80
>     __sock_create+0x25b/0x5c0
>     __sys_socket+0x119/0x1d0
>     __x64_sys_socket+0x72/0xd0
>     do_syscall_64+0xa1/0x5f0
>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Instead of special-casing for refcounted sockets, reject unhashed UDP
> sockets during sockmap updates, as there is no benefit to supporting those.
> This effectively reverts the commit under Fixes, with two exceptions:
>
> 1. sock_map_sk_state_allowed() maintains a fall-through `return true`.
> 2. In the spirit of commit b8b8315e39ff ("bpf, sockmap: Remove unhash
>    handler for BPF sockmap usage"), the proto::unhash BPF handler is not
>    reintroduced.
>
> Historical note: this issue is related to commit 67312adc96b5 ("bpf: reject
> unhashed sockets in bpf_sk_assign").
>
> Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>

Looks good, thanks !

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* Re: [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Michal Luczaj @ 2026-06-26 20:42 UTC (permalink / raw)
  To: Jakub Sitnicki, Kuniyuki Iwashima
  Cc: Willem de Bruijn, John Fastabend, Jiayuan Chen, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Alexei Starovoitov, Cong Wang, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan,
	netdev, bpf, linux-kernel, linux-kselftest
In-Reply-To: <87o6gyyjxk.fsf@cloudflare.com>

On 6/25/26 12:48, Jakub Sitnicki wrote:
> On Wed, Jun 24, 2026 at 02:39 PM -07, Kuniyuki Iwashima wrote:
>> On Wed, Jun 24, 2026 at 2:33 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>>> ...
>>> Setting SOCK_RCU_FREE itself should not cause a problem, but I think
>>> we should take a step back.
>>>
>>> AFAIU, 0c48eefae712 was to allow putting AF_UNIX SOCK_DGRAM sockets
>>> into sockmap, not to allow using unconnected UDP sockets in sk_lookup etc.
>>>
>>> Actually, v4 of the patch was implemented as such but did not get any feedback,
>>> https://lore.kernel.org/bpf/20210508220835.53801-9-xiyou.wangcong@gmail.com/#t
>>>
>>> ... and v5 (the final commit) somehow removed the restriction for unconnected
>>> UDP socket as well.
>>> https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com/
>>>
>>> Given the initial use case, sockmap redirect, is still blocked by
>>> TCP_ESTABLISHED
>>> check in sock_map_redirect_allowed(), I feel there is no point in supporting
>>> unconnected UDP sockets in sockmap.  It cannot get any skb from anywhere
>>> (without buggy sk_lookup).
>>
>> s/unconnected/unhashed/g :)
> 
> Rejecting unhashed UDP sockets on insert to sockmap SGTM.
> It is also in line with disable-problematic-cases strategy.

OK, here's v2 with the sock_map_sk_state_allowed() check reintroduced:
https://lore.kernel.org/bpf/20260626-sockmap-lookup-udp-leak-v2-0-7e7e201c951a@rbox.co/

Thanks,
Michal


^ permalink raw reply

* Re: [RFC PATCH 1/2] landlock: fix TCP Fast Open connection bypass
From: Mickaël Salaün @ 2026-06-26 20:40 UTC (permalink / raw)
  To: Matthieu Buffet
  Cc: Bryam Vargas, Günther Noack, linux-security-module,
	Mikhail Ivanov, Paul Moore, Eric Dumazet, Neal Cardwell,
	linux-kernel, netdev
In-Reply-To: <20260617180526.15627-2-matthieu@buffet.re>

Thanks Matthieu, could you please rebase this serise on the master
branch (especially on top of your UDP changes)?

This patch will be useful for backports though.

On Wed, Jun 17, 2026 at 08:05:23PM +0200, Matthieu Buffet wrote:
> The documentation of the socket_connect() LSM hook states that it
> controls connecting a socket to a remote address. It has not been the
> case since the addition of TCP Fast Open (RFC 7413) support, which allows
> opening a TCP connection (thus, setting a socket's destination address)
> via the MSG_FASTOPEN flag passed to sendto()/sendmsg()/sendmmsg(). The
> problem then got duplicated into MPTCP.
> 
> Landlock did not take it into account when its TCP support was added,
> leaving a bypass of TCP connect policy.
> 
> Ideally a call to the LSM hook would be added in the fastopen code path,
> in order to fix this generically. But connect() hooks are designed to run
> with the socket locked, unlike sendmsg() hooks.
> 
> Closes: https://github.com/landlock-lsm/linux/issues/41
> Fixes: fff69fb03dde ("landlock: Support network rules with TCP bind and connect")
> Signed-off-by: Matthieu Buffet <matthieu@buffet.re>
> ---
>  security/landlock/net.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/security/landlock/net.c b/security/landlock/net.c
> index 4ee4002a8f56..a2375762c18b 100644
> --- a/security/landlock/net.c
> +++ b/security/landlock/net.c
> @@ -246,9 +246,26 @@ static int hook_socket_connect(struct socket *const sock,
>  					   access_request);
>  }
>  
> +static int hook_socket_sendmsg(struct socket *const sock,
> +			       struct msghdr *const msg, const int size)
> +{
> +	struct sockaddr *const address = msg->msg_name;
> +	const int addrlen = msg->msg_namelen;
> +
> +	if (sk_is_tcp(sock->sk) && address != NULL &&
> +	    (msg->msg_flags & MSG_FASTOPEN) != 0) {

This might be a bit better:

  if ((msg->msg_flags & MSG_FASTOPEN) && address && sk_is_tcp(sock->sk))

> +		return current_check_access_socket(
> +			sock, address, addrlen,
> +			LANDLOCK_ACCESS_NET_CONNECT_TCP);
> +	}
> +
> +	return 0;
> +}
> +
>  static struct security_hook_list landlock_hooks[] __ro_after_init = {
>  	LSM_HOOK_INIT(socket_bind, hook_socket_bind),
>  	LSM_HOOK_INIT(socket_connect, hook_socket_connect),
> +	LSM_HOOK_INIT(socket_sendmsg, hook_socket_sendmsg),
>  };
>  
>  __init void landlock_add_net_hooks(void)
> -- 
> 2.47.3
> 
> 

^ permalink raw reply

* [PATCH bpf v2 2/4] selftests/bpf: Ensure UDP sockets are bound
From: Michal Luczaj @ 2026-06-26 20:36 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, David S. Miller,
	Jakub Kicinski, Simon Horman, Alexei Starovoitov, Cong Wang,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260626-sockmap-lookup-udp-leak-v2-0-7e7e201c951a@rbox.co>

Update sockmap_basic tests to bind sockets before they are used. This
accommodates the recent change in sockmap that rejects unbound UDP sockets.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 tools/testing/selftests/bpf/prog_tests/sockmap_basic.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index cb3229711f93..2d22a9058a8e 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -853,7 +853,7 @@ static void test_sockmap_many_socket(void)
 		return;
 	}
 
-	udp = xsocket(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, 0);
+	udp = socket_loopback(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK);
 	if (udp < 0) {
 		close(dgram);
 		close(tcp);
@@ -922,7 +922,7 @@ static void test_sockmap_many_maps(void)
 		return;
 	}
 
-	udp = xsocket(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, 0);
+	udp = socket_loopback(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK);
 	if (udp < 0) {
 		close(dgram);
 		close(tcp);
@@ -993,7 +993,7 @@ static void test_sockmap_same_sock(void)
 		return;
 	}
 
-	udp = xsocket(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, 0);
+	udp = socket_loopback(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK);
 	if (udp < 0) {
 		close(dgram);
 		close(tcp);

-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf v2 0/4] bpf, sockmap: Fix sockmap leaking UDP socks
From: Michal Luczaj @ 2026-06-26 20:36 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, David S. Miller,
	Jakub Kicinski, Simon Horman, Alexei Starovoitov, Cong Wang,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj

Fix for UDP sockets getting leaked during sockmap lookup/release.
Accompanied by selftests updates.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Changes in v2:
- selftest: drop the original, adapt old tests
- fix: change approach to rejecting unbound UDP [Kuniyuki]
- Link to v1: https://patch.msgid.link/20260623-sockmap-lookup-udp-leak-v1-0-05804f9308e4@rbox.co

---
Michal Luczaj (4):
      bpf, sockmap: Reject unhashed UDP sockets on sockmap update
      selftests/bpf: Ensure UDP sockets are bound
      selftests/bpf: Adapt sockmap update error handling
      selftests/bpf: Fail unbound UDP on sockmap update

 net/core/sock_map.c                                     |  2 ++
 tools/testing/selftests/bpf/prog_tests/sockmap_basic.c  |  6 +++---
 tools/testing/selftests/bpf/prog_tests/sockmap_listen.c | 17 +++++++++--------
 tools/testing/selftests/bpf/test_maps.c                 |  6 +++---
 4 files changed, 17 insertions(+), 14 deletions(-)
---
base-commit: 26490a375cb9be9bac96b5171610fd85ca6c2305
change-id: 20260617-sockmap-lookup-udp-leak-bc4e5c5481d7

Best regards,
--  
Michal Luczaj <mhal@rbox.co>


^ permalink raw reply

* [PATCH bpf v2 4/4] selftests/bpf: Fail unbound UDP on sockmap update
From: Michal Luczaj @ 2026-06-26 20:36 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, David S. Miller,
	Jakub Kicinski, Simon Horman, Alexei Starovoitov, Cong Wang,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260626-sockmap-lookup-udp-leak-v2-0-7e7e201c951a@rbox.co>

sockmap now rejects unbound UDP sockets. Adjust test_maps.

This effectively reverts commit c39aa2159974 ("bpf, selftests: Fix
test_maps now that sockmap supports UDP").

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 tools/testing/selftests/bpf/test_maps.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index c32da7bd8be2..81cd5d0d69c1 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -759,12 +759,12 @@ static void test_sockmap(unsigned int tasks, void *data)
 		goto out_sockmap;
 	}
 
-	/* Test update with unsupported UDP socket */
+	/* Test update with unsupported unbound UDP socket */
 	udp = socket(AF_INET, SOCK_DGRAM, 0);
 	i = 0;
 	err = bpf_map_update_elem(fd, &i, &udp, BPF_ANY);
-	if (err) {
-		printf("Failed socket update SOCK_DGRAM '%i:%i'\n",
+	if (!err) {
+		printf("Failed allowed unbound SOCK_DGRAM socket update '%i:%i'\n",
 		       i, udp);
 		goto out_sockmap;
 	}

-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox