* Re: [PATCH net v2] tipc: fix out-of-bounds read in broadcast Gap ACK blocks
From: Sam P @ 2026-06-25 10:00 UTC (permalink / raw)
To: Tung Quang Nguyen
Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, netdev@vger.kernel.org,
tipc-discussion@lists.sourceforge.net,
linux-kernel@vger.kernel.org, Jon Maloy
In-Reply-To: <GV1P189MB19881AD711511094760906DCC6EC2@GV1P189MB1988.EURP189.PROD.OUTLOOK.COM>
On 25/06/2026 10:23, Tung Quang Nguyen wrote:
>> diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index
>> 76a1585d3f6b..08637c3c9db0 100644
>> --- a/net/tipc/bcast.c
>> +++ b/net/tipc/bcast.c
>> @@ -497,11 +497,12 @@ void tipc_bcast_ack_rcv(struct net *net, struct
>> tipc_link *l,
>> */
>> int tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
>> struct tipc_msg *hdr,
>> - struct sk_buff_head *retrq)
>> + struct sk_buff_head *retrq, bool *valid)
>> {
>> struct sk_buff_head *inputq = &tipc_bc_base(net)->inputq;
>> struct tipc_gap_ack_blks *ga;
>> struct sk_buff_head xmitq;
>> + u16 glen;
>
> Move this variable declaration to the bottom to follow reverse xmas tree style.
>
>> int rc = 0;
>>
>> __skb_queue_head_init(&xmitq);
>> @@ -510,13 +511,18 @@ int tipc_bcast_sync_rcv(struct net *net, struct
>> tipc_link *l,
>> if (msg_type(hdr) != STATE_MSG) {
>> tipc_link_bc_init_rcv(l, hdr);
>> } else if (!msg_bc_ack_invalid(hdr)) {
>> - tipc_get_gap_ack_blks(&ga, l, hdr, false);
>> - if (!sysctl_tipc_bc_retruni)
>> - retrq = &xmitq;
>> - rc = tipc_link_bc_ack_rcv(l, msg_bcast_ack(hdr),
>> - msg_bc_gap(hdr), ga, &xmitq,
>> - retrq);
>> - rc |= tipc_link_bc_sync_rcv(l, hdr, &xmitq);
>> + glen = tipc_get_gap_ack_blks(&ga, l, hdr, false);
>> + if (glen > msg_data_sz(hdr)) {
>> + /* Malformed Gap ACK blocks; caller drops the msg */
>> + *valid = false;
>> + } else {
>> + if (!sysctl_tipc_bc_retruni)
>> + retrq = &xmitq;
>> + rc = tipc_link_bc_ack_rcv(l, msg_bcast_ack(hdr),
>> + msg_bc_gap(hdr), ga, &xmitq,
>> + retrq);
>> + rc |= tipc_link_bc_sync_rcv(l, hdr, &xmitq);
>> + }
>> }
>> tipc_bcast_unlock(net);
>>
>> diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index
>> 2d9352dc7b0e..55d17b5413e1 100644
>> --- a/net/tipc/bcast.h
>> +++ b/net/tipc/bcast.h
>> @@ -97,7 +97,7 @@ void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l,
>> struct tipc_msg *hdr);
>> int tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
>> struct tipc_msg *hdr,
>> - struct sk_buff_head *retrq);
>> + struct sk_buff_head *retrq, bool *valid);
>> int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg,
>> struct tipc_link *bcl);
>> int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]); diff --git
>> a/net/tipc/node.c b/net/tipc/node.c index 97aa970a0d83..2887f94ee28f
>> 100644
>> --- a/net/tipc/node.c
>> +++ b/net/tipc/node.c
>> @@ -1831,12 +1831,13 @@ static void tipc_node_mcast_rcv(struct tipc_node
>> *n) }
>>
>> static void tipc_node_bc_sync_rcv(struct tipc_node *n, struct tipc_msg *hdr,
>> - int bearer_id, struct sk_buff_head *xmitq)
>> + int bearer_id, struct sk_buff_head *xmitq,
>> + bool *valid)
>> {
>> struct tipc_link *ucl;
>> int rc;
>>
>> - rc = tipc_bcast_sync_rcv(n->net, n->bc_entry.link, hdr, xmitq);
>> + rc = tipc_bcast_sync_rcv(n->net, n->bc_entry.link, hdr, xmitq, valid);
>
> 'valid' needs to be checked after this call. Then, return immediately if it is false.
>
>>
>> if (rc & TIPC_LINK_DOWN_EVT) {
>> tipc_node_reset_links(n);
>> @@ -2140,12 +2141,18 @@ void tipc_rcv(struct net *net, struct sk_buff *skb,
>> struct tipc_bearer *b)
>>
>> /* Ensure broadcast reception is in synch with peer's send state */
>> if (unlikely(usr == LINK_PROTOCOL)) {
>> + bool valid = true;
>> +
>> if (unlikely(skb_linearize(skb))) {
>> tipc_node_put(n);
>> goto discard;
>> }
>> hdr = buf_msg(skb);
>> - tipc_node_bc_sync_rcv(n, hdr, bearer_id, &xmitq);
>> + tipc_node_bc_sync_rcv(n, hdr, bearer_id, &xmitq, &valid);
>> + if (!valid) {
>> + tipc_node_put(n);
>> + goto discard;
>> + }
>> } else if (unlikely(tipc_link_acked(n->bc_entry.link) != bc_ack)) {
>> tipc_bcast_ack_rcv(net, n->bc_entry.link, hdr);
>> }
>>
>> base-commit: a986fde914d88af47eb78fd29c5d1af7952c3500
>> --
>> 2.54.0
>
Thanks for the review, I'll address this in a v3.
^ permalink raw reply
* Re: [PATCH net] nfc: nci: fix uninit-value in the RF discover/activated NTF handlers
From: Sam P @ 2026-06-25 10:04 UTC (permalink / raw)
To: Simon Horman
Cc: david, davem, edumazet, kuba, pabeni, oe-linux-nfc, netdev,
linux-kernel, stable
In-Reply-To: <20260625091828.1170615-2-horms@kernel.org>
On 25/06/2026 10:18, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> nfc: nci: fix uninit-value in the RF discover/activated NTF handlers
>
> This zero-initialises the on-stack nci_rf_discover_ntf and
> nci_rf_intf_activated_ntf structs in the RF discover and interface
> activated NTF handlers so that the technology- and activation-specific
> unions read back as zero when the corresponding length fields are zero.
>
> Link: https://lore.kernel.org/netdev/20260623172109.1105965-2-horms@kernel.org/
>
>> - activated: with rf_tech_specific_params_len == 0 the same union is read
>> via nci_target_auto_activated(); with activation_params_len == 0 the
>> activation_params union is read by nci_store_ats_nfc_iso_dep() into
>> ndev->target_ats.
>
> [Low]
> Does the third scenario in this paragraph match the code?
>
> Looking at nci_store_ats_nfc_iso_dep():
>
> static int nci_store_ats_nfc_iso_dep(struct nci_dev *ndev,
> const struct nci_rf_intf_activated_ntf *ntf)
> {
> ndev->target_ats_len = 0;
>
> if (ntf->activation_params_len <= 0)
> return NCI_STATUS_OK;
>
> if (ntf->activation_params.nfca_poll_iso_dep.rats_res_len > NFC_ATS_MAXSIZE) {
> ...
>
> The function returns NCI_STATUS_OK before any access to the
> activation_params union when activation_params_len is 0, so the union
> is not read in that case.
>
> The sibling nci_store_general_bytes_nfc_dep() has the same early return
> on activation_params_len <= 0.
>
> The two rf_tech_specific_params_len == 0 cases listed earlier in the
> commit message do match the code (nci_add_new_protocol() and
> nci_target_auto_activated() unconditionally pass
> &ntf.rf_tech_specific_params even when rf_tech_specific_params_len is
> 0), and the KMSAN trace quoted in the commit message only mentions the
> nci_add_new_protocol() path.
>
> The ntf = {} change still looks correct and necessary for the
> rf_tech_specific_params union. Could the third bullet be dropped or
> adjusted so the commit message reflects only the paths that actually
> read uninitialised bytes?
Thanks for sharing the review.
This is valid feedback, I'll fix up the commit message in a v2.
^ permalink raw reply
* Re: [PATCH v2] net: mdio: airoha: fix reset control leak in error path
From: Larysa Zaremba @ 2026-06-25 10:10 UTC (permalink / raw)
To: Wentao Liang
Cc: Andrew Lunn, Heiner Kallweit, Russell King, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20260622115403.39772-1-vulab@iscas.ac.cn>
Please, specify the target tree (probably, net) via a subject prefix.
Also, looks like you are missing to many CCs [0]
[0]
https://netdev-ctrl.bots.linux.dev/logs/build/1114710/14639274/cc_maintainers/summary
On Mon, Jun 22, 2026 at 07:54:03PM +0800, Wentao Liang wrote:
> In airoha_mdio_probe(), after calling reset_control_deassert(),
> if clk_set_rate() fails, the function returns immediately without
> calling reset_control_assert(). This leaves the reset line
> deasserted and causes a reference count leak on shared reset
> controllers.
>
Sashiko correctly points out that since the reset controller is exclusive,
there is no refcount leak. [1]
So the problem is missing rstc->rcdev->ops->assert(rstc->rcdev, rstc->id) call.
It would be great if you could describe, what problem this can cause.
> Fix this by reorganizing the error handling to use a goto label,
> ensuring reset_control_assert() is called on all error paths
> before returning.
>
> Also add error checking for reset_control_deassert().
Sashiko correctly points out you do not actually do this, which is fine, just
update the commit message. [1]
[1] https://sashiko.dev/#/patchset/20260622115403.39772-1-vulab%40iscas.ac.cn
> Fixes: 67e3ba978361 ("net: mdio: Add MDIO bus controller for Airoha AN7583")
> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
> ---
> drivers/net/mdio/mdio-airoha.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/mdio/mdio-airoha.c b/drivers/net/mdio/mdio-airoha.c
> index 52e7475121ea..4c1b2415687c 100644
> --- a/drivers/net/mdio/mdio-airoha.c
> +++ b/drivers/net/mdio/mdio-airoha.c
> @@ -246,15 +246,17 @@ static int airoha_mdio_probe(struct platform_device *pdev)
>
> ret = clk_set_rate(priv->clk, freq);
> if (ret)
> - return ret;
> + goto err_reset_assert;
>
> ret = devm_of_mdiobus_register(dev, bus, dev->of_node);
> - if (ret) {
> - reset_control_assert(priv->reset);
> - return ret;
> - }
> + if (ret)
> + goto err_reset_assert;
>
> return 0;
> +
> +err_reset_assert:
> + reset_control_assert(priv->reset);
> + return ret;
> }
>
> static const struct of_device_id airoha_mdio_dt_ids[] = {
> --
> 2.39.5 (Apple Git-154)
>
>
^ permalink raw reply
* Re: [PATCH v2 0/7] vmsplice: fix some problems in my previous vmsplice patchset
From: Askar Safin @ 2026-06-25 10:11 UTC (permalink / raw)
To: david
Cc: akpm, avagin, axboe, brauner, collin.funk1, david.laight.linux,
dhowells, fuse-devel, hch, jack, joannelkoong, kernel, linux-api,
linux-fsdevel, linux-kernel, linux-mm, luto, metze, miklos,
netdev, patches, pfalcato, safinaskar, torvalds, val, viro, w,
willy
In-Reply-To: <89ea76b3-e956-4232-8180-ee3929adf905@kernel.org>
"David Hildenbrand (Arm)" <david@kernel.org>:
> I think we concluded that we cannot rip out vmsplice that way at this point, and
> I suspect that Christian will drop that topic branch from -next after -rc1.
I think my patches still have a chance.
On fuse regression: I return EINVAL for particular combination of
flags used by fuse. This causes fuse to fail-back to non-vmsplice
code path. I did Debian code search, and I found none significant
packages, which use same combination of options.
So I think I was able to deal with fuse regression.
On CRIU named fifo "Not supported" regression: it is handled.
On CRIU major performance regression: it is NOT handled. But I still
think my approach is right. (See cover letter for details.)
(I wrote about all these in cover letter for this v2 patchset.)
So all regressions found so far (except for CRIU major performance
regression) are handled.
Other option is to introduce some deprecation period (as
suggested by Andrei Vagin). I can do this, if needed.
--
Askar Safin
^ permalink raw reply
* Re: [PATCH net] net: libwx: fix VMDQ mask for 1-queue mode
From: Larysa Zaremba @ 2026-06-25 10:14 UTC (permalink / raw)
To: Jiawen Wu
Cc: netdev, 'Mengyuan Lou', 'Andrew Lunn',
'David S. Miller', 'Eric Dumazet',
'Jakub Kicinski', 'Paolo Abeni',
'Simon Horman', 'Kees Cook'
In-Reply-To: <062401dd0487$3a3152f0$ae93f8d0$@trustnetic.com>
On Thu, Jun 25, 2026 at 05:44:33PM +0800, Jiawen Wu wrote:
> On Thu, Jun 25, 2026 5:39 PM, Larysa Zaremba wrote:
> > On Thu, Jun 25, 2026 at 05:08:51PM +0800, Jiawen Wu wrote:
> > > In wx_set_vmdq_queues(), the VMDQ mask was not set for the devices not
> > > support WX_FLAG_MULTI_64_FUNC, i.e., NGBE devices. A mask of 0 causes
> > > __ALIGN_MASK(1, ~vmdq->mask) to return 0, which incorrectly sets
> > > q_per_pool to 0 in wx_write_qde().
> > >
> > > Fix the VMDQ 1-queue mask to 0x7F then ensures that __ALIGN_MASK(1,
> > > 0x7F) correctly evaluates to 1.
> >
> > __ALIGN_MASK(1, 0x7F) evaulates to 0x80 (128), not to 1. __ALIGN_MASK(1, 0x7E)
> > evaluates to 1. Maybe you need 0x7D for 2 queues and 0x7E for 1 queue?
>
> Sorry, the commit log is so wrong for that '~' is missing...
> I want to describe that __ALIGN_MASK(1, ~0x7F) evaluates to 1.
>
Then I do not have any further concerns. Given you fix the lack of "~" in the
commit message and change "not support" to "not supporting" above, I approve
this patch.
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
> >
> > >
> > > Fixes: c52d4b898901 ("net: libwx: Redesign flow when sriov is enabled")
> > > Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
> > > ---
> > > drivers/net/ethernet/wangxun/libwx/wx_lib.c | 1 +
> > > drivers/net/ethernet/wangxun/libwx/wx_type.h | 1 +
> > > 2 files changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/net/ethernet/wangxun/libwx/wx_lib.c b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > > index d042567b8128..814d88d2aee4 100644
> > > --- a/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > > +++ b/drivers/net/ethernet/wangxun/libwx/wx_lib.c
> > > @@ -1802,6 +1802,7 @@ static bool wx_set_vmdq_queues(struct wx *wx)
> > > rss_i = 4;
> > > }
> > > } else {
> > > + vmdq_m = WX_VMDQ_1Q_MASK;
> > > /* double check we are limited to maximum pools */
> > > vmdq_i = min_t(u16, 8, vmdq_i);
> > >
> > > diff --git a/drivers/net/ethernet/wangxun/libwx/wx_type.h b/drivers/net/ethernet/wangxun/libwx/wx_type.h
> > > index c7befe4cdfe9..65e3e55db1cf 100644
> > > --- a/drivers/net/ethernet/wangxun/libwx/wx_type.h
> > > +++ b/drivers/net/ethernet/wangxun/libwx/wx_type.h
> > > @@ -486,6 +486,7 @@ enum WX_MSCA_CMD_value {
> > >
> > > #define WX_VMDQ_4Q_MASK 0x7C
> > > #define WX_VMDQ_2Q_MASK 0x7E
> > > +#define WX_VMDQ_1Q_MASK 0x7F
> > >
> > > /****************** Manageablility Host Interface defines ********************/
> > > #define WX_HI_MAX_BLOCK_BYTE_LENGTH 256 /* Num of bytes in range */
> > > --
> > > 2.51.0
> > >
> >
>
^ permalink raw reply
* Re: [PATCH iwl v3] ice: retry reading NVM if admin queue returns EBUSY
From: Robert Malz @ 2026-06-25 10:14 UTC (permalink / raw)
To: Przemek Kitszel
Cc: Simon Horman, Grzegorz Nitka, anthony.l.nguyen, intel-wired-lan,
netdev
In-Reply-To: <5658849b-0425-4132-ba32-5801e2907c60@intel.com>
Hey Przemek,
Thanks a lot for the feedback.
I was sure that we use ICE_NVM_TIMEOUT (180s) as a timeout every time
(ice_acquire_nvm) but your proposal made me rethink it a little.
First of all, the datasheet for E810 specifies the timeout as: "As an
input, the software might specify timeout longer than the default
taken for this resource, and up to one minute."
180s is greater than one minute so I took a look into AQC logs:
[ 110.698471] ice 0000:05:00.0: CQ CMD: opcode 0x0008, flags 0x2000,
datalen 0x0000, retval 0x0000
[ 110.698474] ice 0000:05:00.0: cookie (h,l) 0x00000000 0x00000000
[ 110.698477] ice 0000:05:00.0: param (0,1) 0x00010001 0x0002BF20
[ 110.698480] ice 0000:05:00.0: addr (h,l) 0x00000000 0x00000000
[ 110.698645] ice 0000:05:00.0: ATQ: desc and buffer writeback:
[ 110.698648] ice 0000:05:00.0: CQ CMD: opcode 0x0008, flags 0x2003,
datalen 0x0000, retval 0x0000
[ 110.698651] ice 0000:05:00.0: cookie (h,l) 0x00000000 0x00000000
[ 110.698654] ice 0000:05:00.0: param (0,1) 0x00010001 0x00000BB8
[ 110.698657] ice 0000:05:00.0: addr (h,l) 0x00000000 0x00000000
Based on the above, the driver requested a 0x0002BF20 timeout (180 000
ms) but the FW returned only 0x00000BB8 (3s).
I'm assuming this is expected behavior since the maximum timeout for
NVM read should be 60,000 ms.
If changing the timeout requested by the driver to 60s for read ops is
handled correctly by the FW and the FW respects that lock, the retry
patch submitted in this email thread might not be required at all.
Let me quickly prepare a new patch and test it. I'll update this
thread once I have a results.
Regards,
Robert
On Thu, Jun 25, 2026 at 9:51 AM Przemek Kitszel
<przemyslaw.kitszel@intel.com> wrote:
>
> >> This is an AI-generated review of your patch. The human sending this
>
>
> >> [Severity: Low]
> >> Does this check allow an extra retry execution? Since retry_cnt starts at 0,
> >> using a strictly greater-than check allows the loop to retry 4 times
> >> (5 total executions) instead of the 3 executions intended by
> >> ICE_SQ_SEND_MAX_EXECUTE.
> >>
> >>> + break;
> >>> + ice_debug(hw, ICE_DBG_NVM,
> >>> + "NVM read EBUSY error, retry %d\n",
> >>> + retry_cnt + 1);
> >>> + last_cmd = false;
> >>> + ice_release_nvm(hw);
> >>> + msleep(ICE_SQ_SEND_DELAY_TIME_MS);
> >>
> > The review is correct that this allows one extra execution. However, I
> > intentionally kept this verbatim from the OOT driver (1.15.4) to
>
> This OOT change was made entirely by Grzegorz, for E825C. Would be great
> to acknowledge his authorship (+ SoB ofc, below).
> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
>
> What was the platform that you have encountered the bug on?
>
> > retain the significant test exposure it has there. The absolute worst
> > case is one harmless extra retry before giving up. If you strongly
> > prefer strict adherence to ICE_SQ_SEND_MAX_EXECUTE, I can update the
> > condition in a v4.
>
> Please update (to have one attempt less).
>
> >
> >> [Severity: High]
> >> Does dropping the NVM lock here violate the caller's assumption of mutual
> >> exclusion?
> >>
> >> Callers of ice_read_flat_nvm() typically acquire the NVM lock beforehand to
> >> ensure an atomic read of multiple sectors without interleaved writes. By
> >> explicitly dropping the lock during the msleep() window, another thread, VF,
> >> or PF could acquire the NVM lock and modify the NVM before the read resumes,
> >> potentially resulting in reading torn or inconsistent NVM data.
> >>
> >>> + status = ice_acquire_nvm(hw, ICE_RES_READ);
> >>> + if (status)
> >>> + break;
> >>
> >
> > I don't think this is an issue. The NVM resource (req_res 0x0008) is
> > an ownership lease, not a content snapshot. The FW grants it with a
> > hold timeout (see the timeout field handling in ice_aq_req_res). It
> > never guaranteed a point-in-time view of the flash.
> > Very similar behavior already exists in ice_devlink_nvm_snapshot(),
> > which reads the NVM in chunks via ice_read_flat_nvm() and deliberately
> > acquires/releases the lock between chunks (with a comment explaining
> > the FW reclaim).
> > Additionally, the lock is only dropped on the failure path, where
> > neither bytes_read nor offset have been advanced. The driver simply
> > re-reads the failed chunk after re-acquiring. So no data that is
> > returned to the caller is committed across the drop window.
> > Overall the lock drop does not weaken any guarantee callers actually
> > have. If we wanted to introduce a true snapshot guarantee, that would
> > probably require different approach and separate patch.
> >
> >
> >> [Severity: High]
> >> Can this lead to an unbalanced NVM resource lock release and potential data
> >> corruption?
> >>
> >> If ice_acquire_nvm() fails here, the loop breaks and returns an error with the
> >> lock currently unheld. However, callers like ice_read_flash_module() are
> >> unaware of this internal lock state change and will unconditionally call
> >> ice_release_nvm() upon failure:
> >>
> >> drivers/net/ethernet/intel/ice/ice_nvm.c:ice_read_flash_module() {
> >> status = ice_acquire_nvm(hw, ICE_RES_READ);
> >> if (status)
> >> return status;
> >>
> >> status = ice_read_flat_nvm(hw, start + offset, &length, data, false);
> >>
> >> ice_release_nvm(hw);
> >> ...
> >>
> >> Because firmware tracks resource locks per-PF rather than per-OS-thread, this
> >> errant second release could forcefully strip the lock from another thread on
> >> the same PF that successfully acquired it, exposing the NVM to concurrent
> >> modification.
> >>
> >
> > Agreed, this might be a real bug, and the one of the three I think is
> > worth investigating.
> > If ice_acquire_nvm() fails after the drop, ice_read_flat_nvm() returns
> > with the lock unheld while callers unconditionally call
> > ice_release_nvm(), so a stray release is issued.
> >
> > On probability, though, the window is very small. Reaching it requires
> > sustained EBUSY across the retry budget plus a failed re-acquire
> > (which itself polls up to ICE_NVM_TIMEOUT), and concurrently another
> > requester taking the lock. Most reads happen during init (ice_probe,
> > and reset/rebuild via ice_init_nvm), and NVM writes only happen on an
> > already initialized driver. The devlink/ethtool nvm_read paths are
> > also exposed, but hitting this race would require precise timing
> > against a concurrent NVM owner on the device.
> >
> > I'd prefer to keep the scope of this patch limited to the EBUSY retry
> > path and not take on the unbalanced-release fix here. A proper fix
> > should change the lock-ownership contract of ice_read_flat_nvm() (on
> > error, the lock must be released by ice_read_flat_nvm(), callers
> > release only on success) and update all callers. Code change sould be
> > simple for all callers but ice_discover_flash_size(), it intentionally
> > holds one lease across a read loop and would need to re-acquire after
> > each expected boundary failure.
> >
> > Given how small the original window is, I'd rather not trade tested
> > OOT behavior for the risk of a complex unbalanced NVM lock fix. I
> > actually have a patch mostly ready that fixes the lock-ownership
> > contract, but I really don't like it. It changes the design of
> > ice_read_flat_nvm(), making it less intuitive for callers. More
> > importantly, I just don't have the resources or test coverage right
> > now to properly verify such change.
> >
> > However, I can modify the failure path for ice_acquire_nvm inside
> > ice_read_flat_nvm. Instead of bailing out immediately, we can just
> > retry it within the existing retry budget. In this case, the
> > probability of leaving ice_read_flat_nvm without holding the lock is
> > reduced even further without needing a refactor.
> >
> > Please let me know what you think about my thought process on this.
>
> I think that both AI-reported issues against the lock are valid
> concerns.
>
> I think that sleep was the actual fix, and re-locking were merely
> a necessity due to their expiration (as you said).
>
> A proper fix would be to just increase lock-timeout to accommodate all
> attempts (and still do the retries&sleep, but without unlocking).
>
>
> >
> >
> >
> >>> + retry_cnt++;
> >>> + } else {
> >>> + bytes_read += read_size;
> >>> + offset += read_size;
> >>> + retry_cnt = 0;
> >>> + }
> >>> } while (!last_cmd);
> >>>
> >>> *length = bytes_read;
> >
> > Thanks,
> > Robert
>
^ permalink raw reply
* [PATCH net-next] net: neigh: avoid calling neigh_forced_gc on every alloc when table is full
From: Vimal Agrawal @ 2026-06-25 10:20 UTC (permalink / raw)
To: netdev; +Cc: kuniyu, edumazet, vimal.agrawal, avimalin
In-Reply-To: <CALkUMdSCpx_ywYCx_ePLdm6yioO1nQWx7sSM=AEgsq0kywHxTw@mail.gmail.com>
Once the neighbour table exceeds gc_thresh3, neigh_forced_gc() is called
on every allocation attempt with no rate limiting. In workloads with mostly
active/reachable entries, the GC walk traverses a large portion of the
neighbour table without reclaiming entries, holding tbl->lock for an
extended period. This causes severe lock contention and allocation
latencies exceeding 16ms under sustained neighbour creation.
Add a pre-lock check in neigh_forced_gc() to skip the GC run if one was
performed within the last second, avoiding repeated full table scans and
lock acquisitions on the hot allocation path.
Profiling of neigh_create() shows ~3 orders of magnitude latency
improvement with this change.
Link:https://lore.kernel.org/netdev/CALkUMdSCpx_ywYCx_ePLdm6yioO1nQWx7sSM=AEgsq0kywHxTw@mail.gmail.com/
Signed-off-by: Vimal Agrawal <vimal.agrawal@sophos.com>
---
net/core/neighbour.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 1349c0eedb64..078842db3c5f 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -260,6 +260,9 @@ static int neigh_forced_gc(struct neigh_table *tbl)
int shrunk = 0;
int loop = 0;
+ if (!time_after(jiffies, READ_ONCE(tbl->last_flush) + HZ))
+ return 0;
+
NEIGH_CACHE_STAT_INC(tbl, forced_gc_runs);
spin_lock_bh(&tbl->lock);
--
2.17.1
^ permalink raw reply related
* Re: [Intel-wired-lan] [PATCH net] igc: Fix RX HW timestamp reporting when NET_RX_BUSY_POLL is disabled
From: Ding Meng @ 2026-06-25 10:33 UTC (permalink / raw)
To: Paul Menzel
Cc: Florian Bezdeka, anthony.l.nguyen, przemyslaw.kitszel,
andrew+netdev, davem, edumazet, kuba, pabeni, jan.kiszka,
intel-wired-lan, linux-kernel, netdev, wq.wang
In-Reply-To: <40abd0b5-7f3f-4cd4-9975-9db4498d15d3@molgen.mpg.de>
Dear Paul,
Thanks for your comments.
On Mon, Jun 22, 2026 at 05:59:53PM +0200, Paul Menzel wrote:
> Am 22.06.26 um 06:13 schrieb Ding Meng via Intel-wired-lan:
> > When CONFIG_NET_RX_BUSY_POLL is deactivated, fetching RX HW timestamps
> > from the NIC no longer works as expected.
>
> Maybe paste some logs/errors, so it can be easier found by people with the
> same issue.
Will do.
> > This occurs because disabling CONFIG_NET_RX_BUSY_POLL disables the
> > SKB NAPI mapping in __skb_mark_napi_id(). Consequently, get_timestamp()
> > fails to perform its driver lookup, and the igc driver's struct
> > net_device_ops::ndo_get_tstamp is never invoked.
> >
> > Instead, get_timestamp() falls back to use shhwtstamps(skb)->hwtstamp,
> > a field that the driver has not populated.
> >
> > Fix this by populating the hwtstamp field with the correct timestamp
> > in the default timer when CONFIG_NET_RX_BUSY_POLL is disabled.
>
> Maybe detail, why the adapter needs to be passed now.
>
> Also, please describe a test case to check the change.
Will do.
> > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> > index 8ac16808023..1da8d7aa76d 100644
> > --- a/drivers/net/ethernet/intel/igc/igc_main.c
> > +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> > @@ -1992,7 +1992,26 @@ static struct sk_buff *igc_build_skb(struct igc_ring *rx_ring,
> > return skb;
> > }
> > -static struct sk_buff *igc_construct_skb(struct igc_ring *rx_ring,
> > +static void igc_construct_skb_timestamps(struct igc_adapter *adapter,
> > + struct sk_buff *skb,
> > + struct igc_xdp_buff *ctx)
> > +{
> > + if (!ctx->rx_ts)
> > + return;
> > +#ifdef CONFIG_NET_RX_BUSY_POLL
>
> Is there a way to do this in C instead of the pre-processor. That way all
> the code gets build tested. (Is there a config with disabled
> NET_RX_BUSY_POLL?)
>
How about defining a function to replace the pre-processor:
static inline bool is_net_rx_busy_poll()
{
#ifdef CONFIG_NET_RX_BUSY_POLL
return true;
#else
return false;
#endif
}
CONFIG_PREEMPT_RT=y && CONFIG_NETCONSOLE=y will cause NET_RX_BUSY_POLL
disabled.
Kind regards,
Ding Meng
^ permalink raw reply
* Re: [PATCH v2 0/7] vmsplice: fix some problems in my previous vmsplice patchset
From: David Hildenbrand (Arm) @ 2026-06-25 10:35 UTC (permalink / raw)
To: Askar Safin
Cc: akpm, avagin, axboe, brauner, collin.funk1, david.laight.linux,
dhowells, fuse-devel, hch, jack, joannelkoong, kernel, linux-api,
linux-fsdevel, linux-kernel, linux-mm, luto, metze, miklos,
netdev, patches, pfalcato, torvalds, val, viro, w, willy
In-Reply-To: <20260625101132.3859505-1-safinaskar@gmail.com>
On 6/25/26 12:11, Askar Safin wrote:
> "David Hildenbrand (Arm)" <david@kernel.org>:
>> I think we concluded that we cannot rip out vmsplice that way at this point, and
>> I suspect that Christian will drop that topic branch from -next after -rc1.
>
> I think my patches still have a chance.
I talked to Christian and it doesn't sound like it.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH net-next] selftests/xsk: preserve UMEM view in bidi test
From: Maciej Fijalkowski @ 2026-06-25 10:35 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, bpf, magnus.karlsson, stfomichev, pabeni, horms,
tushar.vyavahare, kerneljasonxing
In-Reply-To: <20260624193326.295e3711@kernel.org>
On Wed, Jun 24, 2026 at 07:33:26PM -0700, Jakub Kicinski wrote:
> On Tue, 23 Jun 2026 11:10:08 +0200 Maciej Fijalkowski wrote:
> > Subject: [PATCH net-next] selftests/xsk: preserve UMEM view in bidi test
>
> Do you want it in net? Either way - we'll need a rebase
I have not checked if this has been -net propagated already, but the rule
of thumb on bpf side was that all selftests related effort goes to -next.
Is it different on netdev side?
>
> > Signed-off-by: Maciej Fijalkowski maciej.fijalkowski@intel.com
>
> missing <> around the email
oof.
>
^ permalink raw reply
* Re: [PATCH 5.10] netfilter: nf_log: validate MAC header was set before dumping it
From: Sasha Levin @ 2026-06-25 10:41 UTC (permalink / raw)
To: stable, Greg Kroah-Hartman
Cc: Sasha Levin, Alexander Martyniuk, Pablo Neira Ayuso,
Jozsef Kadlecsik, Florian Westphal, David S. Miller,
Alexey Kuznetsov, Hideaki YOSHIFUJI, Jakub Kicinski,
Patrick McHardy, netfilter-devel, coreteam, netdev, linux-kernel,
Weiming Shi, Xiang Mei
In-Reply-To: <20260624140117.19799-1-alexevgmart@gmail.com>
> [PATCH 5.10] netfilter: nf_log: validate MAC header was set before
> dumping it
>
> --- a/net/ipv4/netfilter/nf_log_ipv4.c
> +++ b/net/ipv4/netfilter/nf_log_ipv4.c
Thanks for the backport - the retarget to nf_log_ipv4.c is right for 5.10.
One gap though: upstream fixed both loggers via the consolidated
nf_log_syslog.c, but in 5.10 the IPv6 logger (net/ipv6/netfilter/
nf_log_ipv6.c) still has the identical unguarded fallback and is left
vulnerable here - which is also Pablo's "why only 5.10?" point.
--
Thanks,
Sasha
^ permalink raw reply
* Re: [PATCH 5.15/6.1/6.6] af_unix: Reject SIOCATMARK on non-stream sockets
From: Sasha Levin @ 2026-06-25 10:41 UTC (permalink / raw)
To: stable, Greg Kroah-Hartman
Cc: Sasha Levin, Alexander Martyniuk, David S. Miller, Jakub Kicinski,
Paolo Abeni, Kuniyuki Iwashima, Jann Horn, Lee Jones, Rao Shoaib,
netdev, linux-kernel, stable, Yuan Tan, Yifan Wu, Juefei Pu,
Xin Liu, Jiexun Wang, Ren Wei
In-Reply-To: <20260624151651.38894-1-alexevgmart@gmail.com>
> [PATCH 5.15/6.1/6.6] af_unix: Reject SIOCATMARK on non-stream sockets
>
> Backport fix for CVE-2026-52928. Reject SIOCATMARK in unix_ioctl()
> for non-stream sockets.
Queued for 6.6, 6.1 and 5.15, thanks!
--
Thanks,
Sasha
^ permalink raw reply
* Re: Please backport bridge multicast exponential field encoding fix series to stable kernels
From: Sasha Levin @ 2026-06-25 10:42 UTC (permalink / raw)
To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Nikolay Aleksandrov, Ido Schimmel, David Ahern,
Shuah Khan, Andy Roulin, Yong Wang, Petr Machata, stable, Greg KH,
Greg Kroah-Hartman
Cc: Sasha Levin, Ujjal Roy, bridge, Kernel, Kernel, linux-kselftest,
Ujjal Roy
In-Reply-To: <CAE2MWknz4X_gcNo6jkR87Lg8F0zfubkOc4Ujr57CS3aBMWrjEA@mail.gmail.com>
> Please backport the 5-patch bridge multicast exponential field
> encoding series (726fa7da2d8c, 12cfb4ecc471, 95bfd196f0dc,
> e51560f4220a, 529dbe762de0) to the stable kernels.
I tried, but it doesn't apply to 7.1. Could you provide a backport please?
--
Thanks,
Sasha
^ permalink raw reply
* Re: [PATCH net v3 1/2] iov_iter: export iov_iter_restore
From: Christian Brauner @ 2026-06-25 10:43 UTC (permalink / raw)
To: Octavian Purdila
Cc: netdev, Alexander Viro, Andrew Morton, Arseniy Krasnov,
David S. Miller, Eric Dumazet, Eugenio Pérez, Jakub Kicinski,
Jason Wang, kvm, linux-block, linux-fsdevel, linux-kernel,
Michael S. Tsirkin, Paolo Abeni, Simon Horman, Stefan Hajnoczi,
Stefano Garzarella, virtualization, Xuan Zhuo, Jens Axboe
In-Reply-To: <20260622222757.2130402-2-tavip@google.com>
> Export iov_iter_restore so that it can be used by modules.
>
> This is needed by the virtio vsock transport (which can be built as a
> module) to restore the msg_iter state when transmission fails.
>
> Acked-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Octavian Purdila <tavip@google.com>
>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 273919b16161..f5df63961fb2 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1491,6 +1491,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
> i->__iov -= state->nr_segs - i->nr_segs;
> i->nr_segs = state->nr_segs;
> }
> +EXPORT_SYMBOL_GPL(iov_iter_restore);
At least only export it for the module that really needs it. For
example, see:
EXPORT_SYMBOL_FOR_MODULES(__kernel_write, "autofs4");
--
Christian Brauner <brauner@kernel.org>
^ permalink raw reply
* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-25 10:46 UTC (permalink / raw)
To: Andrew Lunn
Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <b7de216a-fd1a-42a0-8711-d822a1ad9319@lunn.ch>
Hi Andrew,
On 5/29/26 14:59, Andrew Lunn wrote:
(This discussion was a while ago, but this bit of context should be enough)
> But we also need to consider that for some APIs, we have decided that
> a configuration can be set now, which does not actually apply in our
> current conditions, but it will be stored away for when conditions
> change and it is applicable. The half duplex case could fit that. When
> the link is currently half duplex, you can configure pause, but you
> don't expect it to actually change the current behaviour. It only
> kicks in when the link renegotiates to full duplex sometime in the
> future. We have to also consider this the other way around. The link
> is full duplex and pause is configured by the user. Something happens
> with the LP and the link renegotiates to half duplex. The local end
> should not throw away the configuration, it simply cannot apply it
> given the current situation.
I'm writing the test description for HD with a better formatting, so the
HD test wouldn't be about "are we using pause stuff while in HD" as it
doesn't make sense, but rather "do we correctly store the pause settings
aside for later".
I'm realising that we don't really have an API to report the *true* in-use pause
settings. Taking HD as an example :
# ethtool -s eth2 duplex half
[588209.379363] mvpp2 f4000000.ethernet eth2: Link is Up - 100Mbps/Half - flow control off
# ethtool eth2
[...]
Supported pause frame use: Symmetric Receive-only
Advertised pause frame use: Symmetric Receive-only
Link partner advertised pause frame use: Symmetric Receive-only
# ethtool -a eth2
Autonegotiate: on
RX: off
TX: off
RX negotiated: on
TX negotiated: on
Sure, pause and HD don't make sense, however what I find confusing to some
extent is that the only place we have information about the *actual* pause
settings is the "link is Up" log in dmesg.
Maybe the problem in the above situation is that whoever advertises
half-duplex only modes should also not advertise pause ?
Still, I'm wondering if we should even care about all that actually, HD and
Pause are incompatible, and that's it. If you have any thought on this, let
me know.
Maxime
^ permalink raw reply
* Re: [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Jakub Sitnicki @ 2026-06-25 10:48 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Michal Luczaj, Willem de Bruijn, John Fastabend, Jiayuan Chen,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
Emil Tsalapatis, Shuah Khan, netdev, bpf, linux-kernel,
linux-kselftest
In-Reply-To: <CAAVpQUAy=EVZcZRmXJPr=neJh7Q+UbYML5L4+nBynVrDPidUkw@mail.gmail.com>
On Wed, Jun 24, 2026 at 02:39 PM -07, Kuniyuki Iwashima wrote:
> On Wed, Jun 24, 2026 at 2:33 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>>
>> On Wed, Jun 24, 2026 at 2:26 PM Michal Luczaj <mhal@rbox.co> wrote:
>> >
>> > On 6/24/26 22:01, Willem de Bruijn wrote:
>> > > Jakub Sitnicki wrote:
>> > >> On Tue, Jun 23, 2026 at 08:03 PM +02, Michal Luczaj wrote:
>> > >>> UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
>> > >>> sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
>> > >>>
>> > >>> Because sockmap accepts unbound UDP sockets, a BPF program can increment a
>> > >>> socket's refcount via lookup. If the socket is subsequently bound, the
>> > >>> transition from unbound to bound causes bpf_sk_release() to skip the
>> > >>> decrement of the refcount, causing a memory leak.
>> > >>>
>> > >>> unreferenced object 0xffff88810bc2eb40 (size 1984):
>> > >>> comm "test_progs", pid 2451, jiffies 4295320596
>> > >>> hex dump (first 32 bytes):
>> > >>> 7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00 ................
>> > >>> 02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
>> > >>> backtrace (crc bdee079d):
>> > >>> kmem_cache_alloc_noprof+0x557/0x660
>> > >>> sk_prot_alloc+0x69/0x240
>> > >>> sk_alloc+0x30/0x460
>> > >>> inet_create+0x2ce/0xf80
>> > >>> __sock_create+0x25b/0x5c0
>> > >>> __sys_socket+0x119/0x1d0
>> > >>> __x64_sys_socket+0x72/0xd0
>> > >>> do_syscall_64+0xa1/0x5f0
>> > >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> > >>>
>> > >>> Maintain balanced refcounts across sk lookup/release: (re-)set
>> > >>> SOCK_RCU_FREE on proto update to treat the socket (whether bound or
>> > >>> unbound) as not requiring a refcount increment on (a RCU protected) lookup.
>> > >>>
>> > >>> Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
>> > >>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>> > >>> ---
>> > >>> Note: this issue is related to commit 67312adc96b5 ("bpf: reject unhashed
>> > >>> sockets in bpf_sk_assign").
>> > >>> ---
>> > >>> net/ipv4/udp_bpf.c | 3 +++
>> > >>> 1 file changed, 3 insertions(+)
>> > >>>
>> > >>> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
>> > >>> index ad57c4c9eaab..970327b59582 100644
>> > >>> --- a/net/ipv4/udp_bpf.c
>> > >>> +++ b/net/ipv4/udp_bpf.c
>> > >>> @@ -173,6 +173,9 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>> > >>> if (sk->sk_family == AF_INET6)
>> > >>> udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
>> > >>>
>> > >>> + /* Treat all sockets as non-refcounted, regardless of binding state. */
>> > >>> + sock_set_flag(sk, SOCK_RCU_FREE);
>> > >>> +
>> > >>> sock_replace_proto(sk, &udp_bpf_prots[family]);
>> > >>> return 0;
>> > >>> }
>> > >>
>> > >> There is a side effect that an unhashed (unbound) UDP socket can now be
>> > >> selected in sk_lookup with bpf_sk_assign.
>> > >
>> > > The commit does mention a related fix, beneath the ---, commit
>> > > 67312adc96b5 ("bpf: reject unhashed sockets in bpf_sk_assign").
>> > > That fixes a similar issue by exactly disallowing this:
>> > >
>> > > Fix the problem by rejecting unhashed sockets in bpf_sk_assign().
>> > > This matches the behaviour of __inet_lookup_skb which is ultimately
>> > > the goal of bpf_sk_assign().
>> > >
>> > > So ..
>> > >
>> > >> Though perhaps that's for the
>> > >> better because TC bpf_sk_assign doesn't reject non-refcounted UDP
>> > >> sockets either, so we would have both socket dispatch sites behave the
>> > >> same way.
>> > >
>> > > .. there are two conflicting types of consistency here? Consistent with
>> > > __inet_lookup_skb or the TC bpf hook. Of those the first is the more
>> > > canonical.
>> > >
>> > >> Also, with this patch, if we insert & remove an unhashed UDP socket
>> > >> into/from a sockmap, we end up with an unhashed non-refcounted UDP
>> > >> socket. Not entirely sure if that is actually a problem or not.
>> > >>
>> > >> Willem, what is your take on having unhashed non-refcoted UDP sockets?
>> > >
>> > > I don't immediately see a problem, but I'm not an expert on SOCK_RCU_FREE.
>> >
>> > Perhaps it's worth mentioning that unhashed non-refcounted UDP socket is
>> > already possible: first auto-bind via connect(AF_INET) (which also sets
>> > SOCK_RCU_FREE), then unhash via connect(AF_UNSPEC).
>>
>> Setting SOCK_RCU_FREE itself should not cause a problem, but I think
>> we should take a step back.
>>
>> AFAIU, 0c48eefae712 was to allow putting AF_UNIX SOCK_DGRAM sockets
>> into sockmap, not to allow using unconnected UDP sockets in sk_lookup etc.
>>
>> Actually, v4 of the patch was implemented as such but did not get any feedback,
>> https://lore.kernel.org/bpf/20210508220835.53801-9-xiyou.wangcong@gmail.com/#t
>>
>> ... and v5 (the final commit) somehow removed the restriction for unconnected
>> UDP socket as well.
>> https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com/
>>
>> Given the initial use case, sockmap redirect, is still blocked by
>> TCP_ESTABLISHED
>> check in sock_map_redirect_allowed(), I feel there is no point in supporting
>> unconnected UDP sockets in sockmap. It cannot get any skb from anywhere
>> (without buggy sk_lookup).
>
> s/unconnected/unhashed/g :)
Rejecting unhashed UDP sockets on insert to sockmap SGTM.
It is also in line with disable-problematic-cases strategy.
^ permalink raw reply
* Re: [PATCH net] netpoll: fix a use-after-free on shutdown path
From: Breno Leitao @ 2026-06-25 10:55 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
Amerigo Wang, netdev, linux-kernel, vlad.wing, asantostc,
kernel-team, stable
In-Reply-To: <20260624192513.33023e54@kernel.org>
On Wed, Jun 24, 2026 at 07:25:13PM -0700, Jakub Kicinski wrote:
> On Mon, 22 Jun 2026 08:01:23 -0700 Breno Leitao wrote:
> > + * synchronize_net() does not protect the worker
> > + * (queue_process() is not an RCU reader). It fences the
> > + * senders -- the real RCU readers -- so they cannot re-arm
> > + * tx_work after the np->dev->npinfo was set to NULL.
> > + */
> > + synchronize_net();
> > + cancel_delayed_work_sync(&npinfo->tx_work);
>
> Maybe we can avoid the sync_net and the comment by using
> disable_delayed_work_sync() ?
I've been thinking about it, and I think you have a good point.
queue_process() is the only place that take npinfo without RCU
protection.
This is what it happening right now:
CPU0 {
run tx_work (queue_process())
npinfo = container_of()...
while {
A: deqeue skb from the txq
try to send
}
}
CPU 1 {
call_rcu() -> rcu_cleanup_netpoll_info()
np->dev->npinfo, NULL
B: kfree(npinfo);
}
Then, if B happens before A, we have the UAF. That said, if we make sure
that tx_work() is done, then we are OK with rcu_cleanup_netpoll_info
I am not totally sure if the order of pointer zero'ing and disabling
tx work is important, but, it doesn't seem so, any order would be OK
for:
RCU_INIT_POINTER(np->dev->npinfo, NULL);
disable_delayed_work_sync(&npinfo->tx_work);
Given that npinfo is not read inside queue_process(), then, order doesn't
matter.
Thanks for the point, I will update.
--breno
---
pw-bot: cr
^ permalink raw reply
* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Jani Nikula @ 2026-06-25 11:00 UTC (permalink / raw)
To: Kaitao Cheng, David Laight, Christian König,
David Hildenbrand (Arm), Alexei Starovoitov
Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
Alexander Viro, Christian Brauner, Daniel Borkmann,
Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
audit, bpf, netdev, dri-devel, linux-perf-users,
linux-trace-kernel, kexec, live-patching, linux-modules,
linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <0ed6b5c3-e955-46e2-9fc6-075a0dfd1c4f@linux.dev>
On Thu, 25 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 在 2026/6/24 22:23, David Laight 写道:
>> On Wed, 24 Jun 2026 15:23:47 +0200
>> Christian König <christian.koenig@amd.com> wrote:
>>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>>> 在 2026/6/22 16:42, David Laight 写道:
>>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>>
>>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>>
>>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>>> remove the current entry. Their API exposes the temporary cursor at
>>>>>> every call site, even though most users only need it for the iterator
>>>>>> implementation and never reference it in the loop body.
>>>>>>
>>>>>> Add *_mutable() variants for list and hlist iteration. The new helpers
>>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>>> a unique internal cursor.
>>>>>
>>>>> I'm not really sure 'mutable' means anything either.
>>>>> It is possible to make it valid for the loop body (or even other threads)
>>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>>
>>>>> It might be worth doing something that doesn't need the extra variable,
>>>>> but there is little point doing all the churn just to rename things.
>>>>>
>>>>>>
>>>>>> This makes call sites that only mutate the list through the current entry
>>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>>> compatibility.
>>>>>>
>>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>> ---
>>>>>> include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>> 1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>>> --- a/include/linux/list.h
>>>>>> +++ b/include/linux/list.h
>>>>>> @@ -7,6 +7,7 @@
>>>>>> #include <linux/stddef.h>
>>>>>> #include <linux/poison.h>
>>>>>> #include <linux/const.h>
>>>>>> +#include <linux/args.h>
>>>>>>
>>>>>> #include <asm/barrier.h>
>>>>>>
>>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>> #define list_for_each_prev(pos, head) \
>>>>>> for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>>
>>>>>> -/**
>>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>>> - * @pos: the &struct list_head to use as a loop cursor.
>>>>>> - * @n: another &struct list_head to use as temporary storage
>>>>>> - * @head: the head for your list.
>>>>>> +/*
>>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>> */
>>>>>> #define list_for_each_safe(pos, n, head) \
>>>>>> for (pos = (head)->next, n = pos->next; \
>>>>>> !list_is_head(pos, (head)); \
>>>>>> pos = n, n = pos->next)
>>>>>>
>>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head) \
>>>>>> + for (typeof(pos) tmp = (pos = (head)->next)->next; \
>>>>>
>>>>> Use auto
>>>>>
>>>>>> + !list_is_head(pos, (head)); \
>>>>>> + pos = tmp, tmp = pos->next)
>>>>>> +
>>>>>> +#define __list_for_each_mutable1(pos, head) \
>>>>>> + __list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>>> +
>>>>>> +#define __list_for_each_mutable2(pos, next, head) \
>>>>>> + list_for_each_safe(pos, next, head)
>>>>>> +
>>>>>> /**
>>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>> * @pos: the &struct list_head to use as a loop cursor.
>>>>>> - * @n: another &struct list_head to use as temporary storage
>>>>>> - * @head: the head for your list.
>>>>>> + * @...: either (head) or (next, head)
>>>>>> + *
>>>>>> + * next: another &struct list_head to use as optional temporary storage.
>>>>>> + * The temporary cursor is internal unless explicitly supplied by
>>>>>> + * the caller.
>>>>>> + * head: the head for your list.
>>>>>> + */
>>>>>> +#define list_for_each_mutable(pos, ...) \
>>>>>> + CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__)) \
>>>>>> + (pos, __VA_ARGS__)
>>>>>
>>>>> The variable argument count logic really just slows down compilation.
>>>>> Maybe there aren't enough copies of this code to make that significant.
>>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>>> I'm also not sure it really adds anything to the readability.
>>>>>
>>>>> And, it you are going to make the middle argument optional there is
>>>>> no need to change the macro name.
>>>>
>>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>>> implementation approach. If we abandon that method, it means we will
>>>> inevitably need to add some new macros. If mutable is not a good name,
>>>> suggestions for better alternatives would be welcome; coming up with a
>>>> suitable name is indeed rather tricky.
>>>
>>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>>
>>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
>>
>> IIRC currently you have a choice of either:
>> define Item that can't be deleted
>> list_for_each() The current item.
>> list_for_each_safe() The next item.
>> There is also likely to be code that updates the variables to allow
>> for other scenarios.
>>
>> Note that if increase a reference count and release a lock then list_for_each()
>> is likely safer than list_for_each_safe() :-)
>>
>> list.h has 9 variants of the 'safe' loop.
>> The bloat of another 9 is getting excessive.
>>
>> It has to be said that this is one of my least favourite type of list...
>
> Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
> Andy Shevchenko, Alexei Starovoitov
>
> For ease of discussion, I need to summarize the currently possible
> approaches and briefly describe their respective pros and cons,
> using the list_for_each_entry* interfaces as examples.
>
> 1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
> and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
> would be used specifically for safe deletion scenarios that do not
> need to expose the temporary cursor externally. The code can refer to
> the v1 version.
>
> Pros: Does not depend on immediate per-subsystem adaptation and can be
> merged directly.
> Cons: Requires adding a whole set of mutable interfaces, which makes the
> code somewhat redundant.
Seems fine, and the original _safe naming is ambiguous anyway.
> 2. Directly optimize away the temporary cursor in list_for_each_entry_safe
> and define it inside the loop instead, changing the interface from four
> arguments to three.
>
> Pros: Does not add redundant interfaces.
> Cons: (1) Users need to manually update special cases that use the
> traversal variable of list_for_each_entry_safe, the new
> list_for_each_entry_safe would no longer apply there and would
> need to be open-coded.
> (2) Because the macro arguments changes, all list_for_each_entry_safe
> callers would need to be modified and merged together, making it
> difficult to merge such a large amount of code at once.
This won't fly because there are literally thousands of
list_for_each_entry_safe() users.
> 3. Use a variadic macro approach to optimize list_for_each_entry_safe,
> so that it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
> (2) Does not depend on immediate per-subsystem adaptation and can
> be merged directly.
> Cons: (1) Increases compile time.
> (2) Makes the interface harder for users to use.
Basically I'm against any variadic macro tricks where the optional
argument is not the last argument. That's just way too surprising, and
goes against common practice in just about all other languages.
> 4. Optimize list_for_each_entry by defining the temporary cursor internally,
> making it compatible with the functionality of list_for_each_entry_safe.
> The code can refer to the v2 version.
>
> Pros: (1) Does not add redundant interfaces.
> (2) The number of externally visible arguments of list_for_each_entry
> remains unchanged, still three.
> Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
> into one, and list_for_each_entry_safe would gradually be deprecated.
> (2) Users need to manually update special cases that use the traversal
> variable of list_for_each_entry, the new list_for_each_entry would no
> longer apply there and would need to be open-coded. There are 15 such
> cases in total.
This sounds good to me, though I take it there's some code size increase
and/or performance penalty?
Maybe the 15 cases are questionable anyway?
> 5. Use a variadic macro approach to optimize list_for_each_entry, so that
> it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
> (2) Does not depend on immediate per-subsystem adaptation and can be
> merged directly.
> Cons: (1) Increases compile time.
> (2) list_for_each_entry and list_for_each_entry_safe would be merged
> into one, and list_for_each_entry_safe would gradually be deprecated.
Please don't do the macro tricks.
> 6. Make no changes, keep the current logic unchanged, and close the current
> email discussion.
I like hiding the temporary stuff when possible.
BR,
Jani.
--
Jani Nikula, Intel
^ permalink raw reply
* [PATCH bpf-next v10 0/5] bpf: add icmp_send kfunc
From: Mahe Tardy @ 2026-06-25 11:03 UTC (permalink / raw)
To: bpf
Cc: andrii, ast, daniel, john.fastabend, jordan, martin.lau,
yonghong.song, emil, netdev, edumazet, kuba, pabeni, davem, horms,
Mahe Tardy
Hello,
This is v10 of adding the icmp_send kfunc, as suggested during
LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
actively reject east-west traffic, similarly to what is possible to do
with netfilter reject target. Applications can receive early feedback
that something went wrong during the TCP handshake.
The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.
Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.
Caveats of this kfunc design are that a program can call this function N
times, thus send N ICMP unreach control messages and that the program
can return from the BPF filter with pass leading to a potential
confusing situation where the TCP connection was established while the
client received ICMP_DEST_UNREACH messages.
v2 updates:
- fix a build error from a missing function call rename;
- avoid changing return line in bpf_kfunc_init;
- return SK_DROP from the kfunc (similarly to bpf_redirect);
- check the return value in the selftest.
v3 update:
- fix an undefined reference build error.
v4 updates:
- prevent the kfunc to be called recursively and add a test (thanks to
Martin).
- do not fetch dst route when unnecessary (thanks to Martin).
- extend the test for IPv6 (thanks to Martin).
- use SK_DROP in examples and use non blocking sockets for testing
(thanks to Martin).
- test when the kfunc returns -EINVAL (thanks to Jordan).
- add the kfunc to bpf_kfunc_set_skb as suggested by Alexei.
- guard the IPv4 parts with IS_ENABLED(CONFIG_INET).
- fix a wrong initial value for client_fd (thanks to Yonghong).
- add documentation to the kfunc.
- to Jordan: I couldn't include <linux/icmp.h> because of redefines from
<network_helpers.h>.
v5 updates:
- kfunc name is now icmp_send and takes the control message type as
parameter for future potential extension (daniel)
- drop the net patches to route packet since now the kfunc is limited to
cgroup_skb and tc progs (daniel & martin)
- linearize skb headers (sashiko)
- zero SKB control block (sashiko)
- bind to port 0 instead of fixed port (sashiko)
- poll to wait for POLLERR event (sashiko)
- do not use ASSERT_EQ in CMSG_NXTHDR loop (sashiko)
- fix comment about byte order (sashiko)
- fix endianness IP address issue (sashiko)
- add forgotten cleanup_cgroup_environment (sashiko)
- let packets pass in recursion test (sashiko)
- clarify evaluation order for recursion test (sashiko)
v6 updates (all from sashiko):
- bring back the net patches to route packet since tc ingress needs it.
- rename the ip_route_reply helpers from fetch to fill.
- call pskb_network_may_pull on the cloned pkt.
- check explicitly that we received one and only one ICMP err ctrl msg.
v7 updates:
- use consume_skb on success path (stanislav)
- replace recursion protection with CPU_ARRAY by checking the nature of
the sk (daniel, offline)
- use reverse xmas tree in read_icmp_errqueue (jordan)
- use ASSERT_OK_FD instead of ASSERT_GE whenever possible (jordan)
- add a test for tc (jordan)
- better filtering from host cgroup test progs (sashiko)
v8 updates:
- mostly a resend as it's been sitting as "New" in the queue for almost
one month, fixed a few nits.
- on new bpf_icmp_send kfunc cgroup_skb test (patch 4/7):
- guard a close fd with fd >= 0 (jordan)
- use ASSERT_OK_FD instead of ASSERT_GE (jordan)
- fixed comment style (sashiko)
- on recursion test (patch 7/7):
- guard a close fd with fd >= 0 (jordan)
- fixed comments style (sashiko)
- filter bpf prog on pid and ICMP message types (sashiko)
v9 updates:
- first, there was a v8.5 that I discussed here[^2] with Emil
Tsalapatis. I tried once again to make tc work but the ai review found
something fundamentally wrong. This version removes the tc support for
now and focuses on cgroup_skb.
- use helper get_socket_local_port instead of getsockname (sashiko)
- use if_nametoindex("lo") instead of value 1 (bpf-ci)
- fix IPV6_RECVERR appearance before IPv6 patch (bpf-ci)
- precise that 0 on success mean icmp_send was called but it was just an
attempt since this function does not return anything (sashiko)
- explicitly consider ICMP_FRAG_NEEDED as invalid in bpf_icmp_send as
it would miss the next-hop MTU info. Also test it. (sashiko)
- test for max_code + 1 for invalid (sashiko)
- add review-by tags from Jordan and Emil but remove it on the main
patch as I have significantly changed it.
- check for rec_count in recursion test (sashiko)
- re-order setup_cgroup_environmment in test (sashiko)
- reset kfunc_ret on every test run (sashiko)
- check for skb route for icmp_send as the function would quietly fail
and add a test (sashiko)
v10 updates:
- guard against skbs with metadata_dst before calling icmpv6_send
(sashiko)
- add more review-by tags from Emil and Jordan.
[^1]: https://lwn.net/Articles/1022034/
[^2]: https://lore.kernel.org/bpf/ajvDRCw8cPqXAqQq@gmail.com/
Link to v9: https://lore.kernel.org/bpf/20260624185554.362555-1-mahe.tardy@gmail.com/
Mahe Tardy (5):
bpf: add bpf_icmp_send kfunc
selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests
selftests/bpf: add bpf_icmp_send kfunc cgroup_skb IPv6 tests
selftests/bpf: add bpf_icmp_send recursion test
selftests/bpf: add bpf_icmp_send no route test
net/core/filter.c | 95 +++++++
.../bpf/prog_tests/icmp_send_kfunc.c | 269 ++++++++++++++++++
tools/testing/selftests/bpf/progs/icmp_send.c | 123 ++++++++
3 files changed, 487 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c
--
2.34.1
^ permalink raw reply
* [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc
From: Mahe Tardy @ 2026-06-25 11:03 UTC (permalink / raw)
To: bpf
Cc: andrii, ast, daniel, john.fastabend, jordan, martin.lau,
yonghong.song, emil, netdev, edumazet, kuba, pabeni, davem, horms,
Mahe Tardy
In-Reply-To: <20260625110321.28236-1-mahe.tardy@gmail.com>
This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs.
This reuses concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still let the packet pass
(SK_PASS from the cgroup_skb progs for example) and the current skb
need to stay untouched (cgroup_skb hooks only allow read-only skb
payload).
* We protect against recursion since the kfunc, by generating an ICMP
error message, could retrigger the BPF prog that invoked it.
Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH are currently supported.
The interface accepts a type parameter to facilitate future extension to
other ICMP control message types.
For normal cgroup_skb paths, the skb dst route should already be set.
However, bpf_prog_test_run_skb can create synthetic IPv4 skbs without an
attached route. In that case, icmp_send returns early, and the kfunc
would otherwise report success despite no ICMP reply being sent. The
check also rejects metadata dsts, which are not valid struct rtable
instances. For IPv6, reject metadata dsts only: icmpv6_send can reach
icmp6_dev, where skb_rt6_info treats any non-NULL skb dst as a struct
rt6_info, which is not valid for metadata_dst.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
net/core/filter.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 95 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index 2e96b4b847ce..0a0191586b44 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -84,6 +84,9 @@
#include <linux/un.h>
#include <net/xdp_sock_drv.h>
#include <net/inet_dscp.h>
+#include <linux/icmpv6.h>
+#include <net/icmp.h>
+#include <net/ip6_route.h>
#include "dev.h"
@@ -12546,6 +12549,88 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
return 0;
}
+/**
+ * bpf_icmp_send - Send an ICMP control message
+ * @skb_ctx: Packet that triggered the control message
+ * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
+ * @code: ICMP code (0-15 except ICMP_FRAG_NEEDED for IPv4, 0-6 for IPv6)
+ *
+ * Sends an ICMP control message in response to the packet. The original packet
+ * is cloned before sending the ICMP message, so the BPF program can still let
+ * the packet pass if desired.
+ *
+ * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
+ * supported.
+ *
+ * Return: 0 on success (send attempt), negative error code on failure:
+ * -EBUSY: Recursion detected
+ * -EPROTONOSUPPORT: Non-IP protocol
+ * -EOPNOTSUPP: Unsupported ICMP type
+ * -EINVAL: Invalid code parameter
+ * -ENETUNREACH: No usable route/dst for the ICMP reply
+ * -ENOMEM: Memory allocation failed
+ */
+__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
+{
+ struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+ struct sk_buff *nskb;
+ struct sock *sk;
+
+ sk = skb_to_full_sk(skb);
+ if (sk && sk->sk_kern_sock &&
+ (sk->sk_protocol == IPPROTO_ICMP || sk->sk_protocol == IPPROTO_ICMPV6))
+ return -EBUSY;
+
+ switch (skb->protocol) {
+#if IS_ENABLED(CONFIG_INET)
+ case htons(ETH_P_IP): {
+ if (type != ICMP_DEST_UNREACH)
+ return -EOPNOTSUPP;
+ if (code < 0 || code > NR_ICMP_UNREACH ||
+ code == ICMP_FRAG_NEEDED) /* needs a valid next-hop MTU */
+ return -EINVAL;
+
+ /* icmp_send expects skb_dst to be a real rtable. */
+ if (!skb_valid_dst(skb))
+ return -ENETUNREACH;
+
+ nskb = skb_clone(skb, GFP_ATOMIC);
+ if (!nskb)
+ return -ENOMEM;
+
+ memset(IPCB(nskb), 0, sizeof(*IPCB(nskb)));
+ icmp_send(nskb, type, code, 0);
+ consume_skb(nskb);
+ break;
+ }
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+ case htons(ETH_P_IPV6):
+ if (type != ICMPV6_DEST_UNREACH)
+ return -EOPNOTSUPP;
+ if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+ return -EINVAL;
+
+ /* icmpv6_send may treat skb_dst as rt6_info. */
+ if (skb_metadata_dst(skb))
+ return -ENETUNREACH;
+
+ nskb = skb_clone(skb, GFP_ATOMIC);
+ if (!nskb)
+ return -ENOMEM;
+
+ memset(IP6CB(nskb), 0, sizeof(*IP6CB(nskb)));
+ icmpv6_send(nskb, type, code, 0);
+ consume_skb(nskb);
+ break;
+#endif
+ default:
+ return -EPROTONOSUPPORT;
+ }
+
+ return 0;
+}
+
__bpf_kfunc_end_defs();
int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12588,6 +12673,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
+BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send)
+BTF_ID_FLAGS(func, bpf_icmp_send)
+BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send)
+
static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
.owner = THIS_MODULE,
.set = &bpf_kfunc_check_set_skb,
@@ -12618,6 +12707,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
.set = &bpf_kfunc_check_set_sock_ops,
};
+static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send = {
+ .owner = THIS_MODULE,
+ .set = &bpf_kfunc_check_set_icmp_send,
+};
+
static int __init bpf_kfunc_init(void)
{
int ret;
@@ -12639,6 +12733,7 @@ static int __init bpf_kfunc_init(void)
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
&bpf_kfunc_set_sock_addr);
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
+ ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send);
return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
}
late_initcall(bpf_kfunc_init);
--
2.34.1
^ permalink raw reply related
* [PATCH bpf-next v10 2/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests
From: Mahe Tardy @ 2026-06-25 11:03 UTC (permalink / raw)
To: bpf
Cc: andrii, ast, daniel, john.fastabend, jordan, martin.lau,
yonghong.song, emil, netdev, edumazet, kuba, pabeni, davem, horms,
Mahe Tardy
In-Reply-To: <20260625110321.28236-1-mahe.tardy@gmail.com>
This test opens a server and client, enters a new cgroup, attach a
cgroup_skb program on egress and calls the bpf_icmp_send function from
the client egress so that an ICMP unreach control message is sent back
to the client. It then fetches the message from the error queue to
confirm the correct ICMP unreach code has been sent.
Note that, for the client, we have to connect in non-blocking mode to
let the test execute faster. Otherwise, we need to wait for the TCP
three-way handshake to timeout in the kernel before reading the errno.
Also note that we don't set IP_RECVERR on the socket in
connect_to_fd_nonblock since the error will be transferred anyway in our
test because the connection is rejected at the beginning of the TCP
handshake. See in net/ipv4/tcp_ipv4.c:tcp_v4_err for more details.
Reviewed-by: Jordan Rife <jordan@jrife.io>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
.../bpf/prog_tests/icmp_send_kfunc.c | 164 ++++++++++++++++++
tools/testing/selftests/bpf/progs/icmp_send.c | 38 ++++
2 files changed, 202 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c
diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
new file mode 100644
index 000000000000..b8a98c90053e
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -0,0 +1,164 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include <poll.h>
+#include "icmp_send.skel.h"
+
+#define TIMEOUT_MS 1000
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+#define KFUNC_RET_UNSET -1
+
+static int connect_to_fd_nonblock(int server_fd)
+{
+ struct sockaddr_storage addr;
+ socklen_t len = sizeof(addr);
+ int fd, err;
+
+ if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
+ return -1;
+
+ fd = socket(addr.ss_family, SOCK_STREAM | SOCK_NONBLOCK, 0);
+ if (fd < 0)
+ return -1;
+
+ err = connect(fd, (struct sockaddr *)&addr, len);
+ if (err < 0 && errno != EINPROGRESS) {
+ close(fd);
+ return -1;
+ }
+
+ return fd;
+}
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+ struct sock_extended_err *sock_err;
+ char ctrl_buf[512];
+ struct msghdr msg = {
+ .msg_control = ctrl_buf,
+ .msg_controllen = sizeof(ctrl_buf),
+ };
+ struct pollfd pfd = {
+ .fd = sockfd,
+ .events = POLLERR,
+ };
+ struct cmsghdr *cm;
+ ssize_t n;
+
+ if (!ASSERT_GE(poll(&pfd, 1, TIMEOUT_MS), 1, "poll_errqueue"))
+ return;
+
+ n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+ if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+ return;
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!ASSERT_NEQ(cm, NULL, "cm_firsthdr_null"))
+ return;
+
+ for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
+ if (cm->cmsg_level != IPPROTO_IP || cm->cmsg_type != IP_RECVERR)
+ continue;
+
+ sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+ if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+ "sock_err_origin_icmp"))
+ return;
+ if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+ "sock_err_type_dest_unreach"))
+ return;
+ ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+ return;
+ }
+
+ ASSERT_FAIL("no IP_RECVERR control message found");
+}
+
+static bool valid_unreach_code(int code)
+{
+ if (code < 0)
+ return false;
+
+ return code <= NR_ICMP_UNREACH && code != ICMP_FRAG_NEEDED;
+}
+
+static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
+{
+ int srv_fd = -1, client_fd = -1;
+ int port;
+
+ srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0, TIMEOUT_MS);
+ if (!ASSERT_OK_FD(srv_fd, "start_server"))
+ return;
+
+ port = get_socket_local_port(srv_fd);
+ if (!ASSERT_GE(port, 0, "get_socket_local_port")) {
+ close(srv_fd);
+ return;
+ }
+
+ skel->bss->server_port = ntohs(port);
+ skel->bss->unreach_code = code;
+ skel->data->kfunc_ret = KFUNC_RET_UNSET;
+
+ client_fd = connect_to_fd_nonblock(srv_fd);
+ if (!ASSERT_OK_FD(client_fd, "client_connect_nonblock")) {
+ close(srv_fd);
+ return;
+ }
+
+ if (valid_unreach_code(code))
+ read_icmp_errqueue(client_fd, code);
+
+ close(client_fd);
+ close(srv_fd);
+}
+
+void test_icmp_send_unreach_cgroup(void)
+{
+ struct icmp_send *skel;
+ int cgroup_fd = -1;
+
+ skel = icmp_send__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_open"))
+ goto cleanup;
+
+ cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+ if (!ASSERT_OK_FD(cgroup_fd, "join_cgroup"))
+ goto cleanup;
+
+ skel->links.egress =
+ bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+ if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+ goto cleanup;
+
+ for (int code = 0; code <= NR_ICMP_UNREACH; code++) {
+ if (code == ICMP_FRAG_NEEDED)
+ continue;
+
+ trigger_prog_read_icmp_errqueue(skel, code);
+ ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+ }
+
+ /* Test invalid codes */
+ trigger_prog_read_icmp_errqueue(skel, -1);
+ ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+
+ trigger_prog_read_icmp_errqueue(skel, NR_ICMP_UNREACH + 1);
+ ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+
+ trigger_prog_read_icmp_errqueue(skel, ICMP_FRAG_NEEDED);
+ ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+
+cleanup:
+ icmp_send__destroy(skel);
+ if (cgroup_fd >= 0)
+ close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
new file mode 100644
index 000000000000..6d0be0a9afe1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+/* 127.0.0.1 in host byte order */
+#define SERVER_IP 0x7F000001
+
+#define ICMP_DEST_UNREACH 3
+
+__u16 server_port = 0;
+int unreach_code = 0;
+int kfunc_ret = -1;
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+ void *data = (void *)(long)skb->data;
+ void *data_end = (void *)(long)skb->data_end;
+ struct iphdr *iph;
+ struct tcphdr *tcph;
+
+ iph = data;
+ if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+ iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+ return SK_PASS;
+
+ tcph = (void *)iph + iph->ihl * 4;
+ if ((void *)(tcph + 1) > data_end ||
+ tcph->dest != bpf_htons(server_port))
+ return SK_PASS;
+
+ kfunc_ret = bpf_icmp_send(skb, ICMP_DEST_UNREACH, unreach_code);
+
+ return SK_DROP;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1
^ permalink raw reply related
* [PATCH bpf-next v10 3/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb IPv6 tests
From: Mahe Tardy @ 2026-06-25 11:03 UTC (permalink / raw)
To: bpf
Cc: andrii, ast, daniel, john.fastabend, jordan, martin.lau,
yonghong.song, emil, netdev, edumazet, kuba, pabeni, davem, horms,
Mahe Tardy
In-Reply-To: <20260625110321.28236-1-mahe.tardy@gmail.com>
This test extends the existing cgroup_skb tests with IPv6 support.
Note that we need to set IPV6_RECVERR on the socket for IPv6 in
connect_to_fd_nonblock otherwise the error will be ignored even if we
are in the middle of the TCP handshake. See in
net/ipv6/datagram.c:ipv6_icmp_error for more details.
Reviewed-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
.../bpf/prog_tests/icmp_send_kfunc.c | 91 +++++++++++++------
tools/testing/selftests/bpf/progs/icmp_send.c | 48 ++++++++--
2 files changed, 101 insertions(+), 38 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
index b8a98c90053e..bbb3c3d4509c 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -8,9 +8,11 @@
#define TIMEOUT_MS 1000
#define ICMP_DEST_UNREACH 3
+#define ICMPV6_DEST_UNREACH 1
#define ICMP_FRAG_NEEDED 4
#define NR_ICMP_UNREACH 15
+#define ICMPV6_REJECT_ROUTE 6
#define KFUNC_RET_UNSET -1
@@ -18,7 +20,7 @@ static int connect_to_fd_nonblock(int server_fd)
{
struct sockaddr_storage addr;
socklen_t len = sizeof(addr);
- int fd, err;
+ int fd, err, on = 1;
if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
return -1;
@@ -27,6 +29,12 @@ static int connect_to_fd_nonblock(int server_fd)
if (fd < 0)
return -1;
+ if (addr.ss_family == AF_INET6 &&
+ setsockopt(fd, IPPROTO_IPV6, IPV6_RECVERR, &on, sizeof(on)) < 0) {
+ close(fd);
+ return -1;
+ }
+
err = connect(fd, (struct sockaddr *)&addr, len);
if (err < 0 && errno != EINPROGRESS) {
close(fd);
@@ -36,8 +44,14 @@ static int connect_to_fd_nonblock(int server_fd)
return fd;
}
-static void read_icmp_errqueue(int sockfd, int expected_code)
+static void read_icmp_errqueue(int sockfd, int expected_code, int af)
{
+ int expected_ee_type = (af == AF_INET) ? ICMP_DEST_UNREACH :
+ ICMPV6_DEST_UNREACH;
+ int expected_origin = (af == AF_INET) ? SO_EE_ORIGIN_ICMP :
+ SO_EE_ORIGIN_ICMP6;
+ int expected_level = (af == AF_INET) ? IPPROTO_IP : IPPROTO_IPV6;
+ int expected_type = (af == AF_INET) ? IP_RECVERR : IPV6_RECVERR;
struct sock_extended_err *sock_err;
char ctrl_buf[512];
struct msghdr msg = {
@@ -63,38 +77,43 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
return;
for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
- if (cm->cmsg_level != IPPROTO_IP || cm->cmsg_type != IP_RECVERR)
+ if (cm->cmsg_level != expected_level ||
+ cm->cmsg_type != expected_type)
continue;
sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
- if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
- "sock_err_origin_icmp"))
+ if (!ASSERT_EQ(sock_err->ee_origin, expected_origin,
+ "sock_err_origin"))
return;
- if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+ if (!ASSERT_EQ(sock_err->ee_type, expected_ee_type,
"sock_err_type_dest_unreach"))
return;
ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
return;
}
- ASSERT_FAIL("no IP_RECVERR control message found");
+ ASSERT_FAIL("no IP_RECVERR/IPV6_RECVERR control message found");
}
-static bool valid_unreach_code(int code)
+static bool valid_unreach_code(int code, int af)
{
if (code < 0)
return false;
- return code <= NR_ICMP_UNREACH && code != ICMP_FRAG_NEEDED;
+ if (af == AF_INET)
+ return code <= NR_ICMP_UNREACH && code != ICMP_FRAG_NEEDED;
+
+ return code <= ICMPV6_REJECT_ROUTE;
}
-static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
+static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code,
+ int af, const char *ip)
{
int srv_fd = -1, client_fd = -1;
int port;
- srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0, TIMEOUT_MS);
+ srv_fd = start_server(af, SOCK_STREAM, ip, 0, TIMEOUT_MS);
if (!ASSERT_OK_FD(srv_fd, "start_server"))
return;
@@ -105,6 +124,8 @@ static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
}
skel->bss->server_port = ntohs(port);
+ skel->bss->unreach_type = (af == AF_INET) ? ICMP_DEST_UNREACH :
+ ICMPV6_DEST_UNREACH;
skel->bss->unreach_code = code;
skel->data->kfunc_ret = KFUNC_RET_UNSET;
@@ -114,13 +135,37 @@ static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
return;
}
- if (valid_unreach_code(code))
- read_icmp_errqueue(client_fd, code);
+ if (valid_unreach_code(code, af))
+ read_icmp_errqueue(client_fd, code, af);
close(client_fd);
close(srv_fd);
}
+static void run_icmp_test(struct icmp_send *skel, int af, const char *ip,
+ int max_code)
+{
+ for (int code = 0; code <= max_code; code++) {
+ if (af == AF_INET && code == ICMP_FRAG_NEEDED)
+ continue;
+
+ trigger_prog_read_icmp_errqueue(skel, code, af, ip);
+ ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+ }
+
+ /* Test invalid codes */
+ trigger_prog_read_icmp_errqueue(skel, -1, af, ip);
+ ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+
+ trigger_prog_read_icmp_errqueue(skel, max_code + 1, af, ip);
+ ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+
+ if (af == AF_INET) {
+ trigger_prog_read_icmp_errqueue(skel, ICMP_FRAG_NEEDED, af, ip);
+ ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+ }
+}
+
void test_icmp_send_unreach_cgroup(void)
{
struct icmp_send *skel;
@@ -139,23 +184,11 @@ void test_icmp_send_unreach_cgroup(void)
if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
goto cleanup;
- for (int code = 0; code <= NR_ICMP_UNREACH; code++) {
- if (code == ICMP_FRAG_NEEDED)
- continue;
-
- trigger_prog_read_icmp_errqueue(skel, code);
- ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
- }
-
- /* Test invalid codes */
- trigger_prog_read_icmp_errqueue(skel, -1);
- ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+ if (test__start_subtest("ipv4"))
+ run_icmp_test(skel, AF_INET, "127.0.0.1", NR_ICMP_UNREACH);
- trigger_prog_read_icmp_errqueue(skel, NR_ICMP_UNREACH + 1);
- ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
-
- trigger_prog_read_icmp_errqueue(skel, ICMP_FRAG_NEEDED);
- ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+ if (test__start_subtest("ipv6"))
+ run_icmp_test(skel, AF_INET6, "::1", ICMPV6_REJECT_ROUTE);
cleanup:
icmp_send__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
index 6d0be0a9afe1..6e1ba539eeb0 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -5,10 +5,11 @@
/* 127.0.0.1 in host byte order */
#define SERVER_IP 0x7F000001
-
-#define ICMP_DEST_UNREACH 3
+/* ::1 in host byte order (last 32-bit word) */
+#define SERVER_IP6_LO 0x00000001
__u16 server_port = 0;
+int unreach_type = 0;
int unreach_code = 0;
int kfunc_ret = -1;
@@ -18,19 +19,48 @@ int egress(struct __sk_buff *skb)
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct iphdr *iph;
+ struct ipv6hdr *ip6h;
struct tcphdr *tcph;
+ __u8 version;
- iph = data;
- if ((void *)(iph + 1) > data_end || iph->version != 4 ||
- iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+ if (data + 1 > data_end)
return SK_PASS;
- tcph = (void *)iph + iph->ihl * 4;
- if ((void *)(tcph + 1) > data_end ||
- tcph->dest != bpf_htons(server_port))
+ version = (*((__u8 *)data)) >> 4;
+
+ if (version == 4) {
+ iph = data;
+ if ((void *)(iph + 1) > data_end ||
+ iph->protocol != IPPROTO_TCP ||
+ iph->daddr != bpf_htonl(SERVER_IP))
+ return SK_PASS;
+
+ tcph = (void *)iph + iph->ihl * 4;
+ if ((void *)(tcph + 1) > data_end ||
+ tcph->dest != bpf_htons(server_port))
+ return SK_PASS;
+
+ } else if (version == 6) {
+ ip6h = data;
+ if ((void *)(ip6h + 1) > data_end ||
+ ip6h->nexthdr != IPPROTO_TCP)
+ return SK_PASS;
+
+ if (ip6h->daddr.in6_u.u6_addr32[0] != 0 ||
+ ip6h->daddr.in6_u.u6_addr32[1] != 0 ||
+ ip6h->daddr.in6_u.u6_addr32[2] != 0 ||
+ ip6h->daddr.in6_u.u6_addr32[3] != bpf_htonl(SERVER_IP6_LO))
+ return SK_PASS;
+
+ tcph = (void *)(ip6h + 1);
+ if ((void *)(tcph + 1) > data_end ||
+ tcph->dest != bpf_htons(server_port))
+ return SK_PASS;
+ } else {
return SK_PASS;
+ }
- kfunc_ret = bpf_icmp_send(skb, ICMP_DEST_UNREACH, unreach_code);
+ kfunc_ret = bpf_icmp_send(skb, unreach_type, unreach_code);
return SK_DROP;
}
--
2.34.1
^ permalink raw reply related
* [PATCH bpf-next v10 4/5] selftests/bpf: add bpf_icmp_send recursion test
From: Mahe Tardy @ 2026-06-25 11:03 UTC (permalink / raw)
To: bpf
Cc: andrii, ast, daniel, john.fastabend, jordan, martin.lau,
yonghong.song, emil, netdev, edumazet, kuba, pabeni, davem, horms,
Mahe Tardy
In-Reply-To: <20260625110321.28236-1-mahe.tardy@gmail.com>
This test is similar to test_icmp_send_unreach_cgroup but checks that,
in case of recursion, meaning that the BPF program calling the kfunc was
re-triggered by the icmp_send done by the kfunc, the kfunc will stop
early and return -EBUSY.
The test attaches to the root cgroup to ensure the ICMP packet generated
by the kfunc re-triggers the BPF program.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
.../bpf/prog_tests/icmp_send_kfunc.c | 46 ++++++++++++++++
tools/testing/selftests/bpf/progs/icmp_send.c | 55 +++++++++++++++++++
2 files changed, 101 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
index bbb3c3d4509c..bb532aa0d158 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -1,8 +1,10 @@
// SPDX-License-Identifier: GPL-2.0
#include <test_progs.h>
#include <network_helpers.h>
+#include <cgroup_helpers.h>
#include <linux/errqueue.h>
#include <poll.h>
+#include <unistd.h>
#include "icmp_send.skel.h"
#define TIMEOUT_MS 1000
@@ -10,6 +12,7 @@
#define ICMP_DEST_UNREACH 3
#define ICMPV6_DEST_UNREACH 1
+#define ICMP_HOST_UNREACH 1
#define ICMP_FRAG_NEEDED 4
#define NR_ICMP_UNREACH 15
#define ICMPV6_REJECT_ROUTE 6
@@ -195,3 +198,46 @@ void test_icmp_send_unreach_cgroup(void)
if (cgroup_fd >= 0)
close(cgroup_fd);
}
+
+void test_icmp_send_unreach_recursion(void)
+{
+ struct icmp_send *skel;
+ int cgroup_fd = -1;
+ int err;
+
+ err = setup_cgroup_environment();
+ if (!ASSERT_OK(err, "setup_cgroup_environment"))
+ return;
+
+ skel = icmp_send__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_open"))
+ goto cleanup;
+
+ cgroup_fd = get_root_cgroup();
+ if (!ASSERT_OK_FD(cgroup_fd, "get_root_cgroup"))
+ goto cleanup;
+
+ skel->data->target_pid = getpid();
+ skel->links.recursion =
+ bpf_program__attach_cgroup(skel->progs.recursion, cgroup_fd);
+ if (!ASSERT_OK_PTR(skel->links.recursion, "prog_attach_cgroup"))
+ goto cleanup;
+
+ trigger_prog_read_icmp_errqueue(skel, ICMP_HOST_UNREACH, AF_INET,
+ "127.0.0.1");
+
+ /*
+ * Because there's recursion involved, the first call will return at
+ * index 1 since it will return the second, and the second call will
+ * return at index 0 since it will return the first.
+ */
+ ASSERT_EQ(skel->bss->rec_count, 2, "rec_count");
+ ASSERT_EQ(skel->data->rec_kfunc_rets[0], -EBUSY, "kfunc_rets[0]");
+ ASSERT_EQ(skel->data->rec_kfunc_rets[1], 0, "kfunc_rets[1]");
+
+cleanup:
+ icmp_send__destroy(skel);
+ if (cgroup_fd >= 0)
+ close(cgroup_fd);
+ cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
index 6e1ba539eeb0..c642ccdf9fd5 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -12,6 +12,10 @@ __u16 server_port = 0;
int unreach_type = 0;
int unreach_code = 0;
int kfunc_ret = -1;
+int target_pid = -1;
+
+unsigned int rec_count = 0;
+int rec_kfunc_rets[] = { -1, -1 };
SEC("cgroup_skb/egress")
int egress(struct __sk_buff *skb)
@@ -65,4 +69,55 @@ int egress(struct __sk_buff *skb)
return SK_DROP;
}
+SEC("cgroup_skb/egress")
+int recursion(struct __sk_buff *skb)
+{
+ void *data = (void *)(long)skb->data;
+ void *data_end = (void *)(long)skb->data_end;
+ struct icmphdr *icmph;
+ struct tcphdr *tcph;
+ struct iphdr *iph;
+ int ret;
+
+ if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
+ return SK_PASS;
+
+ iph = data;
+ if ((void *)(iph + 1) > data_end || iph->version != 4)
+ return SK_PASS;
+
+ if (iph->daddr != bpf_htonl(SERVER_IP))
+ return SK_PASS;
+
+ if (iph->protocol == IPPROTO_TCP) {
+ tcph = (void *)iph + iph->ihl * 4;
+ if ((void *)(tcph + 1) > data_end ||
+ tcph->dest != bpf_htons(server_port))
+ return SK_PASS;
+ } else if (iph->protocol == IPPROTO_ICMP) {
+ icmph = (void *)iph + iph->ihl * 4;
+ if ((void *)(icmph + 1) > data_end ||
+ icmph->type != unreach_type || icmph->code != unreach_code)
+ return SK_PASS;
+ } else {
+ return SK_PASS;
+ }
+
+ /*
+ * This call will provoke a recursion: the ICMP packet generated by the
+ * kfunc will re-trigger this program since we are in the root cgroup in
+ * which the kernel ICMP socket belongs. However when re-entering the
+ * kfunc, it should return EBUSY.
+ */
+ ret = bpf_icmp_send(skb, unreach_type, unreach_code);
+ rec_kfunc_rets[rec_count & 1] = ret;
+ __sync_fetch_and_add(&rec_count, 1);
+
+ /* Let the first ICMP error message pass */
+ if (iph->protocol == IPPROTO_ICMP)
+ return SK_PASS;
+
+ return SK_DROP;
+}
+
char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1
^ permalink raw reply related
* [PATCH bpf-next v10 5/5] selftests/bpf: add bpf_icmp_send no route test
From: Mahe Tardy @ 2026-06-25 11:03 UTC (permalink / raw)
To: bpf
Cc: andrii, ast, daniel, john.fastabend, jordan, martin.lau,
yonghong.song, emil, netdev, edumazet, kuba, pabeni, davem, horms,
Mahe Tardy
In-Reply-To: <20260625110321.28236-1-mahe.tardy@gmail.com>
For normal live cgroup_skb paths, the skb should already be routed. The
exception is for test run via BPF_PROG_TEST_RUN with packets created
via bpf_prog_test_run_skb. Those lack dst route and thus the icmp_send
would quietly fail by returning early.
This test exercises this and makes sure the kfunc returns -ENETUNREACH.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
.../bpf/prog_tests/icmp_send_kfunc.c | 26 +++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
index bb532aa0d158..ffaf0fe1880b 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -169,6 +169,29 @@ static void run_icmp_test(struct icmp_send *skel, int af, const char *ip,
}
}
+static void run_icmp_no_route_test(struct icmp_send *skel)
+{
+ struct ipv4_packet pkt = pkt_v4;
+ LIBBPF_OPTS(bpf_test_run_opts, opts,
+ .data_in = &pkt,
+ .data_size_in = sizeof(pkt),
+ );
+ int err;
+
+ pkt.iph.version = 4;
+ pkt.iph.daddr = inet_addr("127.0.0.1");
+ pkt.tcp.dest = htons(80);
+ skel->bss->server_port = 80;
+ skel->bss->unreach_type = ICMP_DEST_UNREACH;
+ skel->bss->unreach_code = ICMP_HOST_UNREACH;
+ skel->data->kfunc_ret = KFUNC_RET_UNSET;
+
+ err = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.egress), &opts);
+ if (!ASSERT_OK(err, "test_run"))
+ return;
+ ASSERT_EQ(skel->data->kfunc_ret, -ENETUNREACH, "kfunc_ret_no_route");
+}
+
void test_icmp_send_unreach_cgroup(void)
{
struct icmp_send *skel;
@@ -193,6 +216,9 @@ void test_icmp_send_unreach_cgroup(void)
if (test__start_subtest("ipv6"))
run_icmp_test(skel, AF_INET6, "::1", ICMPV6_REJECT_ROUTE);
+ if (test__start_subtest("no_route"))
+ run_icmp_no_route_test(skel);
+
cleanup:
icmp_send__destroy(skel);
if (cgroup_fd >= 0)
--
2.34.1
^ permalink raw reply related
* Re: [Intel-wired-lan] [PATCH net] igc: Fix RX HW timestamp reporting when NET_RX_BUSY_POLL is disabled
From: Marcin Szycik @ 2026-06-25 11:07 UTC (permalink / raw)
To: Florian Bezdeka, Kwapulinski, Piotr, Ding Meng, Nguyen, Anthony L,
Kitszel, Przemyslaw, andrew+netdev@lunn.ch, davem@davemloft.net,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
Kiszka, Jan
Cc: intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, wq.wang@siemens.com
In-Reply-To: <d058b0fa9ad923514084a44f51c78ae8355c4ebb.camel@siemens.com>
On 24/06/2026 11:05, Florian Bezdeka via Intel-wired-lan wrote:
> On Tue, 2026-06-23 at 09:46 +0000, Kwapulinski, Piotr wrote:
>>> -----Original Message-----
>>> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Ding Meng via Intel-wired-lan
>>> Sent: Monday, June 22, 2026 6:13 AM
>>> To: Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Kiszka, Jan <jan.kiszka@siemens.com>; Bezdeka, Florian <florian.bezdeka@siemens.com>
>>> Cc: intel-wired-lan@lists.osuosl.org; linux-kernel@vger.kernel.org; netdev@vger.kernel.org; meng.ding@siemens.com; wq.wang@siemens.com
>>> Subject: [Intel-wired-lan] [PATCH net] igc: Fix RX HW timestamp reporting when NET_RX_BUSY_POLL is disabled
>>>
>>> When CONFIG_NET_RX_BUSY_POLL is deactivated, fetching RX HW timestamps from the NIC no longer works as expected.
>>>
>>> This occurs because disabling CONFIG_NET_RX_BUSY_POLL disables the SKB NAPI mapping in __skb_mark_napi_id(). Consequently, get_timestamp() fails to perform its driver lookup, and the igc driver's struct net_device_ops::ndo_get_tstamp is never invoked.
>>>
>>> Instead, get_timestamp() falls back to use shhwtstamps(skb)->hwtstamp, a field that the driver has not populated.
>>>
>>> Fix this by populating the hwtstamp field with the correct timestamp in the default timer when CONFIG_NET_RX_BUSY_POLL is disabled.
>>>
>>> Fixes: 069b142f5819 ("igc: Add support for PTP .getcyclesx64()")
>>> Co-developed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
>>> Signed-off-by: Florian Bezdeka <florian.bezdeka@siemens.com>
>>> Signed-off-by: Ding Meng <meng.ding@siemens.com>
>>> ---
>>> drivers/net/ethernet/intel/igc/igc_main.c | 38 ++++++++++++++++-------
>>> 1 file changed, 26 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
>>> index 8ac16808023..1da8d7aa76d 100644
>>> --- a/drivers/net/ethernet/intel/igc/igc_main.c
>>> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
>>> @@ -1992,7 +1992,26 @@ static struct sk_buff *igc_build_skb(struct igc_ring *rx_ring,
>>> return skb;
>>> }
>>>
>>> -static struct sk_buff *igc_construct_skb(struct igc_ring *rx_ring,
>>> +static void igc_construct_skb_timestamps(struct igc_adapter *adapter,
>>> + struct sk_buff *skb,
>>> + struct igc_xdp_buff *ctx)
>>> +{
>>> + if (!ctx->rx_ts)
>>> + return;
>>> +#ifdef CONFIG_NET_RX_BUSY_POLL
>>> + skb_shinfo(skb)->tx_flags |= SKBTX_HW_TSTAMP_NETDEV;
>>> + skb_hwtstamps(skb)->netdev_data = ctx->rx_ts; #else
>>> + struct igc_inline_rx_tstamps *tstamps;
>> Please move at the top of the function and add:
>
> That would trigger a "unused variable" warning in the
> CONFIG_NET_RX_BUSY_POLL case.
Put it under #ifndef CONFIG_NET_RX_BUSY_POLL. Variable declarations
need to be on top.
Thanks,
Marcin
> Btw: I was really confused that the #else statement moved to the end of
> the previous line. Might someone be using a wrongly configured mail
> client here?
>
> Florian
>
>> Reviewed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com
>>
>>> +
>>> + tstamps = ctx->rx_ts;
>>> + skb_hwtstamps(skb)->hwtstamp = igc_ptp_rx_pktstamp(adapter,
>>> + tstamps->timer0);
>>> +#endif
>>> +}
>>> +
>
> [snip]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox